pt.io - Reading/writing files¶
This module provides useful utility methods for reading and writing files. In particular, it also provides support for reading and writing standard formats, such as TREC-formatted topics files or run files.
- pyterrier.io.autoopen(filename, mode='rb', **kwargs)[source]¶
A drop-in for open() that applies automatic compression for .gz, .bz2 and .lz4 file extensions
- pyterrier.io.find_files(dir)[source]¶
Returns all the files present in a directory and its subdirectories
- Parameters:
dir – The directory containing the files
- Returns:
A list of the paths to the files
- pyterrier.io.finalized_open(path, mode)[source]¶
Opens a file for writing, but reverts it if there was an error in the process.
- Parameters:
path (
str) – Path of file to openmode (
str) – Either t or b, for text or binary mode
- Return type:
ContextManager[BufferedIOBase]
Example
Returns a contextmanager that provides a file object, so should be used in a “with” statement. E.g.:
with pt.io.finalized_open("file.txt", "t") as f: f.write("some text") # file.txt exists with contents "some text"
If there is an error when writing, the file is reverted:
with pt.io.finalized_open("file.txt", "t") as f: f.write("some other text") raise Exception("an error") # file.txt remains unchanged (if existed, contents unchanged; if didn't exist, still doesn't)
- pyterrier.io.finalized_autoopen(path, mode)[source]¶
Opens a file for writing with
autoopen, but reverts it if there was an error in the process.- Parameters:
path (
str) – Path of file to openmode (
str) – Either t or b, for text or binary mode
- Return type:
ContextManager[BufferedIOBase]
Example
Returns a contextmanager that provides a file object, so should be used in a “with” statement. E.g.:
with pt.io.finalized_autoopen("file.gz", "t") as f: f.write("some text") # file.gz exists with contents "some text"
If there is an error when writing, the file is reverted:
with pt.io.finalized_autoopen("file.gz", "t") as f: f.write("some other text") raise Exception("an error") # file.gz remains unchanged (if existed, contents unchanged; if didn't exist, still doesn't)
- pyterrier.io.ok_filename(fname)[source]¶
Checks to see if a filename is valid.
- Return type:
bool- Parameters:
fname (str)
- pyterrier.io.touch(fname, mode=438, dir_fd=None, **kwargs)[source]¶
Eqiuvalent to touch command on linux. Implementation from https://stackoverflow.com/a/1160227
- Parameters:
fname (str)
- pyterrier.io.read_results(filename, format='trec', topics=None, dataset=None, **kwargs)[source]¶
Reads a file into a results dataframe.
- Parameters:
filename (
str) – The filename of the file to be read. Compressed files are handled automatically. A URL is also supported for the “trec” format.format – The format of the results file: one of “trec”, “letor”. Default is “trec”.
topics (
DataFrame|None) – If provided, will merge the topics to merge into the results. This is helpful for providing query text. Cannot be used in conjunction with dataset argument.dataset (
Dataset|None) – If provided, loads topics from the dataset (or dataset ID) and merges them into the results. This is helpful for providing query text. Cannot be used in conjunction with dataset topics.kwargs – Other arguments for the internal method
- Return type:
DataFrame- Returns:
dataframe with usual qid, docno, score columns etc
Examples:
# a dataframe of results can be used directly in a pt.Experiment pt.Experiment( [ pt.io.read_results("/path/to/baselines-results.res.gz") ], topics, qrels, ["map"] ) # make a transformer from a results dataframe, include the query text first_pass = pt.Transformer.from_df( pt.io.read_results("/path/to/results.gz", topics=topics) ) # make a max_passage retriever based on a previously saved results max_passage = (first_pass >> pt.text.get_text(dataset) >> pt.text.sliding() >> pt.text.scorer() >> pt.text.max_passage() )
- pyterrier.io.write_results(res, filename, format='trec', append=False, **kwargs)[source]¶
Write a results dataframe to a file.
- Parameters:
res (
DataFrame) – A results dataframe, with usual columns of qid, docno etcfilename (
str) – The filename of the file to be written. Compressed files are handled automatically.format (
Literal['trec','letor','minimal']) – The format of the results file: one of “trec”, “letor”, “minimal”append – Append to an existing file. Defaults to False.
kwargs – Other arguments for the internal method
- Supported Formats:
“trec” – output columns are $qid Q0 $docno $rank $score $runname, space separated
“letor” – This follows the LETOR and MSLR datasets, in that output columns are $label qid:$qid [$fid:$value]+ # docno=$docno
“minimal”: output columns are $qid $docno $rank, tab-separated. This is used for submissions to the MSMARCO leaderboard.
- pyterrier.io.read_topics(filename, format='trec', **kwargs)[source]¶
Reads a file containing topics.
- Parameters:
filename (
str) – The filename of the topics file. A URL is supported for the “trec” and “singleline” formats.format (
Literal['trec','trecxml','singleline']) – One of “trec”, “trecxml” or “singleline”. Default is “trec”
- Return type:
DataFrame- Returns:
pandas.Dataframe with columns=[‘qid’,’query’], where both columns have type str.
- Supported Formats:
“trec” – an SGML-formatted TREC topics file. Delimited by TOP tags, each having NUM and TITLE tags; DESC and NARR tags are skipped by default. Control using whitelist and blacklist kwargs
“trecxml” – a more modern XML formatted topics file. Delimited by topic tags, each having number tags. query, question and narrative tags are parsed by default. Control using tags kwarg.
“singleline” – one query per line, preceeded by a space or colon. Tokenised by default, use tokenise=False kwargs to prevent tokenisation.
- pyterrier.io.read_qrels(file_path)[source]¶
Reads a file containing qrels (relevance assessments)
- Parameters:
file_path (
str) – The path to the qrels file. A URL is also supported.- Return type:
DataFrame- Returns:
pandas.Dataframe with columns=[‘qid’,’docno’, ‘label’] with column types string, string, and int
- pyterrier.io.pyterrier_home()[source]¶
Returns pyterrier’s home directory. By default this is ~/.pyterrier, but it can also be set with the PYTERRIER_HOME env variable.
- Return type:
str
- pyterrier.io.finalized_directory(path)[source]¶
Creates a directory, but reverts it if there was an error in the process.
- Return type:
Generator[str,None,None]- Parameters:
path (str)
- pyterrier.io.download(url, path, *, expected_sha256=None, verbose=True, headers={})[source]¶
Downloads a file from a URL to a local path.
- Return type:
None- Parameters:
url (str)
path (str)
expected_sha256 (str | None)
verbose (bool)
- pyterrier.io.download_stream(url, *, expected_sha256=None, headers=None, verbose=True)[source]¶
Downloads a file from a URL to a stream.
- Return type:
Generator[BufferedIOBase,None,None]- Parameters:
url (str)
expected_sha256 (str | None)
headers (Dict[str, str] | None)
verbose (bool)
- pyterrier.io.open_or_download_stream(path_or_url, *, expected_sha256=None, headers=None, verbose=True)[source]¶
Opens a file or downloads a file from a URL to a stream.
- Return type:
Generator[BufferedIOBase,None,None]- Parameters:
path_or_url (str)
expected_sha256 (str | None)
headers (Dict[str, str] | None)
verbose (bool)
- class pyterrier.io.HashReader(reader, *, hashfn=<built-in function openssl_sha256>, expected=None)[source]¶
A reader that computes the sha256 hash of the data read.
Create a HashReader.
- Parameters:
reader (BufferedIOBase)
hashfn (Callable)
expected (str | None)
- class pyterrier.io.HashWriter(writer, *, hashfn=<built-in function openssl_sha256>)[source]¶
A writer that computes the sha256 hash of the data written.
Create a HashWriter.
- Parameters:
writer (BufferedIOBase)
hashfn (Callable)
- class pyterrier.io.TqdmReader(reader, *, total=None, desc=None, disable=False)[source]¶
A reader that displays a progress bar.
Create a TqdmReader.
- Parameters:
reader (BufferedIOBase)
total (int | None)
desc (str | None)
disable (bool)
- class pyterrier.io.CallbackReader(reader, callback)[source]¶
A reader that calls a callback with the data read.
Create a CallbackReader.
- Parameters:
reader (BufferedIOBase)
callback (Callable)