pyterrier.io - Reading/writing files¶
This module provides useful utility methods for reading and writing files. In particular, it also provides support for reading and writing standard formats, such as TREC-formatted topics files or run files.
- pyterrier.io.autoopen(filename, mode='rb', **kwargs)[source]¶
A drop-in for open() that applies automatic compression for .gz, .bz2 and .lz4 file extensions
- pyterrier.io.find_files(dir)[source]¶
Returns all the files present in a directory and its subdirectories
- Parameters:
dir (str) – The directory containing the files
- Returns:
A list of the paths to the files
- Return type:
paths(list)
- pyterrier.io.finalized_open(path, mode)[source]¶
Opens a file for writing, but reverts it if there was an error in the process.
- Return type:
ContextManager
[BufferedIOBase
]- Parameters:
path (str) – Path of file to open
mode (str) – Either t or b, for text or binary mode
Example
Returns a contextmanager that provides a file object, so should be used in a “with” statement. E.g.:
with pt.io.finalized_open("file.txt", "t") as f: f.write("some text") # file.txt exists with contents "some text"
If there is an error when writing, the file is reverted:
with pt.io.finalized_open("file.txt", "t") as f: f.write("some other text") raise Exception("an error") # file.txt remains unchanged (if existed, contents unchanged; if didn't exist, still doesn't)
- pyterrier.io.finalized_autoopen(path, mode)[source]¶
Opens a file for writing with
autoopen
, but reverts it if there was an error in the process.- Return type:
ContextManager
[BufferedIOBase
]- Parameters:
path (str) – Path of file to open
mode (str) – Either t or b, for text or binary mode
Example
Returns a contextmanager that provides a file object, so should be used in a “with” statement. E.g.:
with pt.io.finalized_autoopen("file.gz", "t") as f: f.write("some text") # file.gz exists with contents "some text"
If there is an error when writing, the file is reverted:
with pt.io.finalized_autoopen("file.gz", "t") as f: f.write("some other text") raise Exception("an error") # file.gz remains unchanged (if existed, contents unchanged; if didn't exist, still doesn't)
- pyterrier.io.touch(fname, mode=438, dir_fd=None, **kwargs)[source]¶
Eqiuvalent to touch command on linux. Implementation from https://stackoverflow.com/a/1160227
- pyterrier.io.read_results(filename, format='trec', topics=None, dataset=None, **kwargs)[source]¶
Reads a file into a results dataframe.
- Parameters:
filename (str) – The filename of the file to be read. Compressed files are handled automatically. A URL is also supported for the “trec” format.
format (str) – The format of the results file: one of “trec”, “letor”. Default is “trec”.
topics (None or pandas.DataFrame) – If provided, will merge the topics to merge into the results. This is helpful for providing query text. Cannot be used in conjunction with dataset argument.
dataset (None, str or pyterrier.datasets.Dataset) – If provided, loads topics from the dataset (or dataset ID) and merges them into the results. This is helpful for providing query text. Cannot be used in conjunction with dataset topics.
**kwargs (dict) – Other arguments for the internal method
- Returns:
dataframe with usual qid, docno, score columns etc
Examples:
# a dataframe of results can be used directly in a pt.Experiment pt.Experiment( [ pt.io.read_results("/path/to/baselines-results.res.gz") ], topics, qrels, ["map"] ) # make a transformer from a results dataframe, include the query text first_pass = pt.Transformer.from_df( pt.io.read_results("/path/to/results.gz", topics=topics) ) # make a max_passage retriever based on a previously saved results max_passage = (first_pass >> pt.text.get_text(dataset) >> pt.text.sliding() >> pt.text.scorer() >> pt.text.max_passage() )
- pyterrier.io.write_results(res, filename, format='trec', append=False, **kwargs)[source]¶
Write a results dataframe to a file.
- Parameters:
res (DataFrame) – A results dataframe, with usual columns of qid, docno etc
filename (str) – The filename of the file to be written. Compressed files are handled automatically.
format (str) – The format of the results file: one of “trec”, “letor”, “minimal”
append (bool) – Append to an existing file. Defaults to False.
**kwargs (dict) – Other arguments for the internal method
- Supported Formats:
“trec” – output columns are $qid Q0 $docno $rank $score $runname, space separated
“letor” – This follows the LETOR and MSLR datasets, in that output columns are $label qid:$qid [$fid:$value]+ # docno=$docno
“minimal”: output columns are $qid $docno $rank, tab-separated. This is used for submissions to the MSMARCO leaderboard.
- pyterrier.io.read_topics(filename, format='trec', **kwargs)[source]¶
Reads a file containing topics.
- Parameters:
filename (str) – The filename of the topics file. A URL is supported for the “trec” and “singleline” formats.
format (str) – One of “trec”, “trecxml” or “singleline”. Default is “trec”
- Returns:
pandas.Dataframe with columns=[‘qid’,’query’] both columns have type string
- Supported Formats:
“trec” – an SGML-formatted TREC topics file. Delimited by TOP tags, each having NUM and TITLE tags; DESC and NARR tags are skipped by default. Control using whitelist and blacklist kwargs
“trecxml” – a more modern XML formatted topics file. Delimited by topic tags, each having number tags. query, question and narrative tags are parsed by default. Control using tags kwarg.
“singleline” – one query per line, preceeded by a space or colon. Tokenised by default, use tokenise=False kwargs to prevent tokenisation.
- pyterrier.io.read_qrels(file_path)[source]¶
Reads a file containing qrels (relevance assessments)
- Parameters:
file_path (str) – The path to the qrels file. A URL is also supported.
- Returns:
pandas.Dataframe with columns=[‘qid’,’docno’, ‘label’] with column types string, string, and int
- pyterrier.io.pyterrier_home()[source]¶
Returns pyterrier’s home directory. By default this is ~/.pyterrier, but it can also be set with the PYTERRIER_HOME env variable.
- Return type:
str
- pyterrier.io.finalized_directory(path)[source]¶
Creates a directory, but reverts it if there was an error in the process.
- Return type:
Generator
[str
,None
,None
]
- pyterrier.io.download(url, path, *, expected_sha256=None, verbose=True)[source]¶
Downloads a file from a URL to a local path.
- Return type:
None
- pyterrier.io.download_stream(url, *, expected_sha256=None, headers=None, verbose=True)[source]¶
Downloads a file from a URL to a stream.
- Return type:
Generator
[BufferedIOBase
,None
,None
]
- pyterrier.io.open_or_download_stream(path_or_url, *, expected_sha256=None, headers=None, verbose=True)[source]¶
Opens a file or downloads a file from a URL to a stream.
- Return type:
Generator
[BufferedIOBase
,None
,None
]
- class pyterrier.io.HashReader(reader, *, hashfn=<built-in function openssl_sha256>, expected=None)[source]¶
A reader that computes the sha256 hash of the data read.
Create a HashReader.
- class pyterrier.io.HashWriter(writer, *, hashfn=<built-in function openssl_sha256>)[source]¶
A writer that computes the sha256 hash of the data written.
Create a HashWriter.
- class pyterrier.io.TqdmReader(reader, *, total=None, desc=None, disable=False)[source]¶
A reader that displays a progress bar.
Create a TqdmReader.
- class pyterrier.io.CallbackReader(reader, callback)[source]¶
A reader that calls a callback with the data read.
Create a CallbackReader.