pyterrier.io - Reading/writing files¶
This module provides useful utility methods for reading and writing files. In particular, it also provides support for reading and writing standard formats, such as TREC-formatted topics files or run files.
- pyterrier.io.autoopen(filename, mode='rb')[source]¶
A drop-in for open() that applies automatic compression for .gz, .bz2 and .lz4 file extensions
- pyterrier.io.find_files(dir)[source]¶
Returns all the files present in a directory and its subdirectories
- Parameters:
dir (str) – The directory containing the files
- Returns:
A list of the paths to the files
- Return type:
paths(list)
- pyterrier.io.finalized_open(path, mode)[source]¶
Opens a file for writing, but reverts it if there was an error in the process.
- Parameters:
path (str) – Path of file to open
mode (str) – Either t or b, for text or binary mode
Example
Returns a contextmanager that provides a file object, so should be used in a “with” statement. E.g.:
with pt.io.finalized_open("file.txt", "t") as f: f.write("some text") # file.txt exists with contents "some text"
If there is an error when writing, the file is reverted:
with pt.io.finalized_open("file.txt", "t") as f: f.write("some other text") raise Exception("an error") # file.txt remains unchanged (if existed, contents unchanged; if didn't exist, still doesn't)
- pyterrier.io.finalized_autoopen(path, mode)[source]¶
Opens a file for writing with
autoopen
, but reverts it if there was an error in the process.- Parameters:
path (str) – Path of file to open
mode (str) – Either t or b, for text or binary mode
Example
Returns a contextmanager that provides a file object, so should be used in a “with” statement. E.g.:
with pt.io.finalized_autoopen("file.gz", "t") as f: f.write("some text") # file.gz exists with contents "some text"
If there is an error when writing, the file is reverted:
with pt.io.finalized_autoopen("file.gz", "t") as f: f.write("some other text") raise Exception("an error") # file.gz remains unchanged (if existed, contents unchanged; if didn't exist, still doesn't)
- pyterrier.io.touch(fname, mode=438, dir_fd=None, **kwargs)[source]¶
Eqiuvalent to touch command on linux. Implementation from https://stackoverflow.com/a/1160227
- pyterrier.io.read_results(filename, format='trec', topics=None, dataset=None, **kwargs)[source]¶
Reads a file into a results dataframe.
- Parameters:
filename (str) – The filename of the file to be read. Compressed files are handled automatically. A URL is also supported for the “trec” format.
format (str) – The format of the results file: one of “trec”, “letor”. Default is “trec”.
topics (None or pandas.DataFrame) – If provided, will merge the topics to merge into the results. This is helpful for providing query text. Cannot be used in conjunction with dataset argument.
dataset (None, str or pyterrier.datasets.Dataset) – If provided, loads topics from the dataset (or dataset ID) and merges them into the results. This is helpful for providing query text. Cannot be used in conjunction with dataset topics.
**kwargs (dict) – Other arguments for the internal method
- Returns:
dataframe with usual qid, docno, score columns etc
Examples:
# a dataframe of results can be used directly in a pt.Experiment pt.Experiment( [ pt.io.read_results("/path/to/baselines-results.res.gz") ], topics, qrels, ["map"] ) # make a transformer from a results dataframe, include the query text first_pass = pt.Transformer.from_df( pt.io.read_results("/path/to/results.gz", topics=topics) ) # make a max_passage retriever based on a previously saved results max_passage = (first_pass >> pt.text.get_text(dataset) >> pt.text.sliding() >> pt.text.scorer() >> pt.text.max_passage() )
- pyterrier.io.write_results(res, filename, format='trec', append=False, **kwargs)[source]¶
Write a results dataframe to a file.
- Parameters:
res (DataFrame) – A results dataframe, with usual columns of qid, docno etc
filename (str) – The filename of the file to be written. Compressed files are handled automatically.
format (str) – The format of the results file: one of “trec”, “letor”, “minimal”
append (bool) – Append to an existing file. Defaults to False.
**kwargs (dict) – Other arguments for the internal method
- Supported Formats:
“trec” – output columns are $qid Q0 $docno $rank $score $runname, space separated
“letor” – This follows the LETOR and MSLR datasets, in that output columns are $label qid:$qid [$fid:$value]+ # docno=$docno
“minimal”: output columns are $qid $docno $rank, tab-separated. This is used for submissions to the MSMARCO leaderboard.
- pyterrier.io.read_topics(filename, format='trec', **kwargs)[source]¶
Reads a file containing topics.
- Parameters:
filename (str) – The filename of the topics file. A URL is supported for the “trec” and “singleline” formats.
format (str) – One of “trec”, “trecxml” or “singleline”. Default is “trec”
- Returns:
pandas.Dataframe with columns=[‘qid’,’query’] both columns have type string
- Supported Formats:
“trec” – an SGML-formatted TREC topics file. Delimited by TOP tags, each having NUM and TITLE tags; DESC and NARR tags are skipped by default. Control using whitelist and blacklist kwargs
“trecxml” – a more modern XML formatted topics file. Delimited by topic tags, each having number tags. query, question and narrative tags are parsed by default. Control using tags kwarg.
“singeline” – one query per line, preceeded by a space or colon. Tokenised by default, use tokenise=False kwargs to prevent tokenisation.