pyterrier.io - Reading/writing files

This module provides useful utility methods for reading and writing files. In particular, it also provides support for reading and writing standard formats, such as TREC-formatted topics files or run files.

pyterrier.io.autoopen(filename, mode='rb')[source]

A drop-in for open() that applies automatic compression for .gz, .bz2 and .lz4 file extensions

pyterrier.io.find_files(dir)[source]

Returns all the files present in a directory and its subdirectories

Parameters:

dir (str) – The directory containing the files

Returns:

A list of the paths to the files

Return type:

paths(list)

pyterrier.io.finalized_open(path, mode)[source]

Opens a file for writing, but reverts it if there was an error in the process.

Parameters:
  • path (str) – Path of file to open

  • mode (str) – Either t or b, for text or binary mode

Example

Returns a contextmanager that provides a file object, so should be used in a “with” statement. E.g.:

with pt.io.finalized_open("file.txt", "t") as f:
    f.write("some text")
# file.txt exists with contents "some text"

If there is an error when writing, the file is reverted:

with pt.io.finalized_open("file.txt", "t") as f:
    f.write("some other text")
    raise Exception("an error")
# file.txt remains unchanged (if existed, contents unchanged; if didn't exist, still doesn't)
pyterrier.io.finalized_autoopen(path, mode)[source]

Opens a file for writing with autoopen, but reverts it if there was an error in the process.

Parameters:
  • path (str) – Path of file to open

  • mode (str) – Either t or b, for text or binary mode

Example

Returns a contextmanager that provides a file object, so should be used in a “with” statement. E.g.:

with pt.io.finalized_autoopen("file.gz", "t") as f:
    f.write("some text")
# file.gz exists with contents "some text"

If there is an error when writing, the file is reverted:

with pt.io.finalized_autoopen("file.gz", "t") as f:
    f.write("some other text")
    raise Exception("an error")
# file.gz remains unchanged (if existed, contents unchanged; if didn't exist, still doesn't)
pyterrier.io.ok_filename(fname)[source]

Checks to see if a filename is valid.

Return type:

bool

pyterrier.io.touch(fname, mode=438, dir_fd=None, **kwargs)[source]

Eqiuvalent to touch command on linux. Implementation from https://stackoverflow.com/a/1160227

pyterrier.io.read_results(filename, format='trec', topics=None, dataset=None, **kwargs)[source]

Reads a file into a results dataframe.

Parameters:
  • filename (str) – The filename of the file to be read. Compressed files are handled automatically. A URL is also supported for the “trec” format.

  • format (str) – The format of the results file: one of “trec”, “letor”. Default is “trec”.

  • topics (None or pandas.DataFrame) – If provided, will merge the topics to merge into the results. This is helpful for providing query text. Cannot be used in conjunction with dataset argument.

  • dataset (None, str or pyterrier.datasets.Dataset) – If provided, loads topics from the dataset (or dataset ID) and merges them into the results. This is helpful for providing query text. Cannot be used in conjunction with dataset topics.

  • **kwargs (dict) – Other arguments for the internal method

Returns:

dataframe with usual qid, docno, score columns etc

Examples:

# a dataframe of results can be used directly in a pt.Experiment
pt.Experiment(
    [ pt.io.read_results("/path/to/baselines-results.res.gz") ],
    topics,
    qrels,
    ["map"]
)

# make a transformer from a results dataframe, include the query text
first_pass = pt.Transformer.from_df( pt.io.read_results("/path/to/results.gz", topics=topics) )
# make a max_passage retriever based on a previously saved results
max_passage = (first_pass
    >> pt.text.get_text(dataset)
    >> pt.text.sliding()
    >> pt.text.scorer()
    >> pt.text.max_passage()
)
pyterrier.io.write_results(res, filename, format='trec', append=False, **kwargs)[source]

Write a results dataframe to a file.

Parameters:
  • res (DataFrame) – A results dataframe, with usual columns of qid, docno etc

  • filename (str) – The filename of the file to be written. Compressed files are handled automatically.

  • format (str) – The format of the results file: one of “trec”, “letor”, “minimal”

  • append (bool) – Append to an existing file. Defaults to False.

  • **kwargs (dict) – Other arguments for the internal method

Supported Formats:
  • “trec” – output columns are $qid Q0 $docno $rank $score $runname, space separated

  • “letor” – This follows the LETOR and MSLR datasets, in that output columns are $label qid:$qid [$fid:$value]+ # docno=$docno

  • “minimal”: output columns are $qid $docno $rank, tab-separated. This is used for submissions to the MSMARCO leaderboard.

pyterrier.io.read_topics(filename, format='trec', **kwargs)[source]

Reads a file containing topics.

Parameters:
  • filename (str) – The filename of the topics file. A URL is supported for the “trec” and “singleline” formats.

  • format (str) – One of “trec”, “trecxml” or “singleline”. Default is “trec”

Returns:

pandas.Dataframe with columns=[‘qid’,’query’] both columns have type string

Supported Formats:
  • “trec” – an SGML-formatted TREC topics file. Delimited by TOP tags, each having NUM and TITLE tags; DESC and NARR tags are skipped by default. Control using whitelist and blacklist kwargs

  • “trecxml” – a more modern XML formatted topics file. Delimited by topic tags, each having nunber tags. query, question and narrative tags are parsed by default. Control using tags kwarg.

  • “singeline” – one query per line, preceeded by a space or colon. Tokenised by default, use tokenise=False kwargs to prevent tokenisation.

pyterrier.io.read_qrels(file_path)[source]

Reads a file containing qrels (relevance assessments)

Parameters:

file_path (str) – The path to the qrels file. A URL is also supported.

Returns:

pandas.Dataframe with columns=[‘qid’,’docno’, ‘label’] with column types string, string, and int