pt.io - Reading/writing files¶

This module provides useful utility methods for reading and writing files. In particular, it also provides support for reading and writing standard formats, such as TREC-formatted topics files or run files.

pyterrier.io.autoopen(filename, mode='rb', **kwargs)[source]¶: A drop-in for open() that applies automatic compression for .gz, .bz2 and .lz4 file extensions

pyterrier.io.find_files(dir)[source]¶

Returns all the files present in a directory and its subdirectories

Parameters:: dir – The directory containing the files
Returns:: A list of the paths to the files

pyterrier.io.finalized_open(path, mode)[source]¶

Opens a file for writing, but reverts it if there was an error in the process.

Parameters:

path (str) – Path of file to open
mode (str) – Either t or b, for text or binary mode

Return type:

ContextManager[BufferedIOBase]

Example

Returns a contextmanager that provides a file object, so should be used in a “with” statement. E.g.:

with pt.io.finalized_open("file.txt", "t") as f:
    f.write("some text")
# file.txt exists with contents "some text"

If there is an error when writing, the file is reverted:

with pt.io.finalized_open("file.txt", "t") as f:
    f.write("some other text")
    raise Exception("an error")
# file.txt remains unchanged (if existed, contents unchanged; if didn't exist, still doesn't)

pyterrier.io.finalized_autoopen(path, mode)[source]¶

Opens a file for writing with autoopen, but reverts it if there was an error in the process.

Parameters:

path (str) – Path of file to open
mode (str) – Either t or b, for text or binary mode

Return type:

ContextManager[BufferedIOBase]

Example

Returns a contextmanager that provides a file object, so should be used in a “with” statement. E.g.:

with pt.io.finalized_autoopen("file.gz", "t") as f:
    f.write("some text")
# file.gz exists with contents "some text"

If there is an error when writing, the file is reverted:

with pt.io.finalized_autoopen("file.gz", "t") as f:
    f.write("some other text")
    raise Exception("an error")
# file.gz remains unchanged (if existed, contents unchanged; if didn't exist, still doesn't)

pyterrier.io.ok_filename(fname)[source]¶

Checks to see if a filename is valid.

Return type:: bool
Parameters:: fname (str)

pyterrier.io.touch(fname, mode=438, dir_fd=None, **kwargs)[source]¶

Eqiuvalent to touch command on linux. Implementation from https://stackoverflow.com/a/1160227

Parameters:: fname (str)

pyterrier.io.read_results(filename, format='trec', topics=None, dataset=None, **kwargs)[source]¶

Reads a file into a results dataframe.

Parameters:

filename (str) – The filename of the file to be read. Compressed files are handled automatically. A URL is also supported for the “trec” format.
format – The format of the results file: one of “trec”, “letor”. Default is “trec”.
topics (DataFrame | None) – If provided, will merge the topics to merge into the results. This is helpful for providing query text. Cannot be used in conjunction with dataset argument.
dataset (Dataset | None) – If provided, loads topics from the dataset (or dataset ID) and merges them into the results. This is helpful for providing query text. Cannot be used in conjunction with dataset topics.
kwargs – Other arguments for the internal method

Return type:

DataFrame

Returns:

dataframe with usual qid, docno, score columns etc

Examples:

# a dataframe of results can be used directly in a pt.Experiment
pt.Experiment(
    [ pt.io.read_results("/path/to/baselines-results.res.gz") ],
    topics,
    qrels,
    ["map"]
)

# make a transformer from a results dataframe, include the query text
first_pass = pt.Transformer.from_df( pt.io.read_results("/path/to/results.gz", topics=topics) )
# make a max_passage retriever based on a previously saved results
max_passage = (first_pass
    >> pt.text.get_text(dataset)
    >> pt.text.sliding()
    >> pt.text.scorer()
    >> pt.text.max_passage()
)

pyterrier.io.write_results(res, filename, format='trec', append=False, **kwargs)[source]¶

Write a results dataframe to a file.

Parameters:

res (DataFrame) – A results dataframe, with usual columns of qid, docno etc
filename (str) – The filename of the file to be written. Compressed files are handled automatically.
format (Literal['trec', 'letor', 'minimal']) – The format of the results file: one of “trec”, “letor”, “minimal”
append – Append to an existing file. Defaults to False.
kwargs – Other arguments for the internal method

Supported Formats:

“trec” – output columns are $qid Q0 $docno $rank $score $runname, space separated
“letor” – This follows the LETOR and MSLR datasets, in that output columns are $label qid:$qid [$fid:$value]+ # docno=$docno
“minimal”: output columns are $qid $docno $rank, tab-separated. This is used for submissions to the MSMARCO leaderboard.

pyterrier.io.read_topics(filename, format='trec', **kwargs)[source]¶

Reads a file containing topics.

Parameters:

filename (str) – The filename of the topics file. A URL is supported for the “trec” and “singleline” formats.
format (Literal['trec', 'trecxml', 'singleline']) – One of “trec”, “trecxml” or “singleline”. Default is “trec”

Return type:

DataFrame

Returns:

pandas.Dataframe with columns=[‘qid’,’query’], where both columns have type str.

Supported Formats:

“trec” – an SGML-formatted TREC topics file. Delimited by TOP tags, each having NUM and TITLE tags; DESC and NARR tags are skipped by default. Control using whitelist and blacklist kwargs
“trecxml” – a more modern XML formatted topics file. Delimited by topic tags, each having number tags. query, question and narrative tags are parsed by default. Control using tags kwarg.
“singleline” – one query per line, preceeded by a space or colon. Tokenised by default, use tokenise=False kwargs to prevent tokenisation.

pyterrier.io.read_qrels(file_path)[source]¶

Reads a file containing qrels (relevance assessments)

Parameters:: file_path (str) – The path to the qrels file. A URL is also supported.
Return type:: DataFrame
Returns:: pandas.Dataframe with columns=[‘qid’,’docno’, ‘label’] with column types string, string, and int

pyterrier.io.pyterrier_home()[source]¶

Returns pyterrier’s home directory. By default this is ~/.pyterrier, but it can also be set with the PYTERRIER_HOME env variable.

Return type:: str

pyterrier.io.finalized_directory(path)[source]¶

Creates a directory, but reverts it if there was an error in the process.

Return type:: Generator[str, None, None]
Parameters:: path (str)

pyterrier.io.download(url, path, *, expected_sha256=None, verbose=True, headers={})[source]¶

Downloads a file from a URL to a local path.

Return type:

None

Parameters:

url (str)
path (str)
expected_sha256 (str | None)
verbose (bool)

pyterrier.io.download_stream(url, *, expected_sha256=None, headers=None, verbose=True)[source]¶

Downloads a file from a URL to a stream.

Return type:

Generator[BufferedIOBase, None, None]

Parameters:

url (str)
expected_sha256 (str | None)
headers (Dict[str, str] | None)
verbose (bool)

pyterrier.io.open_or_download_stream(path_or_url, *, expected_sha256=None, headers=None, verbose=True)[source]¶

Opens a file or downloads a file from a URL to a stream.

Return type:

Generator[BufferedIOBase, None, None]

Parameters:

path_or_url (str)
expected_sha256 (str | None)
headers (Dict[str, str] | None)
verbose (bool)

class pyterrier.io.HashReader(reader, *, hashfn=<built-in function openssl_sha256>, expected=None)[source]¶

A reader that computes the sha256 hash of the data read.

Create a HashReader.

Parameters:

reader (BufferedIOBase)
hashfn (Callable)
expected (str | None)

on_data(data)[source]¶

Called when data is read.

Return type:: None
Parameters:: data (bytes)

hexdigest()[source]¶

Return the hexdigest of the hash.

Return type:: str

close()[source]¶

Close the reader and check the hash.

Return type:: None

class pyterrier.io.HashWriter(writer, *, hashfn=<built-in function openssl_sha256>)[source]¶

A writer that computes the sha256 hash of the data written.

Create a HashWriter.

Parameters:

writer (BufferedIOBase)
hashfn (Callable)

on_data(data)[source]¶

Called when data is written.

Return type:: None
Parameters:: data (bytes | bytearray | memoryview)

hexdigest()[source]¶

Return the hexdigest of the hash.

Return type:: str

class pyterrier.io.TqdmReader(reader, *, total=None, desc=None, disable=False)[source]¶

A reader that displays a progress bar.

Create a TqdmReader.

Parameters:

reader (BufferedIOBase)
total (int | None)
desc (str | None)
disable (bool)

on_data(data)[source]¶

Called when data is read.

Return type:: None
Parameters:: data (bytes)

close()[source]¶

Close the reader and the progress bar.

Return type:: None

class pyterrier.io.CallbackReader(reader, callback)[source]¶

A reader that calls a callback with the data read.

Create a CallbackReader.

Parameters:

reader (BufferedIOBase)
callback (Callable)

on_data(data)[source]¶

Called when data is read.

Return type:: None
Parameters:: data (bytes)

class pyterrier.io.MultiReader(readers)[source]¶

A reader that reads from multiple readers in sequence.

Create a MultiReader.

Parameters:: readers (Iterable[BufferedIOBase])

read1(size=None)[source]¶

Read a single chunk of data.

Return type:: bytes
Parameters:: size (int | None)

read(size=None)[source]¶

Read data.

Return type:: bytes
Parameters:: size (int | None)

pyterrier.io.path_is_under_base(path, base)[source]¶

Returns True if the path is under the base directory.

Return type:

bool

Parameters:

path (str)
base (str)