Caching Indexing Pipeline Results¶

IndexerCache saves the sequence of documents encountered in an indexing pipeline. It allows you to repeat that sequence without needing to re-execute the computations up to that point.

Example use case: I want to test how different retrieval engines perform over learned sparse representations, but I don’t want to re-compute the representations each time.

You use an IndexerCache the same way you would use an indexer: as the last component of a pipeline. Rather than building an index of the data, the IndexerCache will save your results to a file on disk. This file can be re-read by iterating over the cache object with iter(cache).

Example:

Caching the results of an expensive transformer using IndexerCache¶

import pyterrier as pt
from pyterrier_caching import IndexerCache

# Setup
cache = IndexerCache('path/to/cache')
dataset = pt.get_dataset('irds:msmarco-passage')

# Use the IndexerCache cache object just as you would an indexer
cache_pipeline = MyExpensiveTransformer() >> cache

# The following line will save the results of MyExpensiveTransformer() to path/to/cache
cache_pipeline.index(dataset.get_corpus_iter())

# Now you can build multiple indexes over the results of MyExpensiveTransformer without
# needing to re-run it each time
indexer1 = ... # e.g., pt.IterDictIndexer('./path/to/index.terrier')
indexer1.index(iter(cache))
indexer2 = ... # e.g., pyterrier_pisa.PisaIndex('./path/to/index.pisa')
indexer2.index(iter(cache))

IndexerCache provides a variety of other functionality over the cached results. See the API documentation below for more details.

API Documentation¶

pyterrier_caching.IndexerCache¶: alias of Lz4PickleIndexerCache

class pyterrier_caching.Lz4PickleIndexerCache(path=None)[source]¶

An IndexerCache that stores records as pickled dictionaries compressed with lz4.

Parameters:: path (str | Path | None) – The path to the cache. If None, a temporary cache will be created that is deleted when closed.

indexer(mode=ArtifactBuilderMode.create, skip_docno_lookup=False)[source]¶

Returns an Indexer for this cache. The indexer can be used to create the cache.

Return type:

Indexer

Parameters:

mode (str | ArtifactBuilderMode) – The mode to use for the indexer. Must be ‘create’.
skip_docno_lookup (bool) – If True, skip creating a docno lookup.

index(it)[source]¶

Indexes the provided records to this cache.

Return type:: None
Parameters:: it (Iterator[Dict[str, Any]])

__iter__()[source]¶

Iterates over the records stored in the cache.

Return type:: Iterator[Dict[str, Any]]

__len__()[source]¶

Returns the number of records stored in the cache.

Return type:: int

get_corpus_iter(verbose=False, fields=None, start=None, stop=None)[source]¶

Iterates over the records stored in the cache.

Return type:

Iterator[Dict[str, Any]]

Parameters:

verbose (bool) – If True, show a progress bar.
fields (List[str] | None) – If not None, only return these fields.
start (int | None) – If not None, start at this record number.
stop (int | None) – If not None, stop at this record number.

to_dataframe(verbose=False, fields=None, start=None, stop=None)[source]¶

Converts the results in this cache to a DataFrame.

Return type:

DataFrame

Parameters:

verbose (bool) – If True, show a progress bar.
fields (List[str] | None) – If not None, only return these fields.
start (int | None) – If not None, start at this record number.
stop (int | None) – If not None, stop at this record number.

__getitem__(items)[source]¶

Returns the record(s) stored in the cache by the provided index, docno, or range.

Parameters:: items (int | str | slice)

built()[source]¶

Returns True if the cache is built.

Return type:: bool

text_loader(fields='*', *, verbose=False)[source]¶

Returns a Transformer that loads the text from the cache based on docno.

Return type:

Transformer

Parameters:

fields (str | List[str] | Literal['*']) – If not ‘*’, only return these fields.
verbose (bool) – If True, show a progress bar.

docnos()[source]¶: Returns a Lookup for the docnos stored in the cache.

close()[source]¶: Closes any open files used by this cache.