Caching Indexing Pipeline Results

IndexerCache saves the sequence of documents encountered in an indexing pipeline. It allows you to repeat that sequence without needing to re-execute the computations up to that point.

Example use case: I want to test how different retrieval engines perform over learned sparse representations, but I don’t want to re-compute the representations each time.

You use an IndexerCache the same way you would use an indexer: as the last component of a pipeline. Rather than building an index of the data, the IndexerCache will save your results to a file on disk. This file can be re-read by iterating over the cache object with iter(cache).

Example:

Caching the results of an expensive transformer using IndexerCache
import pyterrier as pt
from pyterrier_caching import IndexerCache

# Setup
cache = IndexerCache('path/to/cache')
dataset = pt.get_dataset('irds:msmarco-passage')

# Use the IndexerCache cache object just as you would an indexer
cache_pipeline = MyExpensiveTransformer() >> cache

# The following line will save the results of MyExpensiveTransformer() to path/to/cache
cache_pipeline.index(dataset.get_corpus_iter())

# Now you can build multiple indexes over the results of MyExpensiveTransformer without
# needing to re-run it each time
indexer1 = ... # e.g., pt.IterDictIndexer('./path/to/index.terrier')
indexer1.index(iter(cache))
indexer2 = ... # e.g., pyterrier_pisa.PisaIndex('./path/to/index.pisa')
indexer2.index(iter(cache))

IndexerCache provides a variety of other functionality over the cached results. See the API documentation below for more details.

API Documentation

pyterrier_caching.IndexerCache

alias of Lz4PickleIndexerCache

class pyterrier_caching.Lz4PickleIndexerCache(path=None)[source]

An IndexerCache that stores records as pickled dictionaries compressed with lz4.

Parameters:

path – The path to the cache. If None, a temporary cache will be created that is deleted when closed.

indexer(mode=ArtifactBuilderMode.create, skip_docno_lookup=False)[source]

Returns an Indexer for this cache. The indexer can be used to create the cache.

Return type:

Indexer

Parameters:
  • mode – The mode to use for the indexer. Must be ‘create’.

  • skip_docno_lookup – If True, skip creating a docno lookup.

index(it)[source]

Indexes the provided records to this cache.

Return type:

None

__iter__()[source]

Iterates over the records stored in the cache.

Return type:

Iterator[Dict[str, Any]]

__len__()[source]

Returns the number of records stored in the cache.

Return type:

int

get_corpus_iter(verbose=False, fields=None, start=None, stop=None)[source]

Iterates over the records stored in the cache.

Return type:

Iterator[Dict[str, Any]]

Parameters:
  • verbose – If True, show a progress bar.

  • fields – If not None, only return these fields.

  • start – If not None, start at this record number.

  • stop – If not None, stop at this record number.

to_dataframe(verbose=False, fields=None, start=None, stop=None)[source]

Converts the results in this cache to a DataFrame.

Return type:

DataFrame

Parameters:
  • verbose – If True, show a progress bar.

  • fields – If not None, only return these fields.

  • start – If not None, start at this record number.

  • stop – If not None, stop at this record number.

__getitem__(items)[source]

Returns the record(s) stored in the cache by the provided index, docno, or range.

built()[source]

Returns True if the cache is built.

Return type:

bool

text_loader(fields='*', *, verbose=False)[source]

Returns a Transformer that loads the text from the cache based on docno.

Return type:

Transformer

Parameters:
  • fields – If not ‘*’, only return these fields.

  • verbose – If True, show a progress bar.

docnos()[source]

Returns a Lookup for the docnos stored in the cache.

close()[source]

Closes any open files used by this cache.