Caching Indexing Pipeline Results¶
IndexerCache
saves the sequence of documents encountered
in an indexing pipeline. It allows you to repeat that sequence without needing to
re-execute the computations up to that point.
Example use case: I want to test how different retrieval engines perform over learned sparse representations, but I don’t want to re-compute the representations each time.
You use an IndexerCache
the same way you would use an indexer:
as the last component of a pipeline. Rather than building an index of the data, the IndexerCache
will save your results to a file on disk. This file can be re-read by iterating over the cache
object with iter(cache)
.
Example:
import pyterrier as pt
from pyterrier_caching import IndexerCache
# Setup
cache = IndexerCache('path/to/cache')
dataset = pt.get_dataset('irds:msmarco-passage')
# Use the IndexerCache cache object just as you would an indexer
cache_pipeline = MyExpensiveTransformer() >> cache
# The following line will save the results of MyExpensiveTransformer() to path/to/cache
cache_pipeline.index(dataset.get_corpus_iter())
# Now you can build multiple indexes over the results of MyExpensiveTransformer without
# needing to re-run it each time
indexer1 = ... # e.g., pt.IterDictIndexer('./path/to/index.terrier')
indexer1.index(iter(cache))
indexer2 = ... # e.g., pyterrier_pisa.PisaIndex('./path/to/index.pisa')
indexer2.index(iter(cache))
IndexerCache
provides a variety of other functionality over the cached results. See the API
documentation below for more details.
API Documentation¶
- pyterrier_caching.IndexerCache¶
alias of
Lz4PickleIndexerCache
- class pyterrier_caching.Lz4PickleIndexerCache(path=None)[source]¶
An
IndexerCache
that stores records as pickled dictionaries compressed with lz4.- Parameters:
path – The path to the cache. If None, a temporary cache will be created that is deleted when closed.
- indexer(mode=ArtifactBuilderMode.create, skip_docno_lookup=False)[source]¶
Returns an
Indexer
for this cache. The indexer can be used to create the cache.- Return type:
- Parameters:
mode – The mode to use for the indexer. Must be ‘create’.
skip_docno_lookup – If True, skip creating a docno lookup.
- __iter__()[source]¶
Iterates over the records stored in the cache.
- Return type:
Iterator
[Dict
[str
,Any
]]
- get_corpus_iter(verbose=False, fields=None, start=None, stop=None)[source]¶
Iterates over the records stored in the cache.
- Return type:
Iterator
[Dict
[str
,Any
]]- Parameters:
verbose – If True, show a progress bar.
fields – If not None, only return these fields.
start – If not None, start at this record number.
stop – If not None, stop at this record number.
- to_dataframe(verbose=False, fields=None, start=None, stop=None)[source]¶
Converts the results in this cache to a DataFrame.
- Return type:
DataFrame
- Parameters:
verbose – If True, show a progress bar.
fields – If not None, only return these fields.
start – If not None, start at this record number.
stop – If not None, stop at this record number.
- __getitem__(items)[source]¶
Returns the record(s) stored in the cache by the provided index, docno, or range.
- text_loader(fields='*', *, verbose=False)[source]¶
Returns a
Transformer
that loads the text from the cache based ondocno
.- Return type:
- Parameters:
fields – If not ‘*’, only return these fields.
verbose – If True, show a progress bar.