Caching Retriever Results¶
RetrieverCache saves the retrieved results based on the fields
of each row. When the same row is encountered again, the value is read from the cache,
avoiding retrieving again.
Example use case: I want to test several different re-ranking models over the same initial set of documents, and I want to save time by not re-running the queries each time.
You use a RetrieverCache in place of the retriever in a pipeline. It holds a reference to
the retriever so that it can retrieve results for queries that are missing from the cache.
Warning
Important Caveats:
RetrieverCachesaves scores based on all the input columns by default. Changes in any of the values will result in a cache miss, even if the column does not affect the retriever’s output. You can specify a subset of columns using theonparameter.DBM does not support concurrent reads/writes from multiple threads or processes. Keep only a single
RetrieverCachepointing to a cache file location open at a time.A
RetrieverCacherepresents the cross between a retriever and a corpus. Do not try to use a single cache across multiple retrievers or corpora – you’ll get unexpected/invalid results.
Example:
import pyterrier as pt
from pyterrier_caching import RetrieverCache
# Setup
cached_retriever = RetrieverCache('path/to/cache', MyRetriever())
dataset = pt.get_dataset('some-dataset') # e.g., 'irds:msmarco-passage'
# Use the RetrieverCache cache object just as you would a retriever
cached_pipeline = cached_retriever >> MySecondStage()
cached_pipeline(dataset.get_topics())
# Will be faster when you run it a second time, since all values are cached
cached_pipeline(dataset.get_topics())
API Documentation¶
- pyterrier_caching.RetrieverCache¶
alias of
DbmRetrieverCache
- class pyterrier_caching.DbmRetrieverCache(path=None, retriever=None, *, on=None, verbose=False)[source]¶
A
RetrieverCachethat stores retrieved results indbm.dumbdatabase files.- Parameters:
path (str | Path | None) – The path to the cache.
retriever (Transformer | None) – The retriever that is cached.
on (str | List[str] | None) – The column(s) to use as the key for the cache. If None, all columns will be used.
verbose (bool) – If True, print progress information.