Caching Retriever Results¶
RetrieverCache
saves the retrieved results based on the fields
of each row. When the same row is encountered again, the value is read from the cache,
avoiding retrieving again.
Example use case: I want to test several different re-ranking models over the same initial set of documents, and I want to save time by not re-running the queries each time.
You use a RetrieverCache
in place of the retriever in a pipeline. It holds a reference to
the retriever so that it can retrieve results for queries that are missing from the cache.
Warning
Important Caveats:
RetrieverCache
saves scores based on all the input columns by default. Changes in any of the values will result in a cache miss, even if the column does not affect the retriever’s output. You can specify a subset of columns using theon
parameter.DBM does not support concurrent reads/writes from multiple threads or processes. Keep only a single
RetrieverCache
pointing to a cache file location open at a time.A
RetrieverCache
represents the cross between a retriever and a corpus. Do not try to use a single cache across multiple retrievers or corpora – you’ll get unexpected/invalid results.
Example:
import pyterrier as pt
from pyterrier_caching import RetrieverCache
# Setup
cached_retriever = RetrieverCache('path/to/cache', MyRetriever())
dataset = pt.get_dataset('some-dataset') # e.g., 'irds:msmarco-passage'
# Use the RetrieverCache cache object just as you would a retriever
cached_pipeline = cached_retriever >> MySecondStage()
cached_pipeline(dataset.get_topics())
# Will be faster when you run it a second time, since all values are cached
cached_pipeline(dataset.get_topics())
API Documentation¶
- pyterrier_caching.RetrieverCache¶
alias of
DbmRetrieverCache
- class pyterrier_caching.DbmRetrieverCache(path, retriever=None, on=None, verbose=False)[source]¶
A
RetrieverCache
that stores retrieved results indbm.dumb
database files.- Parameters:
path – The path to the cache.
retriever – The retriever that is cached.
on – The column(s) to use as the key for the cache. If None, all columns will be used.
verbose – If True, print progress information.