Caching Retriever Results

RetrieverCache saves the retrieved results based on the fields of each row. When the same row is encountered again, the value is read from the cache, avoiding retrieving again.

Example use case: I want to test several different re-ranking models over the same initial set of documents, and I want to save time by not re-running the queries each time.

You use a RetrieverCache in place of the retriever in a pipeline. It holds a reference to the retriever so that it can retrieve results for queries that are missing from the cache.

Warning

Important Caveats:

  • RetrieverCache saves scores based on all the input columns by default. Changes in any of the values will result in a cache miss, even if the column does not affect the retriever’s output. You can specify a subset of columns using the on parameter.

  • DBM does not support concurrent reads/writes from multiple threads or processes. Keep only a single RetrieverCache pointing to a cache file location open at a time.

  • A RetrieverCache represents the cross between a retriever and a corpus. Do not try to use a single cache across multiple retrievers or corpora – you’ll get unexpected/invalid results.

Example:

import pyterrier as pt
from pyterrier_caching import RetrieverCache

# Setup
cached_retriever = RetrieverCache('path/to/cache', MyRetriever())
dataset = pt.get_dataset('some-dataset') # e.g., 'irds:msmarco-passage'

# Use the RetrieverCache cache object just as you would a retriever
cached_pipeline = cached_retriever >> MySecondStage()

cached_pipeline(dataset.get_topics())
# Will be faster when you run it a second time, since all values are cached
cached_pipeline(dataset.get_topics())

API Documentation

pyterrier_caching.RetrieverCache

alias of DbmRetrieverCache

class pyterrier_caching.DbmRetrieverCache(path, retriever=None, on=None, verbose=False)[source]

A RetrieverCache that stores retrieved results in dbm.dumb database files.

Parameters:
  • path – The path to the cache.

  • retriever – The retriever that is cached.

  • on – The column(s) to use as the key for the cache. If None, all columns will be used.

  • verbose – If True, print progress information.