Caching Retriever Results ==================================== :class:`~pyterrier_caching.RetrieverCache` saves the retrieved results based on the fields of each row. When the same row is encountered again, the value is read from the cache, avoiding retrieving again. **Example use case:** I want to test several different re-ranking models over the same initial set of documents, and I want to save time by not re-running the queries each time. You use a ``RetrieverCache`` in place of the retriever in a pipeline. It holds a reference to the retriever so that it can retrieve results for queries that are missing from the cache. .. warning:: **Important Caveats**: - ``RetrieverCache`` saves scores based on **all** the input columns by default. Changes in any of the values will result in a cache miss, even if the column does not affect the retriever's output. You can specify a subset of columns using the ``on`` parameter. - DBM does not support concurrent reads/writes from multiple threads or processes. Keep only a single ``RetrieverCache`` pointing to a cache file location open at a time. - A ``RetrieverCache`` represents the cross between a retriever and a corpus. Do not try to use a single cache across multiple retrievers or corpora -- you'll get unexpected/invalid results. Example: .. code-block:: python import pyterrier as pt from pyterrier_caching import RetrieverCache # Setup cached_retriever = RetrieverCache('path/to/cache', MyRetriever()) dataset = pt.get_dataset('some-dataset') # e.g., 'irds:msmarco-passage' # Use the RetrieverCache cache object just as you would a retriever cached_pipeline = cached_retriever >> MySecondStage() cached_pipeline(dataset.get_topics()) # Will be faster when you run it a second time, since all values are cached cached_pipeline(dataset.get_topics()) API Documentation -------------------------- .. autoclass:: pyterrier_caching.RetrieverCache :members: .. autoclass:: pyterrier_caching.DbmRetrieverCache :members: