Caching Scorer / Re-Ranker Results
====================================

Scorers (Re-Rankers) are a common type of :class:`~pyterrier.Transformer` that re-order a set
of results with a new scoring function. Scorers are often neural cross-encoders, e.g.
:class:`pyterrier_dr.ElectraScorer`.

Scorers can be expensive to execute, so it can be helpful to cache the results throughout the
course of experimentation. For example, you may want to test how a neural relevance model performs
over several first-stage retrieval models that give back many of the same results.

:class:`~pyterrier_caching.ScorerCache` saves the ``score`` based on ``query`` and ``docno`` [#names]_.
When the same ``query``-``docno`` combination is encountered again, the score is read from the cache,
avoiding re-computation.

You use a ``ScorerCache`` in place of the scorer in a pipeline. It holds a reference to
the scorer so that it can compute values that are missing from the cache.

.. warning::
   **Important Caveats**:

   - ``ScorerCache`` saves scores based on **only** the value of the ``query`` and ``docno`` [#names]_.
     All other information is ignored (e.g., the text of the document). Note that this strategy 
     makes it suitable only when each score only depends on the ``query`` and ``docno`` of a single 
     record (e.g., Mono-style models) and not cases that perform pairwise or listwise scoring 
     (e.g, Duo-style models).
   - ``ScorerCache`` only stores the result of the ``score`` column. All other outputs of the scorer
     are discarded. (Rank is also given in the output, but it is calculated by cache, not the scorer.)
   - Scores are saved as ``float64`` values. Other values will be attempted to be cast up/down to `float64`.
   - A ``ScorerCache`` represents the cross between a scorer and a corpus. Do not try to 
     use a single cache across multiple scorers or corpora -- you'll get unexpected/invalid 
     results.

Example:

.. code-block:: python
   :caption: Caching MonoElectra results using :class:`~pyterrier_caching.ScorerCache`

   import pyterrier as pt
   from pyterrier_caching import ScorerCache
   from pyterrier_dr import ElectraScorer
   from pyterrier_pisa import PisaIndex

   # Setup
   dataset = pt.get_dataset('irds:msmarco-passage/dev/small')
   index = PisaIndex.from_hf('macavaney/msmarco-passage.pisa')
   scorer = dataset.text_loader() >> ElectraScorer()
   cached_scorer = ScorerCache('electra.cache', scorer)

   # Use the ScorerCache cache object just as you would a scorer
   cached_pipeline = index.bm25() >> cached_scorer
   cached_pipeline(dataset.get_topics())

   # Will be faster when you run it a second time, since all values are cached
   cached_pipeline(dataset.get_topics())

   # Will only compute scores for docnos that were not returned by bm25
   another_cached_pipeline = index.qld() >> cached_scorer
   another_cached_pipeline(dataset.get_topics())

Advanced
--------------------------

Caching Learning-to-rank Features
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

You can cache learning-to-rank features by setting ``value="features"`` and ``pickle=True`` when constructing
the cache.

Example:

.. code-block:: python
   :caption: Cache learning-to-rank features with ``ScorerCache``

   from pyterrier_caching import ScorerCache
   feature_extractor = ... # a transformer that extracts features based on query and docno
   cache = ScorerCache('mycache', feature_extractor, value="features", pickle=True)


API Documentation
--------------------------

.. autoclass:: pyterrier_caching.ScorerCache
   :members:

.. autoclass:: pyterrier_caching.SparseScorerCache
   :members:

.. autoclass:: pyterrier_caching.Sqlite3ScorerCache
   :members:

.. autoclass:: pyterrier_caching.DenseScorerCache
   :members:

.. autoclass:: pyterrier_caching.Hdf5ScorerCache
   :members:

--------------------------

.. [#names] These fields can be configured with ``group`` (query), ``key`` (docno), and ``value`` (score) settings.