Caching Scorer / Re-Ranker Results

Scorers (Re-Rankers) are a common type of Transformer that re-order a set of results with a new scoring function. Scorers are often neural cross-encoders, e.g. pyterrier_dr.ElectraScorer.

Scorers can be expensive to execute, so it can be helpful to cache the results throughout the course of experimentation. For example, you may want to test how a neural relevance model performs over several first-stage retrieval models that give back many of the same results.

ScorerCache saves the score based on query and docno [1]. When the same query-docno combination is encountered again, the score is read from the cache, avoiding re-computation.

You use a ScorerCache in place of the scorer in a pipeline. It holds a reference to the scorer so that it can compute values that are missing from the cache.

Warning

Important Caveats:

  • ScorerCache saves scores based on only the value of the query and docno [1]. All other information is ignored (e.g., the text of the document). Note that this strategy makes it suitable only when each score only depends on the query and docno of a single record (e.g., Mono-style models) and not cases that perform pairwise or listwise scoring (e.g, Duo-style models).

  • ScorerCache only stores the result of the score column. All other outputs of the scorer are discarded. (Rank is also given in the output, but it is calculated by cache, not the scorer.)

  • Scores are saved as float64 values. Other values will be attempted to be cast up/down to float64.

  • A ScorerCache represents the cross between a scorer and a corpus. Do not try to use a single cache across multiple scorers or corpora – you’ll get unexpected/invalid results.

Example:

Caching MonoElectra results using ScorerCache
import pyterrier as pt
from pyterrier_caching import ScorerCache
from pyterrier_dr import ElectraScorer
from pyterrier_pisa import PisaIndex

# Setup
dataset = pt.get_dataset('irds:msmarco-passage/dev/small')
index = PisaIndex.from_hf('macavaney/msmarco-passage.pisa')
scorer = dataset.text_loader() >> ElectraScorer()
cached_scorer = ScorerCache('electra.cache', scorer)

# Use the ScorerCache cache object just as you would a scorer
cached_pipeline = index.bm25() >> cached_scorer
cached_pipeline(dataset.get_topics())

# Will be faster when you run it a second time, since all values are cached
cached_pipeline(dataset.get_topics())

# Will only compute scores for docnos that were not returned by bm25
another_cached_pipeline = index.qld() >> cached_scorer
another_cached_pipeline(dataset.get_topics())

Advanced

Caching Learning-to-rank Features

You can cache learning-to-rank features by setting value="features" and pickle=True when constructing the cache.

Example:

Cache learning-to-rank features with ScorerCache
from pyterrier_caching import ScorerCache
feature_extractor = ... # a transformer that extracts features based on query and docno
cache = ScorerCache('mycache', feature_extractor, value="features", pickle=True)

API Documentation

pyterrier_caching.ScorerCache

alias of Sqlite3ScorerCache

pyterrier_caching.SparseScorerCache

alias of Sqlite3ScorerCache

class pyterrier_caching.Sqlite3ScorerCache(path, scorer=None, *, group=None, key=None, value=None, pickle=None, verbose=False)[source]

A cache for storing and retrieving scores for documents, backed by a SQLite3 database.

This is a sparse scorer cache, meaning that space is only allocated for documents that have been scored. If a large proportion of the corpus is expected to be scored, a dense cache (e.g., Hdf5ScorerCache) may be more appropriate.

Parameters:
  • path – The path to the directory where the cache should be stored.

  • scorer – The scorer to use to score documents that are missing from the cache.

  • group – The name of the column in the input DataFrame that contains the group identifier (default: query)

  • key – The name of the column in the input DataFrame that contains the document identifier (default: docno)

  • value – The name of the column in the input DataFrame that contains the value to cache (default: score)

  • pickle – Whether to pickle the value before storing it in the cache (default: False)

  • verbose – Whether to print verbose output when scoring documents.

If a cache does not yet exist at the provided path, a new one is created.

Changed in version 0.3.0: added pickle option to support caching non-numeric values

close()[source]

Closes this cache, releasing the sqlite connection that it holds.

transform(inp)[source]

Scores the input DataFrame using cached values, scoring any missing ones and adding them to the cache.

Return type:

DataFrame

merge_from(other)[source]

Merges the cached values from another Sqlite3ScorerCache instance into this one.

Any keys that appear in both self and other will be replaced with the value from other.

pyterrier_caching.DenseScorerCache

alias of Hdf5ScorerCache

class pyterrier_caching.Hdf5ScorerCache(*args, **kwargs)[source]

A cache for storing and retrieving scores for documents, backed by an HDF5 file.

This is a dense scorer cache, meaning that space for all documents is allocated ahead of time. Dense caches are more suitable than sparse ones when a large proportion of the corpus (or the entire corpus) is expected to be scored. If only a small proportion of the corpus is expected to be scored, a sparse cache (e.g., Sqlite3ScorerCache) may be more appropriate.

Parameters:
  • path – The path to the directory where the cache should be stored.

  • scorer – The scorer to use to score documents that are missing from the cache.

  • verbose – Whether to print verbose output when scoring documents.

built()[source]

Returns whether this cache has been built.

Return type:

bool

build(corpus_iter=None, docnos_file=None)[source]

Builds this cache.

corpus_count()[source]

Returns the number of documents in the corpus that this cache was built from.

Return type:

int

cached_scorer()[source]

Returns a scorer that uses this cache to store and retrieve scores.

Return type:

Transformer

cached_retriever(num_results=1000)[source]

Returns a retriever that uses this cache to store and retrieve scores for every document in the corpus.

This transformer will raie an error if the entire corpus is not scored (e.g., from score_all()).

Return type:

Transformer

close()[source]

Closes this cache, releasing the file pointer that it holds and writing any new results to disk.

merge_from(other)[source]

Merges the cached values from another Hdf5ScorerCache instance into this one.

Any records that appear in both self and other will be replaced with the value from other.

score_all(dataset, *, batch_size=1024)[source]

Scores all topics for the entire corpus, storing the results in this cache.