Caching Scorer / Re-Ranker Results¶
Scorers (Re-Rankers) are a common type of Transformer that re-order a set
of results with a new scoring function. Scorers are often neural cross-encoders, e.g.
pyterrier_dr.ElectraScorer.
Scorers can be expensive to execute, so it can be helpful to cache the results throughout the course of experimentation. For example, you may want to test how a neural relevance model performs over several first-stage retrieval models that give back many of the same results.
ScorerCache saves the score based on query and docno [1].
When the same query-docno combination is encountered again, the score is read from the cache,
avoiding re-computation.
You use a ScorerCache in place of the scorer in a pipeline. It holds a reference to
the scorer so that it can compute values that are missing from the cache.
Warning
Important Caveats:
ScorerCachesaves scores based on only the value of thequeryanddocno[1]. All other information is ignored (e.g., the text of the document). Note that this strategy makes it suitable only when each score only depends on thequeryanddocnoof a single record (e.g., Mono-style models) and not cases that perform pairwise or listwise scoring (e.g, Duo-style models).ScorerCacheonly stores the result of thescorecolumn. All other outputs of the scorer are discarded. (Rank is also given in the output, but it is calculated by cache, not the scorer.)Scores are saved as
float64values. Other values will be attempted to be cast up/down to float64.A
ScorerCacherepresents the cross between a scorer and a corpus. Do not try to use a single cache across multiple scorers or corpora – you’ll get unexpected/invalid results.
Example:
ScorerCache¶import pyterrier as pt
from pyterrier_caching import ScorerCache
from pyterrier_dr import ElectraScorer
from pyterrier_pisa import PisaIndex
# Setup
dataset = pt.get_dataset('irds:msmarco-passage/dev/small')
index = PisaIndex.from_hf('macavaney/msmarco-passage.pisa')
scorer = dataset.text_loader() >> ElectraScorer()
cached_scorer = ScorerCache('electra.cache', scorer)
# Use the ScorerCache cache object just as you would a scorer
cached_pipeline = index.bm25() >> cached_scorer
cached_pipeline(dataset.get_topics())
# Will be faster when you run it a second time, since all values are cached
cached_pipeline(dataset.get_topics())
# Will only compute scores for docnos that were not returned by bm25
another_cached_pipeline = index.qld() >> cached_scorer
another_cached_pipeline(dataset.get_topics())
Advanced¶
Caching Learning-to-rank Features¶
You can cache learning-to-rank features by setting value="features" and pickle=True when constructing
the cache.
Example:
ScorerCache¶from pyterrier_caching import ScorerCache
feature_extractor = ... # a transformer that extracts features based on query and docno
cache = ScorerCache('mycache', feature_extractor, value="features", pickle=True)
API Documentation¶
- pyterrier_caching.ScorerCache¶
alias of
Sqlite3ScorerCache
- pyterrier_caching.SparseScorerCache¶
alias of
Sqlite3ScorerCache
- class pyterrier_caching.Sqlite3ScorerCache(path=None, scorer=None, *, group=None, key=None, value=None, pickle=None, verbose=False)[source]¶
A cache for storing and retrieving scores for documents, backed by a SQLite3 database.
This is a sparse scorer cache, meaning that space is only allocated for documents that have been scored. If a large proportion of the corpus is expected to be scored, a dense cache (e.g.,
Hdf5ScorerCache) may be more appropriate.- Parameters:
path (str | Path | None) – The path to the directory where the cache should be stored, or None to create a temporary cache.
scorer (Transformer) – The scorer to use to score documents that are missing from the cache.
group (str | None) – The name of the column in the input DataFrame that contains the group identifier (default:
query)key (str | None) – The name of the column in the input DataFrame that contains the document identifier (default:
docno)value (str | None) – The name of the column in the input DataFrame that contains the value to cache (default:
score)pickle (bool | None) – Whether to pickle the value before storing it in the cache (default: False)
verbose (bool) – Whether to print verbose output when scoring documents.
If a cache does not yet exist at the provided
path, a new one is created.Changed in version 0.3.0: added
pickleoption to support caching non-numeric values- transform(inp)[source]¶
Scores the input DataFrame using cached values, scoring any missing ones and adding them to the cache.
- Return type:
DataFrame- Parameters:
inp (DataFrame)
- merge_from(other)[source]¶
Merges the cached values from another Sqlite3ScorerCache instance into this one.
Any keys that appear in both
selfandotherwill be replaced with the value fromother.- Parameters:
other (Sqlite3ScorerCache)
- pyterrier_caching.DenseScorerCache¶
alias of
Hdf5ScorerCache
- class pyterrier_caching.Hdf5ScorerCache(path=None, scorer=None, *, verbose=False)[source]¶
A cache for storing and retrieving scores for documents, backed by an HDF5 file.
This is a dense scorer cache, meaning that space for all documents is allocated ahead of time. Dense caches are more suitable than sparse ones when a large proportion of the corpus (or the entire corpus) is expected to be scored. If only a small proportion of the corpus is expected to be scored, a sparse cache (e.g.,
Sqlite3ScorerCache) may be more appropriate.Creates a new Hdf5ScorerCache instance.
- Parameters:
path (str | Path | None) – The path to the directory where the cache should be stored, or None to create a temporary cache.
scorer (Transformer | None) – The scorer to use to score documents that are missing from the cache.
verbose (bool) – Whether to print verbose output.
- corpus_count()[source]¶
Returns the number of documents in the corpus that this cache was built from.
- Return type:
int
- cached_scorer()[source]¶
Returns a scorer that uses this cache to store and retrieve scores.
- Return type:
- cached_retriever(num_results=1000)[source]¶
Returns a retriever that uses this cache to store and retrieve scores for every document in the corpus.
This transformer will raie an error if the entire corpus is not scored (e.g., from
score_all()).- Return type:
- Parameters:
num_results (int)
- close(delete_tmp=True)[source]¶
Closes this cache, releasing the file pointer that it holds and writing any new results to disk.
- Parameters:
delete_tmp (bool)
- merge_from(other)[source]¶
Merges the cached values from another Hdf5ScorerCache instance into this one.
Any records that appear in both
selfandotherwill be replaced with the value fromother.- Parameters:
other (Hdf5ScorerCache)