Caching Scorer / Re-Ranker Results¶
Scorers (Re-Rankers) are a common type of Transformer
that re-order a set
of results with a new scoring function. Scorers are often neural cross-encoders, e.g.
pyterrier_dr.ElectraScorer
.
Scorers can be expensive to execute, so it can be helpful to cache the results throughout the course of experimentation. For example, you may want to test how a neural relevance model performs over several first-stage retrieval models that give back many of the same results.
ScorerCache
saves the score
based on query
and docno
[1].
When the same query
-docno
combination is encountered again, the score is read from the cache,
avoiding re-computation.
You use a ScorerCache
in place of the scorer in a pipeline. It holds a reference to
the scorer so that it can compute values that are missing from the cache.
Warning
Important Caveats:
ScorerCache
saves scores based on only the value of thequery
anddocno
[1]. All other information is ignored (e.g., the text of the document). Note that this strategy makes it suitable only when each score only depends on thequery
anddocno
of a single record (e.g., Mono-style models) and not cases that perform pairwise or listwise scoring (e.g, Duo-style models).ScorerCache
only stores the result of thescore
column. All other outputs of the scorer are discarded. (Rank is also given in the output, but it is calculated by cache, not the scorer.)Scores are saved as
float64
values. Other values will be attempted to be cast up/down to float64.A
ScorerCache
represents the cross between a scorer and a corpus. Do not try to use a single cache across multiple scorers or corpora – you’ll get unexpected/invalid results.
Example:
import pyterrier as pt
from pyterrier_caching import ScorerCache
from pyterrier_dr import ElectraScorer
from pyterrier_pisa import PisaIndex
# Setup
dataset = pt.get_dataset('irds:msmarco-passage/dev/small')
index = PisaIndex.from_hf('macavaney/msmarco-passage.pisa')
scorer = dataset.text_loader() >> ElectraScorer()
cached_scorer = ScorerCache('electra.cache', scorer)
# Use the ScorerCache cache object just as you would a scorer
cached_pipeline = index.bm25() >> cached_scorer
cached_pipeline(dataset.get_topics())
# Will be faster when you run it a second time, since all values are cached
cached_pipeline(dataset.get_topics())
# Will only compute scores for docnos that were not returned by bm25
another_cached_pipeline = index.qld() >> cached_scorer
another_cached_pipeline(dataset.get_topics())
Advanced¶
Caching Learning-to-rank Features¶
You can cache learning-to-rank features by setting value="features"
and pickle=True
when constructing
the cache.
Example:
from pyterrier_caching import ScorerCache
feature_extractor = ... # a transformer that extracts features based on query and docno
cache = ScorerCache('mycache', feature_extractor, value="features", pickle=True)
API Documentation¶
- pyterrier_caching.ScorerCache¶
alias of
Sqlite3ScorerCache
- pyterrier_caching.SparseScorerCache¶
alias of
Sqlite3ScorerCache
- class pyterrier_caching.Sqlite3ScorerCache(path, scorer=None, *, group=None, key=None, value=None, pickle=None, verbose=False)[source]¶
A cache for storing and retrieving scores for documents, backed by a SQLite3 database.
This is a sparse scorer cache, meaning that space is only allocated for documents that have been scored. If a large proportion of the corpus is expected to be scored, a dense cache (e.g.,
Hdf5ScorerCache
) may be more appropriate.- Parameters:
path – The path to the directory where the cache should be stored.
scorer – The scorer to use to score documents that are missing from the cache.
group – The name of the column in the input DataFrame that contains the group identifier (default:
query
)key – The name of the column in the input DataFrame that contains the document identifier (default:
docno
)value – The name of the column in the input DataFrame that contains the value to cache (default:
score
)pickle – Whether to pickle the value before storing it in the cache (default: False)
verbose – Whether to print verbose output when scoring documents.
If a cache does not yet exist at the provided
path
, a new one is created.Changed in version 0.3.0: added
pickle
option to support caching non-numeric values
- pyterrier_caching.DenseScorerCache¶
alias of
Hdf5ScorerCache
- class pyterrier_caching.Hdf5ScorerCache(*args, **kwargs)[source]¶
A cache for storing and retrieving scores for documents, backed by an HDF5 file.
This is a dense scorer cache, meaning that space for all documents is allocated ahead of time. Dense caches are more suitable than sparse ones when a large proportion of the corpus (or the entire corpus) is expected to be scored. If only a small proportion of the corpus is expected to be scored, a sparse cache (e.g.,
Sqlite3ScorerCache
) may be more appropriate.- Parameters:
path – The path to the directory where the cache should be stored.
scorer – The scorer to use to score documents that are missing from the cache.
verbose – Whether to print verbose output when scoring documents.
- corpus_count()[source]¶
Returns the number of documents in the corpus that this cache was built from.
- Return type:
int
- cached_scorer()[source]¶
Returns a scorer that uses this cache to store and retrieve scores.
- Return type:
- cached_retriever(num_results=1000)[source]¶
Returns a retriever that uses this cache to store and retrieve scores for every document in the corpus.
This transformer will raie an error if the entire corpus is not scored (e.g., from
score_all()
).- Return type:
- close()[source]¶
Closes this cache, releasing the file pointer that it holds and writing any new results to disk.