SPLADE Overview¶
pyterrier-splade lets you construct learned sparse indexing and retrieval pipelines. SPLADE maps queries and documents into sparse bags of weighted BERT WordPiece tokens, combining the efficiency of sparse retrieval with the effectiveness of learned term weighting and expansion.
A pipeline is built from two main pieces: a SPLADE model, which encodes text into weighted tokens, and a sparse index (Terrier or PISA), which stores those tokens and retrieves over them.
SPLADE Model¶
A Splade model is a factory for the transformers used to encode and score text.
import pyterrier_splade
splade = pyterrier_splade.Splade() # [1]
You can pass any SPLADE model name or path; it defaults to
naver/splade-cocondenser-ensembledistil.
A SPLADE model can produce several transformers:
Encode documents into weighted tokens using its
doc_encoder().Encode queries into weighted tokens using its
query_encoder().Re-rank results by scoring query/document pairs using its
scorer().
Indexing¶
Indexing takes place as a pipeline: the SPLADE document encoder maps raw text into a dictionary of BERT WordPiece tokens
and corresponding weights, and the underlying sparse indexer stores them. Terrier is configured to index the tokens
unchanged, without further tokenisation or stemming (pretokenised=True).
import pyterrier as pt
import pyterrier_splade
splade = pyterrier_splade.Splade()
indexer = pt.IterDictIndexer('./msmarco_psg', pretokenised=True)
idx_pipeline = splade.doc_encoder() >> indexer
idx_pipeline.index(dataset.get_corpus_iter(), batch_size=128)
Retrieval¶
Similarly, SPLADE encodes the query into weighted BERT WordPieces, which are passed to a sparse retriever. Scoring the
SPLADE term weights with wmodel='Tf' recovers the SPLADE dot-product score.
splade_retr = splade.query_encoder() >> pt.terrier.Retriever('./msmarco_psg', wmodel='Tf')
Scoring¶
SPLADE can also be used as a text scoring (re-ranking) function over the results of an earlier stage.
first_stage = ... # e.g., BM25, dense retrieval, etc.
splade_scorer = first_stage >> dataset.text_loader() >> splade.scorer()
PISA¶
For faster retrieval, you can use the PISA backend in place of Terrier.
import pyterrier as pt
import pyterrier_splade
from pyterrier_pisa import PisaIndex
splade = pyterrier_splade.Splade()
dataset = pt.get_dataset('irds:msmarco-passage')
index = PisaIndex('./msmarco-passage-splade', stemmer='none')
# indexing
idx_pipeline = splade.doc_encoder() >> index.toks_indexer()
idx_pipeline.index(dataset.get_corpus_iter())
# retrieval
retr_pipeline = splade.query_encoder() >> index.quantized()
BMP¶
BMP (Block-Max Pruning) is another fast sparse retrieval backend that works
well with learned sparse models like SPLADE. Install it with pip install bmp[pyterrier].
import pyterrier_splade
from bmp.pyterrier import BmpIndex
splade = pyterrier_splade.Splade()
index = BmpIndex('./msmarco-passage-splade.bmp') # [1]
# indexing
idx_pipeline = splade.doc_encoder() >> index.indexer()
idx_pipeline.index(dataset.get_corpus_iter())
# retrieval
retr_pipeline = splade.query_encoder() >> index.retriever()
retr_pipeline.search('my query')
The
.bmpextension is optional.
See also
Step-by-step recipes for these pipelines are available on the SPLADE How-To Guides page.