SPLADE How-To Guides¶
How do I index documents with SPLADE?¶
import pyterrier as pt
import pyterrier_splade
splade = pyterrier_splade.Splade() # [1]
indexer = pt.IterDictIndexer('./msmarco_psg', pretokenised=True) # [2]
idx_pipeline = splade.doc_encoder() >> indexer # [3]
idx_pipeline.index(dataset.get_corpus_iter(), batch_size=128) # [4]
Specify the SPLADE model to use; defaults to
naver/splade-cocondenser-ensembledistil.pretokenised=Truetells Terrier to index the SPLADE tokens unchanged, without further tokenisation or stemming.Create an indexing pipeline by chaining the SPLADE document encoder and the indexer.
get_corpus_iter()can be any iterable of documents, including generators. This allows you to index collections that are too large to fit in memory at once.
How do I retrieve with SPLADE?¶
This example assumes that you already built a SPLADE index for your collection (see the guide above). For faster retrieval, check out the PISA guide.
import pyterrier as pt
import pyterrier_splade
splade = pyterrier_splade.Splade() # [1]
retr = splade.query_encoder() >> pt.terrier.Retriever('./msmarco_psg', wmodel='Tf') # [2]
results = retr.search('a single query')
# or
results = retr([
{'qid': '1', 'query': 'multiple queries'},
{'qid': '2', 'query': 'can be passed as a list of dicts'},
])
Specify the model used to create the index.
Create a retrieval pipeline by chaining the SPLADE query encoder with a Terrier retriever.
wmodel='Tf'scores documents by the SPLADE term weights.
How do I re-rank with SPLADE?¶
import pyterrier as pt
import pyterrier_splade
splade = pyterrier_splade.Splade() # [1]
first_stage = pt.terrier.Retriever('./msmarco_psg', wmodel='BM25') # [2]
retr = first_stage >> dataset.text_loader() >> splade.scorer() # [3]
retr.search('my query')
Specify the model you want to use as a re-ranker.
In this example, we use BM25 over a sparse index for initial retrieval.
Create a re-ranking pipeline by chaining an initial retriever, a text loader, and the SPLADE scorer.
text_loaderloads the document text required by the scorer.
How do I retrieve faster with PISA?¶
For faster retrieval, you can use the PISA backend instead of Terrier.
import pyterrier as pt
import pyterrier_splade
from pyterrier_pisa import PisaIndex
splade = pyterrier_splade.Splade() # [1]
index = PisaIndex('./msmarco-passage-splade', stemmer='none') # [2]
idx_pipeline = splade.doc_encoder() >> index.toks_indexer() # [3]
idx_pipeline.index(dataset.get_corpus_iter())
retr = splade.query_encoder() >> index.quantized() # [4]
retr.search('my query')
Specify the SPLADE model to use.
stemmer='none'keeps the SPLADE tokens unchanged.Create an indexing pipeline by chaining the SPLADE document encoder and the PISA tokens indexer.
Create a retrieval pipeline by chaining the SPLADE query encoder with the quantized PISA retriever.
How do I retrieve faster with BMP?¶
BMP (Block-Max Pruning) is another fast sparse retrieval backend. Install it
with pip install bmp[pyterrier].
import pyterrier as pt
import pyterrier_splade
from bmp.pyterrier import BmpIndex
splade = pyterrier_splade.Splade() # [1]
index = BmpIndex('./msmarco-passage-splade.bmp') # [2]
idx_pipeline = splade.doc_encoder() >> index.indexer() # [3]
idx_pipeline.index(dataset.get_corpus_iter())
retr = splade.query_encoder() >> index.retriever() # [4]
retr.search('my query')
Specify the SPLADE model to use.
Specify the path to index to; the
.bmpextension is optional.Create an indexing pipeline by chaining the SPLADE document encoder and the BMP indexer.
Create a retrieval pipeline by chaining the SPLADE query encoder with the BMP retriever.