SPLADE How-To Guides


How do I index documents with SPLADE?

Indexing documents with SPLADE into a Terrier index
import pyterrier as pt
import pyterrier_splade

splade = pyterrier_splade.Splade() # [1]
indexer = pt.IterDictIndexer('./msmarco_psg', pretokenised=True) # [2]
idx_pipeline = splade.doc_encoder() >> indexer # [3]

idx_pipeline.index(dataset.get_corpus_iter(), batch_size=128) # [4]
  1. Specify the SPLADE model to use; defaults to naver/splade-cocondenser-ensembledistil.

  2. pretokenised=True tells Terrier to index the SPLADE tokens unchanged, without further tokenisation or stemming.

  3. Create an indexing pipeline by chaining the SPLADE document encoder and the indexer.

  4. get_corpus_iter() can be any iterable of documents, including generators. This allows you to index collections that are too large to fit in memory at once.


How do I retrieve with SPLADE?

This example assumes that you already built a SPLADE index for your collection (see the guide above). For faster retrieval, check out the PISA guide.

Retrieving over a SPLADE Terrier index
import pyterrier as pt
import pyterrier_splade

splade = pyterrier_splade.Splade() # [1]
retr = splade.query_encoder() >> pt.terrier.Retriever('./msmarco_psg', wmodel='Tf') # [2]

results = retr.search('a single query')
# or
results = retr([
    {'qid': '1', 'query': 'multiple queries'},
    {'qid': '2', 'query': 'can be passed as a list of dicts'},
])
  1. Specify the model used to create the index.

  2. Create a retrieval pipeline by chaining the SPLADE query encoder with a Terrier retriever. wmodel='Tf' scores documents by the SPLADE term weights.


How do I re-rank with SPLADE?

Re-ranking initial results with the SPLADE scorer
import pyterrier as pt
import pyterrier_splade

splade = pyterrier_splade.Splade() # [1]
first_stage = pt.terrier.Retriever('./msmarco_psg', wmodel='BM25') # [2]
retr = first_stage >> dataset.text_loader() >> splade.scorer() # [3]

retr.search('my query')
  1. Specify the model you want to use as a re-ranker.

  2. In this example, we use BM25 over a sparse index for initial retrieval.

  3. Create a re-ranking pipeline by chaining an initial retriever, a text loader, and the SPLADE scorer. text_loader loads the document text required by the scorer.


How do I retrieve faster with PISA?

For faster retrieval, you can use the PISA backend instead of Terrier.

Indexing and retrieving with SPLADE over a PISA index
import pyterrier as pt
import pyterrier_splade
from pyterrier_pisa import PisaIndex

splade = pyterrier_splade.Splade() # [1]
index = PisaIndex('./msmarco-passage-splade', stemmer='none') # [2]

idx_pipeline = splade.doc_encoder() >> index.toks_indexer() # [3]
idx_pipeline.index(dataset.get_corpus_iter())

retr = splade.query_encoder() >> index.quantized() # [4]
retr.search('my query')
  1. Specify the SPLADE model to use.

  2. stemmer='none' keeps the SPLADE tokens unchanged.

  3. Create an indexing pipeline by chaining the SPLADE document encoder and the PISA tokens indexer.

  4. Create a retrieval pipeline by chaining the SPLADE query encoder with the quantized PISA retriever.


How do I retrieve faster with BMP?

BMP (Block-Max Pruning) is another fast sparse retrieval backend. Install it with pip install bmp[pyterrier].

Indexing and retrieving with SPLADE over a BMP index
import pyterrier as pt
import pyterrier_splade
from bmp.pyterrier import BmpIndex

splade = pyterrier_splade.Splade() # [1]
index = BmpIndex('./msmarco-passage-splade.bmp') # [2]

idx_pipeline = splade.doc_encoder() >> index.indexer() # [3]
idx_pipeline.index(dataset.get_corpus_iter())

retr = splade.query_encoder() >> index.retriever() # [4]
retr.search('my query')
  1. Specify the SPLADE model to use.

  2. Specify the path to index to; the .bmp extension is optional.

  3. Create an indexing pipeline by chaining the SPLADE document encoder and the BMP indexer.

  4. Create a retrieval pipeline by chaining the SPLADE query encoder with the BMP retriever.