SPLADE How-To Guides¶

How do I index documents with SPLADE?¶

Indexing documents with SPLADE into a Terrier index¶

import pyterrier as pt
import pyterrier_splade

splade = pyterrier_splade.Splade() # [1]
indexer = pt.IterDictIndexer('./msmarco_psg', pretokenised=True) # [2]
idx_pipeline = splade.doc_encoder() >> indexer # [3]

idx_pipeline.index(dataset.get_corpus_iter(), batch_size=128) # [4]

Specify the SPLADE model to use; defaults to naver/splade-cocondenser-ensembledistil.
pretokenised=True tells Terrier to index the SPLADE tokens unchanged, without further tokenisation or stemming.
Create an indexing pipeline by chaining the SPLADE document encoder and the indexer.
get_corpus_iter() can be any iterable of documents, including generators. This allows you to index collections that are too large to fit in memory at once.

How do I retrieve with SPLADE?¶

This example assumes that you already built a SPLADE index for your collection (see the guide above). For faster retrieval, check out the PISA guide.

Retrieving over a SPLADE Terrier index¶

import pyterrier as pt
import pyterrier_splade

splade = pyterrier_splade.Splade() # [1]
retr = splade.query_encoder() >> pt.terrier.Retriever('./msmarco_psg', wmodel='Tf') # [2]

results = retr.search('a single query')
# or
results = retr([
    {'qid': '1', 'query': 'multiple queries'},
    {'qid': '2', 'query': 'can be passed as a list of dicts'},
])

Specify the model used to create the index.
Create a retrieval pipeline by chaining the SPLADE query encoder with a Terrier retriever. wmodel='Tf' scores documents by the SPLADE term weights.

How do I re-rank with SPLADE?¶

Re-ranking initial results with the SPLADE scorer¶

import pyterrier as pt
import pyterrier_splade

splade = pyterrier_splade.Splade() # [1]
first_stage = pt.terrier.Retriever('./msmarco_psg', wmodel='BM25') # [2]
retr = first_stage >> dataset.text_loader() >> splade.scorer() # [3]

retr.search('my query')

Specify the model you want to use as a re-ranker.
In this example, we use BM25 over a sparse index for initial retrieval.
Create a re-ranking pipeline by chaining an initial retriever, a text loader, and the SPLADE scorer. text_loader loads the document text required by the scorer.

How do I retrieve faster with PISA?¶

For faster retrieval, you can use the PISA backend instead of Terrier.

Indexing and retrieving with SPLADE over a PISA index¶

import pyterrier as pt
import pyterrier_splade
from pyterrier_pisa import PisaIndex

splade = pyterrier_splade.Splade() # [1]
index = PisaIndex('./msmarco-passage-splade', stemmer='none') # [2]

idx_pipeline = splade.doc_encoder() >> index.toks_indexer() # [3]
idx_pipeline.index(dataset.get_corpus_iter())

retr = splade.query_encoder() >> index.quantized() # [4]
retr.search('my query')

Specify the SPLADE model to use.
stemmer='none' keeps the SPLADE tokens unchanged.
Create an indexing pipeline by chaining the SPLADE document encoder and the PISA tokens indexer.
Create a retrieval pipeline by chaining the SPLADE query encoder with the quantized PISA retriever.

How do I retrieve faster with BMP?¶

BMP (Block-Max Pruning) is another fast sparse retrieval backend. Install it with pip install bmp[pyterrier].

Indexing and retrieving with SPLADE over a BMP index¶

import pyterrier as pt
import pyterrier_splade
from bmp.pyterrier import BmpIndex

splade = pyterrier_splade.Splade() # [1]
index = BmpIndex('./msmarco-passage-splade.bmp') # [2]

idx_pipeline = splade.doc_encoder() >> index.indexer() # [3]
idx_pipeline.index(dataset.get_corpus_iter())

retr = splade.query_encoder() >> index.retriever() # [4]
retr.search('my query')

Specify the SPLADE model to use.
Specify the path to index to; the .bmp extension is optional.
Create an indexing pipeline by chaining the SPLADE document encoder and the BMP indexer.
Create a retrieval pipeline by chaining the SPLADE query encoder with the BMP retriever.