SPLADE Overview¶

pyterrier-splade lets you construct learned sparse indexing and retrieval pipelines. SPLADE maps queries and documents into sparse bags of weighted BERT WordPiece tokens, combining the efficiency of sparse retrieval with the effectiveness of learned term weighting and expansion.

A pipeline is built from two main pieces: a SPLADE model, which encodes text into weighted tokens, and a sparse index (Terrier or PISA), which stores those tokens and retrieves over them.

SPLADE Model¶

A Splade model is a factory for the transformers used to encode and score text.

Loading a SPLADE model¶

import pyterrier_splade
splade = pyterrier_splade.Splade() # [1]

You can pass any SPLADE model name or path; it defaults to naver/splade-cocondenser-ensembledistil.

A SPLADE model can produce several transformers:

Encode documents into weighted tokens using its doc_encoder().
Encode queries into weighted tokens using its query_encoder().
Re-rank results by scoring query/document pairs using its scorer().

Indexing¶

Indexing takes place as a pipeline: the SPLADE document encoder maps raw text into a dictionary of BERT WordPiece tokens and corresponding weights, and the underlying sparse indexer stores them. Terrier is configured to index the tokens unchanged, without further tokenisation or stemming (pretokenised=True).

Indexing documents with SPLADE¶

import pyterrier as pt
import pyterrier_splade

splade = pyterrier_splade.Splade()
indexer = pt.IterDictIndexer('./msmarco_psg', pretokenised=True)

idx_pipeline = splade.doc_encoder() >> indexer
idx_pipeline.index(dataset.get_corpus_iter(), batch_size=128)

Retrieval¶

Similarly, SPLADE encodes the query into weighted BERT WordPieces, which are passed to a sparse retriever. Scoring the SPLADE term weights with wmodel='Tf' recovers the SPLADE dot-product score.

Retrieving with SPLADE¶

splade_retr = splade.query_encoder() >> pt.terrier.Retriever('./msmarco_psg', wmodel='Tf')

Scoring¶

SPLADE can also be used as a text scoring (re-ranking) function over the results of an earlier stage.

Re-ranking with the SPLADE scorer¶

first_stage = ... # e.g., BM25, dense retrieval, etc.
splade_scorer = first_stage >> dataset.text_loader() >> splade.scorer()

PISA¶

For faster retrieval, you can use the PISA backend in place of Terrier.

Indexing and retrieving with SPLADE over a PISA index¶

import pyterrier as pt
import pyterrier_splade
from pyterrier_pisa import PisaIndex

splade = pyterrier_splade.Splade()
dataset = pt.get_dataset('irds:msmarco-passage')
index = PisaIndex('./msmarco-passage-splade', stemmer='none')

# indexing
idx_pipeline = splade.doc_encoder() >> index.toks_indexer()
idx_pipeline.index(dataset.get_corpus_iter())

# retrieval
retr_pipeline = splade.query_encoder() >> index.quantized()

BMP¶

BMP (Block-Max Pruning) is another fast sparse retrieval backend that works well with learned sparse models like SPLADE. Install it with pip install bmp[pyterrier].

Indexing and retrieving with SPLADE over a BMP index¶

import pyterrier_splade
from bmp.pyterrier import BmpIndex

splade = pyterrier_splade.Splade()
index = BmpIndex('./msmarco-passage-splade.bmp') # [1]

# indexing
idx_pipeline = splade.doc_encoder() >> index.indexer()
idx_pipeline.index(dataset.get_corpus_iter())

# retrieval
retr_pipeline = splade.query_encoder() >> index.retriever()
retr_pipeline.search('my query')

The .bmp extension is optional.