SPLADE How-To Guides ============================================================ .. how-to:: How do I index documents with SPLADE? .. code-block:: python :caption: Indexing documents with SPLADE into a Terrier index import pyterrier as pt import pyterrier_splade splade = pyterrier_splade.Splade() # :footnote: Specify the SPLADE model to use; defaults to ``naver/splade-cocondenser-ensembledistil``. indexer = pt.IterDictIndexer('./msmarco_psg', pretokenised=True) # :footnote: ``pretokenised=True`` tells Terrier to index the SPLADE tokens unchanged, without further tokenisation or stemming. idx_pipeline = splade.doc_encoder() >> indexer # :footnote: Create an indexing pipeline by chaining the SPLADE document encoder and the indexer. idx_pipeline.index(dataset.get_corpus_iter(), batch_size=128) # :footnote: ``get_corpus_iter()`` can be any *iterable* of documents, including generators. This allows you to index collections that are too large to fit in memory at once. .. how-to:: How do I retrieve with SPLADE? .. _pyterrier-splade:how-to:terrier-retrieval: This example assumes that you already built a SPLADE index for your collection (see the guide above). For faster retrieval, check out :ref:`the PISA guide `. .. code-block:: python :caption: Retrieving over a SPLADE Terrier index import pyterrier as pt import pyterrier_splade splade = pyterrier_splade.Splade() # :footnote: Specify the model used to create the index. retr = splade.query_encoder() >> pt.terrier.Retriever('./msmarco_psg', wmodel='Tf') # :footnote: Create a retrieval pipeline by chaining the SPLADE query encoder with a Terrier retriever. ``wmodel='Tf'`` scores documents by the SPLADE term weights. results = retr.search('a single query') # or results = retr([ {'qid': '1', 'query': 'multiple queries'}, {'qid': '2', 'query': 'can be passed as a list of dicts'}, ]) .. how-to:: How do I re-rank with SPLADE? .. code-block:: python :caption: Re-ranking initial results with the SPLADE scorer import pyterrier as pt import pyterrier_splade splade = pyterrier_splade.Splade() # :footnote: Specify the model you want to use as a re-ranker. first_stage = pt.terrier.Retriever('./msmarco_psg', wmodel='BM25') # :footnote: In this example, we use BM25 over a sparse index for initial retrieval. retr = first_stage >> dataset.text_loader() >> splade.scorer() # :footnote: Create a re-ranking pipeline by chaining an initial retriever, a text loader, and the SPLADE scorer. ``text_loader`` loads the document text required by the scorer. retr.search('my query') .. how-to:: How do I retrieve faster with PISA? .. _pyterrier-splade:how-to:pisa: For faster retrieval, you can use the `PISA `__ backend instead of Terrier. .. code-block:: python :caption: Indexing and retrieving with SPLADE over a PISA index import pyterrier as pt import pyterrier_splade from pyterrier_pisa import PisaIndex splade = pyterrier_splade.Splade() # :footnote: Specify the SPLADE model to use. index = PisaIndex('./msmarco-passage-splade', stemmer='none') # :footnote: ``stemmer='none'`` keeps the SPLADE tokens unchanged. idx_pipeline = splade.doc_encoder() >> index.toks_indexer() # :footnote: Create an indexing pipeline by chaining the SPLADE document encoder and the PISA tokens indexer. idx_pipeline.index(dataset.get_corpus_iter()) retr = splade.query_encoder() >> index.quantized() # :footnote: Create a retrieval pipeline by chaining the SPLADE query encoder with the quantized PISA retriever. retr.search('my query') .. how-to:: How do I retrieve faster with BMP? .. _pyterrier-splade:how-to:bmp: `BMP (Block-Max Pruning) `__ is another fast sparse retrieval backend. Install it with ``pip install bmp[pyterrier]``. .. code-block:: python :caption: Indexing and retrieving with SPLADE over a BMP index import pyterrier as pt import pyterrier_splade from bmp.pyterrier import BmpIndex splade = pyterrier_splade.Splade() # :footnote: Specify the SPLADE model to use. index = BmpIndex('./msmarco-passage-splade.bmp') # :footnote: Specify the path to index to; the ``.bmp`` extension is optional. idx_pipeline = splade.doc_encoder() >> index.indexer() # :footnote: Create an indexing pipeline by chaining the SPLADE document encoder and the BMP indexer. idx_pipeline.index(dataset.get_corpus_iter()) retr = splade.query_encoder() >> index.retriever() # :footnote: Create a retrieval pipeline by chaining the SPLADE query encoder with the BMP retriever. retr.search('my query')