Anserini How-To Guides

This page provides a set of how-to guides for using Anserini with PyTerrier.


How do I index a standard corpus?

Indexing a standard corpus with Anserini
import pyterrier as pt
from pyterrier_anserini import AnseriniIndex
dataset = pt.datasets.get_dataset("irds:msmarco-passage") # [1]
my_index = AnseriniIndex('/path/to/index/location.anserini') # [2]
my_index.index(dataset.get_corpus_iter()) # [3]
  1. Select your dataset here. If the corpus is not available in PyTerrier datasets, see anserni:how-to:index-custom

  2. Specify the location where you want to store the Anserini index. The location must not yet exist. We recommend using the .anserini extension, though this is not required.

  3. This performs indexing with default settings. If you need more control over the indexing settings, see indexer() for advanced options.


How do I index a custom collection?

Indexing a custom collection with Anserini
import pyterrier as pt
my_collection = [ # [1]
    {"docno": "doc1", "title": "This is the text of document one.", "body": "This is the body of document one."},
    {"docno": "doc2", "title": "This is the text of document two.", "body": "This is the body of document two."},
    {"docno": "doc3", "title": "This is the text of document three.", "body": "This is the body of document three."}
]
my_index = AnseriniIndex('/path/to/index/location.anserini') # [2]
indexer = my_index.indexer(fields=["title", "body"]) # [3]
indexer.index(my_collection)
  1. Each document should be a dictionary with docno (a unique identifier) and additional text fields. Your collection can be any iterable type (list, generator, etc.).

  2. Specify the location where you want to store the Anserini index. The location must not yet exist. We recommend using the .anserini extension, though this is not required.

  3. fields=... lets you specify which fields to index. If you do not specify this option, all string fields will be merged automatically.


How do I retrieve documents from an Anserini index with BM25?

Retrieving documents from an Anserini index with BM25
import pyterrier as pt
from pyterrier_anserini import AnseriniIndex
index = AnseriniIndex('/path/to/index.location.anserini') # [1]
bm25 = index.bm25() # [2]
queries = pd.DataFrame([{"qid": "1", "query": "example query"}]) # [3]
results = bm25(queries) # [4]
  1. Specify the location of your Anserini index.

  2. This creates a BM25 retriever. You can also create other retrievers or re-rankers using the corresponding factory methods in AnseriniIndex.

  3. Create a DataFrame with your queries. The DataFrame must have a qid column for query IDs and a query column for query text.

  4. This runs retrieval and returns a DataFrame with results. By default, the results will include the docno and score fields, but you can include additional fields using the include_fields=... option when creating the retriever.