Anserini How-To Guides¶
This page provides a set of how-to guides for using Anserini with PyTerrier.
How do I index a standard corpus?¶
import pyterrier as pt
from pyterrier_anserini import AnseriniIndex
dataset = pt.datasets.get_dataset("irds:msmarco-passage") # [1]
my_index = AnseriniIndex('/path/to/index/location.anserini') # [2]
my_index.index(dataset.get_corpus_iter()) # [3]
Select your dataset here. If the corpus is not available in PyTerrier datasets, see anserni:how-to:index-custom
Specify the location where you want to store the Anserini index. The location must not yet exist. We recommend using the
.anseriniextension, though this is not required.This performs indexing with default settings. If you need more control over the indexing settings, see
indexer()for advanced options.
How do I index a custom collection?¶
import pyterrier as pt
my_collection = [ # [1]
{"docno": "doc1", "title": "This is the text of document one.", "body": "This is the body of document one."},
{"docno": "doc2", "title": "This is the text of document two.", "body": "This is the body of document two."},
{"docno": "doc3", "title": "This is the text of document three.", "body": "This is the body of document three."}
]
my_index = AnseriniIndex('/path/to/index/location.anserini') # [2]
indexer = my_index.indexer(fields=["title", "body"]) # [3]
indexer.index(my_collection)
Each document should be a dictionary with
docno(a unique identifier) and additional text fields. Your collection can be any iterable type (list, generator, etc.).Specify the location where you want to store the Anserini index. The location must not yet exist. We recommend using the
.anseriniextension, though this is not required.fields=...lets you specify which fields to index. If you do not specify this option, all string fields will be merged automatically.
How do I retrieve documents from an Anserini index with BM25?¶
import pyterrier as pt
from pyterrier_anserini import AnseriniIndex
index = AnseriniIndex('/path/to/index.location.anserini') # [1]
bm25 = index.bm25() # [2]
queries = pd.DataFrame([{"qid": "1", "query": "example query"}]) # [3]
results = bm25(queries) # [4]
Specify the location of your Anserini index.
This creates a BM25 retriever. You can also create other retrievers or re-rankers using the corresponding factory methods in AnseriniIndex.
Create a DataFrame with your queries. The DataFrame must have a
qidcolumn for query IDs and aquerycolumn for query text.This runs retrieval and returns a DataFrame with results. By default, the results will include the
docnoandscorefields, but you can include additional fields using theinclude_fields=...option when creating the retriever.