pyterrir_anserini API Documentation

AnseriniIndex is the primary class for interacting with Anserini indexes in PyTerrier. It acts as a factory for creating transformers that can index, retrieve, and re-rank. It also has methods for providing information about the index and for downloading pre-built indexes.

class pyterrier_anserini.AnseriniIndex(path)[source]

An Anserini index.

An Anserini index is a directory containing a Lucene index built with Anserini.

This object can be used to construct retrieval transformers.

Initializes a new Anserini index.

Parameters:

path – The path to the index.

built()[source]

Checks if this index is built.

Return type:

bool

Returns:

True if this index is built, False otherwise.

indexer(*, fields='*', verbose=False)[source]

Provides an indexer for this index.

Return type:

Indexer

Parameters:
  • fields – The fields to index. If ‘*’ (default), all fields are indexed. Otherwise, the values of the

  • indexed. (fields provided in this argument are concatenated and)

  • verbose – Whether to display a progress bar when indexing.

retriever(similarity, similarity_args=None, *, num_results=1000, include_fields=None, verbose=False)[source]

Provides a retriever that uses the specified similarity function.

Return type:

Transformer

Parameters:
  • similarity – The similarity function to use.

  • similarity_args – The arguments to the similarity function. Defaults to None (no arguments).

  • num_results – The number of results to return. Defaults to 1000.

  • include_fields – A list of the fields to include in the results. If None (default), no extra fields are

  • '*' (included. If)

  • included. (all fields are)

  • verbose – Output verbose logging. Defaults to False.

Returns:

A transformer that can be used to retrieve documents from this index.

bm25(*, k1=0.9, b=0.4, num_results=1000, include_fields=None, verbose=False)[source]

Providers a retriever that uses BM25 over this index.

Return type:

Transformer

Parameters:
  • k1 – The BM25 k1 parameter. Defaults to 0.9.

  • b – The BM25 b parameter. Defaults to 0.4.

  • num_results – The number of results to return. Defaults to 1000.

  • include_fields – A list of the fields to include in the results. If None (default), no extra fields are

  • '*' (included. If)

  • included. (all fields are)

  • verbose – Output verbose logging. Defaults to False.

Returns:

A transformer that can be used to retrieve documents from this index using qld.

qld(*, mu=1000.0, num_results=1000, include_fields=None, verbose=False)[source]

Providers a retriever that uses Query Likelihood with Dirichlet smoothing over this index.

Return type:

Transformer

Parameters:
  • mu – The Dirichlet smoothing parameter. Defaults to 1000.

  • num_results – The number of results to return. Defaults to 1000.

  • include_fields – A list of the fields to include in the results. If None (default), no extra fields are

  • '*' (included. If)

  • included. (all fields are)

  • verbose – Output verbose logging. Defaults to False.

Returns:

A transformer that can be used to retrieve documents from this index using qld.

tfidf(*, num_results=1000, include_fields=None, verbose=False)[source]

Provides a TF-IDF retriever over this index.

Return type:

Transformer

Parameters:
  • num_results – The number of results to return. Defaults to 1000.

  • include_fields – A list of the fields to include in the results. If None (default), no extra fields are

  • '*' (included. If)

  • included. (all fields are)

  • verbose – Output verbose logging. Defaults to False.

Returns:

A transformer that can be used to retrieve documents from this index using TF-IDF.

impact(*, num_results=1000, include_fields=None, verbose=False)[source]

Provides a retriever for pre-comptued impact scores.

This is called “quantized” scoring in PISA or “TF” scoring in Terrier.

Return type:

Transformer

Parameters:
  • num_results – The number of results to return. Defaults to 1000.

  • include_fields – A list of the fields to include in the results. If None (default), no extra fields are

  • '*' (included. If)

  • included. (all fields are)

  • verbose – Output verbose logging. Defaults to False.

Returns:

A transformer that can be used to retrieve documents from this index using impact scores.

reranker(similarity, similarity_args=None, *, verbose=False)[source]

Provides a reranker that uses the specified weithing model.

Return type:

Transformer

Parameters:
  • similarity – The similarity function to use.

  • similarity_args – The arguments to the similarity function. Defaults to None (no arguments).

  • verbose – Output verbose logging. Defaults to False.

Returns:

A transformer that can be used to score (rerank) documents from this index.

text_loader(fields='*', *, verbose=False)[source]

Provides a transformer that can be used to load the text from this index for each document.

Return type:

Transformer

Parameters:
  • fields – The fields to extract. When the literal ‘*’ (default), extracts all available fields.

  • verbose – Output verbose logging. Defaults to False.

Returns:

A transformer that can be used to load the text from this index for each document.

Transformers and Indexers

The following transformer classes are returned by corresponding factory methods in AnseriniIndex.

class pyterrier_anserini.AnseriniIndexer(index, *, fields='*', verbose=False)[source]

An indexer for Anserini indexes.

Initializes the indexer.

Parameters:
  • index – The index to index to. If a string, an AnseriniIndex object is created for the path.

  • fields – The fields to index. If ‘*’ (default), all fields are indexed. Otherwise, the values of the fields provided in this argumetn are concatenated and indexed.

  • verbose – Whether to display a progress bar when indexing.

index(inp)[source]

Indexes the input documents to the index.

Return type:

Artifact

Parameters:

inp – An iterable of documents to index.

Returns:

The index that was indexed to.

class pyterrier_anserini.AnseriniRetriever(index, similarity='BM25', similarity_args=None, *, num_results=1000, include_fields=None, verbose=False)[source]

Retrieves from an Anserini index.

Construct an AnseriniRetriever retrieve from pyserini.search.lucene.LuceneSearcher.

Parameters:
  • index – The Anserini index.

  • similarity – The similarity function to use.

  • similarity_args – model-specific arguments, like bm25.k1.

  • num_results – number of results to return. Default is 1000.

  • include_fields – a list of extra stored fields to include for each result. None indicates no extra fields.

  • verbose – show a progress bar during retrieval?

transform(inp)[source]

Performs retrieval.

Return type:

DataFrame

Parameters:

inp – A pandas.Dataframe

Returns:

pandas.DataFrame with columns=[‘qid’, ‘query’, ‘docno’, ‘rank’, ‘score’]

class pyterrier_anserini.AnseriniReRanker(index, similarity, similarity_args=None, *, verbose=False)[source]

A transformer that scores (i.e., re-ranks) the provided documents from an Anserini index.

Initializes the scorer.

Parameters:
  • index – The index to score from. If a string, an AnseriniIndex object is created for the path.

  • similarity – The similarity function to use for scoring.

  • similarity_args – A dictionary of arguments to use for the similarity function.

  • verbose – Whether to display a progress bar when scoring.

transform(inp)[source]

Scores (i.e., re-ranks) documents from the index for each query in inp.

Return type:

DataFrame

Parameters:

inp – A DataFrame with a ‘query’ column containing queries and a ‘docno’ column containing document IDs.

Returns:

A DataFrame containing the scored documents, with any columns included in inp, plus the ‘score’ and ‘rank’ of the scored documents.

class pyterrier_anserini.AnseriniTextLoader(index, fields, *, verbose=False)[source]

A transformer that provides access to text fields from an Anserini index.

Initializes the text loader.

Parameters:
  • index – The index to provide text from. If a string, an AnseriniIndex object is created for the path.

  • fields – The fields to load.

  • verbose – Whether to display a progress bar when providing text.

transform(inp)[source]

Provides text from the index for each document in inp.

Return type:

DataFrame

Parameters:

inp – A DataFrame with a ‘docno’ column containing document IDs.

Miscellaneous

enum pyterrier_anserini.AnseriniSimilarity(value)[source]

An enum representing the similarity functions available in Anserini.

Valid values are as follows:

bm25 = 'BM25'
qld = 'QLD'
tfidf = 'TFIDF'
impact = 'Impact'

The Enum and its members also have the following methods:

to_lucene_sim(sim_args=None)[source]

Provides a Lucene similarity object that represents this similarity functions, including provided arguments.

Parameters:

sim_args – The arguments of this similarity function. Default values will be used when they are not provided.

Returns:

A pyjnius binding to a org.apache.lucene.search.similarities.Similarity object.

pyterrier_anserini.set_version(version=None)[source]

Set the version of Anserini to use.

If version is None (default), the version of Anserini distributed with the pyserini package is used. Otherwise, the specified version is downloaded from Maven and used insead.

Note that this function must be run before Java is initialized.