pyterrir_anserini
API Documentation¶
AnseriniIndex
is the primary class for interacting with Anserini indexes in PyTerrier. It acts as a factory for
creating transformers that can index, retrieve, and re-rank. It also has methods for providing information about the
index and for downloading pre-built indexes.
- class pyterrier_anserini.AnseriniIndex(path)[source]¶
An Anserini index.
An Anserini index is a directory containing a Lucene index built with Anserini.
This object can be used to construct retrieval transformers.
Initializes a new Anserini index.
- Parameters:
path – The path to the index.
- built()[source]¶
Checks if this index is built.
- Return type:
bool
- Returns:
True if this index is built, False otherwise.
- indexer(*, fields='*', verbose=False)[source]¶
Provides an indexer for this index.
- Return type:
- Parameters:
fields – The fields to index. If ‘*’ (default), all fields are indexed. Otherwise, the values of the
indexed. (fields provided in this argument are concatenated and)
verbose – Whether to display a progress bar when indexing.
- retriever(similarity, similarity_args=None, *, num_results=1000, include_fields=None, verbose=False)[source]¶
Provides a retriever that uses the specified similarity function.
- Return type:
- Parameters:
similarity – The similarity function to use.
similarity_args – The arguments to the similarity function. Defaults to None (no arguments).
num_results – The number of results to return. Defaults to 1000.
include_fields – A list of the fields to include in the results. If None (default), no extra fields are
'*' (included. If)
included. (all fields are)
verbose – Output verbose logging. Defaults to False.
- Returns:
A transformer that can be used to retrieve documents from this index.
- bm25(*, k1=0.9, b=0.4, num_results=1000, include_fields=None, verbose=False)[source]¶
Providers a retriever that uses BM25 over this index.
- Return type:
- Parameters:
k1 – The BM25 k1 parameter. Defaults to 0.9.
b – The BM25 b parameter. Defaults to 0.4.
num_results – The number of results to return. Defaults to 1000.
include_fields – A list of the fields to include in the results. If None (default), no extra fields are
'*' (included. If)
included. (all fields are)
verbose – Output verbose logging. Defaults to False.
- Returns:
A transformer that can be used to retrieve documents from this index using qld.
- qld(*, mu=1000.0, num_results=1000, include_fields=None, verbose=False)[source]¶
Providers a retriever that uses Query Likelihood with Dirichlet smoothing over this index.
- Return type:
- Parameters:
mu – The Dirichlet smoothing parameter. Defaults to 1000.
num_results – The number of results to return. Defaults to 1000.
include_fields – A list of the fields to include in the results. If None (default), no extra fields are
'*' (included. If)
included. (all fields are)
verbose – Output verbose logging. Defaults to False.
- Returns:
A transformer that can be used to retrieve documents from this index using qld.
- tfidf(*, num_results=1000, include_fields=None, verbose=False)[source]¶
Provides a TF-IDF retriever over this index.
- Return type:
- Parameters:
num_results – The number of results to return. Defaults to 1000.
include_fields – A list of the fields to include in the results. If None (default), no extra fields are
'*' (included. If)
included. (all fields are)
verbose – Output verbose logging. Defaults to False.
- Returns:
A transformer that can be used to retrieve documents from this index using TF-IDF.
- impact(*, num_results=1000, include_fields=None, verbose=False)[source]¶
Provides a retriever for pre-comptued impact scores.
This is called “quantized” scoring in PISA or “TF” scoring in Terrier.
- Return type:
- Parameters:
num_results – The number of results to return. Defaults to 1000.
include_fields – A list of the fields to include in the results. If None (default), no extra fields are
'*' (included. If)
included. (all fields are)
verbose – Output verbose logging. Defaults to False.
- Returns:
A transformer that can be used to retrieve documents from this index using impact scores.
- reranker(similarity, similarity_args=None, *, verbose=False)[source]¶
Provides a reranker that uses the specified weithing model.
- Return type:
- Parameters:
similarity – The similarity function to use.
similarity_args – The arguments to the similarity function. Defaults to None (no arguments).
verbose – Output verbose logging. Defaults to False.
- Returns:
A transformer that can be used to score (rerank) documents from this index.
- text_loader(fields='*', *, verbose=False)[source]¶
Provides a transformer that can be used to load the text from this index for each document.
- Return type:
- Parameters:
fields – The fields to extract. When the literal ‘*’ (default), extracts all available fields.
verbose – Output verbose logging. Defaults to False.
- Returns:
A transformer that can be used to load the text from this index for each document.
Transformers and Indexers¶
The following transformer classes are returned by corresponding factory methods in
AnseriniIndex
.
- class pyterrier_anserini.AnseriniIndexer(index, *, fields='*', verbose=False)[source]¶
An indexer for Anserini indexes.
Initializes the indexer.
- Parameters:
index – The index to index to. If a string, an AnseriniIndex object is created for the path.
fields – The fields to index. If ‘*’ (default), all fields are indexed. Otherwise, the values of the fields provided in this argumetn are concatenated and indexed.
verbose – Whether to display a progress bar when indexing.
- class pyterrier_anserini.AnseriniRetriever(index, similarity='BM25', similarity_args=None, *, num_results=1000, include_fields=None, verbose=False)[source]¶
Retrieves from an Anserini index.
Construct an AnseriniRetriever retrieve from pyserini.search.lucene.LuceneSearcher.
- Parameters:
index – The Anserini index.
similarity – The similarity function to use.
similarity_args – model-specific arguments, like bm25.k1.
num_results – number of results to return. Default is 1000.
include_fields – a list of extra stored fields to include for each result. None indicates no extra fields.
verbose – show a progress bar during retrieval?
- class pyterrier_anserini.AnseriniReRanker(index, similarity, similarity_args=None, *, verbose=False)[source]¶
A transformer that scores (i.e., re-ranks) the provided documents from an Anserini index.
Initializes the scorer.
- Parameters:
index – The index to score from. If a string, an AnseriniIndex object is created for the path.
similarity – The similarity function to use for scoring.
similarity_args – A dictionary of arguments to use for the similarity function.
verbose – Whether to display a progress bar when scoring.
- transform(inp)[source]¶
Scores (i.e., re-ranks) documents from the index for each query in inp.
- Return type:
DataFrame
- Parameters:
inp – A DataFrame with a ‘query’ column containing queries and a ‘docno’ column containing document IDs.
- Returns:
A DataFrame containing the scored documents, with any columns included in inp, plus the ‘score’ and ‘rank’ of the scored documents.
- class pyterrier_anserini.AnseriniTextLoader(index, fields, *, verbose=False)[source]¶
A transformer that provides access to text fields from an Anserini index.
Initializes the text loader.
- Parameters:
index – The index to provide text from. If a string, an AnseriniIndex object is created for the path.
fields – The fields to load.
verbose – Whether to display a progress bar when providing text.
Miscellaneous¶
- enum pyterrier_anserini.AnseriniSimilarity(value)[source]¶
An enum representing the similarity functions available in Anserini.
Valid values are as follows:
- bm25 = 'BM25'¶
- qld = 'QLD'¶
- tfidf = 'TFIDF'¶
- impact = 'Impact'¶
The
Enum
and its members also have the following methods:- to_lucene_sim(sim_args=None)[source]¶
Provides a Lucene similarity object that represents this similarity functions, including provided arguments.
- Parameters:
sim_args – The arguments of this similarity function. Default values will be used when they are not provided.
- Returns:
A
pyjnius
binding to aorg.apache.lucene.search.similarities.Similarity
object.
- pyterrier_anserini.set_version(version=None)[source]¶
Set the version of Anserini to use.
If version is
None
(default), the version of Anserini distributed with the pyserini package is used. Otherwise, the specified version is downloaded from Maven and used insead.Note that this function must be run before Java is initialized.