pyterrir_anserini API Documentation¶
AnseriniIndex is the primary class for interacting with Anserini indexes in PyTerrier. It acts as a factory for
creating transformers that can index, retrieve, and re-rank. It also has methods for providing information about the
index and for downloading pre-built indexes.
- class pyterrier_anserini.AnseriniIndex(path)[source]¶
An Anserini index.
An Anserini index is a directory containing a Lucene index built with Anserini.
This object can be used to construct retrieval transformers.
Initializes a new Anserini index.
- Parameters:
path (str) – The path to the index.
- built()[source]¶
Checks if this index is built.
- Return type:
bool- Returns:
True if this index is built, False otherwise.
- indexer(*, fields='*', store_positions=False, verbose=False)[source]¶
Provides an indexer for this index.
- Return type:
- Parameters:
fields (List[str] | str | Literal['*']) – The fields to index. If ‘*’ (default), all fields are indexed. Otherwise, the values of the
indexed. (fields provided in this argument are concatenated and)
store_positions (bool) – Whether to store positions in the index. This is required for phrase queries and proximity queries, but increases index size.
verbose (bool) – Whether to display a progress bar when indexing.
- retriever(similarity, similarity_args=None, *, num_results=1000, include_fields=None, verbose=False)[source]¶
Provides a retriever that uses the specified similarity function.
- Return type:
- Parameters:
similarity (str | AnseriniSimilarity) – The similarity function to use.
similarity_args (Dict[str, Any] | None) – The arguments to the similarity function. Defaults to None (no arguments).
num_results (int) – The number of results to return. Defaults to 1000.
include_fields (List[str] | str | Literal['*'] | None) – A list of the fields to include in the results. If None (default), no extra fields are
'*' (included. If)
included. (all fields are)
verbose (bool) – Output verbose logging. Defaults to False.
- Returns:
A transformer that can be used to retrieve documents from this index.
- bm25(*, k1=0.9, b=0.4, num_results=1000, include_fields=None, verbose=False)[source]¶
Providers a retriever that uses BM25 over this index.
- Return type:
- Parameters:
k1 (float) – The BM25 k1 parameter. Defaults to 0.9.
b (float) – The BM25 b parameter. Defaults to 0.4.
num_results (int) – The number of results to return. Defaults to 1000.
include_fields (List[str] | str | Literal['*'] | None) – A list of the fields to include in the results. If None (default), no extra fields are
'*' (included. If)
included. (all fields are)
verbose (bool) – Output verbose logging. Defaults to False.
- Returns:
A transformer that can be used to retrieve documents from this index using qld.
- qld(*, mu=1000.0, num_results=1000, include_fields=None, verbose=False)[source]¶
Providers a retriever that uses Query Likelihood with Dirichlet smoothing over this index.
- Return type:
- Parameters:
mu (float) – The Dirichlet smoothing parameter. Defaults to 1000.
num_results (int) – The number of results to return. Defaults to 1000.
include_fields (List[str] | str | Literal['*'] | None) – A list of the fields to include in the results. If None (default), no extra fields are
'*' (included. If)
included. (all fields are)
verbose (bool) – Output verbose logging. Defaults to False.
- Returns:
A transformer that can be used to retrieve documents from this index using qld.
- tfidf(*, num_results=1000, include_fields=None, verbose=False)[source]¶
Provides a TF-IDF retriever over this index.
- Return type:
- Parameters:
num_results (int) – The number of results to return. Defaults to 1000.
include_fields (List[str] | str | Literal['*'] | None) – A list of the fields to include in the results. If None (default), no extra fields are
'*' (included. If)
included. (all fields are)
verbose (bool) – Output verbose logging. Defaults to False.
- Returns:
A transformer that can be used to retrieve documents from this index using TF-IDF.
- impact(*, num_results=1000, include_fields=None, verbose=False)[source]¶
Provides a retriever for pre-comptued impact scores.
This is called “quantized” scoring in PISA or “TF” scoring in Terrier.
- Return type:
- Parameters:
num_results (int) – The number of results to return. Defaults to 1000.
include_fields (List[str] | str | Literal['*'] | None) – A list of the fields to include in the results. If None (default), no extra fields are
'*' (included. If)
included. (all fields are)
verbose (bool) – Output verbose logging. Defaults to False.
- Returns:
A transformer that can be used to retrieve documents from this index using impact scores.
- reranker(similarity, similarity_args=None, *, verbose=False)[source]¶
Provides a reranker that uses the specified weithing model.
- Return type:
- Parameters:
similarity (str | AnseriniSimilarity) – The similarity function to use.
similarity_args (Dict[str, Any] | None) – The arguments to the similarity function. Defaults to None (no arguments).
verbose (bool) – Output verbose logging. Defaults to False.
- Returns:
A transformer that can be used to score (rerank) documents from this index.
- text_loader(fields='*', *, verbose=False)[source]¶
Provides a transformer that can be used to load the text from this index for each document.
- Return type:
- Parameters:
fields (List[str] | str | Literal['*']) – The fields to extract. When the literal ‘*’ (default), extracts all available fields.
verbose (bool) – Output verbose logging. Defaults to False.
- Returns:
A transformer that can be used to load the text from this index for each document.
Transformers and Indexers¶
The following transformer classes are returned by corresponding factory methods in
AnseriniIndex.
- class pyterrier_anserini.AnseriniIndexer(index, *, fields='*', store_positions=False, verbose=False)[source]¶
An indexer for Anserini indexes.
Initializes the indexer.
- Parameters:
index (AnseriniIndex | str) – The index to index to. If a string, an AnseriniIndex object is created for the path.
fields (List[str] | Literal['*']) – The fields to index. If ‘*’ (default), all fields are indexed. Otherwise, the values of the fields provided in this argumetn are concatenated and indexed.
store_positions (bool) – Whether to store positions in the index. This is required for phrase queries and proximity queries, but increases index size.
verbose (bool) – Whether to display a progress bar when indexing.
- class pyterrier_anserini.AnseriniRetriever(index, similarity='BM25', similarity_args=None, *, num_results=1000, include_fields=None, verbose=False)[source]¶
Retrieves from an Anserini index.
Construct an AnseriniRetriever retrieve from pyserini.search.lucene.LuceneSearcher.
- Parameters:
index (AnseriniIndex | str) – The Anserini index.
similarity (AnseriniSimilarity | str) – The similarity function to use.
similarity_args (Dict[str, any]) – model-specific arguments, like bm25.k1.
num_results (int) – number of results to return. Default is 1000.
include_fields (List[str] | None) – a list of extra stored fields to include for each result. None indicates no extra fields.
verbose (bool) – show a progress bar during retrieval?
- class pyterrier_anserini.AnseriniReRanker(index, similarity, similarity_args=None, *, verbose=False)[source]¶
A transformer that scores (i.e., re-ranks) the provided documents from an Anserini index.
Initializes the scorer.
- Parameters:
index (AnseriniIndex | str) – The index to score from. If a string, an AnseriniIndex object is created for the path.
similarity (str | AnseriniSimilarity) – The similarity function to use for scoring.
similarity_args (Dict) – A dictionary of arguments to use for the similarity function.
verbose (bool) – Whether to display a progress bar when scoring.
- transform(inp)[source]¶
Scores (i.e., re-ranks) documents from the index for each query in inp.
- Return type:
DataFrame- Parameters:
inp (DataFrame) – A DataFrame with a ‘query’ column containing queries and a ‘docno’ column containing document IDs.
- Returns:
A DataFrame containing the scored documents, with any columns included in inp, plus the ‘score’ and ‘rank’ of the scored documents.
- class pyterrier_anserini.AnseriniTextLoader(index, fields, *, verbose=False)[source]¶
A transformer that provides access to text fields from an Anserini index.
Initializes the text loader.
- Parameters:
index (AnseriniIndex | str) – The index to provide text from. If a string, an AnseriniIndex object is created for the path.
fields (List[str]) – The fields to load.
verbose (bool) – Whether to display a progress bar when providing text.
Miscellaneous¶
- enum pyterrier_anserini.AnseriniSimilarity(value)[source]¶
An enum representing the similarity functions available in Anserini.
Valid values are as follows:
- bm25 = 'BM25'¶
- qld = 'QLD'¶
- tfidf = 'TFIDF'¶
- impact = 'Impact'¶
The
Enumand its members also have the following methods:- to_lucene_sim(sim_args=None)[source]¶
Provides a Lucene similarity object that represents this similarity functions, including provided arguments.
- Parameters:
sim_args (Dict[str, float] | None) – The arguments of this similarity function. Default values will be used when they are not provided.
- Returns:
A
pyjniusbinding to aorg.apache.lucene.search.similarities.Similarityobject.
- pyterrier_anserini.set_version(version=None)[source]¶
Set the version of Anserini to use.
If version is
None(default), the version of Anserini distributed with the pyserini package is used. Otherwise, the specified version is downloaded from Maven and used insead.Note that this function must be run before Java is initialized.
- Parameters:
version (str | None)
- pyterrier_anserini.gss_to_lucene(query, analyzer, *, content_field='contents', mode='strict', boost_rate=10.0)[source]¶
Parses a query using a subset of Google Search Syntax (GSS) conventions into a Lucene query object.
- Return type:
object- Parameters:
query (str)
analyzer (object)
content_field (str)
mode (Literal['strict', 'boost'])
boost_rate (float)
- This implementation supports the following GSS features:
literal query terms: irs w9
terms that must match: irs “w9” (here w9 must occur in the document, but irs may or may not occur)
phrase matching: “irs w9” (here irs and w9 must occur in the document, and they must be adjacent to each other in the specified order)
negative term matching: irs w9 -1040 (1040 must NOT occur in the document, but irs and w9 may or may not occur)
- The method should behave somewhat similarly to Google Search, but a few known differences are present:
Lots of features are missing e.g., domain filtering (site:irs.gov), etc. They are treated as normal query terms.
The parsing almost certainly doesn’t work the same exact way, especially for edge cases
Quoted matches are exact in Google Search, here the analyzer is applied (e.g., with stemming and such)
Similar story for negations – the analyzer is applied
- There are two “modes”:
“strict”: the query is parsed as described above, with MUST and MUST_NOT clauses. This is more strict in that the “should” terms are not actually required to match, and the “must” terms are required to match. This best matches actual Google Search behavior, but may result in over-strict matching in some cases.
“boost”: the query is parsed in a more lenient way, where all terms are added as SHOULD clauses, and the “must” and “phrase” matches are boosted by a factor of boost_rate. This means that documents that match those terms will be scored higher, but documents that don’t match them may still be retrieved if they match other SHOULD terms.