Terrier API Reference¶
This page provides API documentation for the Terrier integration in PyTerrier.
High-Level API¶
TerrierIndex provides a high-level API. We recommended it for
most use cases.
- class pyterrier.terrier.TerrierIndex(path, *, memory=False, _index_ref=None, _index_obj=None)[source]¶
Represents a Terrier index.
A Terrier index is a sparse inverted index structure that supports a variety of operations. It can be used to create transformers that perform retrieval, re-ranking, pseudo-relevance feedback, and other operations.
- Parameters:
path (str | Path) – The path to the index on disk.
memory (bool) – Whether to load the index fully into memory.
_index_ref (object) – For internal use only. The Java IndexRef object for this index.
_index_obj (object) – For internal use only. The Java Index object for this index.
Retrieval¶
- retriever(model, model_args=None, *, num_results=1000, include_fields=None, threads=1, verbose=False)[source]¶
Creates a retriever transformer for this index.
- Return type:
- Parameters:
model (TerrierModel | str) – The weighting model to use for scoring.
model_args (Dict[str, Any] | None) – The arguments to pass to the weighting model.
num_results (int) – The maximum number of results to return per query.
include_fields (List[str] | None) – The metadata fields to return for each search result.
threads (int) – The number of threads to use during retrieval.
verbose (bool) – Whether to progress information during retrieval
Example Pipeline:
index.retriever('BM25', num_results=10)
Rendering issue. Try running the cell again.Terrier retrievers can also perform re-ranking when they receive a result frame as input:
# As a re-ranker index.retriever('BM25', num_results=10)
Rendering issue. Try running the cell again.
- bm25(*, k1=1.2, b=0.75, num_results=1000, include_fields=None, threads=1, verbose=False)[source]¶
Creates a BM25 retriever for this index.
- Return type:
- Parameters:
k1 (float) – BM25’s
k1parameter, which controls TF saturation.b (float) – BM25’s
bparameter, which controls the length penalty.num_results (int) – The maximum number of results to return per query.
include_fields (List[str] | None) – The metadata fields to return for each search result.
threads (int) – The number of threads to use during retrieval.
verbose (bool) – Whether to progress information during retrieval
Example Pipeline:
index.bm25()
Rendering issue. Try running the cell again.Error
Failed to fetch BibTeX for DBLP ID 'conf/trec/RobertsonWJHG94': ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
- dph(*, num_results=1000, include_fields=None, threads=1, verbose=False)[source]¶
Creates a DPH retriever for this index.
- Return type:
- Parameters:
num_results (int) – The maximum number of results to return per query.
include_fields (List[str] | None) – The metadata fields to return for each search result.
threads (int) – The number of threads to use during retrieval.
verbose (bool) – Whether to progress information during retrieval
Example Pipeline:
index.dph()
Rendering issue. Try running the cell again.Citation
Amati. Frequentist and Bayesian Approach to Information Retrieval. ECIR 2006. [link]
@inproceedings{DBLP:conf/ecir/Amati06, author = {Giambattista Amati}, editor = {Mounia Lalmas and Andy MacFarlane and Stefan M. R{\"{u}}ger and Anastasios Tombros and Theodora Tsikrika and Alexei Yavlinsky}, title = {Frequentist and Bayesian Approach to Information Retrieval}, booktitle = {Advances in Information Retrieval, 28th European Conference on {IR} Research, {ECIR} 2006, London, UK, April 10-12, 2006, Proceedings}, series = {Lecture Notes in Computer Science}, volume = {3936}, pages = {13--24}, publisher = {Springer}, year = {2006}, url = {https://doi.org/10.1007/11735106\_3}, doi = {10.1007/11735106\_3}, timestamp = {Tue, 14 May 2019 10:00:37 +0200}, biburl = {https://dblp.org/rec/conf/ecir/Amati06.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
- pl2(*, c=1.0, num_results=1000, include_fields=None, threads=1, verbose=False)[source]¶
Creates a PL2 retriever for this index.
- Return type:
- Parameters:
c (float) – PL2’s
cparameter, which controls the length normalization.num_results (int) – The maximum number of results to return per query.
include_fields (List[str] | None) – The metadata fields to return for each search result.
threads (int) – The number of threads to use during retrieval.
verbose (bool) – Whether to progress information during retrieval
Example Pipeline:
index.pl2()
Rendering issue. Try running the cell again.Citation
Amati and Rijsbergen. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst. 2002. [link]
@article{DBLP:journals/tois/AmatiR02, author = {Gianni Amati and C. J. van Rijsbergen}, title = {Probabilistic models of information retrieval based on measuring the divergence from randomness}, journal = {{ACM} Trans. Inf. Syst.}, volume = {20}, number = {4}, pages = {357--389}, year = {2002}, url = {http://doi.acm.org/10.1145/582415.582416}, doi = {10.1145/582415.582416}, timestamp = {Tue, 01 Jun 2021 09:58:08 +0200}, biburl = {https://dblp.org/rec/journals/tois/AmatiR02.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
- dirichlet_lm(*, mu=2500.0, num_results=1000, include_fields=None, threads=1, verbose=False)[source]¶
Creates a Dirichlet Language Model retriever for this index.
- Return type:
- Parameters:
mu (float) – Dirichlet LM’s
muparameter, which controls the strength of the prior.num_results (int) – The maximum number of results to return per query.
include_fields (List[str] | None) – The metadata fields to return for each search result.
threads (int) – The number of threads to use during retrieval.
verbose (bool) – Whether to progress information during retrieval
Example Pipeline:
index.dirichlet_lm()
Rendering issue. Try running the cell again.Citation
Zhai and Lafferty. A study of smoothing methods for language models applied to information retrieval. ACM Trans. Inf. Syst. 2004. [link]
@article{DBLP:journals/tois/ZhaiL04, author = {ChengXiang Zhai and John D. Lafferty}, title = {A study of smoothing methods for language models applied to information retrieval}, journal = {{ACM} Trans. Inf. Syst.}, volume = {22}, number = {2}, pages = {179--214}, year = {2004}, url = {https://doi.org/10.1145/984321.984322}, doi = {10.1145/984321.984322}, timestamp = {Tue, 06 Nov 2018 12:51:56 +0100}, biburl = {https://dblp.org/rec/journals/tois/ZhaiL04.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
- hiemstra_lm(*, Lambda=0.15, num_results=1000, include_fields=None, threads=1, verbose=False)[source]¶
Creates a Hiemstra Language Model retriever for this index.
- Return type:
- Parameters:
Lambda (float) – Hiemstra LM’s
lambdaparameter, which controls the interpolation weight.num_results (int) – The maximum number of results to return per query.
include_fields (List[str] | None) – The metadata fields to return for each search result.
threads (int) – The number of threads to use during retrieval.
verbose (bool) – Whether to progress information during retrieval
Example Pipeline:
index.hiemstra_lm()
Rendering issue. Try running the cell again.Citation
Hiemstra. Using Language Models for Information Retrieval. 2001. [link]
@phdthesis{DBLP:phd/basesearch/Hiemstra01, author = {Djoerd Hiemstra}, title = {Using Language Models for Information Retrieval}, school = {University of Twente, Enschede, Netherlands}, year = {2001}, url = {http://eprints.eemcs.utwente.nl/6563/}, timestamp = {Thu, 18 May 2017 09:17:27 +0200}, biburl = {https://dblp.org/rec/phd/basesearch/Hiemstra01.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
- tf(*, num_results=1000, include_fields=None, threads=1, verbose=False)[source]¶
Creates a raw Term Frequency (TF) retriever for this index.
This is typically useful for retrieving from learned sparse models.
- Return type:
- Parameters:
num_results (int) – The maximum number of results to return per query.
include_fields (List[str] | None) – The metadata fields to return for each search result.
threads (int) – The number of threads to use during retrieval.
verbose (bool) – Whether to progress information during retrieval
Example Pipeline:
index.tf()
Rendering issue. Try running the cell again.
- tf_idf(*, k1=1.2, b=0.75, num_results=1000, include_fields=None, threads=1, verbose=False)[source]¶
Creates a TF-IDF retriever for this index.
This retriever uses the Robertson formulation of TF and the Sparck Jones formulation of IDF.
- Return type:
- Parameters:
k1 (float) – TF-IDF’s
k1parameter, which controls TF saturation.b (float) – TF-IDF’s
bparameter, which controls the length penalty.num_results (int) – The maximum number of results to return per query.
include_fields (List[str] | None) – The metadata fields to return for each search result.
threads (int) – The number of threads to use during retrieval.
verbose (bool) – Whether to progress information during retrieval
Example Pipeline:
index.tf_idf()
Rendering issue. Try running the cell again.Citation
Robertson et al. Okapi at TREC-7: Automatic Ad Hoc, Filtering, VLC and Interactive. TREC 1998.
@inproceedings{DBLP:conf/trec/RobertsonWB98, author = {Stephen E. Robertson and Steve Walker and Micheline Hancock{-}Beaulieu}, editor = {Ellen M. Voorhees and Donna K. Harman}, title = {Okapi at {TREC-7:} Automatic Ad Hoc, Filtering, {VLC} and Interactive}, booktitle = {Proceedings of The Seventh Text REtrieval Conference, {TREC} 1998, Gaithersburg, Maryland, USA, November 9-11, 1998}, series = {{NIST} Special Publication}, volume = {500-242}, pages = {199--210}, publisher = {National Institute of Standards and Technology {(NIST)}}, year = {1998}, timestamp = {Wed, 07 Jul 2021 16:44:22 +0200}, biburl = {https://dblp.org/rec/conf/trec/RobertsonWB98.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }Citation
Jones. A statistical interpretation of term specificity and its application in retrieval. J. Documentation 2004. [link]
@article{DBLP:journals/jd/Jones04, author = {Karen Sp{\"{a}}rck Jones}, title = {A statistical interpretation of term specificity and its application in retrieval}, journal = {J. Documentation}, volume = {60}, number = {5}, pages = {493--502}, year = {2004}, url = {https://doi.org/10.1108/00220410410560573}, doi = {10.1108/00220410410560573}, timestamp = {Sun, 06 Sep 2020 16:55:45 +0200}, biburl = {https://dblp.org/rec/journals/jd/Jones04.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
Query Expansion & Reformulation¶
- rm3(*, fb_terms=10, fb_docs=3, fb_lambda=0.6)[source]¶
Creates an RM3 pseudo-relevance feedback transformer for this index.
- Return type:
- Parameters:
fb_terms (int) – The number of feedback terms to use.
fb_docs (int) – The number of feedback documents to use.
fb_lambda (float) – The interpolation weight between the original query and the feedback model.
Example Pipeline:
index.bm25() >> index.rm3() >> index.bm25() >> pt.rewrite.reset()
Rendering issue. Try running the cell again.Note
pt.rewrite.reset()is needed after the feedback step to reset the query to its original form.Citation
Jaleel et al. UMass at TREC 2004: Novelty and HARD. TREC 2004. [link]
@inproceedings{DBLP:conf/trec/JaleelACDLLSW04, author = {Nasreen Abdul Jaleel and James Allan and W. Bruce Croft and Fernando Diaz and Leah S. Larkey and Xiaoyan Li and Mark D. Smucker and Courtney Wade}, editor = {Ellen M. Voorhees and Lori P. Buckland}, title = {UMass at {TREC} 2004: Novelty and {HARD}}, booktitle = {Proceedings of the Thirteenth Text REtrieval Conference, {TREC} 2004, Gaithersburg, Maryland, USA, November 16-19, 2004}, series = {{NIST} Special Publication}, volume = {500-261}, publisher = {National Institute of Standards and Technology {(NIST)}}, year = {2004}, url = {http://trec.nist.gov/pubs/trec13/papers/umass.novelty.hard.pdf}, timestamp = {Wed, 07 Jul 2021 16:44:22 +0200}, biburl = {https://dblp.org/rec/conf/trec/JaleelACDLLSW04.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
- bo1(*, fb_terms=10, fb_docs=3)[source]¶
Creates a Bo1 pseudo-relevance feedback transformer for this index.
- Return type:
- Parameters:
fb_terms (int) – The number of feedback terms to use.
fb_docs (int) – The number of feedback documents to use.
Example Pipeline:
index.bm25() >> index.bo1() >> index.bm25() >> pt.rewrite.reset()
Rendering issue. Try running the cell again.Note
pt.rewrite.reset()is needed after the feedback step to reset the query to its original form.Citation
Amati. Probability models for information retrieval based on divergence from randomness. 2003. [link]
@phdthesis{DBLP:phd/ethos/Amati03, author = {Giambattista Amati}, title = {Probability models for information retrieval based on divergence from randomness}, school = {University of Glasgow, {UK}}, year = {2003}, url = {http://theses.gla.ac.uk/1570/}, timestamp = {Tue, 05 Apr 2022 10:59:13 +0200}, biburl = {https://dblp.org/rec/phd/ethos/Amati03.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
- kl(*, fb_terms=10, fb_docs=3)[source]¶
Creates a KL-Divergence pseudo-relevance feedback transformer for this index.
- Return type:
- Parameters:
fb_terms (int) – The number of feedback terms to use.
fb_docs (int) – The number of feedback documents to use.
Example Pipeline:
index.bm25() >> index.kl() >> index.bm25() >> pt.rewrite.reset()
Rendering issue. Try running the cell again.Note
pt.rewrite.reset()is needed after the feedback step to reset the query to its original form.Citation
Amati. Probability models for information retrieval based on divergence from randomness. 2003. [link]
@phdthesis{DBLP:phd/ethos/Amati03, author = {Giambattista Amati}, title = {Probability models for information retrieval based on divergence from randomness}, school = {University of Glasgow, {UK}}, year = {2003}, url = {http://theses.gla.ac.uk/1570/}, timestamp = {Tue, 05 Apr 2022 10:59:13 +0200}, biburl = {https://dblp.org/rec/phd/ethos/Amati03.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
- sdm()[source]¶
Creates a Sequential Dependence Model (SDM) query expansion transformer.
Requires an that the index was built with positional information.
Citation
Metzler and Croft. A Markov random field model for term dependencies. SIGIR 2005. [link]
@inproceedings{DBLP:conf/sigir/MetzlerC05, author = {Donald Metzler and W. Bruce Croft}, editor = {Ricardo A. Baeza{-}Yates and Nivio Ziviani and Gary Marchionini and Alistair Moffat and John Tait}, title = {A Markov random field model for term dependencies}, booktitle = {{SIGIR} 2005: Proceedings of the 28th Annual International {ACM} {SIGIR} Conference on Research and Development in Information Retrieval, Salvador, Brazil, August 15-19, 2005}, pages = {472--479}, publisher = {{ACM}}, year = {2005}, url = {https://doi.org/10.1145/1076034.1076115}, doi = {10.1145/1076034.1076115}, timestamp = {Tue, 06 Nov 2018 11:07:23 +0100}, biburl = {https://dblp.org/rec/conf/sigir/MetzlerC05.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
Loading¶
- text_loader(fields='*', *, verbose=False)[source]¶
Creates a transformer that loads stored text content from this index.
- Return type:
- Parameters:
fields (List[str] | str | Literal['*']) – The metadata fields to load for each document. If
"*", loads all available fields.
Example Pipeline:
index.text_loader()
Rendering issue. Try running the cell again.
Indexing¶
- index(iter, **kwargs)[source]¶
Indexes the given input data, creating the index if it does not yet exist, or raising an error if it does.
This method inspects the first document in the to try to infer reasonable settings for indexing:
If a
tokscolumn is present, it is assumed to contain pre-tokenized text, and usestoks_indexer(). Otherwise, allstrcolumns (exceptdocno) are indexed as raw text usingindexer().Text fields are stored as metadata.
The maximum metadata lengths are predicted based on the lengths present in the first document.
- Parameters:
iter (Iterable[Dict[str, Any]]) – The documents to index as an iterable of dicts.
kwargs (Any) – Ignored
- indexer(*, meta={'docno': 20}, text_attrs=['text'], tokeniser='english', stemmer='porter', stopwords='terrier', store_separate_fields=False, store_positions=False, threads=1)[source]¶
Returns an indexer that is used to build this index.
- Return type:
- Parameters:
meta (Dict) – The fields to store as metadata for each document. The keys are the metadata field names, and the values are the maximum lengths for each field.
text_attrs (List[str]) – The text fields to index as text for each document.
tokeniser (str | TerrierTokeniser) – The tokeniser to use.
stemmer (None | str | TerrierStemmer) – The stemmer to apply to each token.
stopwords (None | TerrierStopwords | str | List[str]) – The set of words to remove as stopwords.
store_separate_fields (bool) – Whether to store each text attribute as a separate field in the index. This allows for fielded retrieval, but increases index size.
store_positions (bool) – Whether to store position information (i.e., blocks) in the index. This allows for positional queries, but increases index size and retrieval time.
threads (int) – The number of threads to use during indexing.
Example Pipeline:
index = pt.terrier.TerrierIndex('my_index.terrier') index.indexer()
Rendering issue. Try running the cell again.
- toks_indexer(*, meta={'docno': 20}, threads=1)[source]¶
Returns an indexer that indexes pre-tokenised documents into this index.
- Return type:
- Parameters:
meta (Dict) – The fields to store as metadata for each document. The keys are the metadata field names, and the values are the maximum lengths for each field.
threads (int) – The number of threads to use during indexing.
Example Pipeline:
index = pt.terrier.TerrierIndex('my_index.terrier') index.toks_indexer()
Rendering issue. Try running the cell again.
Index Data¶
- collection_statistics()[source]¶
Returns the collection statistics for this index.
Example:
Show collection statistics for a Terrier index.¶>>> stats = index.collection_statistics() >>> print(stats) Number of documents: 11429 Number of terms: 7756 Number of postings: 224573 Number of fields: 0 Number of tokens: 271581 Field names: [] Positions: false
In this example, the index has 11429 documents, which contained 271581 word occurrences. 7756 unique words were identified. The total number of postings in the inverted index is 224573. This index did not record fields during indexing (which can be useful for models such as BM25F). Similarly, positions, which are used for phrasal queries or proximity models were not recorded.
- Returns:
A Java CollectionStatistics object for this index.
- lexicon()[source]¶
The lexicon for this index.
Note that the terms in the lexicon include all pre-processing, such as stemming. For example, the term ‘chemical’ would be stored as ‘chemic’ when using the default Porter stemmer.
- Returns:
A Java Lexicon object for this index.
- inverted_index()[source]¶
The inverted posting index for this index.
- Returns:
A Java PostingIndex object for this index’s inverted index.
- document_index()[source]¶
The document index for this index.
- Returns:
A Java DocumentIndex object for this index.
- meta_index()[source]¶
The meta index for this index.
Example:
Show metadata fields in a Terrier index.¶>>> print(index.meta_index().getKeys()) ['docno', 'text']
In this example, the index contains two metadata fields:
docno, which contains the document identifiers, andtext, which contains the raw text of each document.- Returns:
A Java MetaIndex object for this index.
- direct_index()[source]¶
The direct (forward) index for this index.
- Returns:
A Java PostingIndex object for this index’s direct index.
Miscellaneous¶
- get_corpus_iter(return_toks=True)[source]¶
Returns an iterable over the documents in this index’s corpus.
- Return type:
Iterable[Dict[str,Any]]- Parameters:
return_toks (bool) – Whether to return tokenised text (list of strings) or raw text (string).
- A corpus iter from a Terrier index can be used for various purposes, including:
indexing the pre-tokenised Terrier index directly in another indexing pipeline
extracting document metadata for ingestion into another indexing pipeline
- classmethod coerce(cls, index_like)[source]¶
Attempts to build a
TerrierIndexfrom the given object.- Return type:
- Parameters:
index_like (object) – The object to coerce into a TerrierIndex. If a
strorPath, it loads the index at the provided path. If apt.terrier.J.IndexReforpt.terrier.J.Index, it creates a TerrierIndex from the Java object. If apt.terrier.TerrierIndex, it returns itself.
- enum pyterrier.terrier.TerrierModel(value)[source]¶
A built-in Terrier weighting (scoring) model.
This enum is primarily used with
TerrierIndex.retriever()to specify the weighting model to use.Valid values are as follows:
- bm25 = <TerrierModel.bm25: 'bm25'>¶
- dph = <TerrierModel.dph: 'dph'>¶
- pl2 = <TerrierModel.pl2: 'pl2'>¶
- dirichlet_lm = <TerrierModel.dirichlet_lm: 'dirichlet_lm'>¶
- hiemstra_lm = <TerrierModel.hiemstra_lm: 'hiemstra_lm'>¶
- tf = <TerrierModel.tf: 'tf'>¶
- tf_idf = <TerrierModel.tf_idf: 'tf_idf'>¶
- enum pyterrier.terrier.TerrierTokeniser(value)[source]¶
A built-in Terrier tokeniser.
This enum is primarily used with
indexer.Valid values are as follows:
- whitespace = <TerrierTokeniser.whitespace: 'whitespace'>¶
- english = <TerrierTokeniser.english: 'english'>¶
- utf = <TerrierTokeniser.utf: 'utf'>¶
- twitter = <TerrierTokeniser.twitter: 'twitter'>¶
- identity = <TerrierTokeniser.identity: 'identity'>¶
- enum pyterrier.terrier.TerrierStemmer(value)[source]¶
A built-in Terrier stemmer.
The stemming configuration is saved in the index and loaded at retrieval time. Snowball stemmers for various languages are available in Terrier.
This enum is primarily used with
indexer.Valid values are as follows:
- none = <TerrierStemmer.none: 'none'>¶
- porter = <TerrierStemmer.porter: 'porter'>¶
- weakporter = <TerrierStemmer.weakporter: 'weakporter'>¶
- danish = <TerrierStemmer.danish: 'danish'>¶
- finnish = <TerrierStemmer.finnish: 'finnish'>¶
- german = <TerrierStemmer.german: 'german'>¶
- hungarian = <TerrierStemmer.hungarian: 'hungarian'>¶
- norwegian = <TerrierStemmer.norwegian: 'norwegian'>¶
- portugese = <TerrierStemmer.portugese: 'portugese'>¶
- spanish = <TerrierStemmer.spanish: 'spanish'>¶
- swedish = <TerrierStemmer.swedish: 'swedish'>¶
- turkish = <TerrierStemmer.turkish: 'turkish'>¶
The
Enumand its members also have the following methods:
- enum pyterrier.terrier.TerrierStopwords(value)[source]¶
The stopword configuration to use for Terrier.
This enum is primarily used with
indexer.Valid values are as follows:
- none = <TerrierStopwords.none: 'none'>¶
- terrier = <TerrierStopwords.terrier: 'terrier'>¶
- custom = <TerrierStopwords.custom: 'custom'>¶
Mid-Level API¶
The Mid-Level API provides more control over Terrier functionality.
Indexing¶
- class pyterrier.terrier.IterDictIndexer(index_path, *, meta={'docno': 20}, text_attrs=['text'], meta_reverse=['docno'], pretokenised=False, fields=False, threads=1, **kwargs)¶
Use this Indexer if you wish to index an iter of dicts (possibly with multiple fields). This version is optimized by using multiple threads and POSIX fifos to transfer data, which ends up being much faster.
- Parameters:
index_path (
str) – Directory to store index. Ignored for IndexingType.MEMORY.meta (
Dict[str,int]) – What metadata for each document to record in the index, and what length to reserve. Metadata values will be truncated to this length. Defaults to {“docno” : 20}.text_attrs (
List[str]) – List of columns of the input data that should be indexed. These are concatenated in the document representation. Defaults to [“text”].meta_reverse (
List[str]) – What metadata should we be able to resolve back to a docid. Defaults to [“docno”].pretokenised (
bool) – Whether to index pre-tokenized text, e.g., through a Learned Sparse encoder. If True, will ignoretext_attrsand indstead index the dictionary contained in thetokscolumn.fields (
bool) – Whether a fields-indexer should be used, i.e. whether the frequency in each attribute should be recorded separately in the Terrer index. This allows application of weighting models such as BM25F.threads (
int) – Number of threads to use for indexing. Defaults to 1.kwargs – Additional keyword arguments passed to TerrierIndexer.
- index(it, fields=None)¶
Index the specified iter of dicts with the (optional) specified fields
- Parameters:
it – an iter of document dicts to be indexed
- class pyterrier.terrier.TRECCollectionIndexer(index_path, collection='trec', verbose=False, meta={'docno': 20}, meta_reverse=['docno'], meta_tags={}, **kwargs)[source]¶
Use this Indexer if you wish to index a TREC formatted collection
Init method
- Parameters:
index_path (
str) – Directory to store index. Ignored for IndexingType.MEMORY.blocks – Create indexer with blocks if true, else without blocks. Default is False.
overwrite – If index already present at index_path, True would overwrite it, False throws an Exception. Default is False.
type – the specific indexing procedure to use. Default is IndexingType.CLASSIC.
collection (
str) – name, or Class instance, or one of “trec”, “trecweb”, “warc”). Default is “trec”.meta (
Dict[str,int]) – What metadata for each document to record in the index, and what length to reserve. Metadata fields will be truncated to this length. Defaults to {“docno” : 20}.meta_reverse (
List[str]) – What metadata shoudl we be able to resolve back to a docid. Defaults to [“docno”].meta_tags (
Dict[str,str]) – For collections formed using tagged data (e.g. HTML), which tags correspond to which metadata. This is useful for recording the text of documents for use in neural rankers - see Working with Document Texts.verbose (bool)
- class pyterrier.terrier.FilesIndexer(index_path, *, meta={'docno': 20, 'filename': 512}, meta_reverse=['docno'], meta_tags={}, **kwargs)[source]¶
Use this Indexer if you wish to index a pdf, docx, txt etc files
- Parameters:
index_path (str) – Directory to store index. Ignored for IndexingType.MEMORY.
blocks (bool) – Create indexer with blocks if true, else without blocks. Default is False.
type (IndexingType) – the specific indexing procedure to use. Default is IndexingType.CLASSIC.
meta (Dict[str,int]) – What metadata for each document to record in the index, and what length to reserve. Metadata fields will be truncated to this length. Defaults to {“docno” : 20, “filename” : 512}.
meta_reverse (List[str]) – What metadata shoudl we be able to resolve back to a docid. Defaults to [“docno”],
meta_tags (Dict[str,str]) – For collections formed using tagged data (e.g. HTML), which tags correspond to which metadata. Defaults to empty. This is useful for recording the text of documents for use in neural rankers - see Working with Document Texts.
- enum pyterrier.terrier.IndexingType(value)[source]¶
This enum is used to determine the type of index built by Terrier. The default is CLASSIC. For more information, see the relevant Terrier indexer and realtime documentation.
Valid values are as follows:
- CLASSIC = <IndexingType.CLASSIC: 1>¶
- SINGLEPASS = <IndexingType.SINGLEPASS: 2>¶
- MEMORY = <IndexingType.MEMORY: 3>¶
- pyterrier.terrier.treccollection2textgen(files, meta=['docno'], meta_tags={'text': 'ELSE'}, verbose=False, num_docs=None, tag_text_length=4096)[source]¶
Creates a generator of dictionaries on parsing TREC formatted files. This is useful for parsing TREC-formatted corpora in indexers like IterDictIndexer, or similar indexers in other plugins (e.g. ColBERTIndexer).
- “Arguments:
-” files(List[str]): list of files to parse in TREC format. - meta(List[str]): list of attributes to expose in the dictionaries as metadata. - meta_tags(Dict[str,str]): mapping of TREC tags as metadata. - tag_text_length(int): maximium length of metadata. Defaults to 4096. - verbose(bool): set to true to show a TQDM progress bar. Defaults to True. - num_docs(int): a hint for TQDM to size the progress bar based on document counts rather than file count.
Example:
files = pt.io.find_files("/path/to/Disk45") gen = pt.index.treccollection2textgen(files) index = pt.IterDictIndexer("./index45").index(gen)
- Parameters:
files (List[str])
meta (List[str])
meta_tags (Dict[str, str])
tag_text_length (int)
Retrieval & Scoring¶
- class pyterrier.terrier.Retriever(index_location, controls=None, properties=None, metadata=['docno'], num_results=None, wmodel=None, tokeniser=TerrierTokeniser.english, threads=1, verbose=False)[source]¶
Use this class for retrieval by Terrier
Init method
- Parameters:
index_location (
str|Any) – An index-like object - An Index, an IndexRef, or a String that can be resolved to an IndexRefcontrols (
Dict[str,str] |None) – A dictionary with the control names and valuesproperties (
Dict[str,str] |None) – A dictionary with the property keys and valuesverbose (
bool) – If True transform method will display progressnum_results (
int|None) – Number of results to retrieve.metadata (
List[str]) – What metadata to retrieve. Default is [“docno”].wmodel (str | Callable | None)
tokeniser (str | TerrierTokeniser)
threads (int)
- class pyterrier.terrier.FeaturesRetriever(index_location, features, controls=None, properties=None, threads=1, **kwargs)[source]¶
Use this class for retrieval with multiple features
Init method
- Parameters:
index_location (
str|Any) – An index-like object - An Index, an IndexRef, or a String that can be resolved to an IndexReffeatures (
List[str]) – List of features to usecontrols (
Dict[str,str] |None) – A dictionary with the control names and valuesproperties (
Dict[str,str] |None) – A dictionary with the property keys and valuesverbose – If True transform method will display progress
num_results – Number of results to retrieve.
threads (int)
- transform(queries)[source]¶
Performs the retrieval with multiple features
- Parameters:
queries – A pandas.Dataframe with columns=[‘qid’, ‘query’]. For re-ranking, the DataFrame may also have a ‘docid’ and or ‘docno’ column.
- Returns:
a pandas.DataFrame with columns=[‘qid’, ‘docno’, ‘score’, ‘rank, ‘features’]
- class pyterrier.terrier.TextScorer(*args, **kwargs)[source]¶
A re-ranker class, which takes the queries and the contents of documents, indexes the contents of the documents using a MemoryIndex, and performs ranking of those documents with respect to the queries. Unknown kwargs are passed to Retriever.
- Parameters:
takes – configuration - what is needed as input: “queries”, or “docs”. Default is “docs” since v0.8.
returns – configuration - what is needed as output: “queries”, or “docs”. Default is “docs”.
body_attr – what dataframe input column contains the text of the document. Default is “body”.
wmodel – name of the weighting model to use for scoring.
background_index – An optional background index to use for term and collection statistics. If a weighting model such as BM25 or TF_IDF or PL2 is used without setting the background_index, the background statistics will be calculated from the dataframe, which is ususally not the desired behaviour.
Example:
df = pd.DataFrame( [ ["q1", "chemical reactions", "d1", "professor protor poured the chemicals"], ["q1", "chemical reactions", "d2", "chemical brothers turned up the beats"], ], columns=["qid", "query", "text"]) textscorer = pt.TextScorer(takes="docs", body_attr="text", wmodel="Tf") rtr = textscorer.transform(df) #rtr will score each document by term frequency for the query "chemical reactions" based on the provided document contents
Example:
df = pd.DataFrame( [ ["q1", "chemical reactions", "d1", "professor protor poured the chemicals"], ["q1", "chemical reactions", "d2", "chemical brothers turned up the beats"], ], columns=["qid", "query", "text"]) existing_index = pt.IndexFactory.of(...) textscorer = pt.TextScorer(takes="docs", body_attr="text", wmodel="TF_IDF", background_index=existing_index) rtr = textscorer.transform(df) #rtr will score each document by TF_IDF for the query "chemical reactions" based on the provided document contents
Query Expansion¶
- class pyterrier.terrier.rewrite.SDM(verbose=0, remove_stopwords=True, prox_model=None, tokeniser=TerrierTokeniser.english, **kwargs)[source]¶
Implements the sequential dependence model, which Terrier supports using its Indri/Galagoo compatible matchop query language. The rewritten query is derived using the Terrier class DependenceModelPreProcess.
This transformer changes the query. It must be followed by a Terrier Retrieve() transformer. The original query is saved in the “query_0” column, which can be restored using pt.rewrite.reset().
Citation
Metzler and Croft. A Markov random field model for term dependencies. SIGIR 2005. [link]
@inproceedings{DBLP:conf/sigir/MetzlerC05, author = {Donald Metzler and W. Bruce Croft}, editor = {Ricardo A. Baeza{-}Yates and Nivio Ziviani and Gary Marchionini and Alistair Moffat and John Tait}, title = {A Markov random field model for term dependencies}, booktitle = {{SIGIR} 2005: Proceedings of the 28th Annual International {ACM} {SIGIR} Conference on Research and Development in Information Retrieval, Salvador, Brazil, August 15-19, 2005}, pages = {472--479}, publisher = {{ACM}}, year = {2005}, url = {https://doi.org/10.1145/1076034.1076115}, doi = {10.1145/1076034.1076115}, timestamp = {Tue, 06 Nov 2018 11:07:23 +0100}, biburl = {https://dblp.org/rec/conf/sigir/MetzlerC05.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }- Parameters:
tokeniser (str | TerrierTokeniser)
- class pyterrier.terrier.rewrite.RM3(*args, **kwargs)[source]¶
Performs query expansion using RM3 relevance models.
This transformer must be followed by a terrier.Retriever() transformer. The original query is saved in the “query_0” column, which can be restored using pt.rewrite.reset().
- Instance Attributes:
fb_terms(int): number of feedback terms. Defaults to 10
fb_docs(int): number of feedback documents. Defaults to 3
fb_lambda(float): lambda in RM3, i.e. importance of relevance model viz feedback model. Defaults to 0.6.
Example:
bm25 = pt.terrier.Retriever(index, wmodel="BM25") rm3_pipe = bm25 >> pt.rewrite.RM3(index) >> bm25 pt.Experiment([bm25, rm3_pipe], dataset.get_topics(), dataset.get_qrels(), ["map"] )
Citation
Jaleel et al. UMass at TREC 2004: Novelty and HARD. TREC 2004. [link]
@inproceedings{DBLP:conf/trec/JaleelACDLLSW04, author = {Nasreen Abdul Jaleel and James Allan and W. Bruce Croft and Fernando Diaz and Leah S. Larkey and Xiaoyan Li and Mark D. Smucker and Courtney Wade}, editor = {Ellen M. Voorhees and Lori P. Buckland}, title = {UMass at {TREC} 2004: Novelty and {HARD}}, booktitle = {Proceedings of the Thirteenth Text REtrieval Conference, {TREC} 2004, Gaithersburg, Maryland, USA, November 16-19, 2004}, series = {{NIST} Special Publication}, volume = {500-261}, publisher = {National Institute of Standards and Technology {(NIST)}}, year = {2004}, url = {http://trec.nist.gov/pubs/trec13/papers/umass.novelty.hard.pdf}, timestamp = {Wed, 07 Jul 2021 16:44:22 +0200}, biburl = {https://dblp.org/rec/conf/trec/JaleelACDLLSW04.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }- Parameters:
index_like – the Terrier index to use
fb_terms (int) – number of terms to add to the query. Terrier’s default setting is 10 expansion terms.
fb_docs (int) – number of feedback documents to consider. Terrier’s default setting is 3 feedback documents.
fb_lambda (float) – lambda in RM3, i.e. importance of relevance model viz feedback model. Defaults to 0.6.
- class pyterrier.terrier.rewrite.Bo1QueryExpansion(*args, **kwargs)[source]¶
Applies the Bo1 query expansion model from the Divergence from Randomness Framework, as provided by Terrier. It must be followed by a terrier.Retriever() transformer. The original query is saved in the “query_0” column, which can be restored using pt.rewrite.reset().
- Instance Attributes:
fb_terms(int): number of feedback terms. Defaults to 10
fb_docs(int): number of feedback documents. Defaults to 3
Citation
Amati and Rijsbergen. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst. 2002. [link]
@article{DBLP:journals/tois/AmatiR02, author = {Gianni Amati and C. J. van Rijsbergen}, title = {Probabilistic models of information retrieval based on measuring the divergence from randomness}, journal = {{ACM} Trans. Inf. Syst.}, volume = {20}, number = {4}, pages = {357--389}, year = {2002}, url = {http://doi.acm.org/10.1145/582415.582416}, doi = {10.1145/582415.582416}, timestamp = {Tue, 01 Jun 2021 09:58:08 +0200}, biburl = {https://dblp.org/rec/journals/tois/AmatiR02.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }- Parameters:
index_like – the Terrier index to use.
fb_terms (int) – number of terms to add to the query. Terrier’s default setting is 10 expansion terms.
fb_docs (int) – number of feedback documents to consider. Terrier’s default setting is 3 feedback documents.
- class pyterrier.terrier.rewrite.KLQueryExpansion(*args, **kwargs)[source]¶
Applies the KL query expansion model from the Divergence from Randomness Framework, as provided by Terrier. This transformer must be followed by a terrier.Retriever() transformer. The original query is saved in the “query_0” column, which can be restored using pt.rewrite.reset().
- Instance Attributes:
fb_terms(int): number of feedback terms. Defaults to 10
fb_docs(int): number of feedback documents. Defaults to 3
- Parameters:
index_like – the Terrier index to use
fb_terms (int) – number of terms to add to the query. Terrier’s default setting is 10 expansion terms.
fb_docs (int) – number of feedback documents to consider. Terrier’s default setting is 3 feedback documents.
- pyterrier.terrier.rewrite.reset()[source]¶
Undoes a previous query rewriting operation. This results in the query formulation stored in the “query_0” attribute being moved to the “query” attribute, and, if present, the “query_1” being moved to “query_0” and so on. This transformation is useful if you have rewritten the query for the purposes of one retrieval stage, but wish a subquent transformer to be applies on the original formulation.
Internally, this function applies pt.model.pop_queries().
Example:
firststage = pt.rewrite.SDM() >> pt.terrier.Retriever(index, wmodel="DPH") secondstage = pyterrier_bert.cedr.CEDRPipeline() fullranker = firststage >> pt.rewrite.reset() >> secondstage
- Return type:
- pyterrier.terrier.rewrite.tokenise(tokeniser='english', matchop=False)[source]¶
Applies tokenisation to the query. Until PyTerrier 1.0, queries obtained from
pt.get_dataset().get_topics()were generally tokenised.- Return type:
- Parameters:
tokeniser (Union[str,TerrierTokeniser,FunctionType]) – Defines what tokeniser should be used - either a Java tokeniser name in Terrier, a TerrierTokeniser instance, or a function that takes a str as input and returns a list of str.
matchop (bool) – Whether query terms should be wrapped in matchops, to ensure they can be parsed by a Terrier Retriever transformer.
Example - use default tokeniser:
pipe = pt.rewrite.tokenise() >> pt.terrier.Retriever() pipe.search("Question with 'capitals' and other stuff?")
Example - roll your own tokeniser:
poortokenisation = pt.rewrite.tokenise(lambda query: query.split(" ")) >> pt.terrier.Retriever()
Example - for non-English languages, tokenise on standard UTF non-alphanumeric characters:
utftokenised = pt.rewrite.tokenise(pt.TerrierTokeniser.utf)) >> pt.terrier.Retriever() utftokenised = pt.rewrite.tokenise("utf")) >> pt.terrier.Retriever()
Example - tokenising queries using a HuggingFace tokenizer
# this assumes the index was created in a pretokenised manner br = pt.terrier.Retriever(indexref) tok = AutoTokenizer.from_pretrained("bert-base-uncased") query_toks = pt.rewrite.tokenise(tok.tokenize, matchop=True) retr_pipe = query_toks >> br
- pyterrier.terrier.rewrite.stash_results(clear=True)[source]¶
Stashes (saves) the current retrieved documents for each query into the column “stashed_results_0”. This means that they can be restored later by using pt.rewrite.reset_results(). thereby converting a retrieved documents dataframe into one of queries.
Args: clear(bool): whether to drop the document and retrieved document related columns. Defaults to True.
- Return type:
- pyterrier.terrier.rewrite.reset_results()[source]¶
Applies a transformer that undoes a pt.rewrite.stash_results() transformer, thereby restoring the ranked documents.
- Return type:
- pyterrier.terrier.rewrite.linear(weightCurrent, weightPrevious, format='terrierql', **kwargs)[source]¶
Applied to make a linear combination of the current and previous query formulation. The implementation is tied to the underlying query language used by the retrieval/re-ranker transformers. Two of Terrier’s query language formats are supported by the format kwarg, namely “terrierql” and “matchoptql”. Their exact respective formats are detailed in the Terrier documentation.
- Return type:
- Parameters:
weightCurrent (float) – weight to apply to the current query formulation.
weightPrevious (float) – weight to apply to the previous query formulation.
format (str) – which query language to use to rewrite the queries, one of “terrierql” or “matchopql”.
Example:
pipeTQL = pt.apply.query(lambda row: "az") >> pt.rewrite.linear(0.75, 0.25, format="terrierql") pipeMQL = pt.apply.query(lambda row: "az") >> pt.rewrite.linear(0.75, 0.25, format="matchopql") pipeT.search("a") pipeM.search("a")
Example outputs of pipeTQL and pipeMQL corresponding to the query “a” above:
Terrier QL output: “(az)^0.750000 (a)^0.250000”
MatchOp QL output: “#combine:0:0.750000:1:0.250000(#combine(az) #combine(a))”
Loading¶
- class pyterrier.terrier.TerrierTextLoader(index, fields='*', *, verbose=False)[source]¶
A transformer that loads textual metadata from a Terrier index into a DataFrame by docid or docno.
Initialise the transformer with the index to load metadata from.
- Parameters:
index (pyterrier.terrier.J.Index) – The index to load metadata from.
fields (List[str] | str | Literal['*']) – The fields to load from the index. If ‘*’, all fields will be loaded.
verbose – Whether to print debug information.
Low-Level (Java) API¶
Some functions return Java object wrappers (e.g., TerrierIndex.index_obj())
that provide direct low-level API access to Terrier classes. You can find documentation for it in the
Terrier Documentation.
Tip
Pyjnius Java object wrappers show which class they wrap in their string representation. For instance,
str(index.index_obj()) = "<org.terrier.structures.Index at 0x10cd8ba60 ...>", showing that it
wraps an instance of org.terrier.structures.Index.
- class pyterrier.terrier.IndexFactory[source]¶
The
of()method of this factory class allows to load a Terrier Index.NB: This class “shades” the native Terrier IndexFactory class - it offers essential the same API, except that the
of()method contains a memory kwarg, that can be used to load additional index data structures into memory.- Terrier data structures that can be loaded into memory:
‘inverted’ - the inverted index, contains posting lists for each term. In the default configuration, this is read in from disk in chunks.
‘lexicon’ - the dictionary. By default, a binary search of the on-disk structure is used, so loading into memory can enhance speed.
‘meta’ - metadata about documents. Used as the final stage of retrieval, one seek for each retrieved document.
‘direct’ - contains posting lists for each document. No speed advantage for loading into memory unless pseudo-relevance feedback is being used.
‘document’ - contains document lengths, which are anyway loaded into memory. No speed advantage for loading into memory unless pseudo-relevance feedback is being used.