Terrier API Reference

This page provides API documentation for the Terrier integration in PyTerrier.

High-Level API

TerrierIndex provides a high-level API. We recommended it for most use cases.

class pyterrier.terrier.TerrierIndex(path, *, memory=False, _index_ref=None, _index_obj=None)[source]

Represents a Terrier index.

A Terrier index is a sparse inverted index structure that supports a variety of operations. It can be used to create transformers that perform retrieval, re-ranking, pseudo-relevance feedback, and other operations.

Parameters:
  • path (str | Path) – The path to the index on disk.

  • memory (bool) – Whether to load the index fully into memory.

  • _index_ref (object) – For internal use only. The Java IndexRef object for this index.

  • _index_obj (object) – For internal use only. The Java Index object for this index.

Retrieval

retriever(model, model_args=None, *, num_results=1000, include_fields=None, threads=1, verbose=False)[source]

Creates a retriever transformer for this index.

Return type:

Transformer

Parameters:
  • model (TerrierModel | str) – The weighting model to use for scoring.

  • model_args (Dict[str, Any] | None) – The arguments to pass to the weighting model.

  • num_results (int) – The maximum number of results to return per query.

  • include_fields (List[str] | None) – The metadata fields to return for each search result.

  • threads (int) – The number of threads to use during retrieval.

  • verbose (bool) – Whether to progress information during retrieval

Example Pipeline:

index.retriever('BM25', num_results=10)
Rendering issue. Try running the cell again.

Terrier retrievers can also perform re-ranking when they receive a result frame as input:

# As a re-ranker
index.retriever('BM25', num_results=10)
Rendering issue. Try running the cell again.

See also

There are shorthand methods for creating common retrievers: bm25(), dph().

bm25(*, k1=1.2, b=0.75, num_results=1000, include_fields=None, threads=1, verbose=False)[source]

Creates a BM25 retriever for this index.

Return type:

Transformer

Parameters:
  • k1 (float) – BM25’s k1 parameter, which controls TF saturation.

  • b (float) – BM25’s b parameter, which controls the length penalty.

  • num_results (int) – The maximum number of results to return per query.

  • include_fields (List[str] | None) – The metadata fields to return for each search result.

  • threads (int) – The number of threads to use during retrieval.

  • verbose (bool) – Whether to progress information during retrieval

Example Pipeline:

index.bm25()
Rendering issue. Try running the cell again.

Error

Failed to fetch BibTeX for DBLP ID 'conf/trec/RobertsonWJHG94': ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
dph(*, num_results=1000, include_fields=None, threads=1, verbose=False)[source]

Creates a DPH retriever for this index.

Return type:

Transformer

Parameters:
  • num_results (int) – The maximum number of results to return per query.

  • include_fields (List[str] | None) – The metadata fields to return for each search result.

  • threads (int) – The number of threads to use during retrieval.

  • verbose (bool) – Whether to progress information during retrieval

Example Pipeline:

index.dph()
Rendering issue. Try running the cell again.

Citation

Amati. Frequentist and Bayesian Approach to Information Retrieval. ECIR 2006. [link]
@inproceedings{DBLP:conf/ecir/Amati06,
  author       = {Giambattista Amati},
  editor       = {Mounia Lalmas and
                  Andy MacFarlane and
                  Stefan M. R{\"{u}}ger and
                  Anastasios Tombros and
                  Theodora Tsikrika and
                  Alexei Yavlinsky},
  title        = {Frequentist and Bayesian Approach to Information Retrieval},
  booktitle    = {Advances in Information Retrieval, 28th European Conference on {IR}
                  Research, {ECIR} 2006, London, UK, April 10-12, 2006, Proceedings},
  series       = {Lecture Notes in Computer Science},
  volume       = {3936},
  pages        = {13--24},
  publisher    = {Springer},
  year         = {2006},
  url          = {https://doi.org/10.1007/11735106\_3},
  doi          = {10.1007/11735106\_3},
  timestamp    = {Tue, 14 May 2019 10:00:37 +0200},
  biburl       = {https://dblp.org/rec/conf/ecir/Amati06.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}
pl2(*, c=1.0, num_results=1000, include_fields=None, threads=1, verbose=False)[source]

Creates a PL2 retriever for this index.

Return type:

Transformer

Parameters:
  • c (float) – PL2’s c parameter, which controls the length normalization.

  • num_results (int) – The maximum number of results to return per query.

  • include_fields (List[str] | None) – The metadata fields to return for each search result.

  • threads (int) – The number of threads to use during retrieval.

  • verbose (bool) – Whether to progress information during retrieval

Example Pipeline:

index.pl2()
Rendering issue. Try running the cell again.

Citation

Amati and Rijsbergen. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst. 2002. [link]
@article{DBLP:journals/tois/AmatiR02,
  author       = {Gianni Amati and
                  C. J. van Rijsbergen},
  title        = {Probabilistic models of information retrieval based on measuring the
                  divergence from randomness},
  journal      = {{ACM} Trans. Inf. Syst.},
  volume       = {20},
  number       = {4},
  pages        = {357--389},
  year         = {2002},
  url          = {http://doi.acm.org/10.1145/582415.582416},
  doi          = {10.1145/582415.582416},
  timestamp    = {Tue, 01 Jun 2021 09:58:08 +0200},
  biburl       = {https://dblp.org/rec/journals/tois/AmatiR02.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}
dirichlet_lm(*, mu=2500.0, num_results=1000, include_fields=None, threads=1, verbose=False)[source]

Creates a Dirichlet Language Model retriever for this index.

Return type:

Transformer

Parameters:
  • mu (float) – Dirichlet LM’s mu parameter, which controls the strength of the prior.

  • num_results (int) – The maximum number of results to return per query.

  • include_fields (List[str] | None) – The metadata fields to return for each search result.

  • threads (int) – The number of threads to use during retrieval.

  • verbose (bool) – Whether to progress information during retrieval

Example Pipeline:

index.dirichlet_lm()
Rendering issue. Try running the cell again.

Citation

Zhai and Lafferty. A study of smoothing methods for language models applied to information retrieval. ACM Trans. Inf. Syst. 2004. [link]
@article{DBLP:journals/tois/ZhaiL04,
  author       = {ChengXiang Zhai and
                  John D. Lafferty},
  title        = {A study of smoothing methods for language models applied to information
                  retrieval},
  journal      = {{ACM} Trans. Inf. Syst.},
  volume       = {22},
  number       = {2},
  pages        = {179--214},
  year         = {2004},
  url          = {https://doi.org/10.1145/984321.984322},
  doi          = {10.1145/984321.984322},
  timestamp    = {Tue, 06 Nov 2018 12:51:56 +0100},
  biburl       = {https://dblp.org/rec/journals/tois/ZhaiL04.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}
hiemstra_lm(*, Lambda=0.15, num_results=1000, include_fields=None, threads=1, verbose=False)[source]

Creates a Hiemstra Language Model retriever for this index.

Return type:

Transformer

Parameters:
  • Lambda (float) – Hiemstra LM’s lambda parameter, which controls the interpolation weight.

  • num_results (int) – The maximum number of results to return per query.

  • include_fields (List[str] | None) – The metadata fields to return for each search result.

  • threads (int) – The number of threads to use during retrieval.

  • verbose (bool) – Whether to progress information during retrieval

Example Pipeline:

index.hiemstra_lm()
Rendering issue. Try running the cell again.

Citation

Hiemstra. Using Language Models for Information Retrieval. 2001. [link]
@phdthesis{DBLP:phd/basesearch/Hiemstra01,
  author       = {Djoerd Hiemstra},
  title        = {Using Language Models for Information Retrieval},
  school       = {University of Twente, Enschede, Netherlands},
  year         = {2001},
  url          = {http://eprints.eemcs.utwente.nl/6563/},
  timestamp    = {Thu, 18 May 2017 09:17:27 +0200},
  biburl       = {https://dblp.org/rec/phd/basesearch/Hiemstra01.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}
tf(*, num_results=1000, include_fields=None, threads=1, verbose=False)[source]

Creates a raw Term Frequency (TF) retriever for this index.

This is typically useful for retrieving from learned sparse models.

Return type:

Transformer

Parameters:
  • num_results (int) – The maximum number of results to return per query.

  • include_fields (List[str] | None) – The metadata fields to return for each search result.

  • threads (int) – The number of threads to use during retrieval.

  • verbose (bool) – Whether to progress information during retrieval

Example Pipeline:

index.tf()
Rendering issue. Try running the cell again.
tf_idf(*, k1=1.2, b=0.75, num_results=1000, include_fields=None, threads=1, verbose=False)[source]

Creates a TF-IDF retriever for this index.

This retriever uses the Robertson formulation of TF and the Sparck Jones formulation of IDF.

Return type:

Transformer

Parameters:
  • k1 (float) – TF-IDF’s k1 parameter, which controls TF saturation.

  • b (float) – TF-IDF’s b parameter, which controls the length penalty.

  • num_results (int) – The maximum number of results to return per query.

  • include_fields (List[str] | None) – The metadata fields to return for each search result.

  • threads (int) – The number of threads to use during retrieval.

  • verbose (bool) – Whether to progress information during retrieval

Example Pipeline:

index.tf_idf()
Rendering issue. Try running the cell again.

Citation

Robertson et al. Okapi at TREC-7: Automatic Ad Hoc, Filtering, VLC and Interactive. TREC 1998.
@inproceedings{DBLP:conf/trec/RobertsonWB98,
  author       = {Stephen E. Robertson and
                  Steve Walker and
                  Micheline Hancock{-}Beaulieu},
  editor       = {Ellen M. Voorhees and
                  Donna K. Harman},
  title        = {Okapi at {TREC-7:} Automatic Ad Hoc, Filtering, {VLC} and Interactive},
  booktitle    = {Proceedings of The Seventh Text REtrieval Conference, {TREC} 1998,
                  Gaithersburg, Maryland, USA, November 9-11, 1998},
  series       = {{NIST} Special Publication},
  volume       = {500-242},
  pages        = {199--210},
  publisher    = {National Institute of Standards and Technology {(NIST)}},
  year         = {1998},
  timestamp    = {Wed, 07 Jul 2021 16:44:22 +0200},
  biburl       = {https://dblp.org/rec/conf/trec/RobertsonWB98.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Citation

Jones. A statistical interpretation of term specificity and its application in retrieval. J. Documentation 2004. [link]
@article{DBLP:journals/jd/Jones04,
  author       = {Karen Sp{\"{a}}rck Jones},
  title        = {A statistical interpretation of term specificity and its application
                  in retrieval},
  journal      = {J. Documentation},
  volume       = {60},
  number       = {5},
  pages        = {493--502},
  year         = {2004},
  url          = {https://doi.org/10.1108/00220410410560573},
  doi          = {10.1108/00220410410560573},
  timestamp    = {Sun, 06 Sep 2020 16:55:45 +0200},
  biburl       = {https://dblp.org/rec/journals/jd/Jones04.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Query Expansion & Reformulation

rm3(*, fb_terms=10, fb_docs=3, fb_lambda=0.6)[source]

Creates an RM3 pseudo-relevance feedback transformer for this index.

Return type:

Transformer

Parameters:
  • fb_terms (int) – The number of feedback terms to use.

  • fb_docs (int) – The number of feedback documents to use.

  • fb_lambda (float) – The interpolation weight between the original query and the feedback model.

Example Pipeline:

index.bm25() >> index.rm3() >> index.bm25() >> pt.rewrite.reset()
Rendering issue. Try running the cell again.

Note

pt.rewrite.reset() is needed after the feedback step to reset the query to its original form.

Citation

Jaleel et al. UMass at TREC 2004: Novelty and HARD. TREC 2004. [link]
@inproceedings{DBLP:conf/trec/JaleelACDLLSW04,
  author       = {Nasreen Abdul Jaleel and
                  James Allan and
                  W. Bruce Croft and
                  Fernando Diaz and
                  Leah S. Larkey and
                  Xiaoyan Li and
                  Mark D. Smucker and
                  Courtney Wade},
  editor       = {Ellen M. Voorhees and
                  Lori P. Buckland},
  title        = {UMass at {TREC} 2004: Novelty and {HARD}},
  booktitle    = {Proceedings of the Thirteenth Text REtrieval Conference, {TREC} 2004,
                  Gaithersburg, Maryland, USA, November 16-19, 2004},
  series       = {{NIST} Special Publication},
  volume       = {500-261},
  publisher    = {National Institute of Standards and Technology {(NIST)}},
  year         = {2004},
  url          = {http://trec.nist.gov/pubs/trec13/papers/umass.novelty.hard.pdf},
  timestamp    = {Wed, 07 Jul 2021 16:44:22 +0200},
  biburl       = {https://dblp.org/rec/conf/trec/JaleelACDLLSW04.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}
bo1(*, fb_terms=10, fb_docs=3)[source]

Creates a Bo1 pseudo-relevance feedback transformer for this index.

Return type:

Transformer

Parameters:
  • fb_terms (int) – The number of feedback terms to use.

  • fb_docs (int) – The number of feedback documents to use.

Example Pipeline:

index.bm25() >> index.bo1() >> index.bm25() >> pt.rewrite.reset()
Rendering issue. Try running the cell again.

Note

pt.rewrite.reset() is needed after the feedback step to reset the query to its original form.

Citation

Amati. Probability models for information retrieval based on divergence from randomness. 2003. [link]
@phdthesis{DBLP:phd/ethos/Amati03,
  author       = {Giambattista Amati},
  title        = {Probability models for information retrieval based on divergence from
                  randomness},
  school       = {University of Glasgow, {UK}},
  year         = {2003},
  url          = {http://theses.gla.ac.uk/1570/},
  timestamp    = {Tue, 05 Apr 2022 10:59:13 +0200},
  biburl       = {https://dblp.org/rec/phd/ethos/Amati03.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}
kl(*, fb_terms=10, fb_docs=3)[source]

Creates a KL-Divergence pseudo-relevance feedback transformer for this index.

Return type:

Transformer

Parameters:
  • fb_terms (int) – The number of feedback terms to use.

  • fb_docs (int) – The number of feedback documents to use.

Example Pipeline:

index.bm25() >> index.kl() >> index.bm25() >> pt.rewrite.reset()
Rendering issue. Try running the cell again.

Note

pt.rewrite.reset() is needed after the feedback step to reset the query to its original form.

Citation

Amati. Probability models for information retrieval based on divergence from randomness. 2003. [link]
@phdthesis{DBLP:phd/ethos/Amati03,
  author       = {Giambattista Amati},
  title        = {Probability models for information retrieval based on divergence from
                  randomness},
  school       = {University of Glasgow, {UK}},
  year         = {2003},
  url          = {http://theses.gla.ac.uk/1570/},
  timestamp    = {Tue, 05 Apr 2022 10:59:13 +0200},
  biburl       = {https://dblp.org/rec/phd/ethos/Amati03.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}
sdm()[source]

Creates a Sequential Dependence Model (SDM) query expansion transformer.

Requires an that the index was built with positional information.

Citation

Metzler and Croft. A Markov random field model for term dependencies. SIGIR 2005. [link]
@inproceedings{DBLP:conf/sigir/MetzlerC05,
  author       = {Donald Metzler and
                  W. Bruce Croft},
  editor       = {Ricardo A. Baeza{-}Yates and
                  Nivio Ziviani and
                  Gary Marchionini and
                  Alistair Moffat and
                  John Tait},
  title        = {A Markov random field model for term dependencies},
  booktitle    = {{SIGIR} 2005: Proceedings of the 28th Annual International {ACM} {SIGIR}
                  Conference on Research and Development in Information Retrieval, Salvador,
                  Brazil, August 15-19, 2005},
  pages        = {472--479},
  publisher    = {{ACM}},
  year         = {2005},
  url          = {https://doi.org/10.1145/1076034.1076115},
  doi          = {10.1145/1076034.1076115},
  timestamp    = {Tue, 06 Nov 2018 11:07:23 +0100},
  biburl       = {https://dblp.org/rec/conf/sigir/MetzlerC05.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Loading

text_loader(fields='*', *, verbose=False)[source]

Creates a transformer that loads stored text content from this index.

Return type:

Transformer

Parameters:

fields (List[str] | str | Literal['*']) – The metadata fields to load for each document. If "*", loads all available fields.

Example Pipeline:

index.text_loader()
Rendering issue. Try running the cell again.

Indexing

index(iter, **kwargs)[source]

Indexes the given input data, creating the index if it does not yet exist, or raising an error if it does.

This method inspects the first document in the to try to infer reasonable settings for indexing:

  • If a toks column is present, it is assumed to contain pre-tokenized text, and uses toks_indexer(). Otherwise, all str columns (except docno) are indexed as raw text using indexer().

  • Text fields are stored as metadata.

  • The maximum metadata lengths are predicted based on the lengths present in the first document.

Parameters:
  • iter (Iterable[Dict[str, Any]]) – The documents to index as an iterable of dicts.

  • kwargs (Any) – Ignored

indexer(*, meta={'docno': 20}, text_attrs=['text'], tokeniser='english', stemmer='porter', stopwords='terrier', store_separate_fields=False, store_positions=False, threads=1)[source]

Returns an indexer that is used to build this index.

Return type:

Indexer

Parameters:
  • meta (Dict) – The fields to store as metadata for each document. The keys are the metadata field names, and the values are the maximum lengths for each field.

  • text_attrs (List[str]) – The text fields to index as text for each document.

  • tokeniser (str | TerrierTokeniser) – The tokeniser to use.

  • stemmer (None | str | TerrierStemmer) – The stemmer to apply to each token.

  • stopwords (None | TerrierStopwords | str | List[str]) – The set of words to remove as stopwords.

  • store_separate_fields (bool) – Whether to store each text attribute as a separate field in the index. This allows for fielded retrieval, but increases index size.

  • store_positions (bool) – Whether to store position information (i.e., blocks) in the index. This allows for positional queries, but increases index size and retrieval time.

  • threads (int) – The number of threads to use during indexing.

Example Pipeline:

index = pt.terrier.TerrierIndex('my_index.terrier')
index.indexer()
Rendering issue. Try running the cell again.
toks_indexer(*, meta={'docno': 20}, threads=1)[source]

Returns an indexer that indexes pre-tokenised documents into this index.

Return type:

Indexer

Parameters:
  • meta (Dict) – The fields to store as metadata for each document. The keys are the metadata field names, and the values are the maximum lengths for each field.

  • threads (int) – The number of threads to use during indexing.

Example Pipeline:

index = pt.terrier.TerrierIndex('my_index.terrier')
index.toks_indexer()
Rendering issue. Try running the cell again.

Index Data

collection_statistics()[source]

Returns the collection statistics for this index.

Example:

Show collection statistics for a Terrier index.
>>> stats = index.collection_statistics()
>>> print(stats)
Number of documents: 11429
Number of terms: 7756
Number of postings: 224573
Number of fields: 0
Number of tokens: 271581
Field names: []
Positions:   false

In this example, the index has 11429 documents, which contained 271581 word occurrences. 7756 unique words were identified. The total number of postings in the inverted index is 224573. This index did not record fields during indexing (which can be useful for models such as BM25F). Similarly, positions, which are used for phrasal queries or proximity models were not recorded.

Returns:

A Java CollectionStatistics object for this index.

lexicon()[source]

The lexicon for this index.

Note that the terms in the lexicon include all pre-processing, such as stemming. For example, the term ‘chemical’ would be stored as ‘chemic’ when using the default Porter stemmer.

Returns:

A Java Lexicon object for this index.

inverted_index()[source]

The inverted posting index for this index.

Returns:

A Java PostingIndex object for this index’s inverted index.

document_index()[source]

The document index for this index.

Returns:

A Java DocumentIndex object for this index.

meta_index()[source]

The meta index for this index.

Example:

Show metadata fields in a Terrier index.
>>> print(index.meta_index().getKeys())
['docno', 'text']

In this example, the index contains two metadata fields: docno, which contains the document identifiers, and text, which contains the raw text of each document.

Returns:

A Java MetaIndex object for this index.

direct_index()[source]

The direct (forward) index for this index.

Returns:

A Java PostingIndex object for this index’s direct index.

index_ref()[source]

The internal Java index reference object for this index.

Returns:

A Java IndexRef object for this index.

index_obj()[source]

The internal Java index object for this index.

Returns:

A Java Index object for this index.

Miscellaneous

built()[source]

Returns whether the index has been built (or is a built in-memory index).

get_corpus_iter(return_toks=True)[source]

Returns an iterable over the documents in this index’s corpus.

Return type:

Iterable[Dict[str, Any]]

Parameters:

return_toks (bool) – Whether to return tokenised text (list of strings) or raw text (string).

A corpus iter from a Terrier index can be used for various purposes, including:
  • indexing the pre-tokenised Terrier index directly in another indexing pipeline

  • extracting document metadata for ingestion into another indexing pipeline

classmethod coerce(cls, index_like)[source]

Attempts to build a TerrierIndex from the given object.

Return type:

TerrierIndex

Parameters:

index_like (object) – The object to coerce into a TerrierIndex. If a str or Path, it loads the index at the provided path. If a pt.terrier.J.IndexRef or pt.terrier.J.Index, it creates a TerrierIndex from the Java object. If a pt.terrier.TerrierIndex, it returns itself.

Sharing

See also

You can share Terrier indices using the Artifacts API:

enum pyterrier.terrier.TerrierModel(value)[source]

A built-in Terrier weighting (scoring) model.

This enum is primarily used with TerrierIndex.retriever() to specify the weighting model to use.

Valid values are as follows:

bm25 = <TerrierModel.bm25: 'bm25'>
dph = <TerrierModel.dph: 'dph'>
pl2 = <TerrierModel.pl2: 'pl2'>
dirichlet_lm = <TerrierModel.dirichlet_lm: 'dirichlet_lm'>
hiemstra_lm = <TerrierModel.hiemstra_lm: 'hiemstra_lm'>
tf = <TerrierModel.tf: 'tf'>
tf_idf = <TerrierModel.tf_idf: 'tf_idf'>
enum pyterrier.terrier.TerrierTokeniser(value)[source]

A built-in Terrier tokeniser.

This enum is primarily used with indexer.

Valid values are as follows:

whitespace = <TerrierTokeniser.whitespace: 'whitespace'>
english = <TerrierTokeniser.english: 'english'>
utf = <TerrierTokeniser.utf: 'utf'>
twitter = <TerrierTokeniser.twitter: 'twitter'>
identity = <TerrierTokeniser.identity: 'identity'>
enum pyterrier.terrier.TerrierStemmer(value)[source]

A built-in Terrier stemmer.

The stemming configuration is saved in the index and loaded at retrieval time. Snowball stemmers for various languages are available in Terrier.

This enum is primarily used with indexer.

Valid values are as follows:

none = <TerrierStemmer.none: 'none'>
porter = <TerrierStemmer.porter: 'porter'>
weakporter = <TerrierStemmer.weakporter: 'weakporter'>
danish = <TerrierStemmer.danish: 'danish'>
finnish = <TerrierStemmer.finnish: 'finnish'>
german = <TerrierStemmer.german: 'german'>
hungarian = <TerrierStemmer.hungarian: 'hungarian'>
norwegian = <TerrierStemmer.norwegian: 'norwegian'>
portugese = <TerrierStemmer.portugese: 'portugese'>
spanish = <TerrierStemmer.spanish: 'spanish'>
swedish = <TerrierStemmer.swedish: 'swedish'>
turkish = <TerrierStemmer.turkish: 'turkish'>

The Enum and its members also have the following methods:

stem(tok)[source]

Stem a single token using this stemmer.

stemmer = pt.TerrierStemmer.porter
stemmed_word = stemmer.stem('abandoned')
enum pyterrier.terrier.TerrierStopwords(value)[source]

The stopword configuration to use for Terrier.

This enum is primarily used with indexer.

Valid values are as follows:

none = <TerrierStopwords.none: 'none'>
terrier = <TerrierStopwords.terrier: 'terrier'>
custom = <TerrierStopwords.custom: 'custom'>

Mid-Level API

The Mid-Level API provides more control over Terrier functionality.

Indexing

class pyterrier.terrier.IterDictIndexer(index_path, *, meta={'docno': 20}, text_attrs=['text'], meta_reverse=['docno'], pretokenised=False, fields=False, threads=1, **kwargs)

Use this Indexer if you wish to index an iter of dicts (possibly with multiple fields). This version is optimized by using multiple threads and POSIX fifos to transfer data, which ends up being much faster.

Parameters:
  • index_path (str) – Directory to store index. Ignored for IndexingType.MEMORY.

  • meta (Dict[str, int]) – What metadata for each document to record in the index, and what length to reserve. Metadata values will be truncated to this length. Defaults to {“docno” : 20}.

  • text_attrs (List[str]) – List of columns of the input data that should be indexed. These are concatenated in the document representation. Defaults to [“text”].

  • meta_reverse (List[str]) – What metadata should we be able to resolve back to a docid. Defaults to [“docno”].

  • pretokenised (bool) – Whether to index pre-tokenized text, e.g., through a Learned Sparse encoder. If True, will ignore text_attrs and indstead index the dictionary contained in the toks column.

  • fields (bool) – Whether a fields-indexer should be used, i.e. whether the frequency in each attribute should be recorded separately in the Terrer index. This allows application of weighting models such as BM25F.

  • threads (int) – Number of threads to use for indexing. Defaults to 1.

  • kwargs – Additional keyword arguments passed to TerrierIndexer.

index(it, fields=None)

Index the specified iter of dicts with the (optional) specified fields

Parameters:

it – an iter of document dicts to be indexed

class pyterrier.terrier.TRECCollectionIndexer(index_path, collection='trec', verbose=False, meta={'docno': 20}, meta_reverse=['docno'], meta_tags={}, **kwargs)[source]

Use this Indexer if you wish to index a TREC formatted collection

Init method

Parameters:
  • index_path (str) – Directory to store index. Ignored for IndexingType.MEMORY.

  • blocks – Create indexer with blocks if true, else without blocks. Default is False.

  • overwrite – If index already present at index_path, True would overwrite it, False throws an Exception. Default is False.

  • type – the specific indexing procedure to use. Default is IndexingType.CLASSIC.

  • collection (str) – name, or Class instance, or one of “trec”, “trecweb”, “warc”). Default is “trec”.

  • meta (Dict[str, int]) – What metadata for each document to record in the index, and what length to reserve. Metadata fields will be truncated to this length. Defaults to {“docno” : 20}.

  • meta_reverse (List[str]) – What metadata shoudl we be able to resolve back to a docid. Defaults to [“docno”].

  • meta_tags (Dict[str, str]) – For collections formed using tagged data (e.g. HTML), which tags correspond to which metadata. This is useful for recording the text of documents for use in neural rankers - see Working with Document Texts.

  • verbose (bool)

index(files_path)[source]

Index the specified TREC formatted files

Parameters:

files_path (str | List[str]) – can be a String of the path or a list of Strings of the paths for multiple files

class pyterrier.terrier.FilesIndexer(index_path, *, meta={'docno': 20, 'filename': 512}, meta_reverse=['docno'], meta_tags={}, **kwargs)[source]

Use this Indexer if you wish to index a pdf, docx, txt etc files

Parameters:
  • index_path (str) – Directory to store index. Ignored for IndexingType.MEMORY.

  • blocks (bool) – Create indexer with blocks if true, else without blocks. Default is False.

  • type (IndexingType) – the specific indexing procedure to use. Default is IndexingType.CLASSIC.

  • meta (Dict[str,int]) – What metadata for each document to record in the index, and what length to reserve. Metadata fields will be truncated to this length. Defaults to {“docno” : 20, “filename” : 512}.

  • meta_reverse (List[str]) – What metadata shoudl we be able to resolve back to a docid. Defaults to [“docno”],

  • meta_tags (Dict[str,str]) – For collections formed using tagged data (e.g. HTML), which tags correspond to which metadata. Defaults to empty. This is useful for recording the text of documents for use in neural rankers - see Working with Document Texts.

index(files_path)[source]

Index the specified files.

Parameters:

files_path (str | List[str]) – can be a String of the path or a list of Strings of the paths for multiple files

enum pyterrier.terrier.IndexingType(value)[source]

This enum is used to determine the type of index built by Terrier. The default is CLASSIC. For more information, see the relevant Terrier indexer and realtime documentation.

Valid values are as follows:

CLASSIC = <IndexingType.CLASSIC: 1>
SINGLEPASS = <IndexingType.SINGLEPASS: 2>
MEMORY = <IndexingType.MEMORY: 3>
pyterrier.terrier.treccollection2textgen(files, meta=['docno'], meta_tags={'text': 'ELSE'}, verbose=False, num_docs=None, tag_text_length=4096)[source]

Creates a generator of dictionaries on parsing TREC formatted files. This is useful for parsing TREC-formatted corpora in indexers like IterDictIndexer, or similar indexers in other plugins (e.g. ColBERTIndexer).

“Arguments:

-” files(List[str]): list of files to parse in TREC format. - meta(List[str]): list of attributes to expose in the dictionaries as metadata. - meta_tags(Dict[str,str]): mapping of TREC tags as metadata. - tag_text_length(int): maximium length of metadata. Defaults to 4096. - verbose(bool): set to true to show a TQDM progress bar. Defaults to True. - num_docs(int): a hint for TQDM to size the progress bar based on document counts rather than file count.

Example:

files = pt.io.find_files("/path/to/Disk45")
gen = pt.index.treccollection2textgen(files)
index = pt.IterDictIndexer("./index45").index(gen)
Parameters:
  • files (List[str])

  • meta (List[str])

  • meta_tags (Dict[str, str])

  • tag_text_length (int)

Retrieval & Scoring

class pyterrier.terrier.Retriever(index_location, controls=None, properties=None, metadata=['docno'], num_results=None, wmodel=None, tokeniser=TerrierTokeniser.english, threads=1, verbose=False)[source]

Use this class for retrieval by Terrier

Init method

Parameters:
  • index_location (str | Any) – An index-like object - An Index, an IndexRef, or a String that can be resolved to an IndexRef

  • controls (Dict[str, str] | None) – A dictionary with the control names and values

  • properties (Dict[str, str] | None) – A dictionary with the property keys and values

  • verbose (bool) – If True transform method will display progress

  • num_results (int | None) – Number of results to retrieve.

  • metadata (List[str]) – What metadata to retrieve. Default is [“docno”].

  • wmodel (str | Callable | None)

  • tokeniser (str | TerrierTokeniser)

  • threads (int)

transform(queries)[source]

Performs the retrieval

Parameters:

queries – a pandas.Dataframe with columns=[‘qid’, ‘query’]. For re-ranking, the DataFrame may also have a ‘docid’ and or ‘docno’ column.

Returns:

pandas.Dataframe with columns=[‘qid’, ‘docno’, ‘rank’, ‘score’]

class pyterrier.terrier.FeaturesRetriever(index_location, features, controls=None, properties=None, threads=1, **kwargs)[source]

Use this class for retrieval with multiple features

Init method

Parameters:
  • index_location (str | Any) – An index-like object - An Index, an IndexRef, or a String that can be resolved to an IndexRef

  • features (List[str]) – List of features to use

  • controls (Dict[str, str] | None) – A dictionary with the control names and values

  • properties (Dict[str, str] | None) – A dictionary with the property keys and values

  • verbose – If True transform method will display progress

  • num_results – Number of results to retrieve.

  • threads (int)

transform(queries)[source]

Performs the retrieval with multiple features

Parameters:

queries – A pandas.Dataframe with columns=[‘qid’, ‘query’]. For re-ranking, the DataFrame may also have a ‘docid’ and or ‘docno’ column.

Returns:

a pandas.DataFrame with columns=[‘qid’, ‘docno’, ‘score’, ‘rank, ‘features’]

class pyterrier.terrier.TextScorer(*args, **kwargs)[source]

A re-ranker class, which takes the queries and the contents of documents, indexes the contents of the documents using a MemoryIndex, and performs ranking of those documents with respect to the queries. Unknown kwargs are passed to Retriever.

Parameters:
  • takes – configuration - what is needed as input: “queries”, or “docs”. Default is “docs” since v0.8.

  • returns – configuration - what is needed as output: “queries”, or “docs”. Default is “docs”.

  • body_attr – what dataframe input column contains the text of the document. Default is “body”.

  • wmodel – name of the weighting model to use for scoring.

  • background_index – An optional background index to use for term and collection statistics. If a weighting model such as BM25 or TF_IDF or PL2 is used without setting the background_index, the background statistics will be calculated from the dataframe, which is ususally not the desired behaviour.

Example:

df = pd.DataFrame(
    [
        ["q1", "chemical reactions", "d1", "professor protor poured the chemicals"],
        ["q1", "chemical reactions", "d2", "chemical brothers turned up the beats"],
    ], columns=["qid", "query", "text"])
textscorer = pt.TextScorer(takes="docs", body_attr="text", wmodel="Tf")
rtr = textscorer.transform(df)
#rtr will score each document by term frequency for the query "chemical reactions" based on the provided document contents

Example:

df = pd.DataFrame(
    [
        ["q1", "chemical reactions", "d1", "professor protor poured the chemicals"],
        ["q1", "chemical reactions", "d2", "chemical brothers turned up the beats"],
    ], columns=["qid", "query", "text"])
existing_index = pt.IndexFactory.of(...)
textscorer = pt.TextScorer(takes="docs", body_attr="text", wmodel="TF_IDF", background_index=existing_index)
rtr = textscorer.transform(df)
#rtr will score each document by TF_IDF for the query "chemical reactions" based on the provided document contents

Query Expansion

class pyterrier.terrier.rewrite.SDM(verbose=0, remove_stopwords=True, prox_model=None, tokeniser=TerrierTokeniser.english, **kwargs)[source]

Implements the sequential dependence model, which Terrier supports using its Indri/Galagoo compatible matchop query language. The rewritten query is derived using the Terrier class DependenceModelPreProcess.

This transformer changes the query. It must be followed by a Terrier Retrieve() transformer. The original query is saved in the “query_0” column, which can be restored using pt.rewrite.reset().

Citation

Metzler and Croft. A Markov random field model for term dependencies. SIGIR 2005. [link]
@inproceedings{DBLP:conf/sigir/MetzlerC05,
  author       = {Donald Metzler and
                  W. Bruce Croft},
  editor       = {Ricardo A. Baeza{-}Yates and
                  Nivio Ziviani and
                  Gary Marchionini and
                  Alistair Moffat and
                  John Tait},
  title        = {A Markov random field model for term dependencies},
  booktitle    = {{SIGIR} 2005: Proceedings of the 28th Annual International {ACM} {SIGIR}
                  Conference on Research and Development in Information Retrieval, Salvador,
                  Brazil, August 15-19, 2005},
  pages        = {472--479},
  publisher    = {{ACM}},
  year         = {2005},
  url          = {https://doi.org/10.1145/1076034.1076115},
  doi          = {10.1145/1076034.1076115},
  timestamp    = {Tue, 06 Nov 2018 11:07:23 +0100},
  biburl       = {https://dblp.org/rec/conf/sigir/MetzlerC05.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}
Parameters:

tokeniser (str | TerrierTokeniser)

class pyterrier.terrier.rewrite.RM3(*args, **kwargs)[source]

Performs query expansion using RM3 relevance models.

This transformer must be followed by a terrier.Retriever() transformer. The original query is saved in the “query_0” column, which can be restored using pt.rewrite.reset().

Instance Attributes:
  • fb_terms(int): number of feedback terms. Defaults to 10

  • fb_docs(int): number of feedback documents. Defaults to 3

  • fb_lambda(float): lambda in RM3, i.e. importance of relevance model viz feedback model. Defaults to 0.6.

Example:

bm25 = pt.terrier.Retriever(index, wmodel="BM25")
rm3_pipe = bm25 >> pt.rewrite.RM3(index) >> bm25
pt.Experiment([bm25, rm3_pipe],
            dataset.get_topics(),
            dataset.get_qrels(),
            ["map"]
            )

Citation

Jaleel et al. UMass at TREC 2004: Novelty and HARD. TREC 2004. [link]
@inproceedings{DBLP:conf/trec/JaleelACDLLSW04,
  author       = {Nasreen Abdul Jaleel and
                  James Allan and
                  W. Bruce Croft and
                  Fernando Diaz and
                  Leah S. Larkey and
                  Xiaoyan Li and
                  Mark D. Smucker and
                  Courtney Wade},
  editor       = {Ellen M. Voorhees and
                  Lori P. Buckland},
  title        = {UMass at {TREC} 2004: Novelty and {HARD}},
  booktitle    = {Proceedings of the Thirteenth Text REtrieval Conference, {TREC} 2004,
                  Gaithersburg, Maryland, USA, November 16-19, 2004},
  series       = {{NIST} Special Publication},
  volume       = {500-261},
  publisher    = {National Institute of Standards and Technology {(NIST)}},
  year         = {2004},
  url          = {http://trec.nist.gov/pubs/trec13/papers/umass.novelty.hard.pdf},
  timestamp    = {Wed, 07 Jul 2021 16:44:22 +0200},
  biburl       = {https://dblp.org/rec/conf/trec/JaleelACDLLSW04.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}
Parameters:
  • index_like – the Terrier index to use

  • fb_terms (int) – number of terms to add to the query. Terrier’s default setting is 10 expansion terms.

  • fb_docs (int) – number of feedback documents to consider. Terrier’s default setting is 3 feedback documents.

  • fb_lambda (float) – lambda in RM3, i.e. importance of relevance model viz feedback model. Defaults to 0.6.

class pyterrier.terrier.rewrite.Bo1QueryExpansion(*args, **kwargs)[source]

Applies the Bo1 query expansion model from the Divergence from Randomness Framework, as provided by Terrier. It must be followed by a terrier.Retriever() transformer. The original query is saved in the “query_0” column, which can be restored using pt.rewrite.reset().

Instance Attributes:
  • fb_terms(int): number of feedback terms. Defaults to 10

  • fb_docs(int): number of feedback documents. Defaults to 3

Citation

Amati and Rijsbergen. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst. 2002. [link]
@article{DBLP:journals/tois/AmatiR02,
  author       = {Gianni Amati and
                  C. J. van Rijsbergen},
  title        = {Probabilistic models of information retrieval based on measuring the
                  divergence from randomness},
  journal      = {{ACM} Trans. Inf. Syst.},
  volume       = {20},
  number       = {4},
  pages        = {357--389},
  year         = {2002},
  url          = {http://doi.acm.org/10.1145/582415.582416},
  doi          = {10.1145/582415.582416},
  timestamp    = {Tue, 01 Jun 2021 09:58:08 +0200},
  biburl       = {https://dblp.org/rec/journals/tois/AmatiR02.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}
Parameters:
  • index_like – the Terrier index to use.

  • fb_terms (int) – number of terms to add to the query. Terrier’s default setting is 10 expansion terms.

  • fb_docs (int) – number of feedback documents to consider. Terrier’s default setting is 3 feedback documents.

class pyterrier.terrier.rewrite.KLQueryExpansion(*args, **kwargs)[source]

Applies the KL query expansion model from the Divergence from Randomness Framework, as provided by Terrier. This transformer must be followed by a terrier.Retriever() transformer. The original query is saved in the “query_0” column, which can be restored using pt.rewrite.reset().

Instance Attributes:
  • fb_terms(int): number of feedback terms. Defaults to 10

  • fb_docs(int): number of feedback documents. Defaults to 3

Parameters:
  • index_like – the Terrier index to use

  • fb_terms (int) – number of terms to add to the query. Terrier’s default setting is 10 expansion terms.

  • fb_docs (int) – number of feedback documents to consider. Terrier’s default setting is 3 feedback documents.

pyterrier.terrier.rewrite.reset()[source]

Undoes a previous query rewriting operation. This results in the query formulation stored in the “query_0” attribute being moved to the “query” attribute, and, if present, the “query_1” being moved to “query_0” and so on. This transformation is useful if you have rewritten the query for the purposes of one retrieval stage, but wish a subquent transformer to be applies on the original formulation.

Internally, this function applies pt.model.pop_queries().

Example:

firststage = pt.rewrite.SDM() >> pt.terrier.Retriever(index, wmodel="DPH")
secondstage = pyterrier_bert.cedr.CEDRPipeline()
fullranker = firststage >> pt.rewrite.reset() >> secondstage
Return type:

Transformer

pyterrier.terrier.rewrite.tokenise(tokeniser='english', matchop=False)[source]

Applies tokenisation to the query. Until PyTerrier 1.0, queries obtained from pt.get_dataset().get_topics() were generally tokenised.

Return type:

Transformer

Parameters:
  • tokeniser (Union[str,TerrierTokeniser,FunctionType]) – Defines what tokeniser should be used - either a Java tokeniser name in Terrier, a TerrierTokeniser instance, or a function that takes a str as input and returns a list of str.

  • matchop (bool) – Whether query terms should be wrapped in matchops, to ensure they can be parsed by a Terrier Retriever transformer.

Example - use default tokeniser:

pipe = pt.rewrite.tokenise() >> pt.terrier.Retriever()
pipe.search("Question with 'capitals' and other stuff?")

Example - roll your own tokeniser:

poortokenisation = pt.rewrite.tokenise(lambda query: query.split(" ")) >> pt.terrier.Retriever()

Example - for non-English languages, tokenise on standard UTF non-alphanumeric characters:

utftokenised = pt.rewrite.tokenise(pt.TerrierTokeniser.utf)) >> pt.terrier.Retriever()
utftokenised = pt.rewrite.tokenise("utf")) >> pt.terrier.Retriever()

Example - tokenising queries using a HuggingFace tokenizer

# this assumes the index was created in a pretokenised manner
br = pt.terrier.Retriever(indexref)
tok = AutoTokenizer.from_pretrained("bert-base-uncased")
query_toks = pt.rewrite.tokenise(tok.tokenize, matchop=True)
retr_pipe = query_toks >> br
pyterrier.terrier.rewrite.stash_results(clear=True)[source]

Stashes (saves) the current retrieved documents for each query into the column “stashed_results_0”. This means that they can be restored later by using pt.rewrite.reset_results(). thereby converting a retrieved documents dataframe into one of queries.

Args: clear(bool): whether to drop the document and retrieved document related columns. Defaults to True.

Return type:

Transformer

pyterrier.terrier.rewrite.reset_results()[source]

Applies a transformer that undoes a pt.rewrite.stash_results() transformer, thereby restoring the ranked documents.

Return type:

Transformer

pyterrier.terrier.rewrite.linear(weightCurrent, weightPrevious, format='terrierql', **kwargs)[source]

Applied to make a linear combination of the current and previous query formulation. The implementation is tied to the underlying query language used by the retrieval/re-ranker transformers. Two of Terrier’s query language formats are supported by the format kwarg, namely “terrierql” and “matchoptql”. Their exact respective formats are detailed in the Terrier documentation.

Return type:

Transformer

Parameters:
  • weightCurrent (float) – weight to apply to the current query formulation.

  • weightPrevious (float) – weight to apply to the previous query formulation.

  • format (str) – which query language to use to rewrite the queries, one of “terrierql” or “matchopql”.

Example:

pipeTQL = pt.apply.query(lambda row: "az") >> pt.rewrite.linear(0.75, 0.25, format="terrierql")
pipeMQL = pt.apply.query(lambda row: "az") >> pt.rewrite.linear(0.75, 0.25, format="matchopql")
pipeT.search("a")
pipeM.search("a")

Example outputs of pipeTQL and pipeMQL corresponding to the query “a” above:

  • Terrier QL output: “(az)^0.750000 (a)^0.250000”

  • MatchOp QL output: “#combine:0:0.750000:1:0.250000(#combine(az) #combine(a))”

Loading

class pyterrier.terrier.TerrierTextLoader(index, fields='*', *, verbose=False)[source]

A transformer that loads textual metadata from a Terrier index into a DataFrame by docid or docno.

Initialise the transformer with the index to load metadata from.

Parameters:
  • index (pyterrier.terrier.J.Index) – The index to load metadata from.

  • fields (List[str] | str | Literal['*']) – The fields to load from the index. If ‘*’, all fields will be loaded.

  • verbose – Whether to print debug information.

transform(inp)[source]

Load metadata from the index into the input DataFrame.

Return type:

DataFrame

Parameters:

inp (DataFrame) – The input DataFrame. Must contain either ‘docid’ or ‘docno’.

Returns:

A new DataFrame with the metadata columns appended.

Low-Level (Java) API

Some functions return Java object wrappers (e.g., TerrierIndex.index_obj()) that provide direct low-level API access to Terrier classes. You can find documentation for it in the Terrier Documentation.

Tip

Pyjnius Java object wrappers show which class they wrap in their string representation. For instance, str(index.index_obj()) = "<org.terrier.structures.Index at 0x10cd8ba60 ...>", showing that it wraps an instance of org.terrier.structures.Index.

class pyterrier.terrier.IndexFactory[source]

The of() method of this factory class allows to load a Terrier Index.

NB: This class “shades” the native Terrier IndexFactory class - it offers essential the same API, except that the of() method contains a memory kwarg, that can be used to load additional index data structures into memory.

Terrier data structures that can be loaded into memory:
  • ‘inverted’ - the inverted index, contains posting lists for each term. In the default configuration, this is read in from disk in chunks.

  • ‘lexicon’ - the dictionary. By default, a binary search of the on-disk structure is used, so loading into memory can enhance speed.

  • ‘meta’ - metadata about documents. Used as the final stage of retrieval, one seek for each retrieved document.

  • ‘direct’ - contains posting lists for each document. No speed advantage for loading into memory unless pseudo-relevance feedback is being used.

  • ‘document’ - contains document lengths, which are anyway loaded into memory. No speed advantage for loading into memory unless pseudo-relevance feedback is being used.

of(memory=False)[source]

Loads an index. Returns a Terrier Index object.

Parameters:
  • indexlike – The location of the index. This can be a string, or an IndexRef object.

  • memory (bool | List[str]) – If the index should be loaded into memory. Use True for all structures, or a list of structure names.

Returns:

A Terrier Index object.