Terrier How-To Guides¶

This page provides a set of how-to guides for common tasks when using Terrier with PyTerrier.

How do I index a standard corpus?¶

Indexing a standard corpus with Terrier¶

import pyterrier as pt
dataset = pt.datasets.get_dataset("irds:msmarco-passage") # [1]
my_index = pt.terrier.TerrierIndex('/path/to/index/location.terrier') # [2]
my_index.index(dataset.get_corpus_iter()) # [3]

Select your dataset here. If the corpus is not available in PyTerrier datasets, see terrier:how-to:index-custom
Specify the location where you want to store the Terrier index. The location must not yet exist. We recommend using the .terrier extension, though this is not required.
This performs indexing with default settings. If you need more control over the indexing settings, see indexer() and IterDictIndexer for advanced options.

How do I index a custom collection?¶

Indexing a custom collection with Terrier¶

import pyterrier as pt
my_collection = [ # [1]
    {"docno": "doc1", "title": "This is the text of document one.", "body": "This is the body of document one."},
    {"docno": "doc2", "title": "This is the text of document two.", "body": "This is the body of document two."},
    {"docno": "doc3", "title": "This is the text of document three.", "body": "This is the body of document three."}
]
my_index = pt.terrier.TerrierIndex('/path/to/index/location.terrier') # [2]
indexer = my_index.indexer(fields=["title", "body"]) # [3]
indexer.index(my_collection)

Each document should be a dictionary with docno (a unique identifier) and additional text fields. Your collection can be any iterable type (list, generator, etc.).
Specify the location where you want to store the Terrier index. The location must not yet exist. We recommend using the .terrier extension, though this is not required.
fields=... lets you specify which fields to index. The "text" field is the default.

How do I index and retrieve languages other than English?¶

Terrier provides built-in support for several other languages (see list in TerrierStemmer). If your target language is supported, you just need to be sure to set appropriate tokenisation, stemming, and stopword removal options during indexing. Here is an example for German:

Indexing German text with Terrier¶

import pyterrier as pt
my_collection = [
    {"docno": "doc1", "text": "Dies ist der Text von Dokument eins."},
    {"docno": "doc2", "text": "Dies ist der Text von Dokument zwei."},
    {"docno": "doc3", "text": "Dies ist der Text von Dokument drei."}
]
my_index = pt.terrier.TerrierIndex('/pfad/zum/indexort.terrier')

# Indexing
indexer = my_index.indexer(
    tokeniser=pt.terrier.TerrierTokeniser.utf, # [1]
    stopwords=pt.terrier.TerrierStopwords.none,
    stemmer=pt.terrier.TerrierStemmer.german, # [2]
)
indexer.index(my_collection)

# Retrieval
retriever = my_index.bm25()
retriever.search('Dokumente')

Be sure to specify pyterrier.terrier.TerrierTokeniser.utf and pyterrier.terrier.TerrierStopwords.none for non-English text – the default English settings do not work well for other languages.
Specify the appropriate stemmer for your target language.

If your target language does not have built-in support, you can applie custom pre-processing steps in the pipeline. Here is an example using Spacy for Czech:

Indexing Czech text with Terrier¶

import spacy
import pyterrier as pt

nlp = spacy.blank("cs")
def cs_preprocess(text): # [1]
    doc = nlp(text)
    toks = [str(token) for token in doc if not token.is_stop]
    return ' '.join(toks) # combine toks back into a string

my_collection = [
    {"docno": "doc1", "text": "Toto je text prvního dokumentu."},
    {"docno": "doc2", "text": "Toto je text druhého dokumentu."},
    {"docno": "doc3", "text": "Toto je text třetího dokumentu."}
]
my_index = pt.terrier.TerrierIndex('/cesta/k/indexu/umístění.terrier')

# Indexing
indexer = my_index.indexer(
    tokeniser=pt.terrier.TerrierTokeniser.utf,
    stopwords=pt.terrier.TerrierStopwords.none, # [2]
    stemmer=pt.terrier.TerrierStemmer.none,
)
indexer_pipeline = pt.apply.text(lambda d: cs_preprocess(d['text'])) >> indexer [3]
indexer_pipeline.index(my_collection)

# Retrieval
retriever = my_index.bm25()
retriever_pipeline = pt.apply.query(lambda d: cs_preprocess(d['query'])) >> retriever # [3]
retriever_pipeline.search('dokumentu')

Here we define a function that performs the necessary pre-procesisng steps (in this case, Czech tokenization and stopword removal).
Since we are applying custom pre-processing, we disable stopword removal and stemming in Terrier by setting them to pyterrier.terrier.TerrierStopwords.none and pyterrier.terrier.TerrierStemmer.none.
Include the pre-processing steps as stages of the retrieval and indexing pipelines.

How do I loop over all documents in an index?¶

TerrierIndex.get_corpus_iter() provides an iterator over all documents in a Terrier index.

Looping over all documents in a Terrier index¶

import pyterrier as pt
index = pt.terrier.TerrierIndex('/path/to/index/location.terrier')
for doc in index.get_corpus_iter(): # [1]
    print(doc)
    # do something with doc

This creates an iterator over all documents in the specified Terrier index.

How do I access the terms in an index?¶

TerrierIndex.lexicon() provides access to the Lexicon of a Terrier index.

Accessing the Lexicon of a Terrier index¶

import pyterrier as pt
index = pt.terrier.TerrierIndex('/path/to/index/location.terrier')
lexicon = index.lexicon()
for term, entry in lexicon: # [1]
    print(term)
    print(entry.getDocumentFrequency()) # [2]

print("frequency for 'chemic':", lexicon["chemic"].getDocumentFrequency()) # [3]

You can iterate over all terms in the Lexicon
Lexicon provides low-level API access through Java bindings. getDocumentFrequency() is defined in the Java LexiconEntry class.
You can also access statistics for a specific term

LexiconEntry objects provide various statistics about terms in the index, including the number of documents the term occurrs in (getDocumentFrequency()) and the total number of times the term occurrs in the collection (getFrequency()), and more. You can use these to compute various statistics about terms in the index, such as the example code to compute the (un-smoothed) probability of a term occurring in the collection below:

Computing term probabilities from a Terrier Lexicon¶

term = 'chemic'
lexicon = index.lexicon()
collection_stats = index.collection_statistics()
if term in lexicon:
    prob = lexicon[term].getFrequency() / collection_stats.getNumberOfTokens()
else:
    prob = 0.0

How do I manually traverse the postings of an index?¶

Traversing postings lists in a Terrier index¶

term = 'chemic'
meta = index.meta_index()
inv = index.inverted_index()
lexicon = index.lexicon()

for posting in inv.getPostings(lexicon[term]): # [1]
    docno = meta.getItem("docno", posting.getId()) # [2]
    print(f"{docno} has a frequency of {posting.getFrequency()}")

Look up the posting list using the pointer from the lexicon entry
Here we load the docno (document identifier) from the meta index

How do I look up the terms that occur in a document?¶

Accessing terms in a document from a Terrier index¶

docid = 10 # [1]
di = index.direct_index()
doi = index.document_index()
lexicon = index.lexicon()

for posting in di.getPostings(doi.getDocumentEntry(docid)):
    termid = posting.getId()
    lee = lexicon.getLexiconEntry(termid)
    print(f"{lee.getKey()} with frequency {posting.getFrequency()}")

Document IDs are zero-based, so this will return the 11th document in the index

How do I manually compute the scores for a weighting model?¶

Manually computing weighting model scores using Terrier¶

term = "chemic"
inv = index.inverted_index()
meta = index.meta_index()
lex = index.lexicon()
le = lex.getLexiconEntry(term)
wmodel = pt.autoclass("org.terrier.matching.models.PL2")() # [1]
wmodel.setCollectionStatistics(index.collection_statistics()) # [2]
wmodel.setEntryStatistics(le);
wmodel.setKeyFrequency(1)
wmodel.prepare()
for posting in inv.getPostings(le):
    docno = meta.getItem("docno", posting.getId())
    score = wmodel.score(posting)
    print(f"{docno} with score {score:0.4f}")

Here we use the Java class name for the PL2 weighting model. You can replace this with any other Terrier weighting model class.
Using the weighting model requires some setup before it can be used

Note that this is less efficient than using the built-in retriever transformers such as bm25() or pl2().