Terrier How-To Guides¶
This page provides a set of how-to guides for common tasks when using Terrier with PyTerrier.
How do I index a standard corpus?¶
import pyterrier as pt
dataset = pt.datasets.get_dataset("irds:msmarco-passage") # [1]
my_index = pt.terrier.TerrierIndex('/path/to/index/location.terrier') # [2]
my_index.index(dataset.get_corpus_iter()) # [3]
Select your dataset here. If the corpus is not available in PyTerrier datasets, see terrier:how-to:index-custom
Specify the location where you want to store the Terrier index. The location must not yet exist. We recommend using the
.terrierextension, though this is not required.This performs indexing with default settings. If you need more control over the indexing settings, see
indexer()andIterDictIndexerfor advanced options.
How do I index a custom collection?¶
import pyterrier as pt
my_collection = [ # [1]
{"docno": "doc1", "title": "This is the text of document one.", "body": "This is the body of document one."},
{"docno": "doc2", "title": "This is the text of document two.", "body": "This is the body of document two."},
{"docno": "doc3", "title": "This is the text of document three.", "body": "This is the body of document three."}
]
my_index = pt.terrier.TerrierIndex('/path/to/index/location.terrier') # [2]
indexer = my_index.indexer(fields=["title", "body"]) # [3]
indexer.index(my_collection)
Each document should be a dictionary with
docno(a unique identifier) and additional text fields. Your collection can be any iterable type (list, generator, etc.).Specify the location where you want to store the Terrier index. The location must not yet exist. We recommend using the
.terrierextension, though this is not required.fields=...lets you specify which fields to index. The"text"field is the default.
How do I index and retrieve languages other than English?¶
Terrier provides built-in support for several other languages (see list in TerrierStemmer).
If your target language is supported, you just need to be sure to set appropriate tokenisation,
stemming, and stopword removal options during indexing. Here is an example for German:
import pyterrier as pt
my_collection = [
{"docno": "doc1", "text": "Dies ist der Text von Dokument eins."},
{"docno": "doc2", "text": "Dies ist der Text von Dokument zwei."},
{"docno": "doc3", "text": "Dies ist der Text von Dokument drei."}
]
my_index = pt.terrier.TerrierIndex('/pfad/zum/indexort.terrier')
# Indexing
indexer = my_index.indexer(
tokeniser=pt.terrier.TerrierTokeniser.utf, # [1]
stopwords=pt.terrier.TerrierStopwords.none,
stemmer=pt.terrier.TerrierStemmer.german, # [2]
)
indexer.index(my_collection)
# Retrieval
retriever = my_index.bm25()
retriever.search('Dokumente')
Be sure to specify
pyterrier.terrier.TerrierTokeniser.utfandpyterrier.terrier.TerrierStopwords.nonefor non-English text – the default English settings do not work well for other languages.Specify the appropriate stemmer for your target language.
If your target language does not have built-in support, you can applie custom pre-processing steps in the pipeline. Here is an example using Spacy for Czech:
import spacy
import pyterrier as pt
nlp = spacy.blank("cs")
def cs_preprocess(text): # [1]
doc = nlp(text)
toks = [str(token) for token in doc if not token.is_stop]
return ' '.join(toks) # combine toks back into a string
my_collection = [
{"docno": "doc1", "text": "Toto je text prvního dokumentu."},
{"docno": "doc2", "text": "Toto je text druhého dokumentu."},
{"docno": "doc3", "text": "Toto je text třetího dokumentu."}
]
my_index = pt.terrier.TerrierIndex('/cesta/k/indexu/umístění.terrier')
# Indexing
indexer = my_index.indexer(
tokeniser=pt.terrier.TerrierTokeniser.utf,
stopwords=pt.terrier.TerrierStopwords.none, # [2]
stemmer=pt.terrier.TerrierStemmer.none,
)
indexer_pipeline = pt.apply.text(lambda d: cs_preprocess(d['text'])) >> indexer [3]
indexer_pipeline.index(my_collection)
# Retrieval
retriever = my_index.bm25()
retriever_pipeline = pt.apply.query(lambda d: cs_preprocess(d['query'])) >> retriever # [3]
retriever_pipeline.search('dokumentu')
Here we define a function that performs the necessary pre-procesisng steps (in this case, Czech tokenization and stopword removal).
Since we are applying custom pre-processing, we disable stopword removal and stemming in Terrier by setting them to
pyterrier.terrier.TerrierStopwords.noneandpyterrier.terrier.TerrierStemmer.none.Include the pre-processing steps as stages of the retrieval and indexing pipelines.
How do I loop over all documents in an index?¶
TerrierIndex.get_corpus_iter() provides an iterator over all documents in a Terrier index.
import pyterrier as pt
index = pt.terrier.TerrierIndex('/path/to/index/location.terrier')
for doc in index.get_corpus_iter(): # [1]
print(doc)
# do something with doc
This creates an iterator over all documents in the specified Terrier index.
How do I access the terms in an index?¶
TerrierIndex.lexicon() provides access to the Lexicon of a Terrier index.
import pyterrier as pt
index = pt.terrier.TerrierIndex('/path/to/index/location.terrier')
lexicon = index.lexicon()
for term, entry in lexicon: # [1]
print(term)
print(entry.getDocumentFrequency()) # [2]
print("frequency for 'chemic':", lexicon["chemic"].getDocumentFrequency()) # [3]
You can iterate over all terms in the Lexicon
Lexicon provides low-level API access through Java bindings.
getDocumentFrequency()is defined in the Java LexiconEntry class.You can also access statistics for a specific term
LexiconEntry objects provide various
statistics about terms in the index, including the number of documents the term occurrs in (getDocumentFrequency()) and
the total number of times the term occurrs in the collection (getFrequency()), and more. You can use these to compute
various statistics about terms in the index, such as the example code to compute the (un-smoothed) probability of a term
occurring in the collection below:
term = 'chemic'
lexicon = index.lexicon()
collection_stats = index.collection_statistics()
if term in lexicon:
prob = lexicon[term].getFrequency() / collection_stats.getNumberOfTokens()
else:
prob = 0.0
How do I manually traverse the postings of an index?¶
term = 'chemic'
meta = index.meta_index()
inv = index.inverted_index()
lexicon = index.lexicon()
for posting in inv.getPostings(lexicon[term]): # [1]
docno = meta.getItem("docno", posting.getId()) # [2]
print(f"{docno} has a frequency of {posting.getFrequency()}")
Look up the posting list using the pointer from the lexicon entry
Here we load the
docno(document identifier) from the meta index
How do I look up the terms that occur in a document?¶
docid = 10 # [1]
di = index.direct_index()
doi = index.document_index()
lexicon = index.lexicon()
for posting in di.getPostings(doi.getDocumentEntry(docid)):
termid = posting.getId()
lee = lexicon.getLexiconEntry(termid)
print(f"{lee.getKey()} with frequency {posting.getFrequency()}")
Document IDs are zero-based, so this will return the 11th document in the index
How do I manually compute the scores for a weighting model?¶
term = "chemic"
inv = index.inverted_index()
meta = index.meta_index()
lex = index.lexicon()
le = lex.getLexiconEntry(term)
wmodel = pt.autoclass("org.terrier.matching.models.PL2")() # [1]
wmodel.setCollectionStatistics(index.collection_statistics()) # [2]
wmodel.setEntryStatistics(le);
wmodel.setKeyFrequency(1)
wmodel.prepare()
for posting in inv.getPostings(le):
docno = meta.getItem("docno", posting.getId())
score = wmodel.score(posting)
print(f"{docno} with score {score:0.4f}")
Here we use the Java class name for the PL2 weighting model. You can replace this with any other Terrier weighting model class.
Using the weighting model requires some setup before it can be used
Note that this is less efficient than using the built-in retriever transformers such as
bm25() or pl2().