Terrier How-To Guides
============================================================
This page provides a set of how-to guides for common tasks when using Terrier with PyTerrier.
.. how-to:: How do I index a standard corpus?
.. _terrier:how-to:index-standard:
.. related:: pyterrier.terrier.TerrierIndex.index
.. code-block:: python
:caption: Indexing a standard corpus with Terrier
import pyterrier as pt
dataset = pt.datasets.get_dataset("irds:msmarco-passage") # :footnote: Select your dataset here. If the corpus is not available in PyTerrier datasets, see :ref:`terrier:how-to:index-custom`
my_index = pt.terrier.TerrierIndex('/path/to/index/location.terrier') # :footnote: Specify the location where you want to store the Terrier index. The location must not yet exist. We recommend using the ``.terrier`` extension, though this is not required.
my_index.index(dataset.get_corpus_iter()) # :footnote: This performs indexing with default settings. If you need more control over the indexing settings, see :meth:`~pyterrier.terrier.TerrierIndex.indexer` and :class:`~pyterrier.terrier.IterDictIndexer` for advanced options.
.. how-to:: How do I index a custom collection?
.. _terrier:how-to:index-custom:
.. related:: pyterrier.terrier.TerrierIndex.indexer
.. code-block:: python
:caption: Indexing a custom collection with Terrier
import pyterrier as pt
my_collection = [ # :footnote: Each document should be a dictionary with ``docno`` (a unique identifier) and additional text fields. Your collection can be any iterable type (list, generator, etc.).
{"docno": "doc1", "title": "This is the text of document one.", "body": "This is the body of document one."},
{"docno": "doc2", "title": "This is the text of document two.", "body": "This is the body of document two."},
{"docno": "doc3", "title": "This is the text of document three.", "body": "This is the body of document three."}
]
my_index = pt.terrier.TerrierIndex('/path/to/index/location.terrier') # :footnote: Specify the location where you want to store the Terrier index. The location must not yet exist. We recommend using the ``.terrier`` extension, though this is not required.
indexer = my_index.indexer(fields=["title", "body"]) # :footnote: ``fields=...`` lets you specify which fields to index. The ``"text"`` field is the default.
indexer.index(my_collection)
.. how-to:: How do I index and retrieve languages other than English?
.. _terrier:how-to:langs:
.. related:: pyterrier.terrier.TerrierTokeniser
.. related:: pyterrier.terrier.TerrierStopwords
.. related:: pyterrier.terrier.TerrierStemmer
Terrier provides built-in support for several other languages (see list in :class:`~pyterrier.terrier.TerrierStemmer`).
If your target language is supported, you just need to be sure to set appropriate tokenisation,
stemming, and stopword removal options during indexing. Here is an example for German:
.. code-block:: python
:caption: Indexing German text with Terrier
import pyterrier as pt
my_collection = [
{"docno": "doc1", "text": "Dies ist der Text von Dokument eins."},
{"docno": "doc2", "text": "Dies ist der Text von Dokument zwei."},
{"docno": "doc3", "text": "Dies ist der Text von Dokument drei."}
]
my_index = pt.terrier.TerrierIndex('/pfad/zum/indexort.terrier')
# Indexing
indexer = my_index.indexer(
tokeniser=pt.terrier.TerrierTokeniser.utf, # :footnote: Be sure to specify :attr:`pyterrier.terrier.TerrierTokeniser.utf` and :attr:`pyterrier.terrier.TerrierStopwords.none` for non-English text -- the default English settings do not work well for other languages.
stopwords=pt.terrier.TerrierStopwords.none,
stemmer=pt.terrier.TerrierStemmer.german, # :footnote: Specify the appropriate stemmer for your target language.
)
indexer.index(my_collection)
# Retrieval
retriever = my_index.bm25()
retriever.search('Dokumente')
If your target language does not have built-in support, you can applie custom pre-processing
steps in the pipeline. Here is an example using `Spacy `__ for Czech:
.. code-block:: python
:caption: Indexing Czech text with Terrier
import spacy
import pyterrier as pt
nlp = spacy.blank("cs")
def cs_preprocess(text): # :footnote: Here we define a function that performs the necessary pre-procesisng steps (in this case, Czech tokenization and stopword removal).
doc = nlp(text)
toks = [str(token) for token in doc if not token.is_stop]
return ' '.join(toks) # combine toks back into a string
my_collection = [
{"docno": "doc1", "text": "Toto je text prvního dokumentu."},
{"docno": "doc2", "text": "Toto je text druhého dokumentu."},
{"docno": "doc3", "text": "Toto je text třetího dokumentu."}
]
my_index = pt.terrier.TerrierIndex('/cesta/k/indexu/umístění.terrier')
# Indexing
indexer = my_index.indexer(
tokeniser=pt.terrier.TerrierTokeniser.utf,
stopwords=pt.terrier.TerrierStopwords.none, # :footnote: Since we are applying custom pre-processing, we disable stopword removal and stemming in Terrier by setting them to :attr:`pyterrier.terrier.TerrierStopwords.none` and :attr:`pyterrier.terrier.TerrierStemmer.none`.
stemmer=pt.terrier.TerrierStemmer.none,
)
indexer_pipeline = pt.apply.text(lambda d: cs_preprocess(d['text'])) >> indexer [3]
indexer_pipeline.index(my_collection)
# Retrieval
retriever = my_index.bm25()
retriever_pipeline = pt.apply.query(lambda d: cs_preprocess(d['query'])) >> retriever # :footnote: Include the pre-processing steps as stages of the retrieval and indexing pipelines.
retriever_pipeline.search('dokumentu')
.. how-to:: How do I loop over all documents in an index?
.. _terrier:how-to:loop-docs:
.. related:: pyterrier.terrier.TerrierIndex.get_corpus_iter
:meth:`TerrierIndex.get_corpus_iter() ` provides an iterator over all documents in a Terrier index.
.. code-block:: python
:caption: Looping over all documents in a Terrier index
import pyterrier as pt
index = pt.terrier.TerrierIndex('/path/to/index/location.terrier')
for doc in index.get_corpus_iter(): # :footnote: This creates an iterator over all documents in the specified Terrier index.
print(doc)
# do something with doc
.. how-to:: How do I access the terms in an index?
.. _terrier:how-to:access-lexicon:
.. related:: pyterrier.terrier.TerrierIndex.lexicon
:meth:`TerrierIndex.lexicon() ` provides access to the Lexicon of a Terrier index.
.. code-block:: python
:caption: Accessing the Lexicon of a Terrier index
import pyterrier as pt
index = pt.terrier.TerrierIndex('/path/to/index/location.terrier')
lexicon = index.lexicon()
for term, entry in lexicon: # :footnote: You can iterate over all terms in the Lexicon
print(term)
print(entry.getDocumentFrequency()) # :footnote: Lexicon provides low-level API access through Java bindings. ``getDocumentFrequency()`` is defined in the Java `LexiconEntry `__ class.
print("frequency for 'chemic':", lexicon["chemic"].getDocumentFrequency()) # :footnote: You can also access statistics for a specific term
`LexiconEntry `_ objects provide various
statistics about terms in the index, including the number of documents the term occurrs in (``getDocumentFrequency()``) and
the total number of times the term occurrs in the collection (``getFrequency()``), and more. You can use these to compute
various statistics about terms in the index, such as the example code to compute the (un-smoothed) probability of a term
occurring in the collection below:
.. code-block:: python
:caption: Computing term probabilities from a Terrier Lexicon
term = 'chemic'
lexicon = index.lexicon()
collection_stats = index.collection_statistics()
if term in lexicon:
prob = lexicon[term].getFrequency() / collection_stats.getNumberOfTokens()
else:
prob = 0.0
.. how-to:: How do I manually traverse the postings of an index?
.. _terrier:how-to:traverse-postings:
.. related:: pyterrier.terrier.TerrierIndex.inverted_index
.. code-block:: python
:caption: Traversing postings lists in a Terrier index
term = 'chemic'
meta = index.meta_index()
inv = index.inverted_index()
lexicon = index.lexicon()
for posting in inv.getPostings(lexicon[term]): # :footnote: Look up the posting list using the pointer from the lexicon entry
docno = meta.getItem("docno", posting.getId()) # :footnote: Here we load the ``docno`` (document identifier) from the meta index
print(f"{docno} has a frequency of {posting.getFrequency()}")
.. how-to:: How do I look up the terms that occur in a document?
.. _terrier:how-to:direct-index:
.. related:: pyterrier.terrier.TerrierIndex.direct_index
.. code-block:: python
:caption: Accessing terms in a document from a Terrier index
docid = 10 # :footnote: Document IDs are zero-based, so this will return the 11th document in the index
di = index.direct_index()
doi = index.document_index()
lexicon = index.lexicon()
for posting in di.getPostings(doi.getDocumentEntry(docid)):
termid = posting.getId()
lee = lexicon.getLexiconEntry(termid)
print(f"{lee.getKey()} with frequency {posting.getFrequency()}")
.. how-to:: How do I manually compute the scores for a weighting model?
.. _terrier:how-to:manual-wmodel:
.. code-block:: python
:caption: Manually computing weighting model scores using Terrier
term = "chemic"
inv = index.inverted_index()
meta = index.meta_index()
lex = index.lexicon()
le = lex.getLexiconEntry(term)
wmodel = pt.autoclass("org.terrier.matching.models.PL2")() # :footnote: Here we use the Java class name for the PL2 weighting model. You can replace this with any other Terrier weighting model class.
wmodel.setCollectionStatistics(index.collection_statistics()) # :footnote: Using the weighting model requires some setup before it can be used
wmodel.setEntryStatistics(le);
wmodel.setKeyFrequency(1)
wmodel.prepare()
for posting in inv.getPostings(le):
docno = meta.getItem("docno", posting.getId())
score = wmodel.score(posting)
print(f"{docno} with score {score:0.4f}")
Note that this is less efficient than using the built-in retriever transformers such as
:meth:`~pyterrier.terrier.TerrierIndex.bm25` or :meth:`~pyterrier.terrier.TerrierIndex.pl2`.