Terrier How-To Guides ============================================================ This page provides a set of how-to guides for common tasks when using Terrier with PyTerrier. .. how-to:: How do I index a standard corpus? .. _terrier:how-to:index-standard: .. related:: pyterrier.terrier.TerrierIndex.index .. code-block:: python :caption: Indexing a standard corpus with Terrier import pyterrier as pt dataset = pt.datasets.get_dataset("irds:msmarco-passage") # :footnote: Select your dataset here. If the corpus is not available in PyTerrier datasets, see :ref:`terrier:how-to:index-custom` my_index = pt.terrier.TerrierIndex('/path/to/index/location.terrier') # :footnote: Specify the location where you want to store the Terrier index. The location must not yet exist. We recommend using the ``.terrier`` extension, though this is not required. my_index.index(dataset.get_corpus_iter()) # :footnote: This performs indexing with default settings. If you need more control over the indexing settings, see :meth:`~pyterrier.terrier.TerrierIndex.indexer` and :class:`~pyterrier.terrier.IterDictIndexer` for advanced options. .. how-to:: How do I index a custom collection? .. _terrier:how-to:index-custom: .. related:: pyterrier.terrier.TerrierIndex.indexer .. code-block:: python :caption: Indexing a custom collection with Terrier import pyterrier as pt my_collection = [ # :footnote: Each document should be a dictionary with ``docno`` (a unique identifier) and additional text fields. Your collection can be any iterable type (list, generator, etc.). {"docno": "doc1", "title": "This is the text of document one.", "body": "This is the body of document one."}, {"docno": "doc2", "title": "This is the text of document two.", "body": "This is the body of document two."}, {"docno": "doc3", "title": "This is the text of document three.", "body": "This is the body of document three."} ] my_index = pt.terrier.TerrierIndex('/path/to/index/location.terrier') # :footnote: Specify the location where you want to store the Terrier index. The location must not yet exist. We recommend using the ``.terrier`` extension, though this is not required. indexer = my_index.indexer(fields=["title", "body"]) # :footnote: ``fields=...`` lets you specify which fields to index. The ``"text"`` field is the default. indexer.index(my_collection) .. how-to:: How do I index and retrieve languages other than English? .. _terrier:how-to:langs: .. related:: pyterrier.terrier.TerrierTokeniser .. related:: pyterrier.terrier.TerrierStopwords .. related:: pyterrier.terrier.TerrierStemmer Terrier provides built-in support for several other languages (see list in :class:`~pyterrier.terrier.TerrierStemmer`). If your target language is supported, you just need to be sure to set appropriate tokenisation, stemming, and stopword removal options during indexing. Here is an example for German: .. code-block:: python :caption: Indexing German text with Terrier import pyterrier as pt my_collection = [ {"docno": "doc1", "text": "Dies ist der Text von Dokument eins."}, {"docno": "doc2", "text": "Dies ist der Text von Dokument zwei."}, {"docno": "doc3", "text": "Dies ist der Text von Dokument drei."} ] my_index = pt.terrier.TerrierIndex('/pfad/zum/indexort.terrier') # Indexing indexer = my_index.indexer( tokeniser=pt.terrier.TerrierTokeniser.utf, # :footnote: Be sure to specify :attr:`pyterrier.terrier.TerrierTokeniser.utf` and :attr:`pyterrier.terrier.TerrierStopwords.none` for non-English text -- the default English settings do not work well for other languages. stopwords=pt.terrier.TerrierStopwords.none, stemmer=pt.terrier.TerrierStemmer.german, # :footnote: Specify the appropriate stemmer for your target language. ) indexer.index(my_collection) # Retrieval retriever = my_index.bm25() retriever.search('Dokumente') If your target language does not have built-in support, you can applie custom pre-processing steps in the pipeline. Here is an example using `Spacy `__ for Czech: .. code-block:: python :caption: Indexing Czech text with Terrier import spacy import pyterrier as pt nlp = spacy.blank("cs") def cs_preprocess(text): # :footnote: Here we define a function that performs the necessary pre-procesisng steps (in this case, Czech tokenization and stopword removal). doc = nlp(text) toks = [str(token) for token in doc if not token.is_stop] return ' '.join(toks) # combine toks back into a string my_collection = [ {"docno": "doc1", "text": "Toto je text prvního dokumentu."}, {"docno": "doc2", "text": "Toto je text druhého dokumentu."}, {"docno": "doc3", "text": "Toto je text třetího dokumentu."} ] my_index = pt.terrier.TerrierIndex('/cesta/k/indexu/umístění.terrier') # Indexing indexer = my_index.indexer( tokeniser=pt.terrier.TerrierTokeniser.utf, stopwords=pt.terrier.TerrierStopwords.none, # :footnote: Since we are applying custom pre-processing, we disable stopword removal and stemming in Terrier by setting them to :attr:`pyterrier.terrier.TerrierStopwords.none` and :attr:`pyterrier.terrier.TerrierStemmer.none`. stemmer=pt.terrier.TerrierStemmer.none, ) indexer_pipeline = pt.apply.text(lambda d: cs_preprocess(d['text'])) >> indexer [3] indexer_pipeline.index(my_collection) # Retrieval retriever = my_index.bm25() retriever_pipeline = pt.apply.query(lambda d: cs_preprocess(d['query'])) >> retriever # :footnote: Include the pre-processing steps as stages of the retrieval and indexing pipelines. retriever_pipeline.search('dokumentu') .. how-to:: How do I loop over all documents in an index? .. _terrier:how-to:loop-docs: .. related:: pyterrier.terrier.TerrierIndex.get_corpus_iter :meth:`TerrierIndex.get_corpus_iter() ` provides an iterator over all documents in a Terrier index. .. code-block:: python :caption: Looping over all documents in a Terrier index import pyterrier as pt index = pt.terrier.TerrierIndex('/path/to/index/location.terrier') for doc in index.get_corpus_iter(): # :footnote: This creates an iterator over all documents in the specified Terrier index. print(doc) # do something with doc .. how-to:: How do I access the terms in an index? .. _terrier:how-to:access-lexicon: .. related:: pyterrier.terrier.TerrierIndex.lexicon :meth:`TerrierIndex.lexicon() ` provides access to the Lexicon of a Terrier index. .. code-block:: python :caption: Accessing the Lexicon of a Terrier index import pyterrier as pt index = pt.terrier.TerrierIndex('/path/to/index/location.terrier') lexicon = index.lexicon() for term, entry in lexicon: # :footnote: You can iterate over all terms in the Lexicon print(term) print(entry.getDocumentFrequency()) # :footnote: Lexicon provides low-level API access through Java bindings. ``getDocumentFrequency()`` is defined in the Java `LexiconEntry `__ class. print("frequency for 'chemic':", lexicon["chemic"].getDocumentFrequency()) # :footnote: You can also access statistics for a specific term `LexiconEntry `_ objects provide various statistics about terms in the index, including the number of documents the term occurrs in (``getDocumentFrequency()``) and the total number of times the term occurrs in the collection (``getFrequency()``), and more. You can use these to compute various statistics about terms in the index, such as the example code to compute the (un-smoothed) probability of a term occurring in the collection below: .. code-block:: python :caption: Computing term probabilities from a Terrier Lexicon term = 'chemic' lexicon = index.lexicon() collection_stats = index.collection_statistics() if term in lexicon: prob = lexicon[term].getFrequency() / collection_stats.getNumberOfTokens() else: prob = 0.0 .. how-to:: How do I manually traverse the postings of an index? .. _terrier:how-to:traverse-postings: .. related:: pyterrier.terrier.TerrierIndex.inverted_index .. code-block:: python :caption: Traversing postings lists in a Terrier index term = 'chemic' meta = index.meta_index() inv = index.inverted_index() lexicon = index.lexicon() for posting in inv.getPostings(lexicon[term]): # :footnote: Look up the posting list using the pointer from the lexicon entry docno = meta.getItem("docno", posting.getId()) # :footnote: Here we load the ``docno`` (document identifier) from the meta index print(f"{docno} has a frequency of {posting.getFrequency()}") .. how-to:: How do I look up the terms that occur in a document? .. _terrier:how-to:direct-index: .. related:: pyterrier.terrier.TerrierIndex.direct_index .. code-block:: python :caption: Accessing terms in a document from a Terrier index docid = 10 # :footnote: Document IDs are zero-based, so this will return the 11th document in the index di = index.direct_index() doi = index.document_index() lexicon = index.lexicon() for posting in di.getPostings(doi.getDocumentEntry(docid)): termid = posting.getId() lee = lexicon.getLexiconEntry(termid) print(f"{lee.getKey()} with frequency {posting.getFrequency()}") .. how-to:: How do I manually compute the scores for a weighting model? .. _terrier:how-to:manual-wmodel: .. code-block:: python :caption: Manually computing weighting model scores using Terrier term = "chemic" inv = index.inverted_index() meta = index.meta_index() lex = index.lexicon() le = lex.getLexiconEntry(term) wmodel = pt.autoclass("org.terrier.matching.models.PL2")() # :footnote: Here we use the Java class name for the PL2 weighting model. You can replace this with any other Terrier weighting model class. wmodel.setCollectionStatistics(index.collection_statistics()) # :footnote: Using the weighting model requires some setup before it can be used wmodel.setEntryStatistics(le); wmodel.setKeyFrequency(1) wmodel.prepare() for posting in inv.getPostings(le): docno = meta.getItem("docno", posting.getId()) score = wmodel.score(posting) print(f"{docno} with score {score:0.4f}") Note that this is less efficient than using the built-in retriever transformers such as :meth:`~pyterrier.terrier.TerrierIndex.bm25` or :meth:`~pyterrier.terrier.TerrierIndex.pl2`.