Terrier Indexing

Indexing Basics

Common indexing options in Terrier are exposed through TerrierIndex.indexer. Its parameters let you adjust various aspects of the indexing process, such as tokenisation, stemming, position indexing, and more:

Creating an indexer from a TerrierIndex
import pyterrier as pt
my_index = pt.terrier.TerrierIndex("my_index.terrier")
indexer = index.indexer(tokeniser='twitter', store_positions=True) # [1]
indexer.index([
    {"docno" : "tweet1", "text" : "This is a tweet! #pyterrier"},
    {"docno" : "tweet2", "text" : "Another tweet, with a link: https://example.com"}
])
  1. See TerrierIndex.indexer for full list of parameters.

Note that Terrier indexes do not support adding additional documents after the initial indexing process.

You do not need to load all documents into memory at once when indexing. Indexers support any “iterable”, including generators that yield one document at a time. Here is an example of indexing documents from a generator function:

Indexing from a generator function
def generate_docs():
    for line in open('my_humongous_collection.txt'):
        yield {"docno" : str(i), "text" : line.strip()}

indexer.index(generate_docs())

The indexer can function as the final stage of an indexing pipeline. For example, here is an example that expands documents with Doc2Query before indexing them:

import pyterrier as pt
from pyterrier_doc2query import Doc2Query
index = pt.terrier.TerrierIndex("my_index.terrier")
pipeline = Doc2Query(append=True) >> index.indexer()
Rendering issue. Try running the cell again.

Pretokenisation

Sometimes you want more fine-grained control over the tokenisation directly within PyTerrier. In this case, each document to be indexed can contain a dictionary of pre-tokenised text and their counts in the toks column:

Indexing pre-tokenised text
my_index = pt.terrier.TerrierIndex('my_index.terrier')
my_index.index([
    {'docno' : 'd1', 'toks' : {'a' : 1, '##2' : 2}},
    {'docno' : 'd2', 'toks' : {'a' : 2, '##2' : 1}}
])

# or

indexer = my_index.toks_indexer() # [1]
indexer.index([
    {'docno' : 'd1', 'toks' : {'a' : 1, '##2' : 2}},
    {'docno' : 'd2', 'toks' : {'a' : 2, '##2' : 1}}
])
  1. Using toks_indexer() lets you configure settings about the indexing process.

Note

When supplying pre-tokenized text during indexing, Terrier bypasses its tokeniser, stemmer, and stopword removal. You will need to apply the same tokenization rules at retrieval time, for instance using pt.rewrite.tokenise.

This allows tokenisation using, for instance, the HuggingFace tokenizers:

from transformers import AutoTokenizer
from collections import Counter

tok = AutoTokenizer.from_pretrained("bert-base-uncased")

# This creates a new column called 'toks', where each row contains
# a dictionary of the BERT WordPiece tokens of the 'text' column.
# This simple example tokenises one row at a time, this could be
# made more efficient to utilise batching support in the tokeniser.
token_row_apply = pt.apply.toks(lambda row: Counter(tok.tokenize(row['text'])))

my_index = pt.terrier.TerrierIndex('my_index.terrier')
index_pipe = token_row_apply >> my_index
index_pipe.index([
    {'docno' : 'd1', 'text' : 'do goldfish grow?'},
    {'docno' : 'd2', 'text' : ''}
])

At retrieval time, WordPieces that contain special characters (e.g. ‘##w’ ‘[SEP]’) need to be encoded so as to avoid Terrier’s tokeniser. We use pt.rewrite.tokenise() to apply a tokeniser to the query, setting matchop to True, such that pt.terrier.Retriever.matchop() is called to ensure that rewritten query terms are properly encoded:

br = pt.terrier.Retriever(indexref)
tok = AutoTokenizer.from_pretrained("bert-base-uncased")
query_toks = pt.rewrite.tokenise(tok.tokenize, matchop=True)
retr_pipe = query_toks >> br

What’s in a Terrier index?

A Terrier index contains several data structures. These structures provide low-level API access to the indexed data. The data structures that can be present in a Terrier index are:

Collection Statistics

TerrierIndex.collection_statistics() provides global statistics of the index, such as the number of documents, number of terms, etc.

Lexicon

TerrierIndex.lexicon() provides an entry for each unique term in the index, which contains the corresponding statistics of each term (frequency etc), and a pointer to the inverted index posting list for that term.

Inverted Index

TerrierIndex.inverted_index() provides access to the posting list for each term, which records the documents that a given term appears in, and with what frequency for each document.

Document Index

TerrierIndex.document_index() provides access to the length of the document (and other field lengths).

Meta Index

TerrierIndex.meta_index() provides access to document metadata, such as the docno, and optionally the raw text and the URL of each document.

Direct Index (Forward Index)

TerrierIndex.direct_index() provides a posting list for each document, detailing which terms occur in that document and with which frequency. The presence of the direct index depends on the IndexingType that has been applied - single-pass and some memory indices do not provide a direct index.

Advanced: IterDictIndexer

IterDictIndexer is a flexible Terrier indexer implemementation. It is returned from TerrierIndex.indexer(), but can be constructed manually if additional functionality is required.

Examples using IterDictIndexer

An iterdict can just be a list of dictionaries:

docs = [ { 'docno' : 'doc1', 'text' : 'a b c' }  ]
iter_indexer = pt.IterDictIndexer("./index", meta={'docno': 20, 'text': 4096})
indexref1 = iter_indexer.index(docs)

A dataframe can also be used, virtue of its .to_dict() method:

df = pd.DataFrame([['doc1', 'a b c']], columns=['docno', 'text'])
iter_indexer = pt.IterDictIndexer("./index")
indexref2 = indexer.index(df.to_dict(orient="records"))

However, the main power of using IterDictIndexer is for processing indefinite iterables, such as those returned by generator functions. For example, the tsv file of the MSMARCO Passage Ranking corpus can be indexed as follows:

dataset = pt.get_dataset("trec-deep-learning-passages")
def msmarco_generate():
    with pt.io.autoopen(dataset.get_corpus()[0], 'rt') as corpusfile:
        for l in corpusfile:
            docno, passage = l.split("\t")
            yield {'docno' : docno, 'text' : passage}

iter_indexer = pt.IterDictIndexer("./passage_index", meta={'docno': 20, 'text': 4096})
indexref3 = iter_indexer.index(msmarco_generate())

IterDictIndexer can be used in connection with Indexing Pipelines.

Similarly, indexing of JSONL files is similarly a few lines of Python:

def iter_file(filename):
  import json
  with open(filename, 'rt') as file:
    for l in file:
      # assumes that each line contains 'docno', 'text' attributes
      # yields a dictionary for each json line
      yield json.loads(l)

indexref4 = pt.IterDictIndexer("./index", meta={'docno': 20, 'text': 4096}).index(iter_file("/path/to/file.jsonl"))

NB: Use pt.io.autoopen() as a drop-in replacement for open() that supports files compressed by gzip etc.

Indexing TREC-formatted files using IterDictIndexer

If you have TREC-formatted files that you wish to use with an IterDictIndexer-like indexer, treccollection2textgen() can be used as a helper function to aid in parsing such files.

Example using Indexing Pipelines:

files = pt.io.find_files("/path/to/Disk45")
gen = pt.index.treccollection2textgen(files)
indexer = pt.text.sliding() >> pt.IterDictIndexer("./index45")
index = indexer.index(gen)

Threading

On UNIX-based systems, IterDictIndexer can also perform multi-threaded indexing:

iter_indexer = pt.IterDictIndexer("./passage_index_8", meta={'docno': 20, 'text': 4096}, threads=8)
indexref6 = iter_indexer.index(msmarco_generate())

Note that the resulting index ordering with multiple threads is non-deterministic; if you need deterministic behavior you must index in single-threaded mode. Furthermore, indexing can only go as quickly as the document iterator, so to take full advantage of multi-threaded indexing, you will need to keep the iterator function light-weight. Many datasets provide a fast corpus iteration function (get_corpus_iter()), see more information in the Importing Datasets.

Advanced: Specialized Indexers

In most cases, you will want to use IterDictIndexer (e.g., using TerrierIndex.indexer()). However, several specialized indexers are available for specific use-cases.

  • TRECCollectionIndexer lets you index TREC-formaated collections by passing in a list of file paths. For example:

    Indexing a TREC collection using TRECCollectionIndexer
    import pyterrier as pt
    # list of filenames to index
    files = pt.io.find_files("/path/to/WT2G/wt2g-corpus/")
    
    # build the index
    indexer = pt.TRECCollectionIndexer("./wt2g_index", verbose=True, blocks=False)
    indexref = indexer.index(files)
    
    # load the index, print the statistics
    index = pt.IndexFactory.of(indexref)
    print(index.getCollectionStatistics().toString())
    
  • FilesIndexer lets you index a list of files in various formats such as pdf, docx, and txt.

Warning

The specialized indexers in this section are not compatible with indexing pipelines.

Advanced: Indexing Configuration

Our aim is to expose all conventional Terrier indexing configuration through PyTerrier, for instance as constructor arguments to the Indexer classes. However, as Terrier is a legacy platform, some changes will take time to integrate into Terrier. Moreover, the manner of the configuration needed varies a little between the Indexer classes. In the following, we list common indexing configurations, and how to apply them when indexing using PyTerrier, noting any differences betweeen the Indexer classes.

Choice of Indexer

Terrier has three different types of indexer. The choice of indexer is exposed using the type kwarg to the indexer class. The indexer type can be set using the IndexingType enum.

Stemming configuation or stopwords

The default Terrier indexing configuration is to apply an English stopword list, and Porter’s stemmer. You can configure this using the stemmer and stopwords kwargs for the various indexers:

indexer = pt.IterDictIndexer(stemmer='SpanishSnowballStemmer', stopwords=None)

See also the org.terrier.terms package for a list of the available term pipeline objects provided by Terrier.

Similarly the use of Terrier’s English stopword list can be disabled using the stopwords kwarg.

A custom stopword list can be set by setting the stopwords kwarg to a list of words:

indexer = pt.IterDictIndexer("./index", stopwords=['a', 'an', 'the'])

Languages and Tokenisation

Similarly, the choice of tokeniser can be controlled in the indexer constructor using the tokeniser kwarg. EnglishTokeniser is the default tokeniser. Other tokenisers are listed in org.terrier.indexing.tokenisation package. For instance, its common to use UTFTokeniser when indexing non-English text:

indexer = pt.IterDictIndexer(stemmer=None, stopwords=None, tokeniser="UTFTokeniser")

Positions (aka blocks)

All indexer classes expose a blocks boolean constructor argument to allow position information to be recoreded in the index. Defaults to False, i.e. positions are not recorded.

Fields

Fields refers to storing the frequency of a terms occurrence in different parts of a document, e.g. title vs. body vs. anchor text.

IterDictIndexer can be configured to record fields by setting the fields=True kwarg to the constructor. For instance, if we have two different fields to a document:

docs = [ {'docno' : 'd1', 'title': 'This is the title', 'text' : 'This is the main document']
indexref = pt.IterDictIndexer("./index_fields", text_attrs=['text', 'title'], fields=True).index(docs)
index = pt.IndexFactory.of(indexref)
print(index.getCollectionStatistics().getNumberOfFields()) # will print 2
# make a BM25F retriever, places twice as much weight on the title as the main body
bm25 = pt.terrier.Retriever(index, wmodel='BM25F', controls={'w.0' = 1, 'w.1' = 2, 'c.0' = 0.75, 'c.1' = 0.5})

See the Terrier indexing documentation on fields for more information.

NB: Since PyTerrier 0.13, IterDictIndexer no longer records fields by default. This speeds up indexing and retrieval when field-based models such as BM25F are not required.

Changing the tags parsed by TREC Collection

Use the relevant properties listed in the Terrier indexing documentation.

MetaIndex configuration

Metadata refers to the arbitrary strings associated to each document recorded in a Terrier index. These can range from the “docno” attribute of each document, as used to support experimentation, to other attributes such as the URL of the documents, or even the raw text of the document. Indeed, storing the raw text of each document is a trick often used when applying additional re-rankers such as BERT (see pyterrier_bert for more information on integrating PyTerrier with BERT-based re-rankers). Indexers now expose meta and meta_tags constructor kwarg to make this easier.

Reverse MetaIndex configuration

On occasion, there is a need to lookup up documents in a Terrier index based on their metadata, e.g. “docno”. The meta_reverse constructor kwarg allows meta keys that support reverse lookup to be specified.