Terrier Indexing¶
Indexing Basics¶
Common indexing options in Terrier are exposed through TerrierIndex.indexer.
Its parameters let you adjust various aspects of the indexing process, such as tokenisation, stemming, position indexing, and more:
import pyterrier as pt
my_index = pt.terrier.TerrierIndex("my_index.terrier")
indexer = index.indexer(tokeniser='twitter', store_positions=True) # [1]
indexer.index([
{"docno" : "tweet1", "text" : "This is a tweet! #pyterrier"},
{"docno" : "tweet2", "text" : "Another tweet, with a link: https://example.com"}
])
See
TerrierIndex.indexerfor full list of parameters.
Note that Terrier indexes do not support adding additional documents after the initial indexing process.
You do not need to load all documents into memory at once when indexing. Indexers support any “iterable”, including generators that yield one document at a time. Here is an example of indexing documents from a generator function:
def generate_docs():
for line in open('my_humongous_collection.txt'):
yield {"docno" : str(i), "text" : line.strip()}
indexer.index(generate_docs())
The indexer can function as the final stage of an indexing pipeline. For example, here is an example that expands documents with
Doc2Query before indexing them:
import pyterrier as pt
from pyterrier_doc2query import Doc2Query
index = pt.terrier.TerrierIndex("my_index.terrier")
pipeline = Doc2Query(append=True) >> index.indexer()
Pretokenisation¶
Sometimes you want more fine-grained control over the tokenisation directly within PyTerrier.
In this case, each document to be indexed can contain a dictionary of pre-tokenised text and their counts in the toks column:
my_index = pt.terrier.TerrierIndex('my_index.terrier')
my_index.index([
{'docno' : 'd1', 'toks' : {'a' : 1, '##2' : 2}},
{'docno' : 'd2', 'toks' : {'a' : 2, '##2' : 1}}
])
# or
indexer = my_index.toks_indexer() # [1]
indexer.index([
{'docno' : 'd1', 'toks' : {'a' : 1, '##2' : 2}},
{'docno' : 'd2', 'toks' : {'a' : 2, '##2' : 1}}
])
Using
toks_indexer()lets you configure settings about the indexing process.
Note
When supplying pre-tokenized text during indexing, Terrier bypasses its tokeniser, stemmer, and stopword removal. You will need
to apply the same tokenization rules at retrieval time, for instance using pt.rewrite.tokenise.
This allows tokenisation using, for instance, the HuggingFace tokenizers:
from transformers import AutoTokenizer
from collections import Counter
tok = AutoTokenizer.from_pretrained("bert-base-uncased")
# This creates a new column called 'toks', where each row contains
# a dictionary of the BERT WordPiece tokens of the 'text' column.
# This simple example tokenises one row at a time, this could be
# made more efficient to utilise batching support in the tokeniser.
token_row_apply = pt.apply.toks(lambda row: Counter(tok.tokenize(row['text'])))
my_index = pt.terrier.TerrierIndex('my_index.terrier')
index_pipe = token_row_apply >> my_index
index_pipe.index([
{'docno' : 'd1', 'text' : 'do goldfish grow?'},
{'docno' : 'd2', 'text' : ''}
])
At retrieval time, WordPieces that contain special characters (e.g. ‘##w’ ‘[SEP]’) need to be encoded so as to avoid Terrier’s tokeniser.
We use pt.rewrite.tokenise() to apply a tokeniser to the query, setting matchop to True, such that pt.terrier.Retriever.matchop()
is called to ensure that rewritten query terms are properly encoded:
br = pt.terrier.Retriever(indexref)
tok = AutoTokenizer.from_pretrained("bert-base-uncased")
query_toks = pt.rewrite.tokenise(tok.tokenize, matchop=True)
retr_pipe = query_toks >> br
What’s in a Terrier index?¶
A Terrier index contains several data structures. These structures provide low-level API access to the indexed data. The data structures that can be present in a Terrier index are:
- Collection Statistics
TerrierIndex.collection_statistics()provides global statistics of the index, such as the number of documents, number of terms, etc.- Lexicon
TerrierIndex.lexicon()provides an entry for each unique term in the index, which contains the corresponding statistics of each term (frequency etc), and a pointer to the inverted index posting list for that term.- Inverted Index
TerrierIndex.inverted_index()provides access to the posting list for each term, which records the documents that a given term appears in, and with what frequency for each document.- Document Index
TerrierIndex.document_index()provides access to the length of the document (and other field lengths).- Meta Index
TerrierIndex.meta_index()provides access to document metadata, such as thedocno, and optionally the raw text and the URL of each document.- Direct Index (Forward Index)
TerrierIndex.direct_index()provides a posting list for each document, detailing which terms occur in that document and with which frequency. The presence of the direct index depends on the IndexingType that has been applied - single-pass and some memory indices do not provide a direct index.
Advanced: IterDictIndexer¶
IterDictIndexer is a flexible Terrier indexer implemementation. It is returned from
TerrierIndex.indexer(), but can be constructed manually if additional functionality
is required.
Examples using IterDictIndexer
An iterdict can just be a list of dictionaries:
docs = [ { 'docno' : 'doc1', 'text' : 'a b c' } ]
iter_indexer = pt.IterDictIndexer("./index", meta={'docno': 20, 'text': 4096})
indexref1 = iter_indexer.index(docs)
A dataframe can also be used, virtue of its .to_dict() method:
df = pd.DataFrame([['doc1', 'a b c']], columns=['docno', 'text'])
iter_indexer = pt.IterDictIndexer("./index")
indexref2 = indexer.index(df.to_dict(orient="records"))
However, the main power of using IterDictIndexer is for processing indefinite iterables, such as those returned by generator functions. For example, the tsv file of the MSMARCO Passage Ranking corpus can be indexed as follows:
dataset = pt.get_dataset("trec-deep-learning-passages")
def msmarco_generate():
with pt.io.autoopen(dataset.get_corpus()[0], 'rt') as corpusfile:
for l in corpusfile:
docno, passage = l.split("\t")
yield {'docno' : docno, 'text' : passage}
iter_indexer = pt.IterDictIndexer("./passage_index", meta={'docno': 20, 'text': 4096})
indexref3 = iter_indexer.index(msmarco_generate())
IterDictIndexer can be used in connection with Indexing Pipelines.
Similarly, indexing of JSONL files is similarly a few lines of Python:
def iter_file(filename):
import json
with open(filename, 'rt') as file:
for l in file:
# assumes that each line contains 'docno', 'text' attributes
# yields a dictionary for each json line
yield json.loads(l)
indexref4 = pt.IterDictIndexer("./index", meta={'docno': 20, 'text': 4096}).index(iter_file("/path/to/file.jsonl"))
NB: Use pt.io.autoopen() as a drop-in replacement for open() that supports files compressed by gzip etc.
Indexing TREC-formatted files using IterDictIndexer
If you have TREC-formatted files that you wish to use with an IterDictIndexer-like indexer, treccollection2textgen() can be used
as a helper function to aid in parsing such files.
Example using Indexing Pipelines:
files = pt.io.find_files("/path/to/Disk45")
gen = pt.index.treccollection2textgen(files)
indexer = pt.text.sliding() >> pt.IterDictIndexer("./index45")
index = indexer.index(gen)
Threading
On UNIX-based systems, IterDictIndexer can also perform multi-threaded indexing:
iter_indexer = pt.IterDictIndexer("./passage_index_8", meta={'docno': 20, 'text': 4096}, threads=8)
indexref6 = iter_indexer.index(msmarco_generate())
Note that the resulting index ordering with multiple threads is non-deterministic; if you need
deterministic behavior you must index in single-threaded mode. Furthermore, indexing can only go
as quickly as the document iterator, so to take full advantage of multi-threaded indexing, you
will need to keep the iterator function light-weight. Many datasets provide a fast corpus iteration
function (get_corpus_iter()), see more information in the Importing Datasets.
Advanced: Specialized Indexers¶
In most cases, you will want to use IterDictIndexer (e.g., using
TerrierIndex.indexer()). However, several specialized indexers are available
for specific use-cases.
TRECCollectionIndexerlets you index TREC-formaated collections by passing in a list of file paths. For example:Indexing a TREC collection using TRECCollectionIndexer¶import pyterrier as pt # list of filenames to index files = pt.io.find_files("/path/to/WT2G/wt2g-corpus/") # build the index indexer = pt.TRECCollectionIndexer("./wt2g_index", verbose=True, blocks=False) indexref = indexer.index(files) # load the index, print the statistics index = pt.IndexFactory.of(indexref) print(index.getCollectionStatistics().toString())
FilesIndexerlets you index a list of files in various formats such as pdf, docx, and txt.
Warning
The specialized indexers in this section are not compatible with indexing pipelines.
Advanced: Indexing Configuration¶
Our aim is to expose all conventional Terrier indexing configuration through PyTerrier, for instance as constructor arguments to the Indexer classes. However, as Terrier is a legacy platform, some changes will take time to integrate into Terrier. Moreover, the manner of the configuration needed varies a little between the Indexer classes. In the following, we list common indexing configurations, and how to apply them when indexing using PyTerrier, noting any differences betweeen the Indexer classes.
Choice of Indexer
Terrier has three different types of indexer. The choice of indexer is exposed using the type kwarg
to the indexer class. The indexer type can be set using the IndexingType enum.
Stemming configuation or stopwords
The default Terrier indexing configuration is to apply an English stopword list, and Porter’s stemmer. You can configure this using the stemmer and stopwords kwargs for the various indexers:
indexer = pt.IterDictIndexer(stemmer='SpanishSnowballStemmer', stopwords=None)
See also the org.terrier.terms package for a list of the available term pipeline objects provided by Terrier.
Similarly the use of Terrier’s English stopword list can be disabled using the stopwords kwarg.
A custom stopword list can be set by setting the stopwords kwarg to a list of words:
indexer = pt.IterDictIndexer("./index", stopwords=['a', 'an', 'the'])
Languages and Tokenisation
Similarly, the choice of tokeniser can be controlled in the indexer constructor using the tokeniser kwarg.
EnglishTokeniser is the
default tokeniser. Other tokenisers are listed in org.terrier.indexing.tokenisation
package. For instance, its common to use UTFTokeniser when indexing non-English text:
indexer = pt.IterDictIndexer(stemmer=None, stopwords=None, tokeniser="UTFTokeniser")
Positions (aka blocks)
All indexer classes expose a blocks boolean constructor argument to allow position information to be recoreded in the index. Defaults to False, i.e. positions are not recorded.
Fields
Fields refers to storing the frequency of a terms occurrence in different parts of a document, e.g. title vs. body vs. anchor text.
IterDictIndexer can be configured to record fields by setting the fields=True kwarg to the constructor. For instance, if we have two different fields to a document:
docs = [ {'docno' : 'd1', 'title': 'This is the title', 'text' : 'This is the main document']
indexref = pt.IterDictIndexer("./index_fields", text_attrs=['text', 'title'], fields=True).index(docs)
index = pt.IndexFactory.of(indexref)
print(index.getCollectionStatistics().getNumberOfFields()) # will print 2
# make a BM25F retriever, places twice as much weight on the title as the main body
bm25 = pt.terrier.Retriever(index, wmodel='BM25F', controls={'w.0' = 1, 'w.1' = 2, 'c.0' = 0.75, 'c.1' = 0.5})
See the Terrier indexing documentation on fields for more information.
NB: Since PyTerrier 0.13, IterDictIndexer no longer records fields by default. This speeds up indexing and retrieval when field-based models such as BM25F are not required.
Changing the tags parsed by TREC Collection
Use the relevant properties listed in the Terrier indexing documentation.
MetaIndex configuration
Metadata refers to the arbitrary strings associated to each document recorded in a Terrier index. These can range from the “docno” attribute of each document, as used to support experimentation, to other attributes such as the URL of the documents, or even the raw text of the document. Indeed, storing the raw text of each document is a trick often used when applying additional re-rankers such as BERT (see pyterrier_bert for more information on integrating PyTerrier with BERT-based re-rankers). Indexers now expose meta and meta_tags constructor kwarg to make this easier.
Reverse MetaIndex configuration
On occasion, there is a need to lookup up documents in a Terrier index based on their metadata, e.g. “docno”. The meta_reverse constructor kwarg allows meta keys that support reverse lookup to be specified.