Terrier Indexing¶
PyTerrier has a number of useful classes for creating Terrier indices, which can be used for retrieval, query expansion, etc.
Indexer Classes¶
There are four indexer classes:
You can create an index from TREC-formatted files, from a TREC test collection, using
TRECCollectionIndexer
.For indexing TXT, PDF, Microsoft Word files, etc files you can use
FilesIndexer
.For any abitrary iterable dictionaries, or a Pandas Dataframes, you can use
IterDictIndexer
.
There are also different types of indexing supported in Terrier that are exposed in PyTerrier. We explain both the indexing types and the indexer classes below, with examples. Further worked examples of indexing are provided in the example indexing notebook.
TerrierIndexer¶
All indexer classes extend TerrierIndexer
. Common indexer contrustor arguments for all four indexers are shown below.
- class pyterrier.terrier.TerrierIndexer(index_path, *, blocks=False, overwrite=False, verbose=False, meta_reverse=['docno'], stemmer=TerrierStemmer.porter, stopwords=TerrierStopwords.terrier, tokeniser=TerrierTokeniser.english, type=IndexingType.CLASSIC, properties={})[source]¶
This is the super class for all of the Terrier-based indexers exposed by PyTerrier. It hosts common configuration for all index types.
Constructor called by all indexer subclasses. All arguments listed below are available in IterDictIndexer, DFIndexer, TRECCollectionIndexer and FilesIndsexer.
- Parameters:
index_path (str) – Directory to store index. Ignored for IndexingType.MEMORY.
blocks (bool) – Create indexer with blocks if true, else without blocks. Default is False.
overwrite (bool) – If index already present at index_path, True would overwrite it, False throws an Exception. Default is False.
verbose (bool) – Provide progess bars if possible. Default is False.
stemmer (TerrierStemmer) – the stemmer to apply. Default is
TerrierStemmer.porter
.stopwords (TerrierStopwords) – the stopwords list to apply. Default is
TerrierStemmer.terrier
.tokeniser (TerrierTokeniser) – the stemmer to apply. Default is
TerrierTokeniser.english
.type (IndexingType) – the specific indexing procedure to use. Default is
IndexingType.CLASSIC
.properties (dict) – Terrier properties that you wish to overrride.
TRECCollectionIndexer¶
- class pyterrier.terrier.TRECCollectionIndexer(index_path, collection='trec', verbose=False, meta={'docno': 20}, meta_reverse=['docno'], meta_tags={}, **kwargs)[source]¶
Use this Indexer if you wish to index a TREC formatted collection
Init method
- Parameters:
index_path (str) – Directory to store index. Ignored for IndexingType.MEMORY.
blocks (bool) – Create indexer with blocks if true, else without blocks. Default is False.
overwrite (bool) – If index already present at index_path, True would overwrite it, False throws an Exception. Default is False.
type (IndexingType) – the specific indexing procedure to use. Default is IndexingType.CLASSIC.
collection (Class name, or Class instance, or one of "trec", "trecweb", "warc")
meta (Dict[str,int]) – What metadata for each document to record in the index, and what length to reserve. Metadata fields will be truncated to this length. Defaults to {“docno” : 20}.
meta_reverse (List[str]) – What metadata shoudl we be able to resolve back to a docid. Defaults to [“docno”].
meta_tags (Dict[str,str]) – For collections formed using tagged data (e.g. HTML), which tags correspond to which metadata. This is useful for recording the text of documents for use in neural rankers - see Working with Document Texts.
Example indexing the TREC WT2G corpus:
import pyterrier as pt
# list of filenames to index
files = pt.io.find_files("/path/to/WT2G/wt2g-corpus/")
# build the index
indexer = pt.TRECCollectionIndexer("./wt2g_index", verbose=True, blocks=False)
indexref = indexer.index(files)
# load the index, print the statistics
index = pt.IndexFactory.of(indexref)
print(index.getCollectionStatistics().toString())
FilesIndexer¶
- class pyterrier.terrier.FilesIndexer(index_path, *, meta={'docno': 20, 'filename': 512}, meta_reverse=['docno'], meta_tags={}, **kwargs)[source]¶
Use this Indexer if you wish to index a pdf, docx, txt etc files
- Parameters:
index_path (str) – Directory to store index. Ignored for IndexingType.MEMORY.
blocks (bool) – Create indexer with blocks if true, else without blocks. Default is False.
type (IndexingType) – the specific indexing procedure to use. Default is IndexingType.CLASSIC.
meta (Dict[str,int]) – What metadata for each document to record in the index, and what length to reserve. Metadata fields will be truncated to this length. Defaults to {“docno” : 20, “filename” : 512}.
meta_reverse (List[str]) – What metadata shoudl we be able to resolve back to a docid. Defaults to [“docno”],
meta_tags (Dict[str,str]) – For collections formed using tagged data (e.g. HTML), which tags correspond to which metadata. Defaults to empty. This is useful for recording the text of documents for use in neural rankers - see Working with Document Texts.
IterDictIndexer¶
- class pyterrier.terrier.IterDictIndexer(index_path, *, meta={'docno': 20}, text_attrs=['text'], meta_reverse=['docno'], pretokenised=False, fields=False, threads=1, **kwargs)¶
Use this Indexer if you wish to index an iter of dicts (possibly with multiple fields). This version is optimized by using multiple threads and POSIX fifos to tranfer data, which ends up being much faster.
- Parameters:
index_path (str) – Directory to store index. Ignored for IndexingType.MEMORY.
meta (Dict[str,int]) – What metadata for each document to record in the index, and what length to reserve. Metadata values will be truncated to this length. Defaults to {“docno” : 20}.
text_attrs (List[str]) – List of columns of the input data that should be indexed. These are concatenated in the document representation. Defaults to [“text”].
meta_reverse (List[str]) – What metadata should we be able to resolve back to a docid. Defaults to [“docno”].
pretokenised (bool) – Whether to index pre-tokenized text, e.g., through a Learned Sparse encoder. If True, will ignore
text_attrs
and indstead index the dictionary contained in thetoks
column.fields (bool) – Whether a fields-indexer should be used, i.e. whether the frequency in each attribute should be recorded separately in the Terrer index. This allows application of weighting models such as BM25F.
threads (int) – Number of threads to use for indexing. Defaults to 1.
kwargs – Additional keyword arguments passed to TerrierIndexer.
- index(it, fields=None)¶
Index the specified iter of dicts with the (optional) specified fields
- Parameters:
it (iter[dict]) – an iter of document dict to be indexed
Examples using IterDictIndexer
An iterdict can just be a list of dictionaries:
docs = [ { 'docno' : 'doc1', 'text' : 'a b c' } ]
iter_indexer = pt.IterDictIndexer("./index", meta={'docno': 20, 'text': 4096})
indexref1 = iter_indexer.index(docs)
A dataframe can also be used, virtue of its .to_dict()
method:
df = pd.DataFrame([['doc1', 'a b c']], columns=['docno', 'text'])
iter_indexer = pt.IterDictIndexer("./index")
indexref2 = indexer.index(df.to_dict(orient="records"))
However, the main power of using IterDictIndexer is for processing indefinite iterables, such as those returned by generator functions. For example, the tsv file of the MSMARCO Passage Ranking corpus can be indexed as follows:
dataset = pt.get_dataset("trec-deep-learning-passages")
def msmarco_generate():
with pt.io.autoopen(dataset.get_corpus()[0], 'rt') as corpusfile:
for l in corpusfile:
docno, passage = l.split("\t")
yield {'docno' : docno, 'text' : passage}
iter_indexer = pt.IterDictIndexer("./passage_index", meta={'docno': 20, 'text': 4096})
indexref3 = iter_indexer.index(msmarco_generate())
IterDictIndexer can be used in connection with Indexing Pipelines.
Similarly, indexing of JSONL files is similarly a few lines of Python:
def iter_file(filename):
import json
with open(filename, 'rt') as file:
for l in file:
# assumes that each line contains 'docno', 'text' attributes
# yields a dictionary for each json line
yield json.loads(l)
indexref4 = pt.IterDictIndexer("./index", meta={'docno': 20, 'text': 4096}).index(iter_file("/path/to/file.jsonl"))
NB: Use pt.io.autoopen()
as a drop-in replacement for open()
that supports files compressed by gzip etc.
Indexing TREC-formatted files using IterDictIndexer
If you have TREC-formatted files that you wish to use with an IterDictIndexer-like indexer, pt.index.treccollection2textgen()
can be used
as a helper function to aid in parsing such files.
- pyterrier.terrier.treccollection2textgen(files, meta=['docno'], meta_tags={'text': 'ELSE'}, verbose=False, num_docs=None, tag_text_length=4096)[source]¶
Creates a generator of dictionaries on parsing TREC formatted files. This is useful for parsing TREC-formatted corpora in indexers like IterDictIndexer, or similar indexers in other plugins (e.g. ColBERTIndexer).
- Parameters:
files (-) – list of files to parse in TREC format.
meta (-) – list of attributes to expose in the dictionaries as metadata.
meta_tags (-) – mapping of TREC tags as metadata.
tag_text_length (-) – maximium length of metadata. Defaults to 4096.
verbose (-) – set to true to show a TQDM progress bar. Defaults to True.
num_docs (-) – a hint for TQDM to size the progress bar based on document counts rather than file count.
Example:
files = pt.io.find_files("/path/to/Disk45") gen = pt.index.treccollection2textgen(files) index = pt.IterDictIndexer("./index45").index(gen)
Example using Indexing Pipelines:
files = pt.io.find_files("/path/to/Disk45")
gen = pt.index.treccollection2textgen(files)
indexer = pt.text.sliding() >> pt.IterDictIndexer("./index45")
index = indexer.index(gen)
Threading
On UNIX-based systems, IterDictIndexer can also perform multi-threaded indexing:
iter_indexer = pt.IterDictIndexer("./passage_index_8", meta={'docno': 20, 'text': 4096}, threads=8)
indexref6 = iter_indexer.index(msmarco_generate())
Note that the resulting index ordering with multiple threads is non-deterministic; if you need
deterministic behavior you must index in single-threaded mode. Furthermore, indexing can only go
as quickly as the document iterator, so to take full advantage of multi-threaded indexing, you
will need to keep the iterator function light-weight. Many datasets provide a fast corpus iteration
function (get_corpus_iter()
), see more information in the Importing Datasets.
Indexing Configuration¶
Our aim is to expose all conventional Terrier indexing configuration through PyTerrier, for instance as constructor arguments to the Indexer classes. However, as Terrier is a legacy platform, some changes will take time to integrate into Terrier. Moreover, the manner of the configuration needed varies a little between the Indexer classes. In the following, we list common indexing configurations, and how to apply them when indexing using PyTerrier, noting any differences betweeen the Indexer classes.
Choice of Indexer
Terrier has three different types of indexer. The choice of indexer is exposed using the type
kwarg
to the indexer class. The indexer type can be set using the IndexingType
enum.
- class pyterrier.terrier.IndexingType(value)[source]¶
This enum is used to determine the type of index built by Terrier. The default is CLASSIC. For more information, see the relevant Terrier indexer and realtime documentation.
- CLASSIC = 1¶
A classical indexing regime, which also creates a direct index structure, useful for query expansion
- SINGLEPASS = 2¶
A single-pass indexing regime, which builds an inverted index directly. No direct index structure is created. Typically is faster than classical indexing.
- MEMORY = 3¶
An in-memory index. No persistent index is created.
Stemming configuation or stopwords
The default Terrier indexing configuration is to apply an English stopword list, and Porter’s stemmer. You can configure this using the stemmer
and stopwords
kwargs for the various indexers:
indexer = pt.IterDictIndexer(stemmer='SpanishSnowballStemmer', stopwords=None)
- class pyterrier.terrier.TerrierStemmer(value)[source]¶
This enum provides an API for the stemmers available in Terrier. The stemming configuration is saved in the index and loaded at retrieval time. Snowball stemmers for various languages are available in Terrier.
It can also be used to access the stemmer:
stemmer = pt.TerrierStemmer.porter stemmed_word = stemmer.stem('abandoned')
- none = 'none'¶
Apply no stemming
- porter = 'porter'¶
Apply Porter’s English stemmer
- weakporter = 'weakporter'¶
Apply a weak version of Porter’s English stemmer
- danish = 'danish'¶
Snowball Danish stemmer
- finnish = 'finnish'¶
Snowball Finnish stemmer
- german = 'german'¶
Snowball German stemmer
- hungarian = 'hungarian'¶
Snowball Hungarian stemmer
- norwegian = 'norwegian'¶
Snowball Norwegian stemmer
- portugese = 'portugese'¶
Snowball Portuguese stemmer
- swedish = 'swedish'¶
Snowball Swedish stemmer
- turkish = 'turkish'¶
Snowball Turkish stemmer
See also the org.terrier.terms package for a list of the available term pipeline objects provided by Terrier.
Similarly the use of Terrier’s English stopword list can be disabled using the stopwords
kwarg.
- class pyterrier.terrier.TerrierStopwords(value)[source]¶
This enum provides an API for the stopword configuration used during indexing with Terrier
- none = 'none'¶
No Stopwords
- terrier = 'terrier'¶
Apply Terrier’s standard stopword list
- custom = 'custom'¶
Apply PyTerrierCustomStopwordList.Indexing for indexing, and PyTerrierCustomStopwordList.Retrieval for retrieval
A custom stopword list can be set by setting the stopwords
kwarg to a list of words:
indexer = pt.IterDictIndexer("./index", stopwords=['a', 'an', 'the'])
Languages and Tokenisation
Similarly, the choice of tokeniser can be controlled in the indexer constructor using the tokeniser
kwarg.
EnglishTokeniser is the
default tokeniser. Other tokenisers are listed in org.terrier.indexing.tokenisation
package. For instance, its common to use UTFTokeniser when indexing non-English text:
indexer = pt.IterDictIndexer(stemmer=None, stopwords=None, tokeniser="UTFTokeniser")
- class pyterrier.terrier.TerrierTokeniser(value)[source]¶
This enum provides an API for the tokeniser configuration used during indexing with Terrier.
- whitespace = 'whitespace'¶
Tokenise on whitespace only
- english = 'english'¶
Terrier’s standard tokeniser, designed for English
- utf = 'utf'¶
A variant of Terrier’s standard tokeniser, similar to English, but with UTF support.
- twitter = 'twitter'¶
Like utf, but keeps hashtags etc
- identity = 'identity'¶
Performs no tokenisation - strings are kept as is.
Positions (aka blocks)
All indexer classes expose a blocks boolean constructor argument to allow position information to be recoreded in the index. Defaults to False, i.e. positions are not recorded.
Fields
Fields refers to storing the frequency of a terms occurrence in different parts of a document, e.g. title vs. body vs. anchor text.
IterDictIndexer can be configured to record fields by setting the fields=True
kwarg to the constructor. For instance, if we have two different fields to a document:
docs = [ {'docno' : 'd1', 'title': 'This is the title', 'text' : 'This is the main document']
indexref = pt.IterDictIndexer("./index_fields", text_attrs=['text', 'title'], fields=True).index(docs)
index = pt.IndexFactory.of(indexref)
print(index.getCollectionStatistics().getNumberOfFields()) # will print 2
# make a BM25F retriever, places twice as much weight on the title as the main body
bm25 = pt.terrier.Retriever(index, wmodel='BM25F', controls={'w.0' = 1, 'w.1' = 2, 'c.0' = 0.75, 'c.1' = 0.5})
See the Terrier indexing documentation on fields for more information.
NB: Since PyTerrier 0.13, IterDictIndexer no longer records fields by default. This speeds up indexing and retrieval when field-based models such as BM25F are not required.
Changing the tags parsed by TREC Collection
Use the relevant properties listed in the Terrier indexing documentation.
MetaIndex configuration
Metadata refers to the arbitrary strings associated to each document recorded in a Terrier index. These can range from the “docno” attribute of each document, as used to support experimentation, to other attributes such as the URL of the documents, or even the raw text of the document. Indeed, storing the raw text of each document is a trick often used when applying additional re-rankers such as BERT (see pyterrier_bert for more information on integrating PyTerrier with BERT-based re-rankers). Indexers now expose meta and meta_tags constructor kwarg to make this easier.
Reverse MetaIndex configuration
On occasion, there is a need to lookup up documents in a Terrier index based on their metadata, e.g. “docno”. The meta_reverse constructor kwarg allows meta keys that support reverse lookup to be specified.
Pretokenised¶
Sometimes you want more fine-grained control over the tokenisation directly within PyTerrier. In this case, each document to be indexed can contain a dictionary of pre-tokenised text and their counts. This works if the pretokenised flag is set to True at indexing time:
iter_indexer = pt.IterDictIndexer("./pretokindex", meta={'docno': 20}, threads=1, pretokenised=True)
indexref6 = iter_indexer.index([
{'docno' : 'd1', 'toks' : {'a' : 1, '##2' : 2}},
{'docno' : 'd2', 'toks' : {'a' : 2, '##2' : 1}}
])
This allows tokenisation using, for instance, the HuggingFace tokenizers:
iter_indexer = pt.IterDictIndexer("./pretokindex", meta={'docno': 20}, pretokenised=True)
from transformers import AutoTokenizer
from collections import Counter
tok = AutoTokenizer.from_pretrained("bert-base-uncased")
# This creates a new column called 'toks', where each row contains
# a dictionary of the BERT WordPiece tokens of the 'text' column.
# This simple example tokenises one row at a time, this could be
# made more efficient to utilise batching support in the tokeniser.
token_row_apply = pt.apply.toks(lambda row: Counter(tok.tokenize(row['text'])))
index_pipe = token_row_apply >> iter_indexer
indexref = index_pipe.index([
{'docno' : 'd1', 'text' : 'do goldfish grow?'},
{'docno' : 'd2', 'text' : ''}
])
At retrieval time, WordPieces that contain special characters (e.g. ‘##w’ ‘[SEP]’) need to be encoded so as to avoid Terrier’s tokeniser.
We use pt.rewrite.tokenise()
to apply a tokeniser to the query, setting matchop
to True, such that pt.terrier.Retriever.matchop()
is called to ensure that rewritten query terms are properly encoded:
br = pt.terrier.Retriever(indexref)
tok = AutoTokenizer.from_pretrained("bert-base-uncased")
query_toks = pt.rewrite.tokenise(tok.tokenize, matchop=True)
retr_pipe = query_toks >> br