Terrier Indexing ---------------- Indexing Basics ================================== .. related:: pyterrier.terrier.TerrierIndex.indexer .. related:: pyterrier_doc2query.Doc2Query Common indexing options in Terrier are exposed through :meth:`TerrierIndex.indexer `. Its parameters let you adjust various aspects of the indexing process, such as tokenisation, stemming, position indexing, and more: .. code-block:: python :caption: Creating an indexer from a TerrierIndex import pyterrier as pt my_index = pt.terrier.TerrierIndex("my_index.terrier") indexer = index.indexer(tokeniser='twitter', store_positions=True) # :footnote: See :meth:`TerrierIndex.indexer ` for full list of parameters. indexer.index([ {"docno" : "tweet1", "text" : "This is a tweet! #pyterrier"}, {"docno" : "tweet2", "text" : "Another tweet, with a link: https://example.com"} ]) Note that Terrier indexes do not support adding additional documents after the initial indexing process. You do not need to load all documents into memory at once when indexing. Indexers support any "iterable", including generators that yield one document at a time. Here is an example of indexing documents from a generator function: .. code-block:: python :caption: Indexing from a generator function def generate_docs(): for line in open('my_humongous_collection.txt'): yield {"docno" : str(i), "text" : line.strip()} indexer.index(generate_docs()) The indexer can function as the final stage of an :ref:`indexing pipeline `. For example, here is an example that expands documents with :class:`~pyterrier_doc2query.Doc2Query` before indexing them: .. schematic:: :show_code: import pyterrier as pt from pyterrier_doc2query import Doc2Query index = pt.terrier.TerrierIndex("my_index.terrier") pipeline = Doc2Query(append=True) >> index.indexer() Pretokenisation ====================================== .. related:: pyterrier.terrier.TerrierIndex.toks_indexer Sometimes you want more fine-grained control over the tokenisation directly within PyTerrier. In this case, each document to be indexed can contain a dictionary of pre-tokenised text and their counts in the ``toks`` column: .. code-block:: python :caption: Indexing pre-tokenised text my_index = pt.terrier.TerrierIndex('my_index.terrier') my_index.index([ {'docno' : 'd1', 'toks' : {'a' : 1, '##2' : 2}}, {'docno' : 'd2', 'toks' : {'a' : 2, '##2' : 1}} ]) # or indexer = my_index.toks_indexer() # :footnote: Using :meth:`~pyterrier.terrier.TerrierIndex.toks_indexer` lets you configure settings about the indexing process. indexer.index([ {'docno' : 'd1', 'toks' : {'a' : 1, '##2' : 2}}, {'docno' : 'd2', 'toks' : {'a' : 2, '##2' : 1}} ]) .. note:: When supplying pre-tokenized text during indexing, Terrier bypasses its tokeniser, stemmer, and stopword removal. You will need to apply the same tokenization rules at retrieval time, for instance using :meth:`pt.rewrite.tokenise `. This allows tokenisation using, for instance, the `HuggingFace tokenizers `_:: from transformers import AutoTokenizer from collections import Counter tok = AutoTokenizer.from_pretrained("bert-base-uncased") # This creates a new column called 'toks', where each row contains # a dictionary of the BERT WordPiece tokens of the 'text' column. # This simple example tokenises one row at a time, this could be # made more efficient to utilise batching support in the tokeniser. token_row_apply = pt.apply.toks(lambda row: Counter(tok.tokenize(row['text']))) my_index = pt.terrier.TerrierIndex('my_index.terrier') index_pipe = token_row_apply >> my_index index_pipe.index([ {'docno' : 'd1', 'text' : 'do goldfish grow?'}, {'docno' : 'd2', 'text' : ''} ]) At retrieval time, WordPieces that contain special characters (e.g. `'##w'` `'[SEP]'`) need to be encoded so as to avoid Terrier's tokeniser. We use ``pt.rewrite.tokenise()`` to apply a tokeniser to the query, setting ``matchop`` to True, such that ``pt.terrier.Retriever.matchop()`` is called to ensure that rewritten query terms are properly encoded:: br = pt.terrier.Retriever(indexref) tok = AutoTokenizer.from_pretrained("bert-base-uncased") query_toks = pt.rewrite.tokenise(tok.tokenize, matchop=True) retr_pipe = query_toks >> br What's in a Terrier index? =================================== .. related:: pyterrier.terrier.TerrierIndex.collection_statistics .. related:: pyterrier.terrier.TerrierIndex.lexicon .. related:: pyterrier.terrier.TerrierIndex.inverted_index .. related:: pyterrier.terrier.TerrierIndex.document_index .. related:: pyterrier.terrier.TerrierIndex.meta_index .. related:: pyterrier.terrier.TerrierIndex.direct_index A Terrier index contains several data structures. These structures provide low-level API access to the indexed data. The data structures that can be present in a Terrier index are: **Collection Statistics** :meth:`TerrierIndex.collection_statistics() ` provides global statistics of the index, such as the number of documents, number of terms, etc. **Lexicon** :meth:`TerrierIndex.lexicon() ` provides an entry for each unique term in the index, which contains the corresponding statistics of each term (frequency etc), and a pointer to the inverted index posting list for that term. **Inverted Index** :meth:`TerrierIndex.inverted_index() ` provides access to the posting list for each term, which records the documents that a given term appears in, and with what frequency for each document. **Document Index** :meth:`TerrierIndex.document_index() ` provides access to the length of the document (and other field lengths). **Meta Index** :meth:`TerrierIndex.meta_index() ` provides access to document metadata, such as the ``docno``, and optionally the raw text and the URL of each document. **Direct Index** (*Forward Index*) :meth:`TerrierIndex.direct_index() ` provides a posting list for each document, detailing which terms occur in that document and with which frequency. The presence of the direct index depends on the IndexingType that has been applied - single-pass and some memory indices do not provide a direct index. Advanced: IterDictIndexer ======================================== .. related:: pyterrier.terrier.IterDictIndexer :class:`~pyterrier.terrier.IterDictIndexer` is a flexible Terrier indexer implemementation. It is returned from :meth:`TerrierIndex.indexer() `, but can be constructed manually if additional functionality is required. **Examples using IterDictIndexer** An iterdict can just be a list of dictionaries:: docs = [ { 'docno' : 'doc1', 'text' : 'a b c' } ] iter_indexer = pt.IterDictIndexer("./index", meta={'docno': 20, 'text': 4096}) indexref1 = iter_indexer.index(docs) A dataframe can also be used, virtue of its ``.to_dict()`` method:: df = pd.DataFrame([['doc1', 'a b c']], columns=['docno', 'text']) iter_indexer = pt.IterDictIndexer("./index") indexref2 = indexer.index(df.to_dict(orient="records")) However, the main power of using IterDictIndexer is for processing indefinite iterables, such as those returned by generator functions. For example, the tsv file of the MSMARCO Passage Ranking corpus can be indexed as follows:: dataset = pt.get_dataset("trec-deep-learning-passages") def msmarco_generate(): with pt.io.autoopen(dataset.get_corpus()[0], 'rt') as corpusfile: for l in corpusfile: docno, passage = l.split("\t") yield {'docno' : docno, 'text' : passage} iter_indexer = pt.IterDictIndexer("./passage_index", meta={'docno': 20, 'text': 4096}) indexref3 = iter_indexer.index(msmarco_generate()) IterDictIndexer can be used in connection with :ref:`indexing_pipelines`. Similarly, indexing of JSONL files is similarly a few lines of Python:: def iter_file(filename): import json with open(filename, 'rt') as file: for l in file: # assumes that each line contains 'docno', 'text' attributes # yields a dictionary for each json line yield json.loads(l) indexref4 = pt.IterDictIndexer("./index", meta={'docno': 20, 'text': 4096}).index(iter_file("/path/to/file.jsonl")) NB: Use ``pt.io.autoopen()`` as a drop-in replacement for ``open()`` that supports files compressed by gzip etc. **Indexing TREC-formatted files using IterDictIndexer** If you have TREC-formatted files that you wish to use with an IterDictIndexer-like indexer, :func:`~pyterrier.index.treccollection2textgen` can be used as a helper function to aid in parsing such files. Example using Indexing Pipelines:: files = pt.io.find_files("/path/to/Disk45") gen = pt.index.treccollection2textgen(files) indexer = pt.text.sliding() >> pt.IterDictIndexer("./index45") index = indexer.index(gen) **Threading** On UNIX-based systems, IterDictIndexer can also perform multi-threaded indexing:: iter_indexer = pt.IterDictIndexer("./passage_index_8", meta={'docno': 20, 'text': 4096}, threads=8) indexref6 = iter_indexer.index(msmarco_generate()) Note that the resulting index ordering with multiple threads is non-deterministic; if you need deterministic behavior you must index in single-threaded mode. Furthermore, indexing can only go as quickly as the document iterator, so to take full advantage of multi-threaded indexing, you will need to keep the iterator function light-weight. Many datasets provide a fast corpus iteration function (``get_corpus_iter()``), see more information in the :ref:`datasets`. Advanced: Specialized Indexers ============================================================ .. related:: pyterrier.terrier.TRECCollectionIndexer .. related:: pyterrier.terrier.FilesIndexer In most cases, you will want to use :class:`~pyterrier.terrier.IterDictIndexer` (e.g., using :meth:`TerrierIndex.indexer() `). However, several specialized indexers are available for specific use-cases. - :class:`~pyterrier.terrier.TRECCollectionIndexer` lets you index TREC-formaated collections by passing in a list of file paths. For example: .. code-block:: python :caption: Indexing a TREC collection using TRECCollectionIndexer import pyterrier as pt # list of filenames to index files = pt.io.find_files("/path/to/WT2G/wt2g-corpus/") # build the index indexer = pt.TRECCollectionIndexer("./wt2g_index", verbose=True, blocks=False) indexref = indexer.index(files) # load the index, print the statistics index = pt.IndexFactory.of(indexref) print(index.getCollectionStatistics().toString()) - :class:`~pyterrier.terrier.FilesIndexer` lets you index a list of files in various formats such as pdf, docx, and txt. .. warning:: The specialized indexers in this section are not compatible with indexing pipelines. Advanced: Indexing Configuration =========================================== Our aim is to expose all conventional Terrier indexing configuration through PyTerrier, for instance as constructor arguments to the Indexer classes. However, as Terrier is a legacy platform, some changes will take time to integrate into Terrier. Moreover, the manner of the configuration needed varies a little between the Indexer classes. In the following, we list common indexing configurations, and how to apply them when indexing using PyTerrier, noting any differences betweeen the Indexer classes. **Choice of Indexer** Terrier has three different types of indexer. The choice of indexer is exposed using the ``type`` kwarg to the indexer class. The indexer type can be set using the ``IndexingType`` enum. **Stemming configuation or stopwords** The default Terrier indexing configuration is to apply an English stopword list, and Porter's stemmer. You can configure this using the ``stemmer`` and ``stopwords`` kwargs for the various indexers:: indexer = pt.IterDictIndexer(stemmer='SpanishSnowballStemmer', stopwords=None) See also the `org.terrier.terms `_ package for a list of the available term pipeline objects provided by Terrier. Similarly the use of Terrier's English stopword list can be disabled using the ``stopwords`` kwarg. A custom stopword list can be set by setting the ``stopwords`` kwarg to a list of words:: indexer = pt.IterDictIndexer("./index", stopwords=['a', 'an', 'the']) **Languages and Tokenisation** Similarly, the choice of tokeniser can be controlled in the indexer constructor using the ``tokeniser`` kwarg. `EnglishTokeniser `_ is the default tokeniser. Other tokenisers are listed in `org.terrier.indexing.tokenisation `_ package. For instance, its common to use `UTFTokeniser` when indexing non-English text:: indexer = pt.IterDictIndexer(stemmer=None, stopwords=None, tokeniser="UTFTokeniser") **Positions (aka blocks)** All indexer classes expose a `blocks` boolean constructor argument to allow position information to be recoreded in the index. Defaults to False, i.e. positions are not recorded. **Fields** Fields refers to storing the frequency of a terms occurrence in different parts of a document, e.g. title vs. body vs. anchor text. IterDictIndexer can be configured to record fields by setting the ``fields=True`` kwarg to the constructor. For instance, if we have two different fields to a document:: docs = [ {'docno' : 'd1', 'title': 'This is the title', 'text' : 'This is the main document'] indexref = pt.IterDictIndexer("./index_fields", text_attrs=['text', 'title'], fields=True).index(docs) index = pt.IndexFactory.of(indexref) print(index.getCollectionStatistics().getNumberOfFields()) # will print 2 # make a BM25F retriever, places twice as much weight on the title as the main body bm25 = pt.terrier.Retriever(index, wmodel='BM25F', controls={'w.0' = 1, 'w.1' = 2, 'c.0' = 0.75, 'c.1' = 0.5}) See the Terrier `indexing documentation on fields `_ for more information. NB: Since PyTerrier 0.13, IterDictIndexer no longer records fields by default. This speeds up indexing and retrieval when field-based models such as BM25F are not required. **Changing the tags parsed by TREC Collection** Use the relevant properties listed in the Terrier `indexing documentation `_. **MetaIndex configuration** Metadata refers to the arbitrary strings associated to each document recorded in a Terrier index. These can range from the `"docno"` attribute of each document, as used to support experimentation, to other attributes such as the URL of the documents, or even the raw text of the document. Indeed, storing the raw text of each document is a trick often used when applying additional re-rankers such as BERT (see `pyterrier_bert `_ for more information on integrating PyTerrier with BERT-based re-rankers). Indexers now expose `meta` and `meta_tags` constructor kwarg to make this easier. **Reverse MetaIndex configuration** On occasion, there is a need to lookup up documents in a Terrier index based on their metadata, e.g. "docno". The `meta_reverse` constructor kwarg allows meta keys that support reverse lookup to be specified.