Building in PyTerrier Support for Indexing and Retrieval Backends¶

Aim: To provide guidance for how to make a indexing and retrieval backends availble through PyTerrier.

Motivations¶

The PyTerrier ecosystem allow you to use your indexer and retriever with state-of-the-art plugins.

For instance, to index using doc2query (with pyterrier_doc2query) on a collection of your choice:

import pyterrier_doc2query
doc2query = pyterrier_doc2query.Doc2Query(append=True)
pipeline = doc2query >> MyIndexer()
pipeline.index(corpus)

Or re-rank the results of your system using monoT5:

from pyterrier_t5 import MonoT5ReRanker, DuoT5ReRanker
monoT5 = MonoT5ReRanker() # loads castorini/monot5-base-msmarco by default
duoT5 = DuoT5ReRanker() # loads castorini/duot5-base-msmarco by default

mono_pipeline = MyIndex.bm25() >> pt.text.get_text(dataset, "text") >> monoT5
duo_pipeline = mono_pipeline % 50 >> duoT5 # apply a rank cutoff of 50 from monoT5 since duoT5

And evaluate these pipelines using pt.Experiment:

pt.Expertiment(
 [mono_pipeline %50, duo_pipeline],
 dataset.get_topics(),
 dataset.get_qrels(),
 [ MRR@10, "mrt" ]
)

Quickstart¶

PyTerrier has two formats for expressing data inputs and outputs of Transformers. The first is Pandas dataframes, while the second is a iteratable of dicts, In both cases, the expected columns are as defined in the PyTerrier Data Model. In the following, we mainly focus upon the DataFrame format, but the iter-dict format is identically supported.

For retrieval, at the very least you need to define a function that takes a dataframe with two columns ['qid', 'query'], and returns a dataframe with the following columns: ['qid', 'docno', 'score']. This can be made into a pt.Transformer instance using a pt.apply wrapper, or by directly extending pt.Transformer and naming your function as transform(). The types of the qid and docno columns are strings.

For indexing, you need to make a class that inherits from pt.Indexer with an index() function that consumes an iterator of dicts, where each dict contains information for one document, for instance:

[
  { 'docno' : 'd1', 'text' : 'Hello there'},
  { 'docno' : 'd2', 'text' : 'Nice to meet you'}
]

If your indexer is for learned sparse retrieval, the tokens are typically at pre-tokenised, as follows:

[
  { 'docno' : 'd1', 'toks' : {'hello' : 1, 'there' : 1 }],
  { 'docno' : 'd2', 'toks' : {'nice' : 1, 'to; : 1, 'meet' : 1, 'you' : 1 }
]

For such “pre-tokenised” settings at retrieval time, the query dataframe should be expected to have qid and query_toks columns, where query_toks has the same format as the toks column for indexing (but float query weights are typically supported here):

[
  { 'qid' : 'q1', 'query' : 'hello hello there'  'query_toks' : {'hello' : 2.0, 'there' : 1.0 }],
]

Examples of learned sparse integrations are available at pyt_splade.

Note that PyTerrier assumes docnos are strings - if you internally use an integer-based scheme, your indexer and retrieval classes should record a id->docno mapping file. Many of our PyTerrier plugins use the npids package for this.

Example implementation¶

import pyterrier as pt
from collections.abc import Iterable
import pandas as pd

class MyIndexer(pt.Indexer):

  def __init__(self, indexpath : str):
    pass

  def index(self, iterdict : Iterable[dict]):
    """
      Consume the documents in the iterator, assuming that it has keys
      docno (string) and text (string)
    """
    return index # return your Index class here.

  def index_inputs(self) -> List[List[str]]: # Optional: this helps with inspection/schematics
    return [['docno', 'text']]


class MyIndex:
  """
  The index class is used as a factory to allow easy access to different retriever implementations
  """

  def __init__(self, indexpath : str):
    # open your index, initialise etc
    pass

  def bm25(self) -> pt.Transformer:
    def _retr_fn(single_query_df : pd.DataFrame) -> pd.DataFrame
      # check it has both qid and query columns - using pt.validate for easier inspection
      pt.validate.query_frame(single_query_df, extra_columns=["query"])

      qid = single_query_df.iloc[0]["qid"]
      query = single_query_df.iloc[0]["query"]
      # populate a results dataframe with columns ['qid', 'docno', 'score']
      return pt.model.add_ranks(results) # adds rank column

    return pt.apply.by_query(_retr_fn)


# NB: You can merge these two classes into a single one. PyTerrier DR and PyTerrier PISA both use this scheme.

Optionally, your MyIndex class can extend pt.Artifact- this allows your index to be easily shared as an Artifact on Huggingface/Zenodo etc.

Other Examples:¶

PyTerrier ColBERT: ColBERTFactory is the “MyIndex” class; CoLBERTIndexer is the indexer.
PyTerrier DR: Index classes are both Indexers and retrievers (and also pt.Artefact), e.g. NumpyIndex.
PyTerrier Pisa: PisaIndex is the pt.Indexer and the index factory (and also pt.Artefact).