PyTerrier Transformers

PyTerrier’s retrieval architecture is based on three concepts:

  • dataframes with pre-defined types (each with a minimum set of known attributes), as detailed in the data amodel.

  • the transformation of those dataframes by standard information retrieval operations, defined as transformers.

  • the compsition of transformers, supported by the operators defined on transformers.

In essence, a PyTerrier transformer is a class with a transform() method, which takes as input a dataframe, and changes it, before returning it.

Input

Output

Cardinality

Example

Concrete Transformer Example

Q

Q

1 to 1

Query rewriting

pt.rewrite.SDM()

Q

Q x D

1 to N

Retrieval

pt.BatchRetrieve()

Q x D

Q

N to 1

Query expansion

pt.rewrite.RM3()

Q x D

Q x D

1 to 1

Re-ranking

pt.apply.doc_score()

Q x D

Q x Df

1 to 1

Feature scoring

pt.FeaturesBatchRetrieve()

Optimisation

Some operators applied to transformer can be optimised by the underlying search engine - for instance, cutting a ranking earlier. So while the following two pipelines are semantically equivalent, the latter might be more efficient:

pipe1 = BatchRetrieve(index, "BM25") % 10
pipe2 = pipe1.compile()

Fitting

When fit() is called on a pipeline, all estimators (transformers that also have a fit() method, as specified by EstimatorBase) within the pipeline are fitted, in turn. This allows one (or more) stages of learning to be integrated into a retrieval pipeline. See Learning to Rank for examples.

When calling fit on a composed pipeline (i.e. one created using the >> operator), this will will call fit() on any estimators within that pipeline.

Transformer base classes

TransformerBase

This class is the base class for all transformers.

class pyterrier.transformer.TransformerBase[source]
name = 'TransformerBase'

Base class for all transformers. Implements the various operators >> + * | & as well as search() for executing a single query and compile() for rewriting complex pipelines into more simples ones.

transform(topics_or_res)[source]

Abstract method for all transformations. Typically takes as input a Pandas DataFrame, and also returns one.

transform_iter(input: Iterable[dict]) pandas.core.frame.DataFrame[source]

Method that proesses an iter-dict by instantiating it as a dataframe and calling transform(). Returns the DataFrame returned by transform(). Used in the implementation of index() on a composed pipeline.

transform_gen(input: pandas.core.frame.DataFrame, batch_size=1, output_topics=False) Iterator[pandas.core.frame.DataFrame][source]

Method for executing a transformer pipeline on smaller batches of queries. The input dataframe is grouped into batches of batch_size queries, and a generator returned, such that transform() is only executed for a smaller batch at a time.

Parameters
  • input (DataFrame) – a dataframe to process

  • batch_size (int) – how many input instances to execute in each batch. Defaults to 1.

search(query: str, qid: str = '1', sort=True) pandas.core.frame.DataFrame[source]

Method for executing a transformer (pipeline) for a single query. Returns a dataframe with the results for the specified query. This is a utility method, and most uses are expected to use the transform() method passing a dataframe.

Parameters
  • query (str) – String form of the query to run

  • qid (str) – the query id to associate to this request. defaults to 1.

  • sort (bool) – ensures the results are sorted by descending rank (defaults to True)

Example:

bm25 = pt.BatchRetrieve(index, wmodel="BM25")
res = bm25.search("example query")

# is equivalent to
queryDf = pd.DataFrame([["1", "example query"]], columns=["qid", "query"])
res = bm25.transform(queryDf)
compile()[source]

Rewrites this pipeline by applying of the Matchpy rules in rewrite_rules. Pipeline optimisation is discussed in the ICTIR 2020 paper on PyTerrier.

parallel(N: int, backend='joblib')[source]

Returns a parallelised version of this transformer. The underlying transformer must be “picklable”.

Parameters
  • N (int) – how many processes/machines to parallelise this transformer over.

  • backend (str) – which multiprocessing backend to use. Only two backends are supported, ‘joblib’ and ‘ray’. Defaults to ‘joblib’.

get_parameter(name: str)[source]

Gets the current value of a particular key of the transformer’s configuration state. By default, this examines the attributes of the transformer object, using hasattr() and setattr().

set_parameter(name: str, value)[source]

Adjusts this transformer’s configuration state, by setting the value for specific parameter. By default, this examines the attributes of the transformer object, using hasattr() and setattr().

Moreover, by extending TransformerBase, all transformer implementations gain the necessary “dunder” methods (e.g. __rshift__()) to support the transformer operators (>>, + etc).

EstimatorBase

This class exposes a fit() method that can be used for transformers that can be trained.

class pyterrier.transformer.EstimatorBase[source]

This is a base class for things that can be fitted.

fit(topics_or_res_tr, qrels_tr, topics_or_res_va, qrels_va)[source]

Method for training the transformer.

Parameters
  • topics_or_res_tr (DataFrame) – training topics (usually with documents)

  • qrels_tr (DataFrame) – training qrels

  • topics_or_res_va (DataFrame) – validation topics (usually with documents)

  • qrels_va (DataFrame) – validation qrels

The ComposedPipeline implements fit(), which applies the interimediate transformers on the specified training (and validation) topics, and places the output into the fit() method of the final transformer.

Internal transformers

A significant number of transformers are defined in pyterrier.transformer to implement operators etc. Its is not expected to use these directly but they are documented for completeness.

Symbol

Name

Implementing transformer

>>

compose/then

ComposedPipeline

|

set-union

SetUnionTransformer

&

set-intersection

SetIntersectionTransformer

+

linear

CombSumTransformer

+

scalar-product

ScalarProductTransformer

%

rank-cutoff

RankCutoffTransformer

**

feature-union

FeatureUnionPipeline

^

concatenate

ConcatenateTransformer

~

cache

ChestCacheTransformer

Indexing Pipelines

Transformers can be chained to create indexing pipelines. The last element in the chain is assumed to be an indexer like IterDictIndexer - it should implement an index() method like IterDictIndexerBase. For instance:

docs = [ {"docno" : "1", "text" : "a" } ]
indexer = pt.text.sliding() >> pt.IterDictIndexer()
indexer.index(docs)

This is implemented by several methods:

  • The last stage of the pipeline should have an index() method that accepts an iterable of dictionaries

  • ComposedPipeline has a special index() method that breaks the input iterable into chunks (the size of chunks can be altered by a batch_size kwarg) and passes those through the intermediate pipeline stages (i.e. all but the last).

  • In the intermediate pipeline stages, the transform_iter() method is called - by default this instantiates a DataFrame on batch_size records, which is passed to transform().

  • These are passed to index() of the last pipeline stage.

Writing your own transformer

The first step to writing your own transformer for your own code is to consider the type of change being applied. Several common transformations are supported through the functions in the pyterrier.apply - Custom Transformers module. See the pyterrier.apply - Custom Transformers documentation.

However, if your transformer has state, such as an expensive model to be loaded at startup time, you may want to extend TransformerBase directly.

Here are some hints for writing Transformers:
  • Except for an indexer, you should implement a transform() method.

  • If your approach ranks results, use pt.model.add_ranks() to add the rank column.

  • If your approach can be trained, your transformer should extend EstimatorBase, and implement the fit() method.

  • If your approach is an indexer, your transformer should extend IterDictIndexerBase and implement index() method.