PyTerrier Transformers¶
PyTerrier’s retrieval architecture is based on three concepts:
dataframes with pre-defined types (each with a minimum set of known attributes), as detailed in the data model.
the transformation of those dataframes by standard information retrieval operations, defined as transformers.
the compsition of transformers, supported by the operators defined on transformers.
In essence, a PyTerrier transformer is a class with a transform()
method, which takes as input a dataframe, and changes it,
before returning it.
Input |
Output |
Cardinality |
Example |
Concrete Transformer Example |
---|---|---|---|---|
Q |
Q |
1 to 1 |
Query rewriting |
pt.rewrite.SDM() |
Q |
Q x D |
1 to N |
Retrieval |
pt.BatchRetrieve() |
Q x D |
Q |
N to 1 |
Query expansion |
pt.rewrite.RM3() |
Q x D |
Q x D |
1 to 1 |
Re-ranking |
pt.apply.doc_score() |
Q x D |
Q x Df |
1 to 1 |
Feature scoring |
pt.FeaturesBatchRetrieve() |
Optimisation¶
Some operators applied to transformer can be optimised by the underlying search engine - for instance, cutting a ranking earlier. So while the following two pipelines are semantically equivalent, the latter might be more efficient:
pipe1 = BatchRetrieve(index, "BM25") % 10
pipe2 = pipe1.compile()
Fitting¶
When fit() is called on a pipeline, all estimators (transformers that also have a fit()
method, as specified by
Estimator) within the pipeline are fitted, in turn. This allows one (or more) stages of learning to be
integrated into a retrieval pipeline. See Learning to Rank for examples.
When calling fit on a composed pipeline (i.e. one created using the >>
operator), this will will call fit()
on any
estimators within that pipeline.
Transformer base classes¶
Transformer¶
This class is the base class for all transformers.
- class pyterrier.Transformer[source]¶
- name = 'Transformer'¶
Base class for all transformers. Implements the various operators
>>
+
*
|
&
as well assearch()
for executing a single query andcompile()
for rewriting complex pipelines into more simples ones.
- static identity()[source]¶
Instantiates a transformer that returns exactly its input.
This can be useful for adding the candidate ranking score as a feature in for learning-to-rank:
bm25 = pt.BatchRetrieve(index, wmodel="BM25") two_feat_pipe = bm25 >> pt.Transformer.identify() ** pt.BatchRetrieve(index, wmodel="PL2")
This will return a pipeline that produces a score column (BM25), but also has a features column containing BM25 and PL2 scores.
- Return type
- static from_df(input, uniform=False)[source]¶
Instantiates a transformer from an input dataframe. Some rows from the input dataframe are returned in response to a query on the
transform()
method. Depending on the value uniform, the dataframe passed as an argument totransform()
can affect this selection.If uniform is True, input will be returned in its entirety each time. If uniform is False, rows from input that match the qid values from the argument dataframe.
- Return type
- transform(topics_or_res)[source]¶
Abstract method for all transformations. Typically takes as input a Pandas DataFrame, and also returns one.
- Return type
DataFrame
- transform_iter(input)[source]¶
Method that proesses an iter-dict by instantiating it as a dataframe and calling transform(). Returns the DataFrame returned by transform(). This can be a handier version of transform() that avoids constructing a dataframe by hand. Alo used in the implementation of index() on a composed pipeline.
- Return type
DataFrame
- transform_gen(input, batch_size=1, output_topics=False)[source]¶
Method for executing a transformer pipeline on smaller batches of queries. The input dataframe is grouped into batches of batch_size queries, and a generator returned, such that transform() is only executed for a smaller batch at a time.
- Return type
Iterator
[DataFrame
]- Parameters
input (DataFrame) – a dataframe to process
batch_size (int) – how many input instances to execute in each batch. Defaults to 1.
- search(query, qid='1', sort=True)[source]¶
Method for executing a transformer (pipeline) for a single query. Returns a dataframe with the results for the specified query. This is a utility method, and most uses are expected to use the transform() method passing a dataframe.
- Return type
DataFrame
- Parameters
query (str) – String form of the query to run
qid (str) – the query id to associate to this request. defaults to 1.
sort (bool) – ensures the results are sorted by descending rank (defaults to True)
Example:
bm25 = pt.BatchRetrieve(index, wmodel="BM25") res = bm25.search("example query") # is equivalent to queryDf = pd.DataFrame([["1", "example query"]], columns=["qid", "query"]) res = bm25.transform(queryDf)
- compile()[source]¶
Rewrites this pipeline by applying of the Matchpy rules in rewrite_rules. Pipeline optimisation is discussed in the ICTIR 2020 paper on PyTerrier.
- Return type
- parallel(N, backend='joblib')[source]¶
Returns a parallelised version of this transformer. The underlying transformer must be “picklable”.
- Return type
- Parameters
N (int) – how many processes/machines to parallelise this transformer over.
backend (str) – which multiprocessing backend to use. Only two backends are supported, ‘joblib’ and ‘ray’. Defaults to ‘joblib’.
Default Method¶
You can invoke a transformer’s transfor method simply by calling the default method. If t
is a transformer:
df_in = pt.new.queries(['test query'], qid=['q1'])
df_out = t.transform(df_in)
df_out = t(df_in)
The default method can also detect iterable dictionaries, and pass those directly to transform_iter()
(which typically calls transform()
). So the following expression is equivalent to the examples in the
previous code block:
df_out = t([{'qid' : 'q1', 'query' : 'test query'}])
This can be more succinct than creating new dataframes for testing transformer implementations.
Operator Support¶
By extending Transformer, all transformer implementations gain the necessary “dunder” methods (e.g. __rshift__()
)
to support the transformer operators (>>, + etc). NB: This class used to be called pyterrier.transformer.TransformerBase
Estimator¶
This base class exposes a fit()
method that can be used for transformers that can be trained.
- class pyterrier.Estimator[source]¶
This is a base class for things that can be fitted.
- fit(topics_or_res_tr, qrels_tr, topics_or_res_va, qrels_va)[source]¶
Method for training the transformer.
- Parameters
topics_or_res_tr (DataFrame) – training topics (usually with documents)
qrels_tr (DataFrame) – training qrels
topics_or_res_va (DataFrame) – validation topics (usually with documents)
qrels_va (DataFrame) – validation qrels
The ComposedPipeline implements fit()
, which applies the interimediate transformers on the specified training (and validation) topics, and places
the output into the fit()
method of the final transformer.
Indexer¶
This base class exposes a index()
method that can be used for transformers that create an index.
The ComposedPipeline also implements index()
, which applies the interimediate transformers on the specified documents to be indexed, and places
the output into the index()
method of the final transformer.
Internal transformers¶
A significant number of transformers are defined in pyterrier.ops to implement operators etc. Its is not expected to use these directly but they are documented for completeness.
Symbol |
Name |
Implementing transformer |
---|---|---|
>> |
compose/then |
ComposedPipeline |
| |
set-union |
SetUnionTransformer |
& |
set-intersection |
SetIntersectionTransformer |
+ |
linear |
CombSumTransformer |
+ |
scalar-product |
ScalarProductTransformer |
% |
rank-cutoff |
RankCutoffTransformer |
** |
feature-union |
FeatureUnionPipeline |
^ |
concatenate |
ConcatenateTransformer |
~ |
cache |
ChestCacheTransformer |
Indexing Pipelines¶
Transformers can be chained to create indexing pipelines. The last element in the chain is assumed to be an indexer like
IterDictIndexer - it should implement an index()
method like pt.Indexer. For instance:
docs = [ {"docno" : "1", "text" : "a" } ]
indexer = pt.text.sliding() >> pt.IterDictIndexer()
indexer.index(docs)
This is implemented by several methods:
The last stage of the pipeline should have an
index()
method that accepts an iterable of dictionariesComposedPipeline has a special
index()
method that breaks the input iterable into chunks (the size of chunks can be altered by a batch_size kwarg) and passes those through the intermediate pipeline stages (i.e. all but the last).In the intermediate pipeline stages, the
transform_iter()
method is called - by default this instantiates a DataFrame on batch_size records, which is passed totransform()
.These are passed to
index()
of the last pipeline stage.
Writing your own transformer¶
The first step to writing your own transformer for your own code is to consider the type of change being applied. Several common transformations are supported through the functions in the pyterrier.apply - Custom Transformers module. See the pyterrier.apply - Custom Transformers documentation.
However, if your transformer has state, such as an expensive model to be loaded at startup time, you may want to
extend pt.Transformer
directly.
- Here are some hints for writing Transformers:
Except for an indexer, you should implement a
transform()
method.If your approach ranks results, use
pt.model.add_ranks()
to add the rank column. (pt.apply.doc_score
will call add_ranks automatically).If your approach can be trained, your transformer should extend Estimator, and implement the
fit()
method.If your approach is an indexer, your transformer should extend Indexer and implement
index()
method.
Mocking Transformers from DataFrames¶
You can make a Transformer object from dataframes. For instance, a unifom transformer will always return the input
dataframe any time transform()
is called:
df = pt.new.ranked_documents([[1,2]])
uniformT = pt.Transformer.from_df(df, uniform=True)
# uniformT.transform() always returns df, regardless of arguments
You can also create a Transformer object from existing results, e.g. saved on disk using pt.io.write_results()
etc. The resulting “source transformer” will return all results by matching on the qid of the input:
res = pt.io.read_results("/path/to/baseline.res.gz")
baselineT = pt.Transformer.from_df(res, uniform=True)
Q1 = pt.new.queries("test query", qid="Q1")
resQ1 = baselineT.transform(Q1)