PyTerrier Transformers¶
PyTerrier’s retrieval architecture is based on three concepts:
dataframes with pre-defined types (each with a minimum set of known attributes), as detailed in the data model.
the transformation of those dataframes by standard information retrieval operations, defined as transformers.
the compsition of transformers, supported by the operators defined on transformers.
In essence, a PyTerrier transformer is a class that implemented one or more of two methods:
a
transform()method, which takes as input a dataframe, and changes it, before returning it,
and/or
a
transform_iter()method, which takes as input a iterable of dictionaries (“iter-dict”), and changes it, before returning or yielding it.
One of these methods must be implemented. If one is implemented, the other will change the input type and call the other - for instance,
if a transformer’s transform_iter() method is called, but only transform() is implemented, the iter-dict will be used to construct a
dataframe, transform() is called, and the resulting dataframe transformed back into an iter-dict.
Depending on the expected input and output column of a transformer, they can be described as following into different categories.
Input |
Output |
Cardinality |
Example |
Concrete Transformer Example |
|---|---|---|---|---|
D |
D |
1 to 1 |
Document expansion |
|
Q |
Q |
1 to 1 |
Query rewriting |
pt.rewrite.SDM() |
Q |
Q x D |
1 to N |
Retrieval |
pt.terrier.Retriever() |
Q x D |
Q |
N to 1 |
Query expansion |
pt.rewrite.RM3() |
Q x D |
Q x D |
1 to 1 |
Re-ranking |
pt.apply.doc_score() |
Q x D |
Q x Df |
1 to 1 |
Feature scoring |
pt.terrier.FeaturesRetriever() |
Hint
When writing transformers, it’s a good idea to validate the inputs to make sure they contain the values you expect. See Input Validation for more details.
Optimisation¶
Some operators applied to transformer can be optimised by the underlying search engine - for instance, cutting a ranking earlier. So while the following two pipelines are semantically equivalent, the latter might be more efficient:
pipe1 = pt.terrier.Retrieve(index, "BM25") % 10
pipe2 = pipe1.compile()
Fitting¶
When fit() is called on a pipeline, all estimators (transformers that also have a fit() method, as specified by
Estimator) within the pipeline are fitted, in turn. This allows one (or more) stages of learning to be
integrated into a retrieval pipeline. See Learning to Rank for examples.
When calling fit on a composed pipeline (i.e. one created using the >> operator), this will will call fit() on any
estimators within that pipeline.
Transformer base classes¶
Transformer¶
This class is the base class for all transformers.
- class pyterrier.Transformer(*args, **kwargs)[source]¶
Base class for all transformers. Implements the various operators
>>+*|&as well assearch()for executing a single query andcompile()for rewriting complex pipelines into more simples ones.Its expected that either
transform()ortransform_iter()be implemented by any class extending this - this rule does not apply for indexers, which instead implement.index(). pt.apply helper functions can be used to easily construct Transformers around a single function.- static identity()[source]¶
Instantiates a transformer that returns exactly its input.
This can be useful for adding the candidate ranking score as a feature in for learning-to-rank:
bm25 = pt.terrier.Retriever(index, wmodel="BM25") two_feat_pipe = bm25 >> pt.Transformer.identity() ** pt.terrier.Retriever(index, wmodel="PL2")
This will return a pipeline that produces a score column (BM25), but also has a features column containing BM25 and PL2 scores.
- Return type:
- static from_df(input, uniform=False)[source]¶
Instantiates a transformer from an input dataframe. Some rows from the input dataframe are returned in response to a query on the
transform()method. Depending on the value uniform, the dataframe passed as an argument totransform()can affect this selection.- Parameters:
input (
DataFrame) – a dataframe to store and return, based on setting of uniform.uniform (
bool) – If True, input will be returned in its entirety each time, else rows from input that match the qid values from the argument dataframe.
- Return type:
- transform(inp)[source]¶
Abstract method that runs the transformer over Pandas
DataFrameobjects. This ortransform_iter()must be implemented by all Transformer objects.Note
Either
transform()ortransform_iter()must be implemented for all transformers. If not, a runtime error will be raised when constructing the transformer.When
transform()is not implemented, the default implementation runstransform_iter()and converts the output to aDataFrame.- Parameters:
inp (
DataFrame) – The input to the transformer (e.g., queries, documents, results, etc.)- Returns:
The output of the transformer (e.g., result of retrieval, re-writing, re-ranking, etc.)
- Return type:
pd.DataFrame
- transform_iter(inp)[source]¶
Abstract method that runs the transformer over iterable input (such as lists or generators), where each element is a dictionary record. This or
transform()must be implemented by all Transformer objects.This format can sometimes be easier to implement than
transform(). Furthermore, it avoids constructing expensiveDataFrameobjects. It is also used in the invocation ofindex()on a composed pipeline.Note
Either
transform()ortransform_iter()must be implemented for all transformers. If not, a runtime error will be raised when constructing the transformer.When
transform_iter()is not implemented, the default implementation runstransform()and converts the output to an iterable.- Parameters:
inp(Iterable[Dict]) – The input to the transformer (e.g., queries, documents, results, etc.)
inp (Iterable[Dict[str, Any]])
- Returns:
The output of the transformer (e.g., result of retrieval, re-writing, re-ranking, etc.)
- Return type:
Iterable[Dict]
- __call__(inp)[source]¶
Runs the transformer for the given input and returns its output as the same type as the input.
When
inpis a DataFrame, invokestransform()and returns a DataFrameWhen
inpis a list, invokestransform_iter()and returns a listOtherwise, invokes
transform_iter()and returns a generic iterable (returning whatever type is returned fromtransform_iter().)
- Parameters:
inp (
DataFrame|Iterable[Dict[str,Any]] |List[Dict[str,Any]]) – The input to the transformer (e.g., queries, documents, results, etc.)- Returns:
The output of the transformer (e.g., result of retrieval, re-writing, re-ranking, etc.) as the same type as the input.
- Return type:
pd.DataFrame,Iterable[Dict],List[Dict]
- transform_gen(input, batch_size=1, output_topics=False)[source]¶
Method for executing a transformer pipeline on smaller batches of queries. The input dataframe is grouped into batches of batch_size queries, and a generator returned, such that
transform()is only executed for a smaller batch at a time.- Parameters:
input (
DataFrame) – a dataframe to processbatch_size – how many input instances to execute in each batch. Defaults to 1.
- Return type:
Iterator[DataFrame] |Iterator[Tuple[DataFrame,DataFrame]]
- search(query, qid='1', sort=True)[source]¶
Method for executing a transformer (pipeline) for a single query.
- Parameters:
query (
str) – String form of the query to runqid (
str) – the query id to associate to this request. defaults to 1.sort (
bool) – ensures the results are sorted by descending rank (defaults to True)
- Return type:
DataFrame- Returns:
Returns a dataframe with the results for the specified query. This is a utility method, and most uses are expected to use the
transform()method passing a dataframe.
Example:
bm25 = pt.terrier.Retriever(index, wmodel="BM25") res = bm25.search("example query") # is equivalent to queryDf = pd.DataFrame([["1", "example query"]], columns=["qid", "query"]) res = bm25.transform(queryDf)
- compile()[source]¶
Returns an optimised transformer, if possible, to improve performance.
For instance, a pipeline of transformers can be optimised by fusing adjacent transformers.
- Return type:
- Returns:
A new transformer that is equivalent to this transformer, but optimised.
- parallel(N, backend='joblib')[source]¶
Returns a parallelised version of this transformer. The underlying transformer must be “picklable”. For more information, see Parallelisation documentation.
- Parameters:
N (
int) – how many processes/machines to parallelise this transformer over.backend (
str) – which multiprocessing backend to use. Only two backends are supported, ‘joblib’ and ‘ray’. Defaults to ‘joblib’.
- Return type:
- get_parameter(name)[source]¶
Gets the current value of a particular key of the transformer’s configuration state. By default, this examines the attributes of the transformer object, using
hasattr()andsetattr().- Parameters:
name (
str) – name of parameter
- set_parameter(name, value)[source]¶
Adjusts this transformer’s configuration state, by setting the value for specific parameter. By default, this examines the attributes of the transformer object, using
hasattr()andsetattr().- Parameters:
name (
str) – name of parametervalue – current value of parameter
Default Method¶
You can invoke a transformer’s transfor method simply by calling the default method. If t is a transformer:
df_in = pt.new.queries(['test query'], qid=['q1'])
df_out = t.transform(df_in)
df_out = t(df_in)
The default method will also detect iterable dictionaries, and pass those directly to transform_iter()
(which usually calls transform() if transform_iter() has not been impelmented). So the following
expression is equivalent to the examples in the previous code block, except that df_out will contain an iter-dict:
df_out = t([{'qid' : 'q1', 'query' : 'test query'}])
This can be more succinct than creating new dataframes for testing transformer implementations.
Operator Support¶
By extending Transformer, all transformer implementations gain the necessary “dunder” methods (e.g. __rshift__())
to support the transformer operators (>>, + etc). NB: This class used to be called pyterrier.transformer.TransformerBase
Estimator¶
This base class exposes a fit() method that can be used for transformers that can be trained.
- class pyterrier.Estimator(*args, **kwargs)[source]¶
This is a base class for things that can be fitted.
The Compose operator (>>) implements fit(), which applies the intermediate transformers on the specified training (and validation) topics, and places
the output into the fit() method of the final transformer.
Indexer¶
This base class exposes a index() method that can be used for transformers that create an index.
- class pyterrier.Indexer(*args, **kwargs)[source]¶
- index_inputs()[source]¶
Returns a list of column configurations that index() is expects. This default implementation returns None, and should be overridden by subclasses to allow accurate inspections and schematic visualisations.
- Return type:
List[List[str]] |None
- index(iter, **kwargs)[source]¶
Takes an iterable of dictionaries (“iterdict”), and consumes them. The index method may return an instance of the index or retriever. This method is typically used to implement indexers that consume a corpus (or to consume the output of previous pipeline components that have transformer the documents being consumed).
- Parameters:
iter (
Iterable[Dict[str,Any]]) – An iterable of dictionaries, each representing a document.- Return type:
Any
The Compose operator (>>) also implements index(), which applies the intermediate transformers on the specified documents to be indexed, and places
the output into the index() method of the final transformer.
Internal transformers¶
A significant number of transformers are defined in pyterrier._ops to implement operators etc. Its is not expected to use these directly but they are listed for completeness.
Symbol |
Name |
Implementing transformer |
|---|---|---|
>> |
compose/then |
pt._ops.Compose |
| |
set-union |
pt._ops.SetUnion |
& |
set-intersection |
pt._ops.SetIntersection |
+ |
linear |
pt._ops.CombSum |
+ |
scalar-product |
pt._ops.ScalarProduct |
% |
rank-cutoff |
pt._ops.RankCutoff |
** |
feature-union |
pt._ops.FeatureUnion |
^ |
concatenate |
pt._ops.Concatenate |
Indexing Pipelines¶
Transformers can be chained to create indexing pipelines. The last element in the chain is assumed to be an indexer like
IterDictIndexer - it should implement an index() method like pt.Indexer. For instance:
docs = [ {"docno" : "1", "text" : "a" } ]
indexer = pt.text.sliding() >> pt.IterDictIndexer()
indexer.index(docs)
This is implemented by several methods:
The last stage of the pipeline should have an
index()method that accepts an iterable of dictionariesCompose has a special
index()method that breaks the input iterable into chunks (the size of chunks can be altered by a batch_size kwarg) and passes those through the intermediate pipeline stages (i.e. all but the last).In the intermediate pipeline stages, the
transform_iter()method is called - by default this instantiates a DataFrame on batch_size records, which is passed totransform().These are passed to
index()of the last pipeline stage.
Writing your own transformer¶
The first step to writing your own transformer for your own code is to consider the type of change being applied. Several common transformations are supported through the functions in the pt.apply - Custom Transformers module. See the pt.apply - Custom Transformers documentation.
However, if your transformer has state, such as an expensive model or index data structure to be loaded at startup time,
you may want to extend pt.Transformer directly.
- Here are some hints for writing Transformers:
Except for an indexer, you should implement a
transform()and/ortransform_iter()method.If your approach ranks results, use
pt.model.add_ranks()to add the rank column. (pt.apply.doc_scorewill call add_ranks automatically).If your approach can be trained, your transformer should extend Estimator, and implement the
fit()method.If your approach is an indexer, your transformer should extend Indexer and implement
index()method, as well as theindex_inputs()method that describes the expected input columns.
Optimisation of Transfomer Pipelines¶
When .compile() on a transformer or a pipeline of transformer, there is an opportunity to improve the efficiency of the pipeline,
while ensuring that the semantics remain unchanged. Implementors of a transformer wishing to support such optimisations have a number
of mechanisms open to them.
Firstly, the .compile() transformer’s method can be overridden to return a new transformer instance that may be more efficient.
Secondly, the transformer can be fused with adjacent transformers. For instance, a retriever may be fused with a rank-cutoff operator, such that the rank-cutoff is applied at retrieval rather than after. Fusion is controlled by protocol methods, which determine how the transformer can be fused.
- class pyterrier.transformer.SupportsFuseRankCutoff(*args, **kwargs)[source]¶
- fuse_rank_cutoff(k)[source]¶
Fuses this transformer with a following RankCutoff transformer.
This method should return a new transformer that applies the new rank cutoff value k.
Note that if the transformer currently applies a stricter rank cutoff than the one provided, it should not be relaxed. In this case, it is preferred to return self.
If the fusion is not possible, None should be returned.
- Parameters:
k (
int) – The rank cutoff requested- Return type:
Transformer|None
- class pyterrier.transformer.SupportsFuseLeft(*args, **kwargs)[source]¶
- fuse_left(left)[source]¶
Fuses this transformer with a transformer that immediately precedes this one in a composed (>>) pipeline.
The new transformer should have the same effect as performing the two transformers in sequence, i.e., pipeline_unfused and pipeline_fused in the following example should provide the same results for any input:
pipeline_unfused = left >> self pipeline_fused = self.fuse_left(left)
A fused transformer should be more efficient than the unfused version. For instance, a retriever followed by a rank cutoff can be fused to perform the rank cutoff during retrieval.
- Parameters:
left (
Transformer) – transformer to the left.- Return type:
Transformer|None- Returns:
A new transformer that is the result of merging this transformer with the left transformer, or None if a merge is not possible.
- class pyterrier.transformer.SupportsFuseRight(*args, **kwargs)[source]¶
- fuse_right(right)[source]¶
Fuses this transformer with a transformer that immediately follows this one in a composed (>>) pipeline.
The new transformer should have the same effect as performing the two transformers in sequence, i.e., pipeline_unfused and pipeline_fused in the following example should provide the same results for any input:
pipeline_unfused = self >> right pipeline_fused = self.fuse_right(right)
A fused transformer should be more efficient than the unfused version. For instance, a retriever followed by a rank cutoff can be fused to perform the rank cutoff during retrieval.
- Parameters:
right (
Transformer) – transformer to the right in a composed pipeline.- Return type:
Transformer|None- Returns:
A new transformer that is the result of merging this transformer with the right transformer, or None if a merge is not possible.
- class pyterrier.transformer.SupportsFuseFeatureUnion(*args, **kwargs)[source]¶
- fuse_feature_union(other, is_left)[source]¶
Fuses this transformer with another one that provides features.
This method should return a new transformer that is equivalent to performing self ** other, or None if the fusion is not possible.
- Parameters:
other (
Transformer) – transformer to the left or right.is_left (
bool) – is True if self’s features are to the left of other’s. Otherwise, self’s features are to the right.
- Return type:
Transformer|None
Mocking Transformers from DataFrames¶
You can make a Transformer object from dataframes. For instance, a unifom transformer will always return the input
dataframe any time transform() is called:
df = pt.new.ranked_documents([[1,2]])
uniformT = pt.Transformer.from_df(df, uniform=True)
# uniformT.transform() always returns df, regardless of arguments
You can also create a Transformer object from existing results, e.g. saved on disk using pt.io.write_results()
etc. The resulting “source transformer” will return all results by matching on the qid of the input:
res = pt.io.read_results("/path/to/baseline.res.gz")
baselineT = pt.Transformer.from_df(res, uniform=True)
Q1 = pt.new.queries("test query", qid="Q1")
resQ1 = baselineT.transform(Q1)