pyterrier.apply - Custom Transformers

PyTerrier pipelines are easily extensible through the use of apply functions. These are inspired by the Pandas apply() method, which allow to apply a function to each row of a dataframe. Instead, in PyTerrier, the apply methods allow to construct pipeline transformers to address common use cases by using custom functions (including Python lambda functions) to easily transform inputs.

The table below lists the main classes of transformation in the PyTerrier data model, as well as the appropriate apply method to use in each case. In general, if there is a one-to-one mapping between the input and the output, then the specific pt.apply methods should be used (i.e. query(), doc_score(), .doc_features()). If the cardinality of the dataframe changes through applying the transformer, then generic() or by_query() must be applied.

In particular, through the use of pt.apply.doc_score(), any reranking method that can be expressed as a function of the text of the query and the text of the doucment can used as a reranker in a PyTerrier pipeline.

Each apply method takes as input a function (e.g. a function name, or a lambda expression). Objects that are passed to the function vary in terms of the type of the input dataframe (queries or ranked documents), and also vary in terms of what should be returned by the function.

Input

Output

Cardinality

Example

Example apply

Function Input type

Function Return type

Q

Q

1 to 1

Query rewriting

pt.apply.query()

row of one query

str

Q x D

Q x D

1 to 1

Re-ranking

pt.apply.doc_score()

row of one document

float

Q x D

Q x Df

1 to 1

Feature scoring

pt.apply.doc_features()

row of one document

numpy array

Q x D

Q

N to 1

Query expansion

pt.apply.generic()

entire dataframe

entire dataframe

pt.apply.by_query()

dataframe for 1 query

dataframe for 1 query

Q

Q x D

1 to N

Retrieval

pt.apply.generic()

entire dataframe

entire dataframe

pt.apply.by_query()

dataframe for 1 query

dataframe for 1 query

In each case, the result from calling a pyterrier.apply method is another PyTerrier transformer (i.e. extends pt.Transformer), which can be used for experimentation or combined with other PyTerrier transformers through the standard PyTerrier operators.

If verbose=True is passed to any pyterrier apply method (except generic()), then a TQDM progress bar will be shown as the transformer is applied.

Example

In the following, we create a document re-ranking transformer that increases the score of documents by 10% if their url attribute contains “https:”

>>> df = pd.DataFrame([["q1", "d1", "https://www.example.com", 1.0, 1]], columns=["qid", "docno", "url", "score", "rank"])
>>> df
qid docno                      url  score  rank
0  q1    d1  https://www.example.com    1.0     1
>>>
>>> http_boost = pt.apply.doc_score(lambda row: row["score"] * 1.1 if "https:" in row["url"] else row["score"])
>>> http_boost(df)
qid docno                      url  score  rank
0  q1    d1  https://www.example.com    1.1     0

We can combine this pt.apply.doc_score() transformer into as a re-ranking pipeline using the >> operator:

pipeline = bm25 >> http_boost

Further examples are shown for each apply method below.

Apply Methods

pyterrier.apply.query(fn, *args, **kwargs)[source]

Create a transformer that takes as input a query, and applies a supplied function to compute a new query formulation.

The supplied function is called once for each query, and must return a string containing the new query formulation. Each time it is called, the function is supplied with a Panda Series representing the attributes of the query.

The previous query formulation is saved in the “query_0” column. If a later pipeline stage is intended to resort to be executed on the previous query formulation, a pt.rewrite.reset() transformer can be applied.

Return type:

Transformer

Parameters:
  • fn (Callable) – the function to apply to each row. It must return a string containing the new query formulation.

  • verbose (bool) – if set to True, a TQDM progress bar will be displayed

Examples:

# this will remove pre-defined stopwords from the query
stops=set(["and", "the"])

# a naieve function to remove stopwords - takes as input a Pandas Series, and returns a string
def _remove_stops(q):
    terms = q["query"].split(" ")
    terms = [t for t in terms if not t in stops ]
    return " ".join(terms)

# a query rewriting transformer that applies the _remove_stops to each row of an input dataframe
p1 = pt.apply.query(_remove_stops) >> pt.BatchRetrieve(index, wmodel="DPH")

# an equivalent query rewriting transformer using an anonymous lambda function
p2 = pt.apply.query(
        lambda q :  " ".join([t for t in q["query"].split(" ") if t not in stops ])
    ) >> pt.BatchRetrieve(index, wmodel="DPH")

In both of the example pipelines above (p1 and p2), the exact topics are not known until the pipeline is invoked, e.g. by using p1.transform(topics) on a topics dataframe, or within a pt.Experiment(). When the pipeline is invoked, the specified function (_remove_stops in the case of p1) is called for each row of the input datatrame (becoming the q function argument).

pyterrier.apply.doc_score(fn, *args, batch_size=None, **kwargs)[source]

Create a transformer that takes as input a ranked documents dataframe, and applies a supplied function to compute a new score. Ranks are automatically computed. doc_score() can operate row-wise, or batch-wise, depending on whether batch_size is set.

The supplied function is called once for each document, and must return a float containing the new score for that document. Each time it is called, the function is supplied with a Panda Series representing the attributes of the query and document.

Return type:

Transformer

Parameters:
  • fn (Callable) – the function to apply to each row

  • batch_size (int or None). How many documents to operate on at once (batch-wise) –

  • verbose (bool) – if set to True, a TQDM progress bar will be displayed

Example (Row-wise):

# this transformer will subtract 5 from the score of each document
p = pt.BatchRetrieve(index, wmodel="DPH") >>
    pt.apply.doc_score(lambda doc : doc["score"] -5)

Can be used in batch-wise manner, which is particularly useful for appling neural models. In this case, the scoring function receives a dataframe, rather than a single row:

def _doclen(df):
    # returns series of lengths
    return df.text.str.len()

pipe = pt.BatchRetrieve(index) >> pt.apply.doc_score(_doclen, batch_size=128)
pyterrier.apply.doc_features(fn, *args, **kwargs)[source]

Create a transformer that takes as input a ranked documents dataframe, and applies the supplied function to each document to compute feature scores.

The supplied function is called once for each document, must each time return a 1D numpy array. Each time it is called, the function is supplied with a Panda Series representing the attributes of the query and document.

Return type:

Transformer

Parameters:
  • fn (Callable) – the function to apply to each row

  • verbose (bool) – if set to True, a TQDM progress bar will be displayed

Example:

# this transformer will compute the character and number of word in each document retrieved
# using the contents of the document obtained from the MetaIndex

def _features(row):
    docid = row["docid"]
    content = index.getMetaIndex().getItem("text", docid)
    f1 = len(content)
    f2 = len(content.split(" "))
    return np.array([f1, f2])

p = pt.BatchRetrieve(index, wmodel="BM25") >>
    pt.apply.doc_features(_features )
pyterrier.apply.rename(columns, *args, errors='raise', **kwargs)[source]

Creates a transformer that renames columns in a dataframe.

Return type:

Transformer

Parameters:
  • columns (dict) – A dictionary mapping from old column name to new column name

  • errors (str) – Maps to df.rename() errors kwarg - default to ‘raise’, alternatively can be ‘ignore’

Example:

pipe = pt.BatchRetrieve(index, metadata=["docno", "body"]) >> pt.apply.rename({'body':'text'})
pyterrier.apply.generic(fn, *args, batch_size=None, **kwargs)[source]

Create a transformer that changes the input dataframe to another dataframe in an unspecified way.

The supplied function is called once for an entire result set as a dataframe (which may contain one of more queries). Each time it should return a new dataframe. The returned dataframe should abide by the general PyTerrier Data Model, for instance updating the rank column if the scores are amended.

Return type:

Transformer

Parameters:
  • fn (Callable) – the function to apply to each row

  • batch_size (int or None) – whether to apply fn on batches of rows or all that are received

  • verbose (bool) – Whether to display a progress bar over batches (only used if batch_size is set).

Example:

# this transformer will remove all documents at rank greater than 2.

# this pipeline would remove all but the first two documents from a result set
pipe = pt.BatchRetrieve(index) >> pt.apply.generic(lambda res : res[res["rank"] < 2])
pyterrier.apply.by_query(fn, *args, batch_size=None, **kwargs)[source]

As pt.apply.generic() except that fn receives a dataframe for one query at at time, rather than all results at once. If batch_size is set, fn will receive no more than batch_size documents for any query. The verbose kwargs controls whether to display a progress bar over queries.

Return type:

Transformer

Making New Columns and Dropping Columns

Its also possible to construct a transformer that makes a new column on a row-wise basis by directly naming the new column in pt.apply.

For instance, if the column you are creating is called rank_2, it might be created as follows:

pipe = pt.BatchRetrieve(index) >> pt.apply.rank_2(lambda row: row["rank"] * 2)

To create a transformer that drops a column, you can instead pass drop=True as a kwarg:

pipe = pt.BatchRetrieve(index, metadata=["docno", "text"] >> pt.text.scorer() >> pt.apply.text(drop=True)