Operators on Transformers

Part of the power of PyTerrier comes from the ease in which researchers can formulate complex retrieval pipelines. This is made possible by the operators available on Pyterrier’s transformer objects. The following table summarises the available operators:

Operator

Meaning

>>

Then - chaining pipes

+

Linear combination of scores

*

Scalar factoring of scores

&

Document Set Intersection

|

Document Set Union

%

Apply rank cutoff

^

Concatenate run with another

**

Feature Union

~

Cache transformer result

NB: These operators retain their default Python operator precedence - that may not be aligned with your expectations in a PyTerrier context (e.g. & is higher than >>).

Then (>>)

Apply one transformation followed by another:

#rewrites topics to include #1 etc
sdm = pt.rewrite.SDM()
br = BatchRetrieve(index, "DPH")

res = br.transform( sdm.transform(topics))

We use >> as a shorthand for then (also called compose):

res = (sdm >> br).transform(topics)

Example:

Consider a topics dataframe as follows:

qid

query

q1

test query

Then the application of SDM() would produce:

qid

query

q1

test query #1(test query) #uw8(test query)

NB: In practice the query reformulation generated by SDM() is more complex, due to the presence of weights etc in the resulting query.

Then the final res dataframe would contain the results of applying BatchRetrieve on the rewritten queries, as follows:

qid

query

docno

score

rank

q1

test query #1(test query) #uw8(test query)

d10

4

0

q1

test query #1(test query) #uw8(test query)

d04

3.8

1

NB: Then can also be used for retrieval and re-ranking pipelines, such as:

pipeline = BatchRetrieve(index, "DPH") >> BatchRetrieve(index, "BM25")

Linear Combine and Scalar Factor (+, *)

The linear combine (+) and scalar factor (*) operators allow the scores of different retrieval systems to be linearly combined (with weights).

Instead of the following Python:

br_DPH = BatchRetrieve(index, "DPH")
br_BM25 = BatchRetrieve(index, "BM25")

res1 = br_DPH.transform(topics)
res2 = br_BM25.transform(topics)
res = res1.merge(res2, on=["qid", "docno"])
res["score"] = 2 * res["score_x"] + res["score_y"]

We use binary + and * operators. This is natural, as it is intuitive to combine weighted retrieval functions using + and *

br_DPH = BatchRetrieve(index, "DPH")
br_BM25 = BatchRetrieve(index, "BM25")
res = (2* br_DPH + br_BM25).transform(topics)

If the DPH and BM25 transformers would respectively return:

qid

docno

score

rank

q1

d10

2

0

q1

d12

1

1

qid

docno

score

rank

q1

d10

4

0

q1

d01

3

1

then the application of the transformer represented by the expression (2* br_DPH + br_BM25) would be:

qid

docno

score

rank

q1

d10

8

0

q1

d01

3

1

q1

d12

2

2

NB: Any documents not present in one of the constituent rankings will contribute a score of 0 to the final score of that document.

Precedence and Associativity

The + and * operators retain their classical precendence among Pythons operators. This means that the intended semantics of an expression of linear combinations and scalar factors are clear - indeed, * binds higher than +, so 2* br_DPH + br_BM25 is interpreted as (2* br_DPH) + br_BM25.

Set Intersection and Union (&, |)

The set that only includes documents that occur in the intersection (&) and union (|) of both retrieval sets. Scores and ranks are not returned - hence, the rankings documents would normally be re-scored:

BM25_br = BatchRetrieve(index, "BM25")
PL2_br = BatchRetrieve(index, "PL2")

res_intersection = (BM25_br & PL2_br).transform(topics)
res_union = (BM25_br | PL2_br).transform(topics)

Examples:

If the BM25 and PL2 pipelines would respectively return:

qid

docno

score

rank

q1

d10

4.3

0

q1

d12

4.1

1

qid

docno

score

rank

q1

d10

4.3

0

q1

d01

3.9

1

then the application of the set intersection operator (&) would result in a ranking only containing documents appear in both transformers:

qid

docno

q1

d10

and the application of the set union operator (|) would return documents retrieved by either transformer:

qid

docno

q1

d10

q1

d12

q1

d01

Note that, as these are set operators, there are no ranks and scores returned in the output.

Rank Cutoff (%)

The % operator is called rank cutoff, and limits the number of results for each query:

pipe1 = pt.BatchRetrieve(index, "BM25") % 2

Example:

If a retrieval pipeline returns:

qid

docno

score

rank

q1

d10

4.3

0

q1

d12

4.1

1

q1

d05

3.9

2

q1

d03

3.5

3

q1

d01

2.5

4

then the application of the rank cutoff operator would be:

qid

docno

score

rank

q1

d10

4.3

0

q1

d12

4.1

1

Concatenate (^)

Sometimes, we may only want to apply an expensive retrieval process on a few top-ranked documents, and fill up the rest of the ranking with the rest of the documents (removing duplicates). We can do that using the concatenate operator. Concretely, in the example below, alldocs is our candidate set, of say 1000 documents per query. We re-rank the top 3 documents for each query using ExpensiveReranker(), in a pipeline called topdocs. We then use the concatenate operator (^) to append the remaining documents from alldocs, such that they have scores and ranks adjusted to appear just after the documents obtained from the topdocs pipeline:

alldocs = BatchRetrieve(index, "BM25")
topdocs = alldocs % 3 >> ExpensiveReranker()
finaldocs = topdocs ^ alldocs

Example:

If alldocs returns:

qid

docno

score

rank

q1

d10

4.3

0

q1

d12

4.1

1

q1

d05

3.9

2

q1

d03

3.5

3

q1

d01

2.5

4

Then alldocs would compute scores on the top 3 ranked documents (d10, d12, d05). After applying ExpensiveReranker() to score and re-ranked these 3 documents, topdocs could be as follows:

qid

docno

score

rank

q1

d05

1.0

0

q1

d10

0.9

1

q1

d12

0.8

2

Then finaldocs would be:

qid

docno

score

rank

q1

d05

1.0

0

q1

d10

0.9

1

q1

d12

0.8

2

q1

d03

0.7999

3

q1

d01

-0.2001

4

Note that score of d03 is adjusted to appear just under the last ranked document from topdocs (we use a small value of epsilon=0.0001) as the minimum difference between the least ranked document from topdocs and the highest remaining document from alldocs. The relative ordering of documents from alldocs is unchanged, but the gaps between their scores are maintained, so the difference between d03 and d01 is a score delta of -1 in both alldocs and finaldocs.

Feature Union (**)

Here we take one system, e.g. DPH, to get an initial candidate set, then add more systems as features.

The Python would have looked like:

sample_br = BatchRetrieve(index, "DPH")
BM25F_br = BatchRetrieve(index, "BM25F")
PL2F_br = BatchRetrieve(index, "PL2F")

sampleRes = sample_br.transform(topics)
# assumes sampleRes contains the queries
BM25F_res = BM25F_br.transform(sampleRes)
PL2F_res = PL2F_br.transform(sampleRes)

final_res = BM25F_res.join(PL2F_res, on=["qid", "docno"])
final_res["features"] = np.stack(final_res["features_x"], final_res["features_y"])

Instead, we use ** to denote feature union:

sample_br = BatchRetrieve(index, "DPH")
BM25F_br = BatchRetrieve(index, "BM25F")
PL2F_br = BatchRetrieve(index, "PL2F")

# ** is the feature union operator. It requires a candidate document set as input
(BM25F_br ** PL2F_br)).transform(sample_br.transform(topics))
# or combined with the then operator, >>
(sample_br >> (BM25F_br ** PL2F_br)).transform(topics)

NB: Feature union expects the documents being returned by each side of the union to be identical. It will produce a warning if they are not identical. Documents not returned will obtain a score of 0 for that feature.

Example:

For example, consider that sample_br returns a ranking as follows:

qid

docno

score

rank

q1

d10

4.3

0

Further, for document d10, BM25F and PL2F return scores respectively of 4.9 and 13.0. The application of the feature union operator above would be a ranking with features as follows:

qid

docno

score

rank

features

q1

d10

4.3

0

[4.9, 13.0]

More examples of feature union can be found in the learning-to-rank documentation (Learning to Rank).

Precedence and Associativity

Feature union is associative, so in the following examples, x1, x2 and x3 have identical semantics:

x1 = sample_br >> ( BM25F_br ** PL2F_br ** urllen)
x2 =  sample_br >> ( (BM25F_br ** PL2F_br) ** urllen)
x3 =  sample_br >> ( BM25F_br ** (PL2F_br ** urllen))

Pipelines x1, x2 and x3 are all pipelines that create identical document rankings with three features, in the precise order BM25F, PL2F and urllength.

Note that >> has higher operator precendence in Python than **. For this reason, feature unions usually need to be expressed in parentheses. In this way the semantics of pipelines a, b and c in the example below are not identical, and indeed, a is parsed like b, while c is almost always the desired outcome:

# a is parsed in the same way as b, when the likely desired parse was c
a = sample_br >> BM25F_br ** PL2F_br
b = (sample_br >> BM25F_br) ** PL2F_br)
c = sample_br >> ( BM25F_br ** PL2F_br)

Caching (~)

Some transformers are expensive to apply. For instance, we might find ourselves repeatedly running our BM25 baseline. We can request Pyterrier to _cache_ the outcome of a transformer for a given qid by using the unary ~ operator.

Consider the following example:

from pyterrier import BatchRetrieve, Experiment
firstpass = BatchRetrieve(index, "BM25")
reranker = ~firstpass >> BatchRetrieve(index, "BM25F")
Experiment([~firstpass, ~reranker], topics, qrels)

In this example, firstpass is cached when it is used in the Experiment evaluation, as well as when it is used in the reranker. We also cache the outcome of the Experiment, so that another evaluation will be faster.

By default, Pyterrier caches results to ~/.pyterrier/transformer_cache/.