Operators on Transformers¶
Part of the power of PyTerrier comes from the ease in which researchers can formulate complex retrieval pipelines. This is made possible by the operators available on Pyterrier’s transformer objects. The following table summarises the available operators:
Operator |
Meaning |
---|---|
>> |
Then - chaining pipes |
+ |
Linear combination of scores |
* |
Scalar factoring of scores |
& |
Document Set Intersection |
| |
Document Set Union |
% |
Apply rank cutoff |
^ |
Concatenate the output of one transformer with another |
** |
Feature Union |
~ |
Cache transformer result |
NB: These operators retain their default Python operator precedence - that may not be aligned with your expectations in a PyTerrier context (e.g. & is higher than >>).
Then (>>)¶
Apply one transformation followed by another:
#rewrites topics to include #1 etc
sdm = pt.rewrite.SDM()
br = pt.terrier.Retriever(index, "DPH")
res = br.transform( sdm.transform(topics))
We use >> as a shorthand for then (also called compose):
res = (sdm >> br).transform(topics)
Example:
Consider a topics dataframe as follows:
qid |
query |
---|---|
q1 |
test query |
Then the application of SDM() would produce:
qid |
query |
---|---|
q1 |
test query #1(test query) #uw8(test query) |
NB: In practice the query reformulation generated by SDM() is more complex, due to the presence of weights etc in the resulting query.
Then the final res dataframe would contain the results of applying a Retriever on the rewritten queries, as follows:
qid |
query |
docno |
score |
rank |
---|---|---|---|---|
q1 |
test query #1(test query) #uw8(test query) |
d10 |
4 |
0 |
q1 |
test query #1(test query) #uw8(test query) |
d04 |
3.8 |
1 |
NB: Then can also be used for retrieval and re-ranking pipelines, such as:
pipeline = pt.terrier.Retriever(index, "DPH") >> pt.terrier.Retriever(index, "BM25")
Linear Combine and Scalar Factor (+, *)¶
The linear combine (+) and scalar factor (*) operators allow the scores of different retrieval systems to be linearly combined (with weights).
Instead of the following Python:
br_DPH = pt.terrier.Retriever(index, "DPH")
br_BM25 = pt.terrier.Retriever(index, "BM25")
res1 = br_DPH.transform(topics)
res2 = br_BM25.transform(topics)
res = res1.merge(res2, on=["qid", "docno"])
res["score"] = 2 * res["score_x"] + res["score_y"]
We use binary + and * operators. This is natural, as it is intuitive to combine weighted retrieval functions using + and *
br_DPH = pt.terrier.Retriever(index, "DPH")
br_BM25 = pt.terrier.Retriever(index, "BM25")
res = (2* br_DPH + br_BM25).transform(topics)
If the DPH and BM25 transformers would respectively return:
qid |
docno |
score |
rank |
---|---|---|---|
q1 |
d10 |
2 |
0 |
q1 |
d12 |
1 |
1 |
qid |
docno |
score |
rank |
---|---|---|---|
q1 |
d10 |
4 |
0 |
q1 |
d01 |
3 |
1 |
then the application of the transformer represented by the expression (2* br_DPH + br_BM25) would be:
qid |
docno |
score |
rank |
---|---|---|---|
q1 |
d10 |
8 |
0 |
q1 |
d01 |
3 |
1 |
q1 |
d12 |
2 |
2 |
NB: Any documents not present in one of the constituent rankings will contribute a score of 0 to the final score of that document.
Precedence and Associativity
The + and * operators retain their classical precendence among Pythons operators. This means that the intended semantics of an expression of linear combinations and scalar factors are clear - indeed, * binds higher than +, so 2* br_DPH + br_BM25 is interpreted as (2* br_DPH) + br_BM25.
Set Intersection and Union (&, |)¶
The set that only includes documents that occur in the intersection (&) and union (|) of both retrieval sets. Scores and ranks are not returned - hence, the rankings documents would normally be re-scored:
BM25_br = pt.terrier.Retriever(index, "BM25")
PL2_br = pt.terrier.Retriever(index, "PL2")
res_intersection = (BM25_br & PL2_br).transform(topics)
res_union = (BM25_br | PL2_br).transform(topics)
Examples:
If the BM25 and PL2 pipelines would respectively return:
qid |
docno |
score |
rank |
---|---|---|---|
q1 |
d10 |
4.3 |
0 |
q1 |
d12 |
4.1 |
1 |
qid |
docno |
score |
rank |
---|---|---|---|
q1 |
d10 |
4.3 |
0 |
q1 |
d01 |
3.9 |
1 |
then the application of the set intersection operator (&) would result in a ranking only containing documents appear in both transformers:
qid |
docno |
---|---|
q1 |
d10 |
and the application of the set union operator (|) would return documents retrieved by either transformer:
qid |
docno |
---|---|
q1 |
d10 |
q1 |
d12 |
q1 |
d01 |
Note that, as these are set operators, there are no ranks and scores returned in the output.
Rank Cutoff (%)¶
The % operator is called rank cutoff, and limits the number of results for each query:
pipe1 = pt.terrier.Retriever(index, "BM25") % 2
Example:
If a retrieval pipeline returns:
qid |
docno |
score |
rank |
---|---|---|---|
q1 |
d10 |
4.3 |
0 |
q1 |
d12 |
4.1 |
1 |
q1 |
d05 |
3.9 |
2 |
q1 |
d03 |
3.5 |
3 |
q1 |
d01 |
2.5 |
4 |
then the application of the rank cutoff operator would be:
qid |
docno |
score |
rank |
---|---|---|---|
q1 |
d10 |
4.3 |
0 |
q1 |
d12 |
4.1 |
1 |
Concatenate (^)¶
Sometimes, we may only want to apply an expensive retrieval process on a few top-ranked documents, and fill up the rest of the ranking with the rest of the documents (removing duplicates). We can do that using the concatenate operator. Concretely, in the example below, alldocs is our candidate set, of say 1000 documents per query. We re-rank the top 3 documents for each query using ExpensiveReranker(), in a pipeline called topdocs. We then use the concatenate operator (^) to append the remaining documents from alldocs, such that they have scores and ranks adjusted to appear just after the documents obtained from the topdocs pipeline:
alldocs = BatchRetrieve(index, "BM25")
topdocs = alldocs % 3 >> ExpensiveReranker()
finaldocs = topdocs ^ alldocs
Example:
If alldocs returns:
qid |
docno |
score |
rank |
---|---|---|---|
q1 |
d10 |
4.3 |
0 |
q1 |
d12 |
4.1 |
1 |
q1 |
d05 |
3.9 |
2 |
q1 |
d03 |
3.5 |
3 |
q1 |
d01 |
2.5 |
4 |
Then alldocs would compute scores on the top 3 ranked documents (d10, d12, d05). After applying ExpensiveReranker() to score and re-ranked these 3 documents, topdocs could be as follows:
qid |
docno |
score |
rank |
---|---|---|---|
q1 |
d05 |
1.0 |
0 |
q1 |
d10 |
0.9 |
1 |
q1 |
d12 |
0.8 |
2 |
Then finaldocs would be:
qid |
docno |
score |
rank |
---|---|---|---|
q1 |
d05 |
1.0 |
0 |
q1 |
d10 |
0.9 |
1 |
q1 |
d12 |
0.8 |
2 |
q1 |
d03 |
0.7999 |
3 |
q1 |
d01 |
-0.2001 |
4 |
Note that score of d03 is adjusted to appear just under the last ranked document from topdocs (we use a small value of epsilon=0.0001) as the minimum difference between the least ranked document from topdocs and the highest remaining document from alldocs. The relative ordering of documents from alldocs is unchanged, but the gaps between their scores are maintained, so the difference between d03 and d01 is a score delta of -1 in both alldocs and finaldocs.
Feature Union (**)¶
Here we take one system, e.g. DPH, to get an initial candidate set, then add more systems as features.
The Python would have looked like:
sample_br = BatchRetrieve(index, "DPH")
BM25F_br = BatchRetrieve(index, "BM25F")
PL2F_br = BatchRetrieve(index, "PL2F")
sampleRes = sample_br.transform(topics)
# assumes sampleRes contains the queries
BM25F_res = BM25F_br.transform(sampleRes)
PL2F_res = PL2F_br.transform(sampleRes)
final_res = BM25F_res.join(PL2F_res, on=["qid", "docno"])
final_res["features"] = np.stack(final_res["features_x"], final_res["features_y"])
Instead, we use ** to denote feature union:
sample_br = BatchRetrieve(index, "DPH")
BM25F_br = BatchRetrieve(index, "BM25F")
PL2F_br = BatchRetrieve(index, "PL2F")
# ** is the feature union operator. It requires a candidate document set as input
(BM25F_br ** PL2F_br)).transform(sample_br.transform(topics))
# or combined with the then operator, >>
(sample_br >> (BM25F_br ** PL2F_br)).transform(topics)
NB: Feature union expects the documents being returned by each side of the union to be identical. It will produce a warning if they are not identical. Documents not returned will obtain a score of 0 for that feature.
Example:
For example, consider that sample_br returns a ranking as follows:
qid |
docno |
score |
rank |
---|---|---|---|
q1 |
d10 |
4.3 |
0 |
Further, for document d10, BM25F and PL2F return scores respectively of 4.9 and 13.0. The application of the feature union operator above would be a ranking with features as follows:
qid |
docno |
score |
rank |
features |
---|---|---|---|---|
q1 |
d10 |
4.3 |
0 |
[4.9, 13.0] |
More examples of feature union can be found in the learning-to-rank documentation (Learning to Rank).
Precedence and Associativity
Feature union is associative, so in the following examples, x1, x2 and x3 have identical semantics:
x1 = sample_br >> ( BM25F_br ** PL2F_br ** urllen)
x2 = sample_br >> ( (BM25F_br ** PL2F_br) ** urllen)
x3 = sample_br >> ( BM25F_br ** (PL2F_br ** urllen))
Pipelines x1, x2 and x3 are all pipelines that create identical document rankings with three features, in the precise order BM25F, PL2F and urllength.
Note that >> has higher operator precendence in Python than **. For this reason, feature unions usually need to be expressed in parentheses. In this way the semantics of pipelines a, b and c in the example below are not identical, and indeed, a is parsed like b, while c is almost always the desired outcome:
# a is parsed in the same way as b, when the likely desired parse was c
a = sample_br >> BM25F_br ** PL2F_br
b = (sample_br >> BM25F_br) ** PL2F_br)
c = sample_br >> ( BM25F_br ** PL2F_br)
Caching Transformers¶
Some transformers are expensive to apply. For performing experiments, you may value using pyterrier-caching to allow the results of a transformer to be cached.