Running Experiments

PyTerrier aims to make it easy to conduct an information retrieval experiment, namely, to run a transformer pipeline over a set of queries, and evaluating the outcome using standard information retrieval evaluation metrics based on known relevant documents (obtained from a set relevance assessments, also known as qrels). The evaluation metrics are calculated by the pytrec_eval library, a Python wrapper around the widely-used trec_eval evaluation tool.

The main way to achieve this is using pt.Experiment().

API

pyterrier.Experiment()[source]

Allows easy comparison of multiple retrieval transformer pipelines using a common set of topics, and identical evaluation measures computed using the same qrels. In essence, each transformer is applied on the provided set of topics. Then the named trec_eval evaluation measures are computed (using pt.Utils.evaluate()) for each system.

Parameters
  • retr_systems (list) – A list of transformers to evaluate. If you already have the results for one (or more) of your systems, a results dataframe can also be used here. Results produced by the transformers must have “qid”, “docno”, “score”, “rank” columns.

  • topics – Either a path to a topics file or a pandas.Dataframe with columns=[‘qid’, ‘query’]

  • qrels – Either a path to a qrels file or a pandas.Dataframe with columns=[‘qid’,’docno’, ‘label’]

  • eval_metrics (list) – Which evaluation metrics to use. E.g. [‘map’]

  • names (list) – List of names for each retrieval system when presenting the results. Default=None. If None: Obtains the str() representation of each transformer as its name.

  • batch_size (int) – If not None, evaluation is conducted in batches of batch_size topics. Default=None, which evaluates all topics at once. Applying a batch_size is useful if you have large numbers of topics, and/or if your pipeline requires large amounts of temporary memory during a run.

  • filter_by_qrels (bool) – If True, will drop topics from the topics dataframe that have qids not appearing in the qrels dataframe.

  • filter_by_topics (bool) – If True, will drop topics from the qrels dataframe that have qids not appearing in the topics dataframe.

  • perquery (bool) – If True return each metric for each query, else return mean metrics across all queries. Default=False.

  • dataframe (bool) – If True return results as a dataframe, else as a dictionary of dictionaries. Default=True.

  • baseline (int) – If set to the index of an item of the retr_system list, will calculate the number of queries improved, degraded and the statistical significance (paired t-test p value) for each measure. Default=None: If None, no additional columns will be added for each measure.

  • test (string) – Which significance testing approach to apply. Defaults to “t”. Alternatives are “wilcoxon” - not typically used for IR experiments. A Callable can also be passed - it should follow the specification of scipy.stats.ttest_rel(), i.e. it expect two arrays of numbers, and return an array or tuple, of which the second value will be placed in the p-value column.

  • correction (string) – Whether any multiple testing correction should be applied. E.g. ‘bonferroni’, ‘holm’, ‘hs’ aka ‘holm-sidak’. Default is None. Additional columns are added denoting whether the null hypothesis can be rejected, and the corrected p value. See statsmodels.stats.multitest.multipletests() for more information about available testing correction.

  • correction_alpha (float) – What alpha value for multiple testing correction. Default is 0.05.

  • highlight (str) – If highlight=”bold”, highlights in bold the best measure value in each column; if highlight=”color” or “colour”, then the cell with the highest metric value will have a green background.

  • round (int) – How many decimal places to round each measure value to. This can also be a dictionary mapping measure name to number of decimal places. Default is None, which is no rounding.

  • verbose (bool) – If True, a tqdm progress bar is shown as systems (or systems*batches if batch_size is set) are executed. Default=False.

Returns

A Dataframe with each retrieval system with each metric evaluated.

Examples

Average Effectiveness

Getting average effectiveness over a set of topics:

dataset = pt.get_dataset("vaswani")
# vaswani dataset provides an index, topics and qrels

# lets generate two BRs to compare
tfidf = pt.BatchRetrieve(dataset.get_index(), wmodel="TF_IDF")
bm25 = pt.BatchRetrieve(dataset.get_index(), wmodel="BM25")

pt.Experiment(
    [tfidf, bm25],
    dataset.get_topics(),
    dataset.get_qrels(),
    eval_metrics=["map", "recip_rank"]
)

The returned dataframe is as follows:

name

map

recip_rank

0

BR(TF_IDF)

0.290905

0.699168

1

BR(BM25)

0.296517

0.725665

Each row represents one system. We can manually set the names of the systems, using the names= kwarg, as follows:

pt.Experiment(
    [tfidf, bm25],
    dataset.get_topics(),
    dataset.get_qrels(),
    eval_metrics=["map", "recip_rank"],
    names=["TF_IDF", "BM25"]
)

This produces dataframes that are more easily interpretable.

name

map

recip_rank

0

TF_IDF

0.290905

0.699168

1

BM25

0.296517

0.725665

We can also reduce the number of decimal places reported using the round= kwarg, as follows:

pt.Experiment(
    [tfidf, bm25],
    dataset.get_topics(),
    dataset.get_qrels(),
    eval_metrics=["map", "recip_rank"],
    round={"map" : 4, "recip_rank" : 3},
    names=["TF_IDF", "BM25"]
)

The result is as follows:

name

map

recip_rank

0

TF_IDF

0.2909

0.699

1

BM25

0.2965

0.726

Passing an integer value to round= (e.g. round=3) applies rounding to all evaluation measures.

Significance Testing

We can perform significance testing by specifying the index of which transformer we consider to be our baseline, e.g. baseline=0:

pt.Experiment(
    [tfidf, bm25],
    dataset.get_topics(),
    dataset.get_qrels(),
    eval_metrics=["map", "recip_rank"],
    names=["TF_IDF", "BM25"],
    baseline=0
)

In this case, additional columns are returned for each measure, indicating the number of queries improved compared to the baseline, the number of queries degraded, as well as the paired t-test p-value in the difference between each row and the baseline row. NB: For the baseline, these values are NaN (not applicable).

name

map

recip_rank

map +

map -

map p-value

recip_rank +

recip_rank -

recip_rank p-value

0

TF_IDF

0.290905

0.699168

nan

nan

nan

nan

nan

nan

1

BM25

0.296517

0.725665

46

45

0.237317

16

3

0.0258549

For this test collection, between the TF_IDF and BM25 weighting models, there is no significant difference observed in terms of MAP, but there is a significant different in terms of mean reciprocal rank (p<0.05). Indeed, while BM25 improves average precision for 46 queries over TF_IDF, it degrades it for 45; on the other hand, the rank of the first relevant document is improved for 16 queries by BM25 over TD_IDF.

Further more, modern experimental convention suggests that it is important to correct for multiple testing in the comparative evaluation of many IR systems. Experiments provides supported for the multiple testing correction methods supported by the statsmodels package, such as Bonferroni:

pt.Experiment(
    [tfidf, bm25],
    dataset.get_topics(),
    dataset.get_qrels(),
    eval_metrics=["map"],
    names=["TF_IDF", "BM25"],
    baseline=0,
    correction='bonferroni'
)

This adds two further columns for each measure, denoting if the null hypothesis can be rejected (e.g. “map reject”), as well as the corrected p value (“map p-value corrected”), as shown below:

name

map

map +

map -

map p-value

map reject

map p-value corrected

0

TF_IDF

0.290905

nan

nan

nan

False

nan

1

BM25

0.296517

46

45

0.237317

False

0.474634

The table below summarises the multiple testing correction methods supported:

Aliases

Correction Method

0

[‘b’, ‘bonf’, ‘bonferroni’]

Bonferroni

1

[‘s’, ‘sidak’]

Sidak

2

[‘h’, ‘holm’]

Holm

3

[‘hs’, ‘holm-sidak’]

Holm-Sidak

4

[‘sh’, ‘simes-hochberg’]

Simes-Hochberg

5

[‘ho’, ‘hommel’]

Hommel

6

[‘fdr_bh’, ‘fdr_i’, ‘fdr_p’, ‘fdri’, ‘fdrp’]

FDR Benjamini-Hochberg

7

[‘fdr_by’, ‘fdr_n’, ‘fdr_c’, ‘fdrn’, ‘fdrcorr’]

FDR Benjamini-Yekutieli

8

[‘fdr_tsbh’, ‘fdr_2sbh’]

FDR 2-stage Benjamini-Hochberg

9

[‘fdr_tsbky’, ‘fdr_2sbky’, ‘fdr_twostage’]

FDR 2-stage Benjamini-Krieger-Yekutieli

10

[‘fdr_gbs’]

FDR adaptive Gavrilov-Benjamini-Sarkar

Any value in the Aliases column can be passed to Experiment’s correction= kwarg.

Per-query Effectiveness

Finally, if necessary, we can request per-query performances using the perquery=True kwarg:

pt.Experiment(
    [tfidf, bm25],
    dataset.get_topics(),
    dataset.get_qrels(),
    eval_metrics=["map", "recip_rank"],
    names=["TF_IDF", "BM25"],
    perquery=True
)

This provides a dataframe where each row is the performance of a given system for a give query on a particular evaluation measure.

name

qid

measure

value

186

BM25

1

map

0.26794

187

BM25

1

recip_rank

1

204

BM25

10

map

0.115631

205

BM25

10

recip_rank

0.5

206

BM25

11

map

0.0776046

NB: For brevity, we only show the top 5 rows of the returned table.

Missing Topics and/or Qrels

There is not always a one-to-one correspondance between the topic/query IDs (qids) that appear in the provided topics and qrels. Qids that appear in topics but not qrels can be due to incomplete judgments, such as in sparsely labeled datasets or shared tasks that choose to omit some topics (e.g., due to cost). Qids that appear in qrels but no in topics can happen when running a subset of topics for testing purposes (e.g., topics.head(5)).

The filter_by_qrels and fitler_by_topics parameters control the behaviour of an experiment when topics and qrels do not perfectly overlap. When filter_by_qrels=True, topics are filtered down to only the ones that have qids in the qrels. Similarly, when fitler_by_topics=True, qrels are filtered down to only the ones that have qids in the topics.

For example, consier topics that include qids A and B and qrels that include B and C. The results with each combination of settings are:

filter_by_topics

filter_by_qrels

Results consider

Notes

True (default)

False (default)

A,B

C is removed because it does not appear in the topics.

True (default)

True

B

Acts as an intersection of the qids found in the qrels and topics.

False

False (default)

A,B,C

Acts as a union of the qids found in qrels and topics.

False

True

B,C

A is removed because it does not appear in the qrels.

Note that, following IR evaluation conventions, topics that have no relevance judgments (A in the above example) do not contribute to relevance-based measures (e.g., map), but still contribute to efficiency measures (e.g., mrt). As such, aggregate relevance-based measures will not change based on the value of filter_by_qrels. When perquery=True, topics that have no relevance judgments (A) will give a value of NaN, indicating that they are not defined and should not contribute to the average.

The defaults (filter_by_topics=True and filter_by_qrels=False) were chosen because they likely reflect the intent of the user in most cases. In particular, it runs all topics requested and evaluates on only those topics. However, you may want to change these settings in some circumstnaces. E.g.:

  • If you want to save time and avoid running topics that will not be evaluated, set filter_by_qrels=True. This can be particularly helpful for large collections with many missing judgments, such as MS MARCO.

  • If you want to evaluate across all topics from the qrels set filter_by_topics=False.

Note that in all cases, if a requested topic that appears in the qrels returns no results, it will properly contribute a score of 0 for evaluation.

Available Evaluation Measures

All trec_eval evaluation measure are available. Often used measures, including the name that must be used, are:

  • Mean Average Precision (map).

  • Mean Reciprocal Rank (recip_rank).

  • Normalized Discounted Cumulative Gain (ndcg), or calculated at a given rank cutoff (e.g. ndcg_cut_5).

  • Number of queries (num_q) - not averaged.

  • Number of retrieved documents (num_ret) - not averaged.

  • Number of relevant documents (num_rel) - not averaged.

  • Number of relevant documents retrieved (num_rel_ret) - not averaged.

  • Interpolated recall precision curves (iprec_at_recall). This is family of measures, so requesting iprec_at_recall will output measurements for IPrec@0.00, IPrec@0.10, etc.

  • Precision at rank cutoff (e.g. P_5).

  • Recall (recall) will generate recall at different cutoffs, such as recall_5, etc.).

  • Mean response time (mrt) will report the average number of milliseconds to conduct a query (this is calculated by pt.Experiment() directly, not pytrec_eval).

  • trec_eval measure families such as official, set and all_trec will be expanded. These result in many measures being returned. For instance, asking for official results in the following (very wide) output reporting the usual default metrics of trec_eval:

name

P@5

P@10

P@15

P@20

P@30

P@100

P@200

P@500

P@1000

Rprec

Bpref

IPrec@0.0

IPrec@0.1

IPrec@0.2

IPrec@0.3

IPrec@0.4

IPrec@0.5

IPrec@0.6

IPrec@0.7

IPrec@0.8

IPrec@0.9

IPrec@1.0

AP

NumQ

NumRel

NumRet(rel=1)

NumRet

RR

0

TF_IDF

0.473118

0.35914

0.304659

0.273118

0.236201

0.126344

0.0785484

0.0379785

0.0208387

0.301125

0.934214

0.725414

0.640478

0.520824

0.406439

0.336222

0.267922

0.197803

0.151191

0.104208

0.0575086

0.0240411

0.290905

93

2083

1938

91930

0.699168

1

BM25

0.460215

0.352688

0.302509

0.269892

0.236918

0.126667

0.0779032

0.0379785

0.020828

0.303662

0.934607

0.751261

0.658588

0.523882

0.411899

0.345536

0.274788

0.207526

0.159466

0.105463

0.0579222

0.0238924

0.296517

93

2083

1937

91930

0.725665

See also a list of common TREC eval measures.

Evaluation Measures Objects

Using the ir_measures Python package, PyTerrier supports evaluation measure objects. These make it easier to express measure configurations such as rank cutoffs:

from pyterrier.measures import *
pt.Experiment(
    [tfidf, bm25],
    dataset.get_topics(),
    dataset.get_qrels(),
    eval_metrics=[AP, RR, nDCG@5],
)

NB: We have to use from pyterrier.measures import *, as from pt.measures import * wont work.

More specifically, lets consider the TREC Deep Learning track passage ranking task, which requires NDCG@10, NDCG@100 (using graded labels), as well as MRR@10 and MAP using binary labels (where relevant is grade 2 and above). The necessary incantation of pt.Experiment() looks like:

from pyterrier.measures import *
dataset = pt.get_dataset("trec-deep-learning-passages")
pt.Experiment(
    [tfidf, bm25],
    dataset.get_topics("test-2019"),
    dataset.get_qrels("test-2019"),
    eval_metrics=[RR(rel=2), nDCG@10, nDCG@100, AP(rel=2)],
)

The available evaluation measure objects are listed below.

pyterrier.measures.P(**kwargs)

Basic measure for that computes the percentage of documents in the top cutoff results that are labeled as relevant. cutoff is a required parameter, and can be provided as P@cutoff.

@misc{rijsbergen:1979:ir,
  title={Information Retrieval.},
  author={Van Rijsbergen, Cornelis J},
  year={1979},
  publisher={USA: Butterworth-Heinemann}
}
pyterrier.measures.R(**kwargs)

Recall@k (R@k). The fraction of relevant documents for a query that have been retrieved by rank k.

NOTE: Some tasks define Recall@k as whether any relevant documents are found in the top k results. This software follows the TREC convention and refers to that measure as Success@k.

pyterrier.measures.AP(**kwargs)

The [Mean] Average Precision ([M]AP). The average precision of a single query is the mean of the precision scores at each relevant item returned in a search results list.

AP is typically used for adhoc ranking tasks where getting as many relevant items as possible is. It is commonly referred to as MAP, by taking the mean of AP over the query set.

@article{Harman:1992:ESIR,
  author = {Donna Harman},
  title = {Evaluation Issues in Information Retrieval},
  journal = {Information Processing and Management},
  volume = {28},
  number = {4},
  pages = {439 - -440},
  year = {1992},
}
pyterrier.measures.RR(**kwargs)

The [Mean] Reciprocal Rank ([M]RR) is a precision-focused measure that scores based on the reciprocal of the rank of the highest-scoring relevance document. An optional cutoff can be provided to limit the depth explored. rel (default 1) controls which relevance level is considered relevant.

@article{kantor2000trec,
  title={The TREC-5 Confusion Track},
  author={Kantor, Paul and Voorhees, Ellen},
  journal={Information Retrieval},
  volume={2},
  number={2-3},
  pages={165--176},
  year={2000}
}
pyterrier.measures.nDCG(**kwargs)

The normalized Discounted Cumulative Gain (nDCG). Uses graded labels - systems that put the highest graded documents at the top of the ranking. It is normalized wrt. the Ideal NDCG, i.e. documents ranked in descending order of graded label.

@article{Jarvelin:2002:CGE:582415.582418,
  author = {J"{a}rvelin, Kalervo and Kek"{a}l"{a}inen, Jaana},
  title = {Cumulated Gain-based Evaluation of IR Techniques},
  journal = {ACM Trans. Inf. Syst.},
  volume = {20},
  number = {4},
  year = {2002},
  pages = {422--446},
  numpages = {25},
  url = {http://doi.acm.org/10.1145/582415.582418},
}
pyterrier.measures.ERR(**kwargs)

The Expected Reciprocal Rank (ERR) is a precision-focused measure. In essence, an extension of reciprocal rank that encapsulates both graded relevance and a more realistic cascade-based user model of how users brwose a ranking.

pyterrier.measures.Success(**kwargs)

1 if a document with at least rel relevance is found in the first cutoff documents, else 0.

NOTE: Some refer to this measure as Recall@k. This software follows the TREC convention, where Recall@k is defined as the proportion of known relevant documents retrieved in the top k results.

pyterrier.measures.Judged(**kwargs)

Percentage of results in the top k (cutoff) results that have relevance judgments. Equivalent to P@k with a rel lower than any judgment.

pyterrier.measures.NumQ(**kwargs)

The total number of queries.

pyterrier.measures.NumRet(**kwargs)

The number of results returned. When rel is provided, counts the number of documents returned with at least that relevance score (inclusive).

pyterrier.measures.NumRelRet(**kwargs)

The number of results returned. When rel is provided, counts the number of documents returned with at least that relevance score (inclusive).

pyterrier.measures.NumRel(**kwargs)

The number of relevant documents the query has (independent of what the system retrieved).

pyterrier.measures.Rprec(**kwargs)

The precision of at R, where R is the number of relevant documents for a given query. Has the cute property that it is also the recall at R.

@misc{Buckley2005RetrievalSE,
  title={Retrieval System Evaluation},
  author={Chris Buckley and Ellen M. Voorhees},
  annote={Chapter 3 in TREC: Experiment and Evaluation in Information Retrieval},
  howpublished={MIT Press},
  year={2005}
}
pyterrier.measures.Bpref(**kwargs)

Binary Preference (Bpref). This measure examines the relative ranks of judged relevant and non-relevant documents. Non-judged documents are not considered.

@inproceedings{Buckley2004RetrievalEW,
  title={Retrieval evaluation with incomplete information},
  author={Chris Buckley and Ellen M. Voorhees},
  booktitle={SIGIR},
  year={2004}
}
pyterrier.measures.infAP(**kwargs)

Inferred AP. AP implementation that accounts for pooled-but-unjudged documents by assuming that they are relevant at the same proportion as other judged documents. Essentially, skips documents that were pooled-but-not-judged, and assumes unjudged are non-relevant.

Pooled-but-unjudged indicated by a score of -1, by convention. Note that not all qrels use this convention.