Running Experiments¶
PyTerrier aims to make it easy to conduct an information retrieval experiment, namely, to run a transformer pipeline over a set of queries, and evaluating the outcome using standard information retrieval evaluation metrics based on known relevant documents (obtained from a set relevance assessments, also known as qrels).
NB: For calculating evaluation metrics, we use ir_measures library, which includes implementations of many standard metrics. By default, to calculate more measures, ir_measures uses our fork of the pytrec_eval library, which itself is a Python wrapper around the widely-used trec_eval evaluation tool.
The main way to achieve this is using pt.Experiment(). If you have an existing results dataframe, you can use
pt.Evaluate().
API¶
- pyterrier.Experiment(retr_systems, topics, qrels, eval_metrics, names=None, perquery=False, dataframe=True, batch_size=None, filter_by_qrels=False, filter_by_topics=True, baseline=None, test='t', correction=None, correction_alpha=0.05, highlight=None, round=None, verbose=False, validate='warn', save_dir=None, save_mode='warn', save_format='trec', precompute_prefix=False, **kwargs)[source]¶
Allows easy comparison of multiple retrieval transformer pipelines using a common set of topics, and identical evaluation measures computed using the same qrels. In essence, each transformer is applied on the provided set of topics. Then the named evaluation measures are computed for each system.
- Parameters:
retr_systems (
Sequence[Transformer|DataFrame]) – A list of transformers to evaluate. If you already have the results for one (or more) of your systems, a results dataframe can also be used here. Results produced by the transformers must have “qid”, “docno”, “score”, “rank” columns.topics (
DataFrame) – Either a path to a topics file or a pandas.Dataframe with columns=[‘qid’, ‘query’]qrels (
DataFrame) – Either a path to a qrels file or a pandas.Dataframe with columns=[‘qid’,’docno’, ‘label’]eval_metrics (
Sequence[str|Measure]) – Which evaluation metrics to use. E.g. [‘map’]names (
Sequence[str] |None) – List of names for each retrieval system when presenting the results. Default=None. If None: Obtains the str() representation of each transformer as its name.batch_size (
int|None) – If not None, evaluation is conducted in batches of batch_size topics. Default=None, which evaluates all topics at once. Applying a batch_size is useful if you have large numbers of topics, and/or if your pipeline requires large amounts of temporary memory during a run.filter_by_qrels (
bool) – If True, will drop topics from the topics dataframe that have qids not appearing in the qrels dataframe.filter_by_topics (
bool) – If True, will drop topics from the qrels dataframe that have qids not appearing in the topics dataframe.perquery (
bool|Literal['both']) – If True return each metric for each query, if False, will return mean metrics across all queries. If both, will return both averages and perquery results in a tuple. Default=False.save_dir (
str|None) – If set to the name of a directory, the results of each transformer will be saved in TREC-formatted results file, whose filename is based on the systems names (as specified bynameskwarg). If the file exists andsave_modeis set to “reuse”, then the file will be used for evaluation rather than the transformer. Default is None, such that saving and loading from files is disabled.save_mode (
Literal['reuse','overwrite','error','warn']) – Defines how existing files are used whensave_diris set. If set to “reuse”, then files will be preferred over transformers for evaluation. If set to “overwrite”, existing files will be replaced. If set to “warn” or “error”, the presence of any existing file will cause a warning or error, respectively. Default is “warn”.save_format (
Literal['trec'] |ModuleType|Tuple[Callable[[IO],DataFrame],Callable[[DataFrame,IO],None]]) – How are result being saved. Defaults to ‘trec’, which usespt.io.read_results()andpt.io.write_results()for saving system outputs. If TREC results format is insufficient, setsave_format=pickle. Alternatively, a tuple of read and write function can be specified, for instance,save_format=(pandas.from_csv, pandas.DataFrame.to_csv), or evensave_format=(pandas.from_parquet, pandas.DataFrame.to_parquet).dataframe (
Literal[False] |Literal[True]) – If True return results as a dataframe, else as a dictionary of dictionaries. Default=True.baseline (
int|None) – If set to the index of an item of the retr_system list, will calculate the number of queries improved, degraded and the statistical significance (paired t-test p value) for each measure. Default=None: If None, no additional columns will be added for each measure.test (
str|Callable[[Sequence[float|int|complex],Sequence[float|int|complex]],Tuple[Any,float|int|complex]]) – Which significance testing approach to apply. Defaults to “t”. Alternatives are “wilcoxon” - not typically used for IR experiments. A Callable can also be passed - it should follow the specification of scipy.stats.ttest_rel(), i.e. it expect two arrays of numbers, and return an array or tuple, of which the second value will be placed in the p-value column.correction (
str|None) – Whether any multiple testing correction should be applied. E.g. ‘bonferroni’, ‘holm’, ‘hs’ aka ‘holm-sidak’. Default is None. Additional columns are added denoting whether the null hypothesis can be rejected, and the corrected p value. See statsmodels.stats.multitest.multipletests() for more information about available testing correction.correction_alpha (
float) – What alpha value for multiple testing correction. Default is 0.05.highlight (
str|None) – If highlight=”bold”, highlights in bold the best measure value in each column; if highlight=”color” or “colour”, then the cell with the highest metric value will have a green background.round (
int|Dict[str,int] |None) – How many decimal places to round each measure value to. This can also be a dictionary mapping measure name to number of decimal places. Default is None, which is no rounding.precompute_prefix (
bool) – If set to True, then pt.Experiment will look for a common prefix on all input pipelines, and execute that common prefix pipeline only once. This functionality assumes that the intermidiate results of the common prefix can fit in memory. Set to False by default.verbose (
bool) – If True, a tqdm progress bar is shown as systems (or systems*batches if batch_size is set) are executed. Default=False.validate (
Literal['warn','error','ignore']) – If set to value other than ‘ignore’, each transformer is validated against the topics dataframe, to ensure that it produces the expected output columns.pt.inspect.transformer_outputs()is used to determine the output columns. If ‘warn’, then transformers whose output columns don’t match the columns required by the specified evaluation measures will product warnings; If ‘error’, then an error is produced. If a transformer cannot be inspected, a warning is produced.
- Returns:
A Dataframe/dict with each retrieval system with each metric evaluated, or alternatively a tuple with averages and perquery results.
- pyterrier.Evaluate(res, qrels, metrics=['map', 'ndcg'], perquery=False)[source]¶
Evaluate a single result dataframe with the given qrels. This method may be used as an alternative to
pt.Experiment()for getting only the evaluation measurements given a single set of existing results.The PyTerrier-way is to use
pt.Experiment()to evaluate a set of transformers, but this method is useful if you have a set of results already, and want to evaluate them without having to create a transformer pipeline.- Parameters:
res (
DataFrame) – Either a dataframe with columns=[‘qid’, ‘docno’, ‘score’] or a dict {qid:{docno:score,},}qrels (
DataFrame) – Either a dataframe with columns=[‘qid’,’docno’, ‘label’] or a dict {qid:{docno:label,},}metrics (
Sequence[str|Measure]) – A list of strings specifying which evaluation metrics to use. Default=[‘map’, ‘ndcg’]perquery (
bool) – If true return each metric for each query, else return mean metrics. Default=False
- Return type:
Dict
Examples¶
Average Effectiveness¶
Getting average effectiveness over a set of topics:
dataset = pt.get_dataset("vaswani")
# vaswani dataset provides an index, topics and qrels
# lets generate two BRs to compare
tfidf = pt.terrier.Retriever(dataset.get_index(), wmodel="TF_IDF")
bm25 = pt.terrier.Retriever(dataset.get_index(), wmodel="BM25")
pt.Experiment(
[tfidf, bm25],
dataset.get_topics(),
dataset.get_qrels(),
eval_metrics=["map", "recip_rank"]
)
The returned dataframe is as follows:
name |
map |
recip_rank |
|
|---|---|---|---|
0 |
TerrierRetr(TF_IDF) |
0.290905 |
0.699168 |
1 |
TerrierRetr(BM25) |
0.296517 |
0.725665 |
Each row represents one system. We can manually set the names of the systems, using the names= kwarg, as follows:
pt.Experiment(
[tfidf, bm25],
dataset.get_topics(),
dataset.get_qrels(),
eval_metrics=["map", "recip_rank"],
names=["TF_IDF", "BM25"]
)
This produces dataframes that are more easily interpretable.
name |
map |
recip_rank |
|
|---|---|---|---|
0 |
TF_IDF |
0.290905 |
0.699168 |
1 |
BM25 |
0.296517 |
0.725665 |
We can also reduce the number of decimal places reported using the round= kwarg, as follows:
pt.Experiment(
[tfidf, bm25],
dataset.get_topics(),
dataset.get_qrels(),
eval_metrics=["map", "recip_rank"],
round={"map" : 4, "recip_rank" : 3},
names=["TF_IDF", "BM25"]
)
The result is as follows:
name |
map |
recip_rank |
|
|---|---|---|---|
0 |
TF_IDF |
0.2909 |
0.699 |
1 |
BM25 |
0.2965 |
0.726 |
Passing an integer value to round= (e.g. round=3) applies rounding to all evaluation measures.
Significance Testing¶
We can perform significance testing by specifying the index of which transformer we consider to be our baseline, e.g. baseline=0:
pt.Experiment(
[tfidf, bm25],
dataset.get_topics(),
dataset.get_qrels(),
eval_metrics=["map", "recip_rank"],
names=["TF_IDF", "BM25"],
baseline=0
)
In this case, additional columns are returned for each measure, indicating the number of queries improved compared to the baseline, the number of queries degraded, as well as the paired t-test p-value in the difference between each row and the baseline row. NB: For the baseline, these values are NaN (not applicable).
name |
map |
recip_rank |
map + |
map - |
map p-value |
recip_rank + |
recip_rank - |
recip_rank p-value |
|
|---|---|---|---|---|---|---|---|---|---|
0 |
TF_IDF |
0.290905 |
0.699168 |
nan |
nan |
nan |
nan |
nan |
nan |
1 |
BM25 |
0.296517 |
0.725665 |
46 |
45 |
0.237317 |
16 |
3 |
0.0258549 |
For this test collection, between the TF_IDF and BM25 weighting models, there is no significant difference observed in terms of MAP, but there is a significant different in terms of mean reciprocal rank (p<0.05). Indeed, while BM25 improves average precision for 46 queries over TF_IDF, it degrades it for 45; on the other hand, the rank of the first relevant document is improved for 16 queries by BM25 over TD_IDF.
Further more, modern experimental convention suggests that it is important to correct for multiple testing in the comparative evaluation of many IR systems. Experiments provides supported for the multiple testing correction methods supported by the statsmodels package, such as Bonferroni:
pt.Experiment(
[tfidf, bm25],
dataset.get_topics(),
dataset.get_qrels(),
eval_metrics=["map"],
names=["TF_IDF", "BM25"],
baseline=0,
correction='bonferroni'
)
This adds two further columns for each measure, denoting if the null hypothesis can be rejected (e.g. “map reject”), as well as the corrected p value (“map p-value corrected”), as shown below:
name |
map |
map + |
map - |
map p-value |
map reject |
map p-value corrected |
|
|---|---|---|---|---|---|---|---|
0 |
TF_IDF |
0.290905 |
nan |
nan |
nan |
False |
nan |
1 |
BM25 |
0.296517 |
46 |
45 |
0.237317 |
False |
0.237317 |
The table below summarises the multiple testing correction methods supported:
Aliases |
Correction Method |
|
|---|---|---|
0 |
[‘b’, ‘bonf’, ‘bonferroni’] |
Bonferroni |
1 |
[‘s’, ‘sidak’] |
Sidak |
2 |
[‘h’, ‘holm’] |
Holm |
3 |
[‘hs’, ‘holm-sidak’] |
Holm-Sidak |
4 |
[‘sh’, ‘simes-hochberg’] |
Simes-Hochberg |
5 |
[‘ho’, ‘hommel’] |
Hommel |
6 |
[‘fdr_bh’, ‘fdr_i’, ‘fdr_p’, ‘fdri’, ‘fdrp’] |
FDR Benjamini-Hochberg |
7 |
[‘fdr_by’, ‘fdr_n’, ‘fdr_c’, ‘fdrn’, ‘fdrcorr’] |
FDR Benjamini-Yekutieli |
8 |
[‘fdr_tsbh’, ‘fdr_2sbh’] |
FDR 2-stage Benjamini-Hochberg |
9 |
[‘fdr_tsbky’, ‘fdr_2sbky’, ‘fdr_twostage’] |
FDR 2-stage Benjamini-Krieger-Yekutieli |
10 |
[‘fdr_gbs’] |
FDR adaptive Gavrilov-Benjamini-Sarkar |
Any value in the Aliases column can be passed to Experiment’s correction= kwarg.
Per-query Effectiveness¶
Finally, if necessary, we can request per-query performances using the perquery=True kwarg:
pt.Experiment(
[tfidf, bm25],
dataset.get_topics(),
dataset.get_qrels(),
eval_metrics=["map", "recip_rank"],
names=["TF_IDF", "BM25"],
perquery=True
)
This provides a dataframe where each row is the performance of a given system for a give query on a particular evaluation measure.
name |
qid |
measure |
value |
|
|---|---|---|---|---|
186 |
BM25 |
1 |
map |
0.26794 |
187 |
BM25 |
1 |
recip_rank |
1 |
204 |
BM25 |
10 |
map |
0.115631 |
205 |
BM25 |
10 |
recip_rank |
0.5 |
206 |
BM25 |
11 |
map |
0.0776046 |
NB: For brevity, we only show the top 5 rows of the returned table.
Saving and Reusing Results¶
For some research tasks, it is considered good practice to save your results files when conducting experiments. This allows several advantages:
It permits additional evaluation (e.g. more measures, more signifiance tests) without re-applying potentially slow transformer pipelines.
It allows transformer results to be made available for other experiments, perhaps as a virtual data appendix in a paper.
Saving can be enabled by adding the save_dir as a kwarg to pt.Experiment:
pt.Experiment(
[tfidf, bm25],
dataset.get_topics(),
dataset.get_qrels(),
eval_metrics=["map", "recip_rank"],
names=["TF_IDF", "BM25"],
save_dir="./",
)
This will save two files, namely, TF_IDF.res.gz and BM25.res.gz to the current directory. If these files already exist, they will be “reused”, i.e. loaded and evaluated in preference to application of the tfidf and/or bm25 transformers. If experiments are being conducted on multiple different topic sets, care should be taken to ensure that previous results for a different topic set are not reused for evaluation.
If a transformer has been updated, outdated results files can be mistakenly used. To prevent this, set the save_mode
kwarg to "overwrite":
pt.Experiment(
[tfidf, bm25],
dataset.get_topics(),
dataset.get_qrels(),
eval_metrics=["map", "recip_rank"],
names=["TF_IDF", "BM25"],
save_dir="./",
save_mode="overwrite"
)
Missing Topics and/or Qrels¶
There is not always a one-to-one correspondance between the topic/query IDs (qids) that appear in
the provided topics and qrels. Qids that appear in topics but not qrels can be due to incomplete judgments,
such as in sparsely labeled datasets or shared tasks that choose to omit some topics (e.g., due to cost).
Qids that appear in qrels but no in topics can happen when running a subset of topics for testing purposes
(e.g., topics.head(5)).
The filter_by_qrels and filter_by_topics parameters control the behaviour of an experiment when topics and qrels
do not perfectly overlap. When filter_by_qrels=True, topics are filtered down to only the ones that have qids in the
qrels. Similarly, when filter_by_topics=True, qrels are filtered down to only the ones that have qids in the topics.
For example, consier topics that include qids A and B and qrels that include B and C. The results with
each combination of settings are:
|
|
Results consider |
Notes |
|---|---|---|---|
|
|
|
|
|
|
|
Acts as an intersection of the qids found in the qrels and topics. |
|
|
|
Acts as a union of the qids found in qrels and topics. |
|
|
|
|
Note that, following IR evaluation conventions, topics that have no relevance judgments (A in the above example)
do not contribute to relevance-based measures (e.g., map), but still contribute to efficiency measures (e.g., mrt).
As such, aggregate relevance-based measures will not change based on the value of filter_by_qrels. When perquery=True,
topics that have no relevance judgments (A) will give a value of NaN, indicating that they are not defined
and should not contribute to the average.
The defaults (filter_by_topics=True and filter_by_qrels=False) were chosen because they likely reflect the intent
of the user in most cases. In particular, it runs all topics requested and evaluates on only those topics. However, you
may want to change these settings in some circumstnaces. E.g.:
If you want to save time and avoid running topics that will not be evaluated, set
filter_by_qrels=True. This can be particularly helpful for large collections with many missing judgments, such as MS MARCO.If you want to evaluate across all topics from the qrels set
filter_by_topics=False.
Note that in all cases, if a requested topic that appears in the qrels returns no results, it will properly contribute a score of 0 for evaluation.
Available Evaluation Measures¶
All trec_eval evaluation measure are available. Often used measures, including the name that must be used, are:
Mean Average Precision (map).
Mean Reciprocal Rank (recip_rank).
Normalized Discounted Cumulative Gain (ndcg), or calculated at a given rank cutoff (e.g. ndcg_cut_5).
Number of queries (num_q) - not averaged.
Number of retrieved documents (num_ret) - not averaged.
Number of relevant documents (num_rel) - not averaged.
Number of relevant documents retrieved (num_rel_ret) - not averaged.
Interpolated recall precision curves (iprec_at_recall). This is family of measures, so requesting iprec_at_recall will output measurements for IPrec@0.00, IPrec@0.10, etc.
Precision at rank cutoff (e.g. P_5).
Recall (recall) will generate recall at different cutoffs, such as recall_5, etc.).
Mean response time (mrt) will report the average number of milliseconds to conduct a query (this is calculated by
pt.Experiment()directly, not pytrec_eval).trec_eval measure families such as official, set and all_trec will be expanded. These result in many measures being returned. For instance, asking for official results in the following (very wide) output reporting the usual default metrics of trec_eval:
name |
NumQ |
NumRet |
NumRel |
NumRet(rel=1) |
AP |
Rprec |
Bpref |
RR |
P@5 |
||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 |
TF_IDF |
93 |
91930 |
2083 |
20.8387 |
0.290905 |
0.301125 |
0.934214 |
0.699168 |
0.725414 |
0.640478 |
0.520824 |
0.406439 |
0.336222 |
0.267922 |
0.197803 |
0.151191 |
0.104208 |
0.0575086 |
0.0240411 |
0.473118 |
0.35914 |
0.304659 |
0.273118 |
0.236201 |
0.126344 |
0.0785484 |
0.0379785 |
0.0208387 |
1 |
BM25 |
93 |
91930 |
2083 |
20.828 |
0.296517 |
0.303662 |
0.934607 |
0.725665 |
0.751261 |
0.658588 |
0.523882 |
0.411899 |
0.345536 |
0.274788 |
0.207526 |
0.159466 |
0.105463 |
0.0579222 |
0.0238924 |
0.460215 |
0.352688 |
0.302509 |
0.269892 |
0.236918 |
0.126667 |
0.0779032 |
0.0379785 |
0.020828 |
See also a list of common TREC eval measures.
Evaluation Measures Objects¶
Using the ir_measures Python package, PyTerrier supports evaluation measure objects. These make it easier to express measure configurations such as rank cutoffs:
from pyterrier.measures import *
pt.Experiment(
[tfidf, bm25],
dataset.get_topics(),
dataset.get_qrels(),
eval_metrics=[AP, RR, nDCG@5],
)
NB: We have to use from pyterrier.measures import *, as from pt.measures import * wont work.
More specifically, lets consider the TREC Deep Learning track passage ranking task, which requires NDCG@10, NDCG@100 (using graded labels), as well as MRR@10 and MAP using binary labels (where relevant is grade 2 and above). The necessary incantation of pt.Experiment() looks like:
from pyterrier.measures import *
dataset = pt.get_dataset("trec-deep-learning-passages")
pt.Experiment(
[tfidf, bm25],
dataset.get_topics("test-2019"),
dataset.get_qrels("test-2019"),
eval_metrics=[RR(rel=2), nDCG@10, nDCG@100, AP(rel=2)],
)
The available evaluation measure objects are listed below.
- pyterrier.measures.P(**kwargs)¶
Basic measure for that computes the percentage of documents in the top cutoff results that are labeled as relevant. cutoff is a required parameter, and can be provided as P@cutoff.
Citation
Rijsbergen. Information Retrieval. 1979.
@book{DBLP:books/bu/Rijsbergen79, author = {C. J. van Rijsbergen}, title = {Information Retrieval}, publisher = {Butterworth}, year = {1979}, isbn = {0-408-70929-4}, timestamp = {Thu, 03 Jan 2002 11:51:10 +0100}, biburl = {https://dblp.org/rec/books/bu/Rijsbergen79.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
- pyterrier.measures.R(**kwargs)¶
Recall@k (R@k). The fraction of relevant documents for a query that have been retrieved by rank k.
NOTE: Some tasks define Recall@k as whether any relevant documents are found in the top k results. This software follows the TREC convention and refers to that measure as Success@k.
- pyterrier.measures.AP(**kwargs)¶
The [Mean] Average Precision ([M]AP). The average precision of a single query is the mean of the precision scores at each relevant item returned in a search results list.
AP is typically used for adhoc ranking tasks where getting as many relevant items as possible is. It is commonly referred to as MAP, by taking the mean of AP over the query set.
Citation
Harman. Evaluation Issues in Information Retrieval. Inf. Process. Manag. 1992. [link]
@article{DBLP:journals/ipm/Harman92, author = {Donna Harman}, title = {Evaluation Issues in Information Retrieval}, journal = {Inf. Process. Manag.}, volume = {28}, number = {4}, pages = {439--440}, year = {1992}, url = {https://doi.org/10.1016/0306-4573(92)90001-G}, doi = {10.1016/0306-4573(92)90001-G}, timestamp = {Fri, 21 Feb 2020 13:11:30 +0100}, biburl = {https://dblp.org/rec/journals/ipm/Harman92.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
- pyterrier.measures.RR(**kwargs)¶
The [Mean] Reciprocal Rank ([M]RR) is a precision-focused measure that scores based on the reciprocal of the rank of the highest-scoring relevance document. An optional cutoff can be provided to limit the depth explored. rel (default 1) controls which relevance level is considered relevant.
Citation
Kantor and Voorhees. The TREC-5 Confusion Track: Comparing Retrieval Methods for Scanned Text. Inf. Retr. 2000. [link]
@article{DBLP:journals/ir/KantorV00, author = {Paul B. Kantor and Ellen M. Voorhees}, title = {The {TREC-5} Confusion Track: Comparing Retrieval Methods for Scanned Text}, journal = {Inf. Retr.}, volume = {2}, number = {2/3}, pages = {165--176}, year = {2000}, url = {https://doi.org/10.1023/A:1009902609570}, doi = {10.1023/A:1009902609570}, timestamp = {Thu, 14 Oct 2021 09:13:06 +0200}, biburl = {https://dblp.org/rec/journals/ir/KantorV00.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
- pyterrier.measures.nDCG(**kwargs)¶
The normalized Discounted Cumulative Gain (nDCG). Uses graded labels - systems that put the highest graded documents at the top of the ranking. It is normalized wrt. the Ideal NDCG, i.e. documents ranked in descending order of graded label.
Citation
Järvelin and Kekäläinen. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst. 2002. [link]
@article{DBLP:journals/tois/JarvelinK02, author = {Kalervo J{\"{a}}rvelin and Jaana Kek{\"{a}}l{\"{a}}inen}, title = {Cumulated gain-based evaluation of {IR} techniques}, journal = {{ACM} Trans. Inf. Syst.}, volume = {20}, number = {4}, pages = {422--446}, year = {2002}, url = {http://doi.acm.org/10.1145/582415.582418}, doi = {10.1145/582415.582418}, timestamp = {Fri, 09 Jun 2017 11:03:19 +0200}, biburl = {https://dblp.org/rec/journals/tois/JarvelinK02.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
- pyterrier.measures.ERR(**kwargs)¶
The Expected Reciprocal Rank (ERR) is a precision-focused measure. In essence, an extension of reciprocal rank that encapsulates both graded relevance and a more realistic cascade-based user model of how users brwose a ranking.
- pyterrier.measures.Success(**kwargs)¶
1 if a document with at least rel relevance is found in the first cutoff documents, else 0.
NOTE: Some refer to this measure as Recall@k. This software follows the TREC convention, where Recall@k is defined as the proportion of known relevant documents retrieved in the top k results.
- pyterrier.measures.Judged(**kwargs)¶
Percentage of results in the top k (cutoff) results that have relevance judgments. Equivalent to P@k with a rel lower than any judgment.
- pyterrier.measures.NumQ(**kwargs)¶
The total number of queries.
- pyterrier.measures.NumRet(**kwargs)¶
The number of results returned. When rel is provided, counts the number of documents returned with at least that relevance score (inclusive).
- pyterrier.measures.NumRelRet(**kwargs)¶
The number of results returned. When rel is provided, counts the number of documents returned with at least that relevance score (inclusive).
- pyterrier.measures.NumRel(**kwargs)¶
The number of relevant documents the query has (independent of what the system retrieved).
- pyterrier.measures.Rprec(**kwargs)¶
The precision at R, where R is the number of relevant documents for a given query. Has the cute property that it is also the recall at R.
Citation
Buckley and Voorhees. Retrieval System Evaluation. 2005. [link]
- pyterrier.measures.Bpref(**kwargs)¶
Binary Preference (Bpref). This measure examines the relative ranks of judged relevant and non-relevant documents. Non-judged documents are not considered.
Citation
Buckley and Voorhees. Retrieval evaluation with incomplete information. SIGIR 2004. [link]
@inproceedings{DBLP:conf/sigir/BuckleyV04, author = {Chris Buckley and Ellen M. Voorhees}, editor = {Mark Sanderson and Kalervo J{\"{a}}rvelin and James Allan and Peter Bruza}, title = {Retrieval evaluation with incomplete information}, booktitle = {{SIGIR} 2004: Proceedings of the 27th Annual International {ACM} {SIGIR} Conference on Research and Development in Information Retrieval, Sheffield, UK, July 25-29, 2004}, pages = {25--32}, publisher = {{ACM}}, year = {2004}, url = {https://doi.org/10.1145/1008992.1009000}, doi = {10.1145/1008992.1009000}, timestamp = {Thu, 14 Oct 2021 10:27:19 +0200}, biburl = {https://dblp.org/rec/conf/sigir/BuckleyV04.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
- pyterrier.measures.infAP(**kwargs)¶
Inferred AP. AP implementation that accounts for pooled-but-unjudged documents by assuming that they are relevant at the same proportion as other judged documents. Essentially, skips documents that were pooled-but-not-judged, and assumes unjudged are non-relevant.
Pooled-but-unjudged indicated by a score of -1, by convention. Note that not all qrels use this convention.
Validation of Transformers¶
When formulating pipelines for a pt.Experiment(), its possible to formulate invalid pipelines, e.g. a transformer that does not produce the expected columns, or a transformer that does not accept the input columns of the previous transformer.
To mitigate this, pt.Experiment() will validate the transformers in the pipeline, and raise an error if the pipeline is invalid.
This validation is controlled by the validate= kwarg, which can take the following values:
- "warn" (default): If the pipeline is invalid, a warning is issued, but the experiment proceeds. Pipelines that do not validate will still run, but may produce unexpected results.
- "error": If the pipeline is invalid, an error is raised, and the experiment does not proceed. If a pipeline is not validated, the user is informed such that the experiment fails-fast.
- "ignore": No validation is performed, and the experiment proceeds. This is useful for pipelines that are known to be valid, but cannot be validated due to transformer objects that cannot be inspected to determing their input and output columns.
Validation uses pt.inspect.transformer_outputs() to determine the output columns of each transformer in the pipeline, and whether they match the expected input columns of the next transformer, and that the overall result of the pipeline has the expected columns for the evaluation measures requested.
Most transformers can be validated automatically, particularly if they respond correctly to an empty DataFrame input. Other transformers may require a transform_output method to be implemented, which returns the expected output columns of the transformer.
If a pipeline fails validation, the user is informed of the problem, and, if validate=”error” is set, the experiment does not proceed. On the other hand, if a pipeline cannot be validated (because a transformer cannot be inspected), a warning is issued, and the experiment proceeds.
Precomputation of Common Pipeline Prefixes¶
Often we wish to evaluate multiple pipelines that have exactly the same initial stages. pt.Experiment exposes a precompute_prefix kwarg, will precompute the results of the common initial stages, and then use these results to call the subsequent remainder of each pipelines.
Consider the following example:
from pyterrier_t5 import MonoT5ReRanker
bm25 = pt.terrier.Retriever.from_dataset('vaswani', 'terrier_stemmed_text', wmodel='BM25', num_results=100)
monoT5 = MonoT5ReRanker()
monoT5 = bm25 >> monoT5
pt.Experiment(
[bm25, monoT5],
pt.get_dataset('vaswani').get_topics(),
pt.get_dataset('vaswani').get_qrels(),
eval_metrics=['map'],
precompute_prefix=True
)
Normally, BM25 retriever would be invoked twice during this experiment - once for each pipeline, resulting in a slower executation time compared to an imperative workflow (get BM25 results, evaluate, apply monoT5, evaluate). By setting precompute_prefix=True, pt.Experiment will execute the bm25 transformer only once on the input topics, and then reuse those results as input to monoT5.
NB: This is experimental functionality, but should initial usage be successful, it may be turned on by default in future versions of PyTerrier.