RAG Measures

PyTerrier-RAG offers a number of commonly used evaluation measures as ir_measures objects that can be used via pt.Experiment() and pt.Evaluate().

For analysis of the generated answers:
  • Answer length (in characters): pyterrier_rag.measures.AnswerLen

  • Answer zero length (number of questions with an empty answer) : pyterrier_rag.measures.AnswerZeroLen

For comparison with gold-truth answers:
  • Exact match percentage: pyterrier_rag.measures.EM

  • F1: pyterrier_rag.measures.F1

  • ROUGE: pyterrier_rag.measures.ROUGE1P, pyterrier_rag.measures.ROUGE1R, pyterrier_rag.measures.ROUGE1F etc, as implemented by the rouge-score library.

Example:

pt.Experiment(
    [ragpipe1, ragpipe2],
    dataset.get_topics(),
    dataset.get_answers(),
    [pyterrier_rag.measures.EM, pyterrier_rag.measures.F1, pyterrier_rag.measures.ROUGE1F, pyterrier_rag.measures.AnswerLen]
)
Various ROUGE measures are available:
  • ROUGE-1 (precision, recall, f-measure)

  • ROUGE-2 (precision, recall, f-measure)

  • ROUGE-L (precision, recall, f-measure)

For comparison with known-relevant documents:
  • BERTScore (measures similarity of answer with relevant documents): pyterrier_rag.measures.BERTScore

pyterrier_rag.measures.BERTScore(rel=3, submeasure='f1', agg='max')[source]

Implements BERTScore, a semantic measure of equivalence. This is defined to take a qrels dataframe with an additional text attribute, and compare with the generated qanswers. NB: This is a function that returns a measure - it needs to be called.

Parameters:
  • rel (-) – Minimum label value for relevant qrels. Defaults to 3, which is the highest label in MSMARCO.

  • submeasure (-) – One of ‘precision’, ‘recall’ and ‘f1’. Defaults to ‘f1’.

  • agg (-) – How to combine (aggregate) when there are multiple relevant documents. Valid options are ‘max’ or ‘avg’. Defaults to ‘max’.

Returns:

An IR measures measure object that can be used in pt.Evaluate or pt.Experiment

Example:

text_loader = pt.text.get_text(pt.get_dataset('irds:msmarco-passage'), 'text')
topics_qrels = pt.get_dataset('irds:msmarco-passage/trec-dl-2019/judged')
pt.Experiment(
    [ragpipe1, ragpipe2],
    dataset.get_topics('test-2019'),
    text_loader(dataset.get_qrels()),
    [pyterrier_rag.measures.BERTScore(rel=3)]
)

Using ir_measures, custom measures can easily be implemented using functions that take a dataframe of answers and of gold-truth answers:

import ir_measures
# measures counts how many words in the answer
AnswerWords = ir_measures.define_byquery(
    lambda qrels, res: len(res.iloc[0]['qanswer'].split(" ")),
    name='AnswerWords', support_cutoff=False)
pt.Experiment(
    [ragpipe1, ragpipe2],
    dataset.get_topics(),
    dataset.get_answers(),
    [AnswerWords]
)