RAG Measures¶
PyTerrier-RAG offers a number of commonly used evaluation measures as ir_measures objects that can be used via pt.Experiment() and pt.Evaluate().
- For analysis of the generated answers:
Answer length (in characters):
pyterrier_rag.measures.AnswerLenAnswer zero length (number of questions with an empty answer) :
pyterrier_rag.measures.AnswerZeroLen
- For comparison with gold-truth answers:
Exact match percentage:
pyterrier_rag.measures.EMF1:
pyterrier_rag.measures.F1ROUGE:
pyterrier_rag.measures.ROUGE1P,pyterrier_rag.measures.ROUGE1R,pyterrier_rag.measures.ROUGE1Fetc, as implemented by the rouge-score library.
Example:
pt.Experiment(
[ragpipe1, ragpipe2],
dataset.get_topics(),
dataset.get_answers(),
[pyterrier_rag.measures.EM, pyterrier_rag.measures.F1, pyterrier_rag.measures.ROUGE1F, pyterrier_rag.measures.AnswerLen]
)
- Various ROUGE measures are available:
ROUGE-1 (precision, recall, f-measure)
ROUGE-2 (precision, recall, f-measure)
ROUGE-L (precision, recall, f-measure)
- For comparison with known-relevant documents:
BERTScore (measures similarity of answer with relevant documents):
pyterrier_rag.measures.BERTScore
- pyterrier_rag.measures.BERTScore(rel=3, submeasure='f1', agg='max')[source]¶
Implements BERTScore, a semantic measure of equivalence. This is defined to take a qrels dataframe with an additional text attribute, and compare with the generated qanswers. NB: This is a function that returns a measure - it needs to be called.
- Parameters:
rel (-) – Minimum label value for relevant qrels. Defaults to 3, which is the highest label in MSMARCO.
submeasure (-) – One of ‘precision’, ‘recall’ and ‘f1’. Defaults to ‘f1’.
agg (-) – How to combine (aggregate) when there are multiple relevant documents. Valid options are ‘max’ or ‘avg’. Defaults to ‘max’.
- Returns:
An IR measures measure object that can be used in pt.Evaluate or pt.Experiment
Example:
text_loader = pt.text.get_text(pt.get_dataset('irds:msmarco-passage'), 'text')
topics_qrels = pt.get_dataset('irds:msmarco-passage/trec-dl-2019/judged')
pt.Experiment(
[ragpipe1, ragpipe2],
dataset.get_topics('test-2019'),
text_loader(dataset.get_qrels()),
[pyterrier_rag.measures.BERTScore(rel=3)]
)
Using ir_measures, custom measures can easily be implemented using functions that take a dataframe of answers and of gold-truth answers:
import ir_measures
# measures counts how many words in the answer
AnswerWords = ir_measures.define_byquery(
lambda qrels, res: len(res.iloc[0]['qanswer'].split(" ")),
name='AnswerWords', support_cutoff=False)
pt.Experiment(
[ragpipe1, ragpipe2],
dataset.get_topics(),
dataset.get_answers(),
[AnswerWords]
)