RAG Measures
============
PyTerrier-RAG offers a number of commonly used evaluation measures as `ir_measures `_ objects that can be used via pt.Experiment() and pt.Evaluate().
For analysis of the generated answers:
- Answer length (in characters): ``pyterrier_rag.measures.AnswerLen``
- Answer zero length (number of questions with an empty answer) : ``pyterrier_rag.measures.AnswerZeroLen``
For comparison with gold-truth answers:
- Exact match percentage: ``pyterrier_rag.measures.EM``
- F1: ``pyterrier_rag.measures.F1``
- ROUGE: ``pyterrier_rag.measures.ROUGE1P``, ``pyterrier_rag.measures.ROUGE1R``, ``pyterrier_rag.measures.ROUGE1F`` etc, as implemented by the `rouge-score `_ library.
Example::
pt.Experiment(
[ragpipe1, ragpipe2],
dataset.get_topics(),
dataset.get_answers(),
[pyterrier_rag.measures.EM, pyterrier_rag.measures.F1, pyterrier_rag.measures.ROUGE1F, pyterrier_rag.measures.AnswerLen]
)
Various ROUGE measures are available:
- ROUGE-1 (precision, recall, f-measure)
- ROUGE-2 (precision, recall, f-measure)
- ROUGE-L (precision, recall, f-measure)
For comparison with known-relevant documents:
- BERTScore (measures similarity of answer with relevant documents): ``pyterrier_rag.measures.BERTScore``
.. autofunction:: pyterrier_rag.measures.BERTScore()
Example::
text_loader = pt.text.get_text(pt.get_dataset('irds:msmarco-passage'), 'text')
topics_qrels = pt.get_dataset('irds:msmarco-passage/trec-dl-2019/judged')
pt.Experiment(
[ragpipe1, ragpipe2],
dataset.get_topics('test-2019'),
text_loader(dataset.get_qrels()),
[pyterrier_rag.measures.BERTScore(rel=3)]
)
Using ir_measures, custom measures can easily be implemented using functions that take a dataframe of answers and of gold-truth answers::
import ir_measures
# measures counts how many words in the answer
AnswerWords = ir_measures.define_byquery(
lambda qrels, res: len(res.iloc[0]['qanswer'].split(" ")),
name='AnswerWords', support_cutoff=False)
pt.Experiment(
[ragpipe1, ragpipe2],
dataset.get_topics(),
dataset.get_answers(),
[AnswerWords]
)