Diversity¶
Search Result Diversification¶
pyterrier-dr
provides one diversification algorithm, MmrScorer
(Maximal Marginal Relevance).
The transformer works over input dataframes that contain the dense vectors of the documents and the query. You can also
use mmr()
to first load vectors from an index and then apply MMR.
- class pyterrier_dr.MmrScorer(*, Lambda=0.5, norm_rel=False, norm_sim=False, drop_doc_vec=True, verbose=False)[source]¶
An MMR (Maximal Marginal Relevance) scorer (i.e., re-ranker).
The MMR scorer re-orders documents by balancing relevance (from the initial scores) and diversity (based on the similarity of the document vectors).
Citation
Carbonell and Goldstein. The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries. SIGIR 1998. [link]
@inproceedings{DBLP:conf/sigir/CarbonellG98, author = {Jaime G. Carbonell and Jade Goldstein}, editor = {W. Bruce Croft and Alistair Moffat and C. J. van Rijsbergen and Ross Wilkinson and Justin Zobel}, title = {The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries}, booktitle = {{SIGIR} '98: Proceedings of the 21st Annual International {ACM} {SIGIR} Conference on Research and Development in Information Retrieval, August 24-28 1998, Melbourne, Australia}, pages = {335--336}, publisher = {{ACM}}, year = {1998}, url = {https://doi.org/10.1145/290941.291025}, doi = {10.1145/290941.291025}, timestamp = {Wed, 14 Nov 2018 10:58:11 +0100}, biburl = {https://dblp.org/rec/conf/sigir/CarbonellG98.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
- Parameters:
Lambda – The balance parameter between relevance and diversity (default: 0.5)
norm_rel – Whether to normalize relevance scores to [0, 1] (default: False)
norm_sim – Whether to normalize similarity scores to [0, 1] (default: False)
drop_doc_vec – Whether to drop the ‘doc_vec’ column after re-ranking (default: True)
verbose – Whether to display verbose output (e.g., progress bars) (default: False)
Diversity Evaluation¶
pyterrier-dr
provides one diversity evaluation measure, ILS()
(Intra-List Similarity),
which can be used to evaluate the diversity of search results based on the dense vectors of a FlexIndex
.
This measure can be used alongside PyTerrier’s built-in evaluation measures in a pyterrier.Experiment()
.
import pyterrier as pt
from pyterrier.measures import nDCG, R
from pyterrier_dr import FlexIndex, TasB
from pyterrier_pisa import PisaIndex
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2019/judged')
index = FlexIndex.from_hf('macavaney/msmarco-passage.tasb.flex')
bm25 = PisaIndex.from_hf('macavaney/msmarco-passage.pisa').bm25()
model = TasB.dot()
pt.Experiment(
[
bm25,
model >> index.retriever(),
model >> index.retriever() >> index.mmr(),
],
dataset.get_topics(),
dataset.get_qrels(),
[nDCG@10, R(rel=2)@1000, index.ILS@10, index.ILS@1000]
)
# name nDCG@10 R(rel=2)@1000 ILS@10 ILS@1000
# BM25 0.498 0.755 0.852 0.754
# TasB 0.716 0.841 0.889 0.775
# TasB w/ MMR 0.714 0.841 0.888 0.775
- pyterrier_dr.ILS(index, *, name=None, verbose=False)[source]¶
Create an ILS (Intra-List Similarity) measure calculated using the vectors in the provided index.
Higher scores indicate lower diversity in the results.
This measure supports the
@k
convention for applying a top-k cutoff before scoring.- Return type:
Measure
- Parameters:
index (FlexIndex) – The index to use for loading document vectors.
name (str, optional) – The name of the measure (default: “ILS”).
verbose (bool, optional) – Whether to display a progress bar.
- Returns:
An ILS measure object.
- Return type:
ir_measures.Measure
Citation
Ziegler et al. Improving recommendation lists through topic diversification. WWW 2005. [link]
@inproceedings{DBLP:conf/www/ZieglerMKL05, author = {Cai{-}Nicolas Ziegler and Sean M. McNee and Joseph A. Konstan and Georg Lausen}, editor = {Allan Ellis and Tatsuya Hagino}, title = {Improving recommendation lists through topic diversification}, booktitle = {Proceedings of the 14th international conference on World Wide Web, {WWW} 2005, Chiba, Japan, May 10-14, 2005}, pages = {22--32}, publisher = {{ACM}}, year = {2005}, url = {https://doi.org/10.1145/1060745.1060754}, doi = {10.1145/1060745.1060754}, timestamp = {Fri, 25 Dec 2020 01:14:58 +0100}, biburl = {https://dblp.org/rec/conf/www/ZieglerMKL05.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
- pyterrier_dr.ils(results, index=None, *, verbose=False)[source]¶
Calculate the ILS (Intra-List Similarity) of a set of results.
Higher scores indicate lower diversity in the results.
- Return type:
Iterable
[Tuple
[str
,float
]]- Parameters:
results – The result frame to calculate ILS for.
index – The index to use for loading document vectors. Required if results does not have a doc_vec column.
verbose – Whether to display a progress bar.
- Returns:
An iterable of (qid, ILS) pairs.
- Return type:
Iterable[Tuple[str,float]]
Citation
Ziegler et al. Improving recommendation lists through topic diversification. WWW 2005. [link]
@inproceedings{DBLP:conf/www/ZieglerMKL05, author = {Cai{-}Nicolas Ziegler and Sean M. McNee and Joseph A. Konstan and Georg Lausen}, editor = {Allan Ellis and Tatsuya Hagino}, title = {Improving recommendation lists through topic diversification}, booktitle = {Proceedings of the 14th international conference on World Wide Web, {WWW} 2005, Chiba, Japan, May 10-14, 2005}, pages = {22--32}, publisher = {{ACM}}, year = {2005}, url = {https://doi.org/10.1145/1060745.1060754}, doi = {10.1145/1060745.1060754}, timestamp = {Fri, 25 Dec 2020 01:14:58 +0100}, biburl = {https://dblp.org/rec/conf/www/ZieglerMKL05.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }