Diversity using Dense Vectors¶
Dense vectors can be used for both diversifying search results, and measuring the diversity of search results.
pyterrier-dr provides functionality for both these use cases.
Search Result Diversification¶
Maximal Marginal Relevance (MMR) is a technique to diversify search results by balancing relevance and novelty.
It is available using pyterrier_dr.MmrScorer, which uses document vector similarities to measure novelty,
and the value in the score input column to measure relevance.
The transformer requires doc_vec columns to be present in the input data frame. Therefore, you
will usually want to load vectors from a FlexIndex first using vec_loader(),
then apply MmrScorer. FlexIndex.mmr() is a shorthand to return both
these steps. Alternatively, you could include an encoder beforehand to compute document vectors on-the-fly.
The example below applies BM25 retrieval over a sparse index, then applies search result diversification to the results using MMR.
sparse_index.bm25() >> dense_index.mmr()
Citation
Carbonell and Goldstein. The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries. SIGIR 1998. [link]
@inproceedings{DBLP:conf/sigir/CarbonellG98,
author = {Jaime G. Carbonell and
Jade Goldstein},
editor = {W. Bruce Croft and
Alistair Moffat and
C. J. van Rijsbergen and
Ross Wilkinson and
Justin Zobel},
title = {The Use of MMR, Diversity-Based Reranking for Reordering Documents
and Producing Summaries},
booktitle = {{SIGIR} '98: Proceedings of the 21st Annual International {ACM} {SIGIR}
Conference on Research and Development in Information Retrieval, August
24-28 1998, Melbourne, Australia},
pages = {335--336},
publisher = {{ACM}},
year = {1998},
url = {https://doi.org/10.1145/290941.291025},
doi = {10.1145/290941.291025},
timestamp = {Wed, 14 Nov 2018 10:58:11 +0100},
biburl = {https://dblp.org/rec/conf/sigir/CarbonellG98.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
Diversity Evaluation¶
Intra-List Similarity (ILS) is a diversity evaluation measure that quantifies the similarity between documents in a ranked list.
It is available using pyterrier_dr.ILS() or FlexIndex.ILS.
This measure can be used alongside PyTerrier’s built-in evaluation measures in a pt.Experiment.
import pyterrier as pt
from pyterrier.measures import nDCG, R
from pyterrier_dr import FlexIndex, TasB
from pyterrier_pisa import PisaIndex
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2019/judged')
index = FlexIndex.from_hf('macavaney/msmarco-passage.tasb.flex')
bm25 = PisaIndex.from_hf('macavaney/msmarco-passage.pisa').bm25()
model = TasB.dot()
pt.Experiment(
[
bm25,
model >> index.retriever(),
model >> index.retriever() >> index.mmr(),
],
dataset.get_topics(),
dataset.get_qrels(),
[nDCG@10, R(rel=2)@1000, index.ILS@10, index.ILS@1000]
)
# name nDCG@10 R(rel=2)@1000 ILS@10 ILS@1000
# BM25 0.498 0.755 0.852 0.754
# TasB 0.716 0.841 0.889 0.775
# TasB w/ MMR 0.714 0.841 0.888 0.775
Citation
Ziegler et al. Improving recommendation lists through topic diversification. WWW 2005. [link]
@inproceedings{DBLP:conf/www/ZieglerMKL05,
author = {Cai{-}Nicolas Ziegler and
Sean M. McNee and
Joseph A. Konstan and
Georg Lausen},
editor = {Allan Ellis and
Tatsuya Hagino},
title = {Improving recommendation lists through topic diversification},
booktitle = {Proceedings of the 14th international conference on World Wide Web,
{WWW} 2005, Chiba, Japan, May 10-14, 2005},
pages = {22--32},
publisher = {{ACM}},
year = {2005},
url = {https://doi.org/10.1145/1060745.1060754},
doi = {10.1145/1060745.1060754},
timestamp = {Fri, 25 Dec 2020 01:14:58 +0100},
biburl = {https://dblp.org/rec/conf/www/ZieglerMKL05.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}