Diversity using Dense Vectors

Dense vectors can be used for both diversifying search results, and measuring the diversity of search results. pyterrier-dr provides functionality for both these use cases.

Search Result Diversification

Maximal Marginal Relevance (MMR) is a technique to diversify search results by balancing relevance and novelty. It is available using pyterrier_dr.MmrScorer, which uses document vector similarities to measure novelty, and the value in the score input column to measure relevance.

The transformer requires doc_vec columns to be present in the input data frame. Therefore, you will usually want to load vectors from a FlexIndex first using vec_loader(), then apply MmrScorer. FlexIndex.mmr() is a shorthand to return both these steps. Alternatively, you could include an encoder beforehand to compute document vectors on-the-fly.

The example below applies BM25 retrieval over a sparse index, then applies search result diversification to the results using MMR.

sparse_index.bm25() >> dense_index.mmr()
Rendering issue. Try running the cell again.

Citation

Carbonell and Goldstein. The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries. SIGIR 1998. [link]
@inproceedings{DBLP:conf/sigir/CarbonellG98,
  author       = {Jaime G. Carbonell and
                  Jade Goldstein},
  editor       = {W. Bruce Croft and
                  Alistair Moffat and
                  C. J. van Rijsbergen and
                  Ross Wilkinson and
                  Justin Zobel},
  title        = {The Use of MMR, Diversity-Based Reranking for Reordering Documents
                  and Producing Summaries},
  booktitle    = {{SIGIR} '98: Proceedings of the 21st Annual International {ACM} {SIGIR}
                  Conference on Research and Development in Information Retrieval, August
                  24-28 1998, Melbourne, Australia},
  pages        = {335--336},
  publisher    = {{ACM}},
  year         = {1998},
  url          = {https://doi.org/10.1145/290941.291025},
  doi          = {10.1145/290941.291025},
  timestamp    = {Wed, 14 Nov 2018 10:58:11 +0100},
  biburl       = {https://dblp.org/rec/conf/sigir/CarbonellG98.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Diversity Evaluation

Intra-List Similarity (ILS) is a diversity evaluation measure that quantifies the similarity between documents in a ranked list. It is available using pyterrier_dr.ILS() or FlexIndex.ILS.

This measure can be used alongside PyTerrier’s built-in evaluation measures in a pt.Experiment.

Compare the relevance and ILS of lexical and dense retrieval with a PyTerrier Experiment
import pyterrier as pt
from pyterrier.measures import nDCG, R
from pyterrier_dr import FlexIndex, TasB
from pyterrier_pisa import PisaIndex

dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2019/judged')
index = FlexIndex.from_hf('macavaney/msmarco-passage.tasb.flex')
bm25 = PisaIndex.from_hf('macavaney/msmarco-passage.pisa').bm25()
model = TasB.dot()

pt.Experiment(
    [
        bm25,
        model >> index.retriever(),
        model >> index.retriever() >> index.mmr(),
    ],
    dataset.get_topics(),
    dataset.get_qrels(),
    [nDCG@10, R(rel=2)@1000, index.ILS@10, index.ILS@1000]
)
#        name   nDCG@10  R(rel=2)@1000    ILS@10  ILS@1000
# BM25            0.498          0.755     0.852     0.754
# TasB            0.716          0.841     0.889     0.775
# TasB w/ MMR     0.714          0.841     0.888     0.775

Citation

Ziegler et al. Improving recommendation lists through topic diversification. WWW 2005. [link]
@inproceedings{DBLP:conf/www/ZieglerMKL05,
  author       = {Cai{-}Nicolas Ziegler and
                  Sean M. McNee and
                  Joseph A. Konstan and
                  Georg Lausen},
  editor       = {Allan Ellis and
                  Tatsuya Hagino},
  title        = {Improving recommendation lists through topic diversification},
  booktitle    = {Proceedings of the 14th international conference on World Wide Web,
                  {WWW} 2005, Chiba, Japan, May 10-14, 2005},
  pages        = {22--32},
  publisher    = {{ACM}},
  year         = {2005},
  url          = {https://doi.org/10.1145/1060745.1060754},
  doi          = {10.1145/1060745.1060754},
  timestamp    = {Fri, 25 Dec 2020 01:14:58 +0100},
  biburl       = {https://dblp.org/rec/conf/www/ZieglerMKL05.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}