Diversity

Search Result Diversification

pyterrier-dr provides one diversification algorithm, MmrScorer (Maximal Marginal Relevance). The transformer works over input dataframes that contain the dense vectors of the documents and the query. You can also use mmr() to first load vectors from an index and then apply MMR.

class pyterrier_dr.MmrScorer(*, Lambda=0.5, norm_rel=False, norm_sim=False, drop_doc_vec=True, verbose=False)[source]

An MMR (Maximal Marginal Relevance) scorer (i.e., re-ranker).

The MMR scorer re-orders documents by balancing relevance (from the initial scores) and diversity (based on the similarity of the document vectors).

Citation

Carbonell and Goldstein. The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries. SIGIR 1998. [link]
@inproceedings{DBLP:conf/sigir/CarbonellG98,
  author       = {Jaime G. Carbonell and
                  Jade Goldstein},
  editor       = {W. Bruce Croft and
                  Alistair Moffat and
                  C. J. van Rijsbergen and
                  Ross Wilkinson and
                  Justin Zobel},
  title        = {The Use of MMR, Diversity-Based Reranking for Reordering Documents
                  and Producing Summaries},
  booktitle    = {{SIGIR} '98: Proceedings of the 21st Annual International {ACM} {SIGIR}
                  Conference on Research and Development in Information Retrieval, August
                  24-28 1998, Melbourne, Australia},
  pages        = {335--336},
  publisher    = {{ACM}},
  year         = {1998},
  url          = {https://doi.org/10.1145/290941.291025},
  doi          = {10.1145/290941.291025},
  timestamp    = {Wed, 14 Nov 2018 10:58:11 +0100},
  biburl       = {https://dblp.org/rec/conf/sigir/CarbonellG98.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}
Parameters:
  • Lambda – The balance parameter between relevance and diversity (default: 0.5)

  • norm_rel – Whether to normalize relevance scores to [0, 1] (default: False)

  • norm_sim – Whether to normalize similarity scores to [0, 1] (default: False)

  • drop_doc_vec – Whether to drop the ‘doc_vec’ column after re-ranking (default: True)

  • verbose – Whether to display verbose output (e.g., progress bars) (default: False)

Diversity Evaluation

pyterrier-dr provides one diversity evaluation measure, ILS() (Intra-List Similarity), which can be used to evaluate the diversity of search results based on the dense vectors of a FlexIndex.

This measure can be used alongside PyTerrier’s built-in evaluation measures in a pyterrier.Experiment().

Compare the relevance and ILS of lexical and dense retrieval with a PyTerrier Experiment
import pyterrier as pt
from pyterrier.measures import nDCG, R
from pyterrier_dr import FlexIndex, TasB
from pyterrier_pisa import PisaIndex

dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2019/judged')
index = FlexIndex.from_hf('macavaney/msmarco-passage.tasb.flex')
bm25 = PisaIndex.from_hf('macavaney/msmarco-passage.pisa').bm25()
model = TasB.dot()

pt.Experiment(
    [
        bm25,
        model >> index.retriever(),
        model >> index.retriever() >> index.mmr(),
    ],
    dataset.get_topics(),
    dataset.get_qrels(),
    [nDCG@10, R(rel=2)@1000, index.ILS@10, index.ILS@1000]
)
#        name   nDCG@10  R(rel=2)@1000    ILS@10  ILS@1000
# BM25            0.498          0.755     0.852     0.754
# TasB            0.716          0.841     0.889     0.775
# TasB w/ MMR     0.714          0.841     0.888     0.775
pyterrier_dr.ILS(index, *, name=None, verbose=False)[source]

Create an ILS (Intra-List Similarity) measure calculated using the vectors in the provided index.

Higher scores indicate lower diversity in the results.

This measure supports the @k convention for applying a top-k cutoff before scoring.

Return type:

Measure

Parameters:
  • index (FlexIndex) – The index to use for loading document vectors.

  • name (str, optional) – The name of the measure (default: “ILS”).

  • verbose (bool, optional) – Whether to display a progress bar.

Returns:

An ILS measure object.

Return type:

ir_measures.Measure

Citation

Ziegler et al. Improving recommendation lists through topic diversification. WWW 2005. [link]
@inproceedings{DBLP:conf/www/ZieglerMKL05,
  author       = {Cai{-}Nicolas Ziegler and
                  Sean M. McNee and
                  Joseph A. Konstan and
                  Georg Lausen},
  editor       = {Allan Ellis and
                  Tatsuya Hagino},
  title        = {Improving recommendation lists through topic diversification},
  booktitle    = {Proceedings of the 14th international conference on World Wide Web,
                  {WWW} 2005, Chiba, Japan, May 10-14, 2005},
  pages        = {22--32},
  publisher    = {{ACM}},
  year         = {2005},
  url          = {https://doi.org/10.1145/1060745.1060754},
  doi          = {10.1145/1060745.1060754},
  timestamp    = {Fri, 25 Dec 2020 01:14:58 +0100},
  biburl       = {https://dblp.org/rec/conf/www/ZieglerMKL05.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}
pyterrier_dr.ils(results, index=None, *, verbose=False)[source]

Calculate the ILS (Intra-List Similarity) of a set of results.

Higher scores indicate lower diversity in the results.

Return type:

Iterable[Tuple[str, float]]

Parameters:
  • results – The result frame to calculate ILS for.

  • index – The index to use for loading document vectors. Required if results does not have a doc_vec column.

  • verbose – Whether to display a progress bar.

Returns:

An iterable of (qid, ILS) pairs.

Return type:

Iterable[Tuple[str,float]]

Citation

Ziegler et al. Improving recommendation lists through topic diversification. WWW 2005. [link]
@inproceedings{DBLP:conf/www/ZieglerMKL05,
  author       = {Cai{-}Nicolas Ziegler and
                  Sean M. McNee and
                  Joseph A. Konstan and
                  Georg Lausen},
  editor       = {Allan Ellis and
                  Tatsuya Hagino},
  title        = {Improving recommendation lists through topic diversification},
  booktitle    = {Proceedings of the 14th international conference on World Wide Web,
                  {WWW} 2005, Chiba, Japan, May 10-14, 2005},
  pages        = {22--32},
  publisher    = {{ACM}},
  year         = {2005},
  url          = {https://doi.org/10.1145/1060745.1060754},
  doi          = {10.1145/1060745.1060754},
  timestamp    = {Fri, 25 Dec 2020 01:14:58 +0100},
  biburl       = {https://dblp.org/rec/conf/www/ZieglerMKL05.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}