Diversity using Dense Vectors¶

Dense vectors can be used for both diversifying search results, and measuring the diversity of search results. pyterrier-dr provides functionality for both these use cases.

Search Result Diversification¶

Maximal Marginal Relevance (MMR) is a technique to diversify search results by balancing relevance and novelty. It is available using pyterrier_dr.MmrScorer, which uses document vector similarities to measure novelty, and the value in the score input column to measure relevance.

The transformer requires doc_vec columns to be present in the input data frame. Therefore, you will usually want to load vectors from a FlexIndex first using vec_loader(), then apply MmrScorer. FlexIndex.mmr() is a shorthand to return both these steps. Alternatively, you could include an encoder beforehand to compute document vectors on-the-fly.

The example below applies BM25 retrieval over a sparse index, then applies search result diversification to the results using MMR.

sparse_index.bm25() >> dense_index.mmr()

Click to explore!

Input

qid	str	(Query ID) ID of query in frame
query	str	Query text

                
                    pt.terrier.retriever.Retriever

index_location	<org.terrier.querying.IndexRef at 0x720af16e37a0 jclass=org/terrier/querying/IndexRef jself=<LocalRef obj=0x5d8dfd8a75ba at 0x720af526de70>>
num_results	1000
metadata	['docno']
wmodel	BM25
threads	1
verbose	False
terrierql	on
parsecontrols	on
parseql	on
applypipeline	on
localmatching	on
filters	on
decorate	on
bm25.k_1	1.2
bm25.b	0.75
decorate_batch	on
querying.processes	terrierql:TerrierQLParser,parsecontrols:TerrierQLToControls,parseql:TerrierQLToMatchingQueryTerms,matchopql:MatchingOpQLParser,applypipeline:ApplyTermPipeline,context_wmodel:org.terrier.python.WmodelFromContextProcess,localmatching:LocalManager$ApplyLocalMatching,qe:QueryExpansion,labels:org.terrier.learning.LabelDecorator,filters:LocalManager$PostFilterProcess,decorate:SimpleDecorateProcess
querying.postfilters	decorate:SimpleDecorate,site:SiteFilter,scope:Scope
querying.default.controls	wmodel:DPH,parsecontrols:on,parseql:on,applypipeline:on,terrierql:on,localmatching:on,filters:on,decorate:on
querying.allowed.controls	scope,qe,qemodel,start,end,site,scope,applypipeline
termpipelines	Stopwords,PorterStemmer

BM25

qid	str	(Query ID) ID of query in frame
query	str	Query text
docid	int	(Internal Document ID) Integer ID of document in a specific index
docno	str	(External Document ID) String ID of document in collection
rank	int	Ranking order of document to query (lower=better)
score	float	Ranking score of document to query (higher=better)

                
                    pyterrier_dr.flex.np_retr.NumpyVectorLoader

flex_index	FlexIndex('/home/docs/.pyterrier/artifacts/426c662fb720c2576539eb3dad459d53be9556557fc4cc4f48556f5b581f70bb')

VecLoader

qid	str	(Query ID) ID of query in frame
query	str	Query text
docid	int	(Internal Document ID) Integer ID of document in a specific index
docno	str	(External Document ID) String ID of document in collection
rank	int	Ranking order of document to query (lower=better)
score	float	Ranking score of document to query (higher=better)
doc_vec	np.array	Dense document vector

                
                    pyterrier_dr._mmr.MmrScorer

Lambda	0.5
norm_rel	False
norm_sim	False
drop_doc_vec	True
verbose	False

MMR

qid	str	(Query ID) ID of query in frame
query	str	Query text
docid	int	(Internal Document ID) Integer ID of document in a specific index
docno	str	(External Document ID) String ID of document in collection
rank	int	Ranking order of document to query (lower=better)
score	float	Ranking score of document to query (higher=better)
doc_vec	np.array	Dense document vector

Output

Rendering issue. Try running the cell again.

Citation

Carbonell and Goldstein. The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries. SIGIR 1998. [link]

@inproceedings{DBLP:conf/sigir/CarbonellG98,
  author       = {Jaime G. Carbonell and
                  Jade Goldstein},
  editor       = {W. Bruce Croft and
                  Alistair Moffat and
                  C. J. van Rijsbergen and
                  Ross Wilkinson and
                  Justin Zobel},
  title        = {The Use of MMR, Diversity-Based Reranking for Reordering Documents
                  and Producing Summaries},
  booktitle    = {{SIGIR} '98: Proceedings of the 21st Annual International {ACM} {SIGIR}
                  Conference on Research and Development in Information Retrieval, August
                  24-28 1998, Melbourne, Australia},
  pages        = {335--336},
  publisher    = {{ACM}},
  year         = {1998},
  url          = {https://doi.org/10.1145/290941.291025},
  doi          = {10.1145/290941.291025},
  timestamp    = {Wed, 14 Nov 2018 10:58:11 +0100},
  biburl       = {https://dblp.org/rec/conf/sigir/CarbonellG98.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Diversity Evaluation¶

Intra-List Similarity (ILS) is a diversity evaluation measure that quantifies the similarity between documents in a ranked list. It is available using pyterrier_dr.ILS() or FlexIndex.ILS.

This measure can be used alongside PyTerrier’s built-in evaluation measures in a pt.Experiment.

Compare the relevance and ILS of lexical and dense retrieval with a PyTerrier Experiment¶

import pyterrier as pt
from pyterrier.measures import nDCG, R
from pyterrier_dr import FlexIndex, TasB
from pyterrier_pisa import PisaIndex

dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2019/judged')
index = FlexIndex.from_hf('macavaney/msmarco-passage.tasb.flex')
bm25 = PisaIndex.from_hf('macavaney/msmarco-passage.pisa').bm25()
model = TasB.dot()

pt.Experiment(
    [
        bm25,
        model >> index.retriever(),
        model >> index.retriever() >> index.mmr(),
    ],
    dataset.get_topics(),
    dataset.get_qrels(),
    [nDCG@10, R(rel=2)@1000, index.ILS@10, index.ILS@1000]
)
#        name   nDCG@10  R(rel=2)@1000    ILS@10  ILS@1000
# BM25            0.498          0.755     0.852     0.754
# TasB            0.716          0.841     0.889     0.775
# TasB w/ MMR     0.714          0.841     0.888     0.775

Citation

Ziegler et al. Improving recommendation lists through topic diversification. WWW 2005. [link]

@inproceedings{DBLP:conf/www/ZieglerMKL05,
  author       = {Cai{-}Nicolas Ziegler and
                  Sean M. McNee and
                  Joseph A. Konstan and
                  Georg Lausen},
  editor       = {Allan Ellis and
                  Tatsuya Hagino},
  title        = {Improving recommendation lists through topic diversification},
  booktitle    = {Proceedings of the 14th international conference on World Wide Web,
                  {WWW} 2005, Chiba, Japan, May 10-14, 2005},
  pages        = {22--32},
  publisher    = {{ACM}},
  year         = {2005},
  url          = {https://doi.org/10.1145/1060745.1060754},
  doi          = {10.1145/1060745.1060754},
  timestamp    = {Fri, 25 Dec 2020 01:14:58 +0100},
  biburl       = {https://dblp.org/rec/conf/www/ZieglerMKL05.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}