Terrier Query Rewriting & Expansion¶

Query rewriting refers to changing the formulation of the query in order to improve the effectiveness of the search ranking. PyTerrier supplies a number of query rewriting transformers designed to work with Retriever.

Firstly, we differentiate between two forms of query rewriting:

Q -> Q: this rewrites the query, for instance by adding/removing extra query terms. Examples might be a WordNet- or Word2Vec-based QE; The input dataframes contain only [“qid”, “docno”] columns. The output dataframes contain [“qid”, “query”, “query_0”] columns, where “query” contains the reformulated query, and “query_0” contains the previous formulation of the query. An example is the sequential dependence model, discussed below.

Rendering issue. Try running the cell again.

R -> Q: these class of transformers rewrite a query by making use of an associated set of documents, in a formulation typically referred to as pseudo-relevance feedback. Similarly the output dataframes contain [“qid”, “query”, “query_0”] columns. This is typically used in a pipeline such as Retriever >> Rewriter >> Retriever, as shown below. Examples of include RM3, Bo1 and KL QE, discussed below.

Click to explore!

Input

qid	str	(Query ID) ID of query in frame
query	str	Query text

                
                    pt.terrier.retriever.Retriever

index_location	<org.terrier.querying.IndexRef at 0x7ba671f4d130 jclass=org/terrier/querying/IndexRef jself=<LocalRef obj=0x5ff0fda7559a at 0x7ba6788c9df0>>
num_results	1000
metadata	['docno']
wmodel	DPH
threads	1
verbose	False
terrierql	on
parsecontrols	on
parseql	on
applypipeline	on
localmatching	on
filters	on
decorate	on
decorate_batch	on
querying.processes	terrierql:TerrierQLParser,parsecontrols:TerrierQLToControls,parseql:TerrierQLToMatchingQueryTerms,matchopql:MatchingOpQLParser,applypipeline:ApplyTermPipeline,context_wmodel:org.terrier.python.WmodelFromContextProcess,localmatching:LocalManager$ApplyLocalMatching,qe:QueryExpansion,labels:org.terrier.learning.LabelDecorator,filters:LocalManager$PostFilterProcess,decorate:SimpleDecorateProcess
querying.postfilters	decorate:SimpleDecorate,site:SiteFilter,scope:Scope
querying.default.controls	wmodel:DPH,parsecontrols:on,parseql:on,applypipeline:on,terrierql:on,localmatching:on,filters:on,decorate:on
querying.allowed.controls	scope,qe,qemodel,start,end,site,scope,applypipeline
termpipelines	Stopwords,PorterStemmer

DPH

qid	str	(Query ID) ID of query in frame
query	str	Query text
docid	int	(Internal Document ID) Integer ID of document in a specific index
docno	str	(External Document ID) String ID of document in collection
rank	int	Ranking order of document to query (lower=better)
score	float	Ranking score of document to query (higher=better)

                
                    pt.terrier.rewrite.RM3

RM3

qid	str	(Query ID) ID of query in frame
query_0	str	Stashed query text
query	str	Query text

                
                    pt.terrier.retriever.Retriever

index_location	<org.terrier.querying.IndexRef at 0x7ba671f4d130 jclass=org/terrier/querying/IndexRef jself=<LocalRef obj=0x5ff0fda7559a at 0x7ba6788c9df0>>
num_results	1000
metadata	['docno']
wmodel	DPH
threads	1
verbose	False
terrierql	on
parsecontrols	on
parseql	on
applypipeline	on
localmatching	on
filters	on
decorate	on
decorate_batch	on
querying.processes	terrierql:TerrierQLParser,parsecontrols:TerrierQLToControls,parseql:TerrierQLToMatchingQueryTerms,matchopql:MatchingOpQLParser,applypipeline:ApplyTermPipeline,context_wmodel:org.terrier.python.WmodelFromContextProcess,localmatching:LocalManager$ApplyLocalMatching,qe:QueryExpansion,labels:org.terrier.learning.LabelDecorator,filters:LocalManager$PostFilterProcess,decorate:SimpleDecorateProcess
querying.postfilters	decorate:SimpleDecorate,site:SiteFilter,scope:Scope
querying.default.controls	wmodel:DPH,parsecontrols:on,parseql:on,applypipeline:on,terrierql:on,localmatching:on,filters:on,decorate:on
querying.allowed.controls	scope,qe,qemodel,start,end,site,scope,applypipeline
termpipelines	Stopwords,PorterStemmer

DPH

qid	str	(Query ID) ID of query in frame
query_0	str	Stashed query text
query	str	Query text
docid	int	(Internal Document ID) Integer ID of document in a specific index
docno	str	(External Document ID) String ID of document in collection
rank	int	Ranking order of document to query (lower=better)
score	float	Ranking score of document to query (higher=better)

Output

Rendering issue. Try running the cell again.

If needed, the previous formulation of the query can be restored using pt.rewrite.reset(), discussed below.

Sequential Dependence¶

sdm() provides the sequential dependence model of Metzler and Croft, designed to boost the scores of documents where the query terms occur in close proximity. Application of this transformer rewrites each input query such that:

pairs of adjacent query terms are added as #1 and #uw8 complex query terms, with a low weight.
the full query is added as #uw12 complex query term, with a low weight.
all terms are weighted by a proximity model, either Dirichlet LM or pBiL2.

For example, the query pyterrier IR platform would become pyterrier IR platform #1(pyterrier IR) #1(IR platform) #uw8(pyterrier IR) #uw8(IR platform) #uw12(pyterrier IR platform). NB: Acutally, we have simplified the rewritten query - in practice, we also (a) set the weight of the proximity terms to be low using a #combine() operator and (b) set a proximity term weighting model.

This transfomer is only compatible with Retriever, as Terrier supports the #1 and #uwN complex query terms operators. The Terrier index must have blocks (positional information) recorded in the index.

Example:

pipeline = index.sdm() >> index.dph()

Tip

The SDM query transformation does not technically depend on the index. It’s TerrierIndex.sdm() is available, however, to first check that the index has the positional information necessary to perform SDM. This helps avoid errors that can crop up once executed.

Citation

Metzler and Croft. A Markov random field model for term dependencies. SIGIR 2005. [link]

@inproceedings{DBLP:conf/sigir/MetzlerC05,
  author       = {Donald Metzler and
                  W. Bruce Croft},
  editor       = {Ricardo A. Baeza{-}Yates and
                  Nivio Ziviani and
                  Gary Marchionini and
                  Alistair Moffat and
                  John Tait},
  title        = {A Markov random field model for term dependencies},
  booktitle    = {{SIGIR} 2005: Proceedings of the 28th Annual International {ACM} {SIGIR}
                  Conference on Research and Development in Information Retrieval, Salvador,
                  Brazil, August 15-19, 2005},
  pages        = {472--479},
  publisher    = {{ACM}},
  year         = {2005},
  url          = {https://doi.org/10.1145/1076034.1076115},
  doi          = {10.1145/1076034.1076115},
  timestamp    = {Tue, 06 Nov 2018 11:07:23 +0100},
  biburl       = {https://dblp.org/rec/conf/sigir/MetzlerC05.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Citation

Peng et al. Incorporating term dependency in the dfr framework. SIGIR 2007. [link]

@inproceedings{DBLP:conf/sigir/PengMHPO07,
  author       = {Jie Peng and
                  Craig Macdonald and
                  Ben He and
                  Vassilis Plachouras and
                  Iadh Ounis},
  editor       = {Wessel Kraaij and
                  Arjen P. de Vries and
                  Charles L. A. Clarke and
                  Norbert Fuhr and
                  Noriko Kando},
  title        = {Incorporating term dependency in the dfr framework},
  booktitle    = {{SIGIR} 2007: Proceedings of the 30th Annual International {ACM} {SIGIR}
                  Conference on Research and Development in Information Retrieval, Amsterdam,
                  The Netherlands, July 23-27, 2007},
  pages        = {843--844},
  publisher    = {{ACM}},
  year         = {2007},
  url          = {https://doi.org/10.1145/1277741.1277937},
  doi          = {10.1145/1277741.1277937},
  timestamp    = {Tue, 26 Nov 2024 07:42:48 +0100},
  biburl       = {https://dblp.org/rec/conf/sigir/PengMHPO07.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Bo1QueryExpansion¶

This class applies the Bo1 Divergence from Randomess query expansion model to rewrite the query based on the occurences of terms in the feedback documents provided for each query. In this way, it takes in a dataframe with columns [“qid”, “query”, “docno”, “score”, “rank”] and returns a dataframe with [“qid”, “query”].

class pyterrier.rewrite.Bo1QueryExpansion(*args, **kwargs)[source]¶

Applies the Bo1 query expansion model from the Divergence from Randomness Framework, as provided by Terrier. It must be followed by a terrier.Retriever() transformer. The original query is saved in the “query_0” column, which can be restored using pt.rewrite.reset().

Instance Attributes:

fb_terms(int): number of feedback terms. Defaults to 10
fb_docs(int): number of feedback documents. Defaults to 3

Citation

Amati and Rijsbergen. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst. 2002. [link]

@article{DBLP:journals/tois/AmatiR02,
  author       = {Gianni Amati and
                  C. J. van Rijsbergen},
  title        = {Probabilistic models of information retrieval based on measuring the
                  divergence from randomness},
  journal      = {{ACM} Trans. Inf. Syst.},
  volume       = {20},
  number       = {4},
  pages        = {357--389},
  year         = {2002},
  url          = {http://doi.acm.org/10.1145/582415.582416},
  doi          = {10.1145/582415.582416},
  timestamp    = {Tue, 01 Jun 2021 09:58:08 +0200},
  biburl       = {https://dblp.org/rec/journals/tois/AmatiR02.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Parameters:

index_like – the Terrier index to use.
fb_terms (int) – number of terms to add to the query. Terrier’s default setting is 10 expansion terms.
fb_docs (int) – number of feedback documents to consider. Terrier’s default setting is 3 feedback documents.

Example:

bo1 = pt.rewrite.Bo1QueryExpansion(index)
dph = pt.terrier.Retriever(index, wmodel="DPH")
pipelineQE = dph >> bo1 >> dph

View the expansion terms:

pipelineDisplay = dph >> bo1
pipelineDisplay.search("chemical reactions")
# will return a dataframe with ['qid', 'query', 'query_0'] columns
# the reformulated query can be found in the 'query' column,
# while the original query is in the 'query_0' columns

Alternative Formulations

Note that it is also possible to configure Retriever to perform QE directly using controls, which will result in identical retrieval effectiveness:

pipelineQE = pt.terrier.Retriever(index, wmodel="DPH", controls={"qemodel" : "Bo1", "qe" : "on"})

However, using pt.rewrite.Bo1QueryExpansion is preferable as:

the semantics of retrieve >> rewrite >> retrieve are clearly visible.
the complex control configuration of Terrier need not be learned.
the rewritten query is visible outside, and not hidden inside Terrier.

Citation

Amati. Probability models for information retrieval based on divergence from randomness. 2003. [link]

@phdthesis{DBLP:phd/ethos/Amati03,
  author       = {Giambattista Amati},
  title        = {Probability models for information retrieval based on divergence from
                  randomness},
  school       = {University of Glasgow, {UK}},
  year         = {2003},
  url          = {http://theses.gla.ac.uk/1570/},
  timestamp    = {Tue, 05 Apr 2022 10:59:13 +0200},
  biburl       = {https://dblp.org/rec/phd/ethos/Amati03.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

KLQueryExpansion¶

Similar to Bo1, this class deploys a Divergence from Randomess query expansion model based on Kullback Leibler divergence.

class pyterrier.rewrite.KLQueryExpansion(*args, **kwargs)[source]¶

Applies the KL query expansion model from the Divergence from Randomness Framework, as provided by Terrier. This transformer must be followed by a terrier.Retriever() transformer. The original query is saved in the “query_0” column, which can be restored using pt.rewrite.reset().

Instance Attributes:

fb_terms(int): number of feedback terms. Defaults to 10
fb_docs(int): number of feedback documents. Defaults to 3

Parameters:

index_like – the Terrier index to use
fb_terms (int) – number of terms to add to the query. Terrier’s default setting is 10 expansion terms.
fb_docs (int) – number of feedback documents to consider. Terrier’s default setting is 3 feedback documents.

Citation

Amati. Probability models for information retrieval based on divergence from randomness. 2003. [link]

@phdthesis{DBLP:phd/ethos/Amati03,
  author       = {Giambattista Amati},
  title        = {Probability models for information retrieval based on divergence from
                  randomness},
  school       = {University of Glasgow, {UK}},
  year         = {2003},
  url          = {http://theses.gla.ac.uk/1570/},
  timestamp    = {Tue, 05 Apr 2022 10:59:13 +0200},
  biburl       = {https://dblp.org/rec/phd/ethos/Amati03.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

RM3¶

class pyterrier.rewrite.RM3(*args, **kwargs)[source]¶

Performs query expansion using RM3 relevance models.

This transformer must be followed by a terrier.Retriever() transformer. The original query is saved in the “query_0” column, which can be restored using pt.rewrite.reset().

Instance Attributes:

fb_terms(int): number of feedback terms. Defaults to 10
fb_docs(int): number of feedback documents. Defaults to 3
fb_lambda(float): lambda in RM3, i.e. importance of relevance model viz feedback model. Defaults to 0.6.

Example:

bm25 = pt.terrier.Retriever(index, wmodel="BM25")
rm3_pipe = bm25 >> pt.rewrite.RM3(index) >> bm25
pt.Experiment([bm25, rm3_pipe],
            dataset.get_topics(),
            dataset.get_qrels(),
            ["map"]
            )

Citation

Jaleel et al. UMass at TREC 2004: Novelty and HARD. TREC 2004. [link]

@inproceedings{DBLP:conf/trec/JaleelACDLLSW04,
  author       = {Nasreen Abdul Jaleel and
                  James Allan and
                  W. Bruce Croft and
                  Fernando Diaz and
                  Leah S. Larkey and
                  Xiaoyan Li and
                  Mark D. Smucker and
                  Courtney Wade},
  editor       = {Ellen M. Voorhees and
                  Lori P. Buckland},
  title        = {UMass at {TREC} 2004: Novelty and {HARD}},
  booktitle    = {Proceedings of the Thirteenth Text REtrieval Conference, {TREC} 2004,
                  Gaithersburg, Maryland, USA, November 16-19, 2004},
  series       = {{NIST} Special Publication},
  volume       = {500-261},
  publisher    = {National Institute of Standards and Technology {(NIST)}},
  year         = {2004},
  url          = {http://trec.nist.gov/pubs/trec13/papers/umass.novelty.hard.pdf},
  timestamp    = {Wed, 07 Jul 2021 16:44:22 +0200},
  biburl       = {https://dblp.org/rec/conf/trec/JaleelACDLLSW04.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Parameters:

index_like – the Terrier index to use
fb_terms (int) – number of terms to add to the query. Terrier’s default setting is 10 expansion terms.
fb_docs (int) – number of feedback documents to consider. Terrier’s default setting is 3 feedback documents.
fb_lambda (float) – lambda in RM3, i.e. importance of relevance model viz feedback model. Defaults to 0.6.

Citation

Jaleel et al. UMass at TREC 2004: Novelty and HARD. TREC 2004. [link]

@inproceedings{DBLP:conf/trec/JaleelACDLLSW04,
  author       = {Nasreen Abdul Jaleel and
                  James Allan and
                  W. Bruce Croft and
                  Fernando Diaz and
                  Leah S. Larkey and
                  Xiaoyan Li and
                  Mark D. Smucker and
                  Courtney Wade},
  editor       = {Ellen M. Voorhees and
                  Lori P. Buckland},
  title        = {UMass at {TREC} 2004: Novelty and {HARD}},
  booktitle    = {Proceedings of the Thirteenth Text REtrieval Conference, {TREC} 2004,
                  Gaithersburg, Maryland, USA, November 16-19, 2004},
  series       = {{NIST} Special Publication},
  volume       = {500-261},
  publisher    = {National Institute of Standards and Technology {(NIST)}},
  year         = {2004},
  url          = {http://trec.nist.gov/pubs/trec13/papers/umass.novelty.hard.pdf},
  timestamp    = {Wed, 07 Jul 2021 16:44:22 +0200},
  biburl       = {https://dblp.org/rec/conf/trec/JaleelACDLLSW04.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Resetting the Query Formulation¶

The application of any query rewriting operation, including the apply transformer, pt.apply.query(), will return a dataframe that includes the input formulation of the query in the query_0 column, and the new reformulation in the query column. The previous query reformulation can be obtained by inclusion of a reset() transformer in the pipeline.

This is useful if, for instance, you want to use a PRF pipeline to retrieve more relevant documents, but then want to revert to the original query formulation for a final ranking step such as MonoT5. For example:

pipeline = index.dph() >> index.rm3() >> index.dph() >> pt.rewrite.reset() >> pt.get_dataset('irds:vaswani').text_loader() >> monoT5

Click to explore!

Input

qid	str	(Query ID) ID of query in frame
query	str	Query text

                
                    pt.terrier.retriever.Retriever

index_location	<org.terrier.querying.IndexRef at 0x7ba6729ca710 jclass=org/terrier/querying/IndexRef jself=<LocalRef obj=0x5ff0fda755ba at 0x7ba6758be770>>
num_results	1000
metadata	['docno']
wmodel	DPH
threads	1
verbose	False
terrierql	on
parsecontrols	on
parseql	on
applypipeline	on
localmatching	on
filters	on
decorate	on
decorate_batch	on
querying.processes	terrierql:TerrierQLParser,parsecontrols:TerrierQLToControls,parseql:TerrierQLToMatchingQueryTerms,matchopql:MatchingOpQLParser,applypipeline:ApplyTermPipeline,context_wmodel:org.terrier.python.WmodelFromContextProcess,localmatching:LocalManager$ApplyLocalMatching,qe:QueryExpansion,labels:org.terrier.learning.LabelDecorator,filters:LocalManager$PostFilterProcess,decorate:SimpleDecorateProcess
querying.postfilters	decorate:SimpleDecorate,site:SiteFilter,scope:Scope
querying.default.controls	wmodel:DPH,parsecontrols:on,parseql:on,applypipeline:on,terrierql:on,localmatching:on,filters:on,decorate:on
querying.allowed.controls	scope,qe,qemodel,start,end,site,scope,applypipeline
termpipelines	Stopwords,PorterStemmer

DPH

qid	str	(Query ID) ID of query in frame
query	str	Query text
docid	int	(Internal Document ID) Integer ID of document in a specific index
docno	str	(External Document ID) String ID of document in collection
rank	int	Ranking order of document to query (lower=better)
score	float	Ranking score of document to query (higher=better)

                
                    pt.terrier.rewrite.RM3

RM3

qid	str	(Query ID) ID of query in frame
query_0	str	Stashed query text
query	str	Query text

                
                    pt.terrier.retriever.Retriever

index_location	<org.terrier.querying.IndexRef at 0x7ba6729c8f50 jclass=org/terrier/querying/IndexRef jself=<LocalRef obj=0x5ff0fda75632 at 0x7ba6758bea70>>
num_results	1000
metadata	['docno']
wmodel	DPH
threads	1
verbose	False
terrierql	on
parsecontrols	on
parseql	on
applypipeline	on
localmatching	on
filters	on
decorate	on
decorate_batch	on
querying.processes	terrierql:TerrierQLParser,parsecontrols:TerrierQLToControls,parseql:TerrierQLToMatchingQueryTerms,matchopql:MatchingOpQLParser,applypipeline:ApplyTermPipeline,context_wmodel:org.terrier.python.WmodelFromContextProcess,localmatching:LocalManager$ApplyLocalMatching,qe:QueryExpansion,labels:org.terrier.learning.LabelDecorator,filters:LocalManager$PostFilterProcess,decorate:SimpleDecorateProcess
querying.postfilters	decorate:SimpleDecorate,site:SiteFilter,scope:Scope
querying.default.controls	wmodel:DPH,parsecontrols:on,parseql:on,applypipeline:on,terrierql:on,localmatching:on,filters:on,decorate:on
querying.allowed.controls	scope,qe,qemodel,start,end,site,scope,applypipeline
termpipelines	Stopwords,PorterStemmer

DPH

qid	str	(Query ID) ID of query in frame
query_0	str	Stashed query text
query	str	Query text
docid	int	(Internal Document ID) Integer ID of document in a specific index
docno	str	(External Document ID) String ID of document in collection
rank	int	Ranking order of document to query (lower=better)
score	float	Ranking score of document to query (higher=better)

                
                    pt.terrier.rewrite.ResetQuery

ResetQuery

qid	str	(Query ID) ID of query in frame
query	str	Query text
docid	int	(Internal Document ID) Integer ID of document in a specific index
docno	str	(External Document ID) String ID of document in collection
rank	int	Ranking order of document to query (lower=better)
score	float	Ranking score of document to query (higher=better)

                
                    pt.datasets._irds.IRDSTextLoader

dataset	IRDSDataset('vaswani')
fields	['text']
verbose	False

TextLoader

qid	str	(Query ID) ID of query in frame
query	str	Query text
docid	int	(Internal Document ID) Integer ID of document in a specific index
docno	str	(External Document ID) String ID of document in collection
rank	int	Ranking order of document to query (lower=better)
score	float	Ranking score of document to query (higher=better)
text	str	Document text

                
                    pyterrier_t5.MonoT5ReRanker

MonoT5

qid	str	(Query ID) ID of query in frame
query	str	Query text
docid	int	(Internal Document ID) Integer ID of document in a specific index
docno	str	(External Document ID) String ID of document in collection
text	str	Document text
score	float	Ranking score of document to query (higher=better)
rank	int	Ranking order of document to query (lower=better)

Output

Rendering issue. Try running the cell again.

Tokenising the Query¶

Sometimes your query can include symbols that aren’t compatible with how your retriever parses the query. In this case, a custom tokeniser can be applied as part of the retrieval pipeline. using pt.terrier.rewrite.tokenise.

Advanced: Combining Query Formulations¶

In some cases, you may want to combine multiple query formulations into a single query. This can be achieved using linear(), which allows you to linearly combine multiple query columns into a single query column.

Advanced: Stashing the Documents¶

Very rarely, you might want to apply a query rewriting function as a re-ranker, but your rewriting function uses a different document ranking. In this case, you can use stash_results() to stash the retrieved documents for each query, so they can be recovered and re-ranked later using your rewritten query formulation. reset_results() can then be used later to restore the stashed documents.

Example: Query Expansion as a re-ranker

Citation

Diaz. Condensed List Relevance Models. ICTIR 2015. [link]

@inproceedings{DBLP:conf/ictir/Diaz15,
  author       = {Fernando Diaz},
  editor       = {James Allan and
                  W. Bruce Croft and
                  Arjen P. de Vries and
                  Chengxiang Zhai},
  title        = {Condensed List Relevance Models},
  booktitle    = {Proceedings of the 2015 International Conference on The Theory of
                  Information Retrieval, {ICTIR} 2015, Northampton, Massachusetts, USA,
                  September 27-30, 2015},
  pages        = {313--316},
  publisher    = {{ACM}},
  year         = {2015},
  url          = {https://doi.org/10.1145/2808194.2809491},
  doi          = {10.1145/2808194.2809491},
  timestamp    = {Mon, 03 Mar 2025 21:11:38 +0100},
  biburl       = {https://dblp.org/rec/conf/ictir/Diaz15.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Some papers advocate for the use of query expansion (PRF) as a re-ranker. This can be attained in PyTerrier through use of stash_results() and reset_results():

# index: the corpus you are ranking
pipeline = (
    index.dph()
    >> pt.terrier.rewrite.stash_results(clear=False)
    >> index.rm3()
    >> pt.terrier.rewrite.reset_results()
    >> index.dph()
)

Summary of dataframe types:

output of	dataframe contents	actual columns
dph	R	qid, query, docno, score
stash_results	R + “stashed_results_0”	qid, query, docno, score, stashed_results_0
RM3	Q + “stashed_results_0”	qid, query, query_0, stashed_results_0
reset_results	R	qid, query, docno, score, query_0
dph	R	qid, query, docno, score, query_0

Indeed, as we need RM3 to have the initial ranking of documents as input, we use clear=False as the kwarg to stash_results().

Example: Collection Enrichment as a re-ranker:

# index: the corpus you are ranking
# wiki_index: index of Wikipedia, used for enrichment

pipeline = (
    index.dph()
    >> pt.terrier.rewrite.stash_results()
    >> wiki_index.dph()
    >> wiki_index.rm3()
    >> pt.terrier.rewrite.reset_results()
    >> index.dph()
)

In general, collection enrichment describes conducting a PRF query expansion process on an external corpus (often Wikipedia), before applying the reformulated query to the main corpus. Collection enrichment can be used for improving a first pass retrieval (wiki_index.dph() >> wiki_index.rm3() >> main_index.dph()). Instead, the particular example shown above applies collection enrichment as a re-ranker.

Summary of dataframe types:

output of	dataframe contents	actual columns
dph	R	qid, query, docno, score
stash_results	Q + “stashed_results_0”	qid, query, saved_docs_0
Retriever	R + “stashed_results_0”	qid, query, docno, score, stashed_results_0
RM3	Q + “stashed_results_0”	qid, query, query_0, stashed_results_0
reset_results	R	qid, query, docno, score, query_0
dph	R	qid, query, docno, score, query_0

In this example, we have a Retriever instance executed on the wiki_index before RM3, so we clear the document ranking columns when using stash_results().