Query Rewriting & Expansion¶
Query rewriting refers to changing the formulation of the query in order to improve the effectiveness of the search ranking. PyTerrier supplies a number of query rewriting transformers designed to work with Retriever.
Firstly, we differentiate between two forms of query rewriting:
Q -> Q: this rewrites the query, for instance by adding/removing extra query terms. Examples might be a WordNet- or Word2Vec-based QE; The input dataframes contain only [“qid”, “docno”] columns. The output dataframes contain [“qid”, “query”, “query_0”] columns, where “query” contains the reformulated query, and “query_0” contains the previous formulation of the query.
R -> Q: these class of transformers rewrite a query by making use of an associated set of documents. This is typically exemplifed by pseudo-relevance feedback. Similarly the output dataframes contain [“qid”, “query”, “query_0”] columns.
The previous formulation of the query can be restored using pt.rewrite.reset(), discussed below.
SequentialDependence¶
This class implements Metzler and Croft’s sequential dependence model, designed to boost the scores of documents where the query terms occur in close proximity. Application of this transformer rewrites each input query such that:
pairs of adjacent query terms are added as #1 and #uw8 complex query terms, with a low weight.
the full query is added as #uw12 complex query term, with a low weight.
all terms are weighted by a proximity model, either Dirichlet LM or pBiL2.
For example, the query pyterrier IR platform would become pyterrier IR platform #1(pyterrier IR) #1(IR platform) #uw8(pyterrier IR) #uw8(IR platform) #uw12(pyterrier IR platform). NB: Acutally, we have simplified the rewritten query - in practice, we also (a) set the weight of the proximity terms to be low using a #combine() operator and (b) set a proximity term weighting model.
This transfomer is only compatible with Retriever, as Terrier supports the #1 and #uwN complex query terms operators. The Terrier index must have blocks (positional information) recorded in the index.
- class pyterrier.rewrite.SequentialDependence(*args, **kwargs)¶
Implements the sequential dependence model, which Terrier supports using its Indri/Galagoo compatible matchop query language. The rewritten query is derived using the Terrier class DependenceModelPreProcess.
This transformer changes the query. It must be followed by a Terrier Retrieve() transformer. The original query is saved in the “query_0” column, which can be restored using pt.rewrite.reset().
Example:
sdm = pt.rewrite.SequentialDependence()
dph = pt.terrier.Retriever(index, wmodel="DPH")
pipeline = sdm >> dph
Citation
Metzler and Croft. A Markov random field model for term dependencies. SIGIR 2005. [link]
@inproceedings{DBLP:conf/sigir/MetzlerC05, author = {Donald Metzler and W. Bruce Croft}, editor = {Ricardo A. Baeza{-}Yates and Nivio Ziviani and Gary Marchionini and Alistair Moffat and John Tait}, title = {A Markov random field model for term dependencies}, booktitle = {{SIGIR} 2005: Proceedings of the 28th Annual International {ACM} {SIGIR} Conference on Research and Development in Information Retrieval, Salvador, Brazil, August 15-19, 2005}, pages = {472--479}, publisher = {{ACM}}, year = {2005}, url = {https://doi.org/10.1145/1076034.1076115}, doi = {10.1145/1076034.1076115}, timestamp = {Tue, 06 Nov 2018 11:07:23 +0100}, biburl = {https://dblp.org/rec/conf/sigir/MetzlerC05.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
Citation
Peng et al. Incorporating term dependency in the dfr framework. SIGIR 2007. [link]
@inproceedings{DBLP:conf/sigir/PengMHPO07, author = {Jie Peng and Craig Macdonald and Ben He and Vassilis Plachouras and Iadh Ounis}, editor = {Wessel Kraaij and Arjen P. de Vries and Charles L. A. Clarke and Norbert Fuhr and Noriko Kando}, title = {Incorporating term dependency in the dfr framework}, booktitle = {{SIGIR} 2007: Proceedings of the 30th Annual International {ACM} {SIGIR} Conference on Research and Development in Information Retrieval, Amsterdam, The Netherlands, July 23-27, 2007}, pages = {843--844}, publisher = {{ACM}}, year = {2007}, url = {https://doi.org/10.1145/1277741.1277937}, doi = {10.1145/1277741.1277937}, timestamp = {Tue, 26 Nov 2024 07:42:48 +0100}, biburl = {https://dblp.org/rec/conf/sigir/PengMHPO07.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
Bo1QueryExpansion¶
This class applies the Bo1 Divergence from Randomess query expansion model to rewrite the query based on the occurences of terms in the feedback documents provided for each query. In this way, it takes in a dataframe with columns [“qid”, “query”, “docno”, “score”, “rank”] and returns a dataframe with [“qid”, “query”].
- class pyterrier.rewrite.Bo1QueryExpansion(*args, **kwargs)¶
Applies the Bo1 query expansion model from the Divergence from Randomness Framework, as provided by Terrier. It must be followed by a Terrier Retrieve() transformer. The original query is saved in the “query_0” column, which can be restored using pt.rewrite.reset().
- Instance Attributes:
fb_terms(int): number of feedback terms. Defaults to 10
fb_docs(int): number of feedback documents. Defaults to 3
- Parameters:
index_like – the Terrier index to use.
fb_terms (int) – number of terms to add to the query. Terrier’s default setting is 10 expansion terms.
fb_docs (int) – number of feedback documents to consider. Terrier’s default setting is 3 feedback documents.
Example:
bo1 = pt.rewrite.Bo1QueryExpansion(index)
dph = pt.terrier.Retriever(index, wmodel="DPH")
pipelineQE = dph >> bo1 >> dph
View the expansion terms:
pipelineDisplay = dph >> bo1
pipelineDisplay.search("chemical reactions")
# will return a dataframe with ['qid', 'query', 'query_0'] columns
# the reformulated query can be found in the 'query' column,
# while the original query is in the 'query_0' columns
Alternative Formulations
Note that it is also possible to configure Retriever to perform QE directly using controls, which will result in identical retrieval effectiveness:
pipelineQE = pt.terrier.Retriever(index, wmodel="DPH", controls={"qemodel" : "Bo1", "qe" : "on"})
However, using pt.rewrite.Bo1QueryExpansion is preferable as:
the semantics of retrieve >> rewrite >> retrieve are clearly visible.
the complex control configuration of Terrier need not be learned.
the rewritten query is visible outside, and not hidden inside Terrier.
Citation
Amati. Probability models for information retrieval based on divergence from randomness. 2003. [link]
@phdthesis{DBLP:phd/ethos/Amati03, author = {Giambattista Amati}, title = {Probability models for information retrieval based on divergence from randomness}, school = {University of Glasgow, {UK}}, year = {2003}, url = {http://theses.gla.ac.uk/1570/}, timestamp = {Tue, 05 Apr 2022 10:59:13 +0200}, biburl = {https://dblp.org/rec/phd/ethos/Amati03.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
KLQueryExpansion¶
Similar to Bo1, this class deploys a Divergence from Randomess query expansion model based on Kullback Leibler divergence.
- class pyterrier.rewrite.KLQueryExpansion(*args, **kwargs)¶
Applies the KL query expansion model from the Divergence from Randomness Framework, as provided by Terrier. This transformer must be followed by a Terrier Retrieve() transformer. The original query is saved in the “query_0” column, which can be restored using pt.rewrite.reset().
- Instance Attributes:
fb_terms(int): number of feedback terms. Defaults to 10
fb_docs(int): number of feedback documents. Defaults to 3
- Parameters:
index_like – the Terrier index to use
fb_terms (int) – number of terms to add to the query. Terrier’s default setting is 10 expansion terms.
fb_docs (int) – number of feedback documents to consider. Terrier’s default setting is 3 feedback documents.
Citation
Amati. Probability models for information retrieval based on divergence from randomness. 2003. [link]
@phdthesis{DBLP:phd/ethos/Amati03, author = {Giambattista Amati}, title = {Probability models for information retrieval based on divergence from randomness}, school = {University of Glasgow, {UK}}, year = {2003}, url = {http://theses.gla.ac.uk/1570/}, timestamp = {Tue, 05 Apr 2022 10:59:13 +0200}, biburl = {https://dblp.org/rec/phd/ethos/Amati03.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
RM3¶
- class pyterrier.rewrite.RM3(*args, **kwargs)¶
Performs query expansion using RM3 relevance models.
This transformer must be followed by a Terrier Retrieve() transformer. The original query is saved in the “query_0” column, which can be restored using pt.rewrite.reset().
- Instance Attributes:
fb_terms(int): number of feedback terms. Defaults to 10
fb_docs(int): number of feedback documents. Defaults to 3
fb_lambda(float): lambda in RM3, i.e. importance of relevance model viz feedback model. Defaults to 0.6.
Example:
bm25 = pt.terrier.Retriever(index, wmodel="BM25") rm3_pipe = bm25 >> pt.rewrite.RM3(index) >> bm25 pt.Experiment([bm25, rm3_pipe], dataset.get_topics(), dataset.get_qrels(), ["map"] )
- Parameters:
index_like – the Terrier index to use
fb_terms (int) – number of terms to add to the query. Terrier’s default setting is 10 expansion terms.
fb_docs (int) – number of feedback documents to consider. Terrier’s default setting is 3 feedback documents.
fb_lambda (float) – lambda in RM3, i.e. importance of relevance model viz feedback model. Defaults to 0.6.
Citation
Jaleel et al. UMass at TREC 2004: Novelty and HARD. TREC 2004. [link]
@inproceedings{DBLP:conf/trec/JaleelACDLLSW04, author = {Nasreen Abdul Jaleel and James Allan and W. Bruce Croft and Fernando Diaz and Leah S. Larkey and Xiaoyan Li and Mark D. Smucker and Courtney Wade}, editor = {Ellen M. Voorhees and Lori P. Buckland}, title = {UMass at {TREC} 2004: Novelty and {HARD}}, booktitle = {Proceedings of the Thirteenth Text REtrieval Conference, {TREC} 2004, Gaithersburg, Maryland, USA, November 16-19, 2004}, series = {{NIST} Special Publication}, volume = {500-261}, publisher = {National Institute of Standards and Technology {(NIST)}}, year = {2004}, url = {http://trec.nist.gov/pubs/trec13/papers/umass.novelty.hard.pdf}, timestamp = {Wed, 07 Jul 2021 16:44:22 +0200}, biburl = {https://dblp.org/rec/conf/trec/JaleelACDLLSW04.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
Combining Query Formulations¶
- pyterrier.rewrite.linear(weightCurrent, weightPrevious, format='terrierql', **kwargs)¶
Applied to make a linear combination of the current and previous query formulation. The implementation is tied to the underlying query language used by the retrieval/re-ranker transformers. Two of Terrier’s query language formats are supported by the format kwarg, namely “terrierql” and “matchoptql”. Their exact respective formats are detailed in the Terrier documentation.
- Return type:
- Parameters:
weightCurrent (float) – weight to apply to the current query formulation.
weightPrevious (float) – weight to apply to the previous query formulation.
format (str) – which query language to use to rewrite the queries, one of “terrierql” or “matchopql”.
Example:
pipeTQL = pt.apply.query(lambda row: "az") >> pt.rewrite.linear(0.75, 0.25, format="terrierql") pipeMQL = pt.apply.query(lambda row: "az") >> pt.rewrite.linear(0.75, 0.25, format="matchopql") pipeT.search("a") pipeM.search("a")
Example outputs of pipeTQL and pipeMQL corresponding to the query “a” above:
Terrier QL output: “(az)^0.750000 (a)^0.250000”
MatchOp QL output: “#combine:0:0.750000:1:0.250000(#combine(az) #combine(a))”
Resetting the Query Formulation¶
The application of any query rewriting operation, including the apply transformer, pt.apply.query(), will return a dataframe that includes the input formulation of the query in the query_0 column, and the new reformulation in the query column. The previous query reformulation can be obtained by inclusion of a reset transformer in the pipeline.
- pyterrier.rewrite.reset()¶
Undoes a previous query rewriting operation. This results in the query formulation stored in the “query_0” attribute being moved to the “query” attribute, and, if present, the “query_1” being moved to “query_0” and so on. This transformation is useful if you have rewritten the query for the purposes of one retrieval stage, but wish a subquent transformer to be applies on the original formulation.
Internally, this function applies pt.model.pop_queries().
Example:
firststage = pt.rewrite.SDM() >> pt.terrier.Retriever(index, wmodel="DPH") secondstage = pyterrier_bert.cedr.CEDRPipeline() fullranker = firststage >> pt.rewrite.reset() >> secondstage
- Return type:
Tokenising the Query¶
Sometimes your query can include symbols that aren’t compatible with how your retriever parses the query. In this case, a custom tokeniser can be applied as part of the retrieval pipeline.
- pyterrier.rewrite.tokenise(tokeniser='english', matchop=False)¶
Applies tokenisation to the query. By default, queries obtained from
pt.get_dataset().get_topics()
are normally tokenised.- Return type:
- Parameters:
tokeniser (Union[str,TerrierTokeniser,FunctionType]) – Defines what tokeniser should be used - either a Java tokeniser name in Terrier, a TerrierTokeniser instance, or a function that takes a str as input and returns a list of str.
matchop (bool) – Whether query terms should be wrapped in matchops, to ensure they can be parsed by a Terrier Retriever transformer.
Example - use default tokeniser:
pipe = pt.rewrite.tokenise() >> pt.terrier.Retriever() pipe.search("Question with 'capitals' and other stuff?")
Example - roll your own tokeniser:
poortokenisation = pt.rewrite.tokenise(lambda query: query.split(" ")) >> pt.terrier.Retriever()
Example - for non-English languages, tokenise on standard UTF non-alphanumeric characters:
utftokenised = pt.rewrite.tokenise(pt.TerrierTokeniser.utf)) >> pt.terrier.Retriever() utftokenised = pt.rewrite.tokenise("utf")) >> pt.terrier.Retriever()
Example - tokenising queries using a HuggingFace tokenizer
# this assumes the index was created in a pretokenised manner br = pt.terrier.Retriever(indexref) tok = AutoTokenizer.from_pretrained("bert-base-uncased") query_toks = pt.rewrite.tokenise(tok.tokenize, matchop=True) retr_pipe = query_toks >> br
Stashing the Documents¶
Sometimes you want to apply a query rewriting function as a re-ranker, but your rewriting function uses a different document ranking. In this case, you can use pt.rewrite.stash_results() to stash the retrieved documents for each query, so they can be recovered and re-ranked later using your rewritten query formulation.
- pyterrier.rewrite.stash_results(clear=True)¶
Stashes (saves) the current retrieved documents for each query into the column “stashed_results_0”. This means that they can be restored later by using pt.rewrite.reset_results(). thereby converting a retrieved documents dataframe into one of queries.
Args: clear(bool): whether to drop the document and retrieved document related columns. Defaults to True.
- Return type:
- pyterrier.rewrite.reset_results()¶
Applies a transformer that undoes a pt.rewrite.stash_results() transformer, thereby restoring the ranked documents.
- Return type:
Example: Query Expansion as a re-ranker
Some papers advocate for the use of query expansion (PRF) as a re-ranker. This can be attained in PyTerrier through use of stash_results() and reset_results():
# index: the corpus you are ranking
dph = pt.terrier.Retriever(index)
Pipe = dph
>> pt.rewrite.stash_results(clear=False)
>> pt.rewrite.RM3(index)
>> pt.rewrite.reset_results()
>> dph
Summary of dataframe types:
output of |
dataframe contents |
actual columns |
---|---|---|
dph |
R |
qid, query, docno, score |
stash_results |
R + “stashed_results_0” |
qid, query, docno, score, stashed_results_0 |
RM3 |
Q + “stashed_results_0” |
qid, query, query_0, stashed_results_0 |
reset_results |
R |
qid, query, docno, score, query_0 |
dph |
R |
qid, query, docno, score, query_0 |
Indeed, as we need RM3 to have the initial ranking of documents as input, we use clear=False as the kwarg to stash_results().
Example: Collection Enrichment as a re-ranker:
# index: the corpus you are ranking
# wiki_index: index of Wikipedia, used for enrichment
dph = pt.terrier.Retriever(index)
Pipe = dph
>> pt.rewrite.stash_results()
>> pt.terrier.Retriever(wiki_index)
>> pt.rewrite.RM3(wiki_index)
>> pt.rewrite.reset_results()
>> dph
In general, collection enrichment describes conducting a PRF query expansion process on an external corpus (often Wikipedia), before applying the reformulated query to the main corpus. Collection enrichment can be used for improving a first pass retrieval (pt.terrier.Retriever(wiki_index) >> pt.rewrite.RM3(wiki_index) >> pt.terrier.Retriever(main_index)). Instead, the particular example shown above applies collection enrichment as a re-ranker.
Summary of dataframe types:
output of |
dataframe contents |
actual columns |
---|---|---|
dph |
R |
qid, query, docno, score |
stash_results |
Q + “stashed_results_0” |
qid, query, saved_docs_0 |
Retriever |
R + “stashed_results_0” |
qid, query, docno, score, stashed_results_0 |
RM3 |
Q + “stashed_results_0” |
qid, query, query_0, stashed_results_0 |
reset_results |
R |
qid, query, docno, score, query_0 |
dph |
R |
qid, query, docno, score, query_0 |
In this example, we have a Retriever instance executed on the wiki_index before RM3, so we clear the document ranking columns when using stash_results().