# Examples of Retrieval Pipelines ## Query Rewriting ### Sequential Dependence Model ```python pipe = pt.rewrite.SDM() >> pt.terrier.Retriever(indexref, wmodel="BM25") ``` Note that the SDM() rewriter has a number of constructor parameters: - `remove_stopwords` - defines if stopwords should be removed from the query - `prox_model` - change the proximity model. For true language modelling, you should set `prox_model` to "org.terrier.matching.models.Dirichlet_LM" ### Divergence from Randomness Query Expansion A simple QE transformer can be achieved using ```python qe = pt.terrier.Retriever(indexref, wmodel="BM25", controls={"qe" : "on"}) ``` As this is pseudo-relevance feedback in nature, it identifies a set of documents, extracts informative term in the top-ranked documents, and re-exectutes the query. However, more control can be achieved by using the QueryExpansion transformer separately, as thus: ```python qe = (pt.terrier.Retriever(indexref, wmodel="BM25") >> pt.rewrite.QueryExpansion(indexref) >> pt.terrier.Retriever(indexref, wmodel="BM25") ) ``` The QueryExpansion() object has the following constructor parameters: - `index_like` - which index you are using to obtain the contents of the documents. This should match the preceeding Retriever. - `fb_docs` - number of feedback documents to examine - `fb_terms` - number of feedback terms to add to the query Note that different indexes can be used to achieve query expansion using an external collection (sometimes called collection enrichment or external feedback). For example, to expand queries using Wikipedia as an external resource, in order to get higher quality query re-weighted queries, would look like this: ```python pipe = (pt.terrier.Retriever(wikipedia_index, wmodel="BM25") >> pt.rewrite.QueryExpansion(wikipedia_index) >> pt.terrier.Retriever(local_index, wmodel="BM25") ) ``` ### RM3 Query Expansion We also provide RM3 query expansion. ```python pipe = (pt.terrier.Retriever(indexref, wmodel="BM25") >> pt.rewrite.RM3(indexref) >> pt.terrier.Retriever(indexref, wmodel="BM25") ) ``` ## Combining Rankings Sometimes we have good retrieval approaches and we wish to combine these in a unsupervised manner. We can do that using the linear combination operator: ```python bm25 = pt.terrier.Retriever(indexref, wmodel="BM25") dph = pt.terrier.Retriever(indexref, wmodel="DPH") linear = bm25 + dph ``` Of course, some weighting can help: ```python bm25 = pt.terrier.Retriever(indexref, wmodel="BM25") dph = pt.terrier.Retriever(indexref, wmodel="DPH") linear = bm25 + 2* dph ``` However, if the score distributions are not similar, finding a good weight can be tricky. Normalisation of retrieval scores can be advantagous in this case. PyTerrier-Alpha provide PerQueryMaxMinScore() to make the normalisation easy. ```python import pyterrier_alpha as pta bm25 = pt.terrier.Retriever(indexref, wmodel="BM25") >> pta.fusion.PerQueryMaxMinScore() dph = pt.terrier.Retriever(indexref, wmodel="DPH" >> pta.fusion.PerQueryMaxMinScore() linear = 0.75 * bm25 + 0.25 * dph ``` ## Learning to Rank Having shown some of the main formulations, lets show how to build different formulations into a LTR model. - Some authors report that it is useful to take a union of different retrieval mechanisms in order to build a good candidate set. We use the set-union operator here to combine the rankings of BM25 and DPH weighting models. - We then score each of the retrieved documents ```python bm25_cands = pt.terrier.Retriever(indexref, wmodel="BM25") dph_cands = pt.terrier.Retriever(indexref, wmodel="DPH") all_cands = bm25_cands | dph_cands all_features = all_cands >> ( pt.terrier.Retriever(indexref, wmodel="BM25F") ** pt.rewrite.SDM() >> pt.terrier.Retriever(indexref, wmodel="BM25") ) import xgboost as xgb params = {'objective': 'rank:ndcg', 'learning_rate': 0.1, 'gamma': 1.0, 'min_child_weight': 0.1, 'max_depth': 6, 'verbose': 2, 'random_state': 42 } lambdamart = pt.ltr.apply_learned_model(xgb.sklearn.XGBRanker(**params), form='ltr') final_pipe = all_features >> lambdamart final_pipe.fit(tr_topics, tr_qrels, va_topics, va_qrels) ```