Adaptive Retrieval

pyterrier-adaptive provides PyTerrier functionality to support Adaptive Retrieval.

Adaptive Retrieval is a family of techniques that help overcome the recall limitation of re-ranking approaches by identifying relevant documents that were missed by earlier stages.

API Documentation

class pyterrier_adaptive.GAR(scorer, corpus_graph, num_results=1000, batch_size=None, backfill=True, enabled=True, verbose=False)[source]

A Transformer that implements Graph-based Adaptive Re-ranking (GAR).

Required input columns: ['qid', 'query', 'docno', 'score', 'rank']

Output columns: ['qid', 'query', 'docno', 'score', 'rank', 'iteration']

Note

The iteration column defines the batch number that first identified the document in the results. Due to the alternating nature of the algorithm, even=initial retrieval, odd=corpus graph, and -1=backfilled.

Citation

MacAvaney et al. Adaptive Re-Ranking with a Corpus Graph. CIKM 2022. [link]
@inproceedings{DBLP:conf/cikm/MacAvaneyTM22,
  author       = {Sean MacAvaney and
                  Nicola Tonellotto and
                  Craig Macdonald},
  editor       = {Mohammad Al Hasan and
                  Li Xiong},
  title        = {Adaptive Re-Ranking with a Corpus Graph},
  booktitle    = {Proceedings of the 31st {ACM} International Conference on Information
                  {\&} Knowledge Management, Atlanta, GA, USA, October 17-21, 2022},
  pages        = {1491--1500},
  publisher    = {{ACM}},
  year         = {2022},
  url          = {https://doi.org/10.1145/3511808.3557231},
  doi          = {10.1145/3511808.3557231},
  timestamp    = {Wed, 19 Oct 2022 17:09:02 +0200},
  biburl       = {https://dblp.org/rec/conf/cikm/MacAvaneyTM22.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}
Parameters:
  • scorer (Transformer) – A transformer that scores query-document pairs. It will only be provided with [‘qid, ‘query’, ‘docno’, ‘score’].

  • corpus_graph (CorpusGraph) – A graph of the corpus, enabling quick lookups of nearest neighbours

  • num_results (int) – The maximum number of documents to score (called “budget” and $c$ in the paper)

  • batch_size (int) – The number of documents to score at once (called $b$ in the paper). If not provided, will attempt to use the batch size from the scorer

  • backfill (bool) – If True, always include all documents from the initial stage, even if they were not re-scored

  • enabled (bool) – If False, perform re-ranking without using the corpus graph

  • verbose (bool) – If True, print progress information

transform(df)[source]

Applies Graph-based Adaptive Re-ranking to the provided dataframe. Essentially, Algorithm 1 from the paper.

Return type:

DataFrame

class pyterrier_adaptive.CorpusGraph(path)[source]

Initialize the artifact at the provided URL.

class pyterrier_adaptive.NpTopKCorpusGraph(path, k=None)[source]
to_limit_k(k)[source]

Creates a version of the graph with a maximum of k edges from each node. k must be less than the number of edges per node from the original graph.

Return type:

NpTopKCorpusGraph

class pyterrier_adaptive.Laff(model='macavaney/laff', *, device=None, batch_size=128, max_length=512, verbose=False)[source]

A transformer that computes a learned affinity score between two document texts using a transformer model.

Citation

Rathee et al. Quam: Adaptive Retrieval through Query Affinity Modelling. arXiv 2024. [link]
@article{DBLP:journals/corr/abs-2410-20286,
  author       = {Mandeep Rathee and
                  Sean MacAvaney and
                  Avishek Anand},
  title        = {Quam: Adaptive Retrieval through Query Affinity Modelling},
  journal      = {CoRR},
  volume       = {abs/2410.20286},
  year         = {2024},
  url          = {https://doi.org/10.48550/arXiv.2410.20286},
  doi          = {10.48550/ARXIV.2410.20286},
  eprinttype    = {arXiv},
  eprint       = {2410.20286},
  timestamp    = {Thu, 28 Nov 2024 21:32:46 +0100},
  biburl       = {https://dblp.org/rec/journals/corr/abs-2410-20286.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}
Parameters:
  • model – the name of the transformer model to use.

  • device – the device to use for the transformer model.

  • batch_size – the batch size to use for processing.

  • max_length – the maximum length of the input text.

  • verbose – whether to display progress bars.

compute_affinity(texts_left, texts_right)[source]

Compute the affinity scores between two lists of texts.

Return type:

List[float]

Parameters:
  • texts_left – the left-hand side texts.

  • texts_right – the right-hand side texts.

If either the left or right text is a string (or length-1 list), it is projected to the length of the other input (akin to numpy or torch projection).

A higher affinity score indicates the documents are more similar to one another.

Compute the Learned Affinity (LAFF) score between documents.
>>> from pyterrier_adaptive import Laff
>>> model = Laff()
>>> model.compute_affinity('the cat sat on the mat', ['cats like to sit in the sun', 'dogs like to play fetch'])
[5.46875, -3.140625]
Returns:

A list of affinity scores.

transform(inp)[source]

Compute the affinity scores between two columns of texts in the input.

Expects columns text (for the left-side text) and other_text (for the right-side text).

Results are sorted by the left-side text and the affinity score.

Return type:

DataFrame

wrap_graph(graph, text_loader)[source]

Wrap a corpus graph with the LAFF transformer for on-the-fly LAFF scomre computation.

Return type:

OnTheFlyLaffGraph

Parameters:
  • graph – the input corpus graph.

  • text_loader – a transformer that loads the text for a given document.

Returns:

A corpus graph that computes LAFF scores on-the-fly.

apply_to_graph(graph, text_loader, out_path=None, *, verbose=None)[source]

Apply the LAFF transformer to a corpus graph to construct a new one.

Return type:

NpTopKCorpusGraph

Parameters:
  • graph – the input corpus graph.

  • text_loader – a transformer that loads the text for a given document.

  • out_path – the path to save the output corpus graph. If not provided, the input graph’s path is used with a ‘.laff’ extension.

  • verbose – whether to display progress bars.

Bibliography

For more information on adaptive retrieval, see:

Citation

MacAvaney et al. Adaptive Re-Ranking with a Corpus Graph. CIKM 2022. [link]
@inproceedings{DBLP:conf/cikm/MacAvaneyTM22,
  author       = {Sean MacAvaney and
                  Nicola Tonellotto and
                  Craig Macdonald},
  editor       = {Mohammad Al Hasan and
                  Li Xiong},
  title        = {Adaptive Re-Ranking with a Corpus Graph},
  booktitle    = {Proceedings of the 31st {ACM} International Conference on Information
                  {\&} Knowledge Management, Atlanta, GA, USA, October 17-21, 2022},
  pages        = {1491--1500},
  publisher    = {{ACM}},
  year         = {2022},
  url          = {https://doi.org/10.1145/3511808.3557231},
  doi          = {10.1145/3511808.3557231},
  timestamp    = {Wed, 19 Oct 2022 17:09:02 +0200},
  biburl       = {https://dblp.org/rec/conf/cikm/MacAvaneyTM22.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Citation

MacAvaney et al. Adaptive Re-Ranking as an Information-Seeking Agent. CIKM Workshops 2022. [link]
@inproceedings{DBLP:conf/cikm/MacAvaneyTM22a,
  author       = {Sean MacAvaney and
                  Nicola Tonellotto and
                  Craig Macdonald},
  editor       = {Georgios Drakopoulos and
                  Eleanna Kafeza},
  title        = {Adaptive Re-Ranking as an Information-Seeking Agent},
  booktitle    = {Proceedings of the {CIKM} 2022 Workshops co-located with 31st {ACM}
                  International Conference on Information and Knowledge Management {(CIKM}
                  2022), Atlanta, USA, October 17-21, 2022},
  series       = {{CEUR} Workshop Proceedings},
  volume       = {3318},
  publisher    = {CEUR-WS.org},
  year         = {2022},
  url          = {https://ceur-ws.org/Vol-3318/paper9.pdf},
  timestamp    = {Fri, 10 Mar 2023 16:22:32 +0100},
  biburl       = {https://dblp.org/rec/conf/cikm/MacAvaneyTM22a.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Citation

Kulkarni et al. Lexically-Accelerated Dense Retrieval. SIGIR 2023. [link]
@inproceedings{DBLP:conf/sigir/KulkarniMGF23,
  author       = {Hrishikesh Kulkarni and
                  Sean MacAvaney and
                  Nazli Goharian and
                  Ophir Frieder},
  editor       = {Hsin{-}Hsi Chen and
                  Wei{-}Jou (Edward) Duh and
                  Hen{-}Hsen Huang and
                  Makoto P. Kato and
                  Josiane Mothe and
                  Barbara Poblete},
  title        = {Lexically-Accelerated Dense Retrieval},
  booktitle    = {Proceedings of the 46th International {ACM} {SIGIR} Conference on
                  Research and Development in Information Retrieval, {SIGIR} 2023, Taipei,
                  Taiwan, July 23-27, 2023},
  pages        = {152--162},
  publisher    = {{ACM}},
  year         = {2023},
  url          = {https://doi.org/10.1145/3539618.3591715},
  doi          = {10.1145/3539618.3591715},
  timestamp    = {Fri, 21 Jul 2023 22:25:19 +0200},
  biburl       = {https://dblp.org/rec/conf/sigir/KulkarniMGF23.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Citation

Frayling et al. Effective Adhoc Retrieval Through Traversal of a Query-Document Graph. ECIR (3) 2024. [link]
@inproceedings{DBLP:conf/ecir/FraylingMMO24,
  author       = {Erlend Frayling and
                  Sean MacAvaney and
                  Craig Macdonald and
                  Iadh Ounis},
  editor       = {Nazli Goharian and
                  Nicola Tonellotto and
                  Yulan He and
                  Aldo Lipani and
                  Graham McDonald and
                  Craig Macdonald and
                  Iadh Ounis},
  title        = {Effective Adhoc Retrieval Through Traversal of a Query-Document Graph},
  booktitle    = {Advances in Information Retrieval - 46th European Conference on Information
                  Retrieval, {ECIR} 2024, Glasgow, UK, March 24-28, 2024, Proceedings,
                  Part {III}},
  series       = {Lecture Notes in Computer Science},
  volume       = {14610},
  pages        = {89--104},
  publisher    = {Springer},
  year         = {2024},
  url          = {https://doi.org/10.1007/978-3-031-56063-7\_6},
  doi          = {10.1007/978-3-031-56063-7\_6},
  timestamp    = {Fri, 29 Mar 2024 23:01:35 +0100},
  biburl       = {https://dblp.org/rec/conf/ecir/FraylingMMO24.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Citation

MacAvaney and Tonellotto. A Reproducibility Study of PLAID. SIGIR 2024. [link]
@inproceedings{DBLP:conf/sigir/MacAvaneyT24,
  author       = {Sean MacAvaney and
                  Nicola Tonellotto},
  editor       = {Grace Hui Yang and
                  Hongning Wang and
                  Sam Han and
                  Claudia Hauff and
                  Guido Zuccon and
                  Yi Zhang},
  title        = {A Reproducibility Study of {PLAID}},
  booktitle    = {Proceedings of the 47th International {ACM} {SIGIR} Conference on
                  Research and Development in Information Retrieval, {SIGIR} 2024, Washington
                  DC, USA, July 14-18, 2024},
  pages        = {1411--1419},
  publisher    = {{ACM}},
  year         = {2024},
  url          = {https://doi.org/10.1145/3626772.3657856},
  doi          = {10.1145/3626772.3657856},
  timestamp    = {Sun, 06 Oct 2024 21:14:16 +0200},
  biburl       = {https://dblp.org/rec/conf/sigir/MacAvaneyT24.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Citation

Rathee et al. Quam: Adaptive Retrieval through Query Affinity Modelling. arXiv 2024. [link]
@article{DBLP:journals/corr/abs-2410-20286,
  author       = {Mandeep Rathee and
                  Sean MacAvaney and
                  Avishek Anand},
  title        = {Quam: Adaptive Retrieval through Query Affinity Modelling},
  journal      = {CoRR},
  volume       = {abs/2410.20286},
  year         = {2024},
  url          = {https://doi.org/10.48550/arXiv.2410.20286},
  doi          = {10.48550/ARXIV.2410.20286},
  eprinttype    = {arXiv},
  eprint       = {2410.20286},
  timestamp    = {Thu, 28 Nov 2024 21:32:46 +0100},
  biburl       = {https://dblp.org/rec/journals/corr/abs-2410-20286.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}