Adaptive Retrieval

pyterrier-adaptive provides PyTerrier functionality to support Adaptive Retrieval.

Adaptive Retrieval is a family of techniques that help overcome the recall limitation of re-ranking approaches by identifying relevant documents that were missed by earlier stages.

API Documentation

class pyterrier_adaptive.GAR(scorer, corpus_graph, num_results=1000, batch_size=None, backfill=True, enabled=True, verbose=False)[source]

A Transformer that implements Graph-based Adaptive Re-ranking (GAR).

Required input columns: ['qid', 'query', 'docno', 'score', 'rank']

Output columns: ['qid', 'query', 'docno', 'score', 'rank', 'iteration']

Note

The iteration column defines the batch number that first identified the document in the results. Due to the alternating nature of the algorithm, even=initial retrieval, odd=corpus graph, and -1=backfilled.

Citation

MacAvaney et al. Adaptive Re-Ranking with a Corpus Graph. CIKM 2022. [link]
@inproceedings{DBLP:conf/cikm/MacAvaneyTM22,
  author       = {Sean MacAvaney and
                  Nicola Tonellotto and
                  Craig Macdonald},
  editor       = {Mohammad Al Hasan and
                  Li Xiong},
  title        = {Adaptive Re-Ranking with a Corpus Graph},
  booktitle    = {Proceedings of the 31st {ACM} International Conference on Information
                  {\&} Knowledge Management, Atlanta, GA, USA, October 17-21, 2022},
  pages        = {1491--1500},
  publisher    = {{ACM}},
  year         = {2022},
  url          = {https://doi.org/10.1145/3511808.3557231},
  doi          = {10.1145/3511808.3557231},
  timestamp    = {Wed, 19 Oct 2022 17:09:02 +0200},
  biburl       = {https://dblp.org/rec/conf/cikm/MacAvaneyTM22.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}
Parameters:
  • scorer (Transformer) – A transformer that scores query-document pairs. It will only be provided with [‘qid, ‘query’, ‘docno’, ‘score’].

  • corpus_graph (CorpusGraph) – A graph of the corpus, enabling quick lookups of nearest neighbours

  • num_results (int) – The maximum number of documents to score (called “budget” and $c$ in the paper)

  • batch_size (int) – The number of documents to score at once (called $b$ in the paper). If not provided, will attempt to use the batch size from the scorer

  • backfill (bool) – If True, always include all documents from the initial stage, even if they were not re-scored

  • enabled (bool) – If False, perform re-ranking without using the corpus graph

  • verbose (bool) – If True, print progress information

transform(df)[source]

Applies Graph-based Adaptive Re-ranking to the provided dataframe. Essentially, Algorithm 1 from the paper.

Return type:

DataFrame

Parameters:

df (DataFrame)

class pyterrier_adaptive.CorpusGraph(path)[source]

This abstract class represents a corpus graph, which is a graph where each node represents a document and edges represent a relationship between those documents (e.g., the most similar documents).

abstractmethod neighbours(docid, weights=False)[source]

Abstract method. Implementations should return the nearest neighbors for the provided docid.

docid can either be an integer or string. When an integer, it is treated as an internal docid. When a string, it is treated as an external id (i.e., docno) and the corresponding internal id are looked up.

Returns either either a numpy array of other itnernal docids or a list of external docids, depending on the input.

Return type:

array | List[str] | Tuple[array, array] | Tuple[List[str], array]

Parameters:
  • docid (int | str) – The internal or external id of the document.

  • weights (bool) – If True, also return the weights of the edges.

class pyterrier_adaptive.NpTopKCorpusGraph(path, k=None)[source]

A top-k neighbor corpus graph.

This class represents a corpus graph where each document is connected to its k nearest neighbors.

Parameters:
  • path – The path to the directory containing the corpus graph.

  • k – The number of neighbors to keep for each document. If None, the original k is kept.

property edges_data: array

Returns the edges data as a numpy array.

property weights_data: array

Returns the edge weight data as a numpy array.

to_limit_k(k)[source]

Creates a version of the graph with a maximum of k edges from each node. k must be less than the number of edges per node from the original graph.

Return type:

NpTopKCorpusGraph

Parameters:

k (int)

class pyterrier_adaptive.Laff(model='macavaney/laff', *, device=None, batch_size=128, max_length=512, verbose=False)[source]

A transformer that computes a learned affinity score between two document texts using a transformer model.

Citation

Rathee et al. Quam: Adaptive Retrieval through Query Affinity Modelling. arXiv 2024. [link]
@article{DBLP:journals/corr/abs-2410-20286,
  author       = {Mandeep Rathee and
                  Sean MacAvaney and
                  Avishek Anand},
  title        = {Quam: Adaptive Retrieval through Query Affinity Modelling},
  journal      = {CoRR},
  volume       = {abs/2410.20286},
  year         = {2024},
  url          = {https://doi.org/10.48550/arXiv.2410.20286},
  doi          = {10.48550/ARXIV.2410.20286},
  eprinttype    = {arXiv},
  eprint       = {2410.20286},
  timestamp    = {Thu, 28 Nov 2024 21:32:46 +0100},
  biburl       = {https://dblp.org/rec/journals/corr/abs-2410-20286.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}
Parameters:
  • model (str) – the name of the transformer model to use.

  • device (str | device | None) – the device to use for the transformer model.

  • batch_size (int) – the batch size to use for processing.

  • max_length (int) – the maximum length of the input text.

  • verbose (bool) – whether to display progress bars.

compute_affinity(texts_left, texts_right)[source]

Compute the affinity scores between two lists of texts.

Return type:

List[float]

Parameters:
  • texts_left (str | List[str]) – the left-hand side texts.

  • texts_right (str | List[str]) – the right-hand side texts.

If either the left or right text is a string (or length-1 list), it is projected to the length of the other input (akin to numpy or torch projection).

A higher affinity score indicates the documents are more similar to one another.

Compute the Learned Affinity (LAFF) score between documents.
>>> from pyterrier_adaptive import Laff
>>> model = Laff()
>>> model.compute_affinity('the cat sat on the mat', ['cats like to sit in the sun', 'dogs like to play fetch'])
[5.46875, -3.140625]
Returns:

A list of affinity scores.

Parameters:
  • texts_left (str | List[str])

  • texts_right (str | List[str])

Return type:

List[float]

transform(inp)[source]

Compute the affinity scores between two columns of texts in the input.

Expects columns text (for the left-side text) and other_text (for the right-side text).

Results are sorted by the left-side text and the affinity score.

Return type:

DataFrame

Parameters:

inp (DataFrame)

wrap_graph(graph, text_loader)[source]

Wrap a corpus graph with the LAFF transformer for on-the-fly LAFF scomre computation.

Return type:

OnTheFlyLaffGraph

Parameters:
  • graph (CorpusGraph) – the input corpus graph.

  • text_loader (Transformer) – a transformer that loads the text for a given document.

Returns:

A corpus graph that computes LAFF scores on-the-fly.

apply_to_graph(graph, text_loader, out_path=None, *, verbose=None)[source]

Apply the LAFF transformer to a corpus graph to construct a new one.

Return type:

NpTopKCorpusGraph

Parameters:
  • graph (NpTopKCorpusGraph) – the input corpus graph.

  • text_loader (Transformer) – a transformer that loads the text for a given document.

  • out_path (str | None) – the path to save the output corpus graph. If not provided, the input graph’s path is used with a ‘.laff’ extension.

  • verbose (bool | None) – whether to display progress bars.

Bibliography

For more information on adaptive retrieval, see:

Citation

MacAvaney et al. Adaptive Re-Ranking with a Corpus Graph. CIKM 2022. [link]
@inproceedings{DBLP:conf/cikm/MacAvaneyTM22,
  author       = {Sean MacAvaney and
                  Nicola Tonellotto and
                  Craig Macdonald},
  editor       = {Mohammad Al Hasan and
                  Li Xiong},
  title        = {Adaptive Re-Ranking with a Corpus Graph},
  booktitle    = {Proceedings of the 31st {ACM} International Conference on Information
                  {\&} Knowledge Management, Atlanta, GA, USA, October 17-21, 2022},
  pages        = {1491--1500},
  publisher    = {{ACM}},
  year         = {2022},
  url          = {https://doi.org/10.1145/3511808.3557231},
  doi          = {10.1145/3511808.3557231},
  timestamp    = {Wed, 19 Oct 2022 17:09:02 +0200},
  biburl       = {https://dblp.org/rec/conf/cikm/MacAvaneyTM22.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Citation

MacAvaney et al. Adaptive Re-Ranking as an Information-Seeking Agent. CIKM Workshops 2022. [link]
@inproceedings{DBLP:conf/cikm/MacAvaneyTM22a,
  author       = {Sean MacAvaney and
                  Nicola Tonellotto and
                  Craig Macdonald},
  editor       = {Georgios Drakopoulos and
                  Eleanna Kafeza},
  title        = {Adaptive Re-Ranking as an Information-Seeking Agent},
  booktitle    = {Proceedings of the {CIKM} 2022 Workshops co-located with 31st {ACM}
                  International Conference on Information and Knowledge Management {(CIKM}
                  2022), Atlanta, USA, October 17-21, 2022},
  series       = {{CEUR} Workshop Proceedings},
  volume       = {3318},
  publisher    = {CEUR-WS.org},
  year         = {2022},
  url          = {https://ceur-ws.org/Vol-3318/paper9.pdf},
  timestamp    = {Fri, 10 Mar 2023 16:22:32 +0100},
  biburl       = {https://dblp.org/rec/conf/cikm/MacAvaneyTM22a.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Citation

Kulkarni et al. Lexically-Accelerated Dense Retrieval. SIGIR 2023. [link]
@inproceedings{DBLP:conf/sigir/KulkarniMGF23,
  author       = {Hrishikesh Kulkarni and
                  Sean MacAvaney and
                  Nazli Goharian and
                  Ophir Frieder},
  editor       = {Hsin{-}Hsi Chen and
                  Wei{-}Jou (Edward) Duh and
                  Hen{-}Hsen Huang and
                  Makoto P. Kato and
                  Josiane Mothe and
                  Barbara Poblete},
  title        = {Lexically-Accelerated Dense Retrieval},
  booktitle    = {Proceedings of the 46th International {ACM} {SIGIR} Conference on
                  Research and Development in Information Retrieval, {SIGIR} 2023, Taipei,
                  Taiwan, July 23-27, 2023},
  pages        = {152--162},
  publisher    = {{ACM}},
  year         = {2023},
  url          = {https://doi.org/10.1145/3539618.3591715},
  doi          = {10.1145/3539618.3591715},
  timestamp    = {Fri, 21 Jul 2023 22:25:19 +0200},
  biburl       = {https://dblp.org/rec/conf/sigir/KulkarniMGF23.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Citation

Frayling et al. Effective Adhoc Retrieval Through Traversal of a Query-Document Graph. ECIR (3) 2024. [link]
@inproceedings{DBLP:conf/ecir/FraylingMMO24,
  author       = {Erlend Frayling and
                  Sean MacAvaney and
                  Craig Macdonald and
                  Iadh Ounis},
  editor       = {Nazli Goharian and
                  Nicola Tonellotto and
                  Yulan He and
                  Aldo Lipani and
                  Graham McDonald and
                  Craig Macdonald and
                  Iadh Ounis},
  title        = {Effective Adhoc Retrieval Through Traversal of a Query-Document Graph},
  booktitle    = {Advances in Information Retrieval - 46th European Conference on Information
                  Retrieval, {ECIR} 2024, Glasgow, UK, March 24-28, 2024, Proceedings,
                  Part {III}},
  series       = {Lecture Notes in Computer Science},
  volume       = {14610},
  pages        = {89--104},
  publisher    = {Springer},
  year         = {2024},
  url          = {https://doi.org/10.1007/978-3-031-56063-7\_6},
  doi          = {10.1007/978-3-031-56063-7\_6},
  timestamp    = {Fri, 29 Mar 2024 23:01:35 +0100},
  biburl       = {https://dblp.org/rec/conf/ecir/FraylingMMO24.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Citation

MacAvaney and Tonellotto. A Reproducibility Study of PLAID. SIGIR 2024. [link]
@inproceedings{DBLP:conf/sigir/MacAvaneyT24,
  author       = {Sean MacAvaney and
                  Nicola Tonellotto},
  editor       = {Grace Hui Yang and
                  Hongning Wang and
                  Sam Han and
                  Claudia Hauff and
                  Guido Zuccon and
                  Yi Zhang},
  title        = {A Reproducibility Study of {PLAID}},
  booktitle    = {Proceedings of the 47th International {ACM} {SIGIR} Conference on
                  Research and Development in Information Retrieval, {SIGIR} 2024, Washington
                  DC, USA, July 14-18, 2024},
  pages        = {1411--1419},
  publisher    = {{ACM}},
  year         = {2024},
  url          = {https://doi.org/10.1145/3626772.3657856},
  doi          = {10.1145/3626772.3657856},
  timestamp    = {Sun, 06 Oct 2024 21:14:16 +0200},
  biburl       = {https://dblp.org/rec/conf/sigir/MacAvaneyT24.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Citation

Rathee et al. Quam: Adaptive Retrieval through Query Affinity Modelling. arXiv 2024. [link]
@article{DBLP:journals/corr/abs-2410-20286,
  author       = {Mandeep Rathee and
                  Sean MacAvaney and
                  Avishek Anand},
  title        = {Quam: Adaptive Retrieval through Query Affinity Modelling},
  journal      = {CoRR},
  volume       = {abs/2410.20286},
  year         = {2024},
  url          = {https://doi.org/10.48550/arXiv.2410.20286},
  doi          = {10.48550/ARXIV.2410.20286},
  eprinttype    = {arXiv},
  eprint       = {2410.20286},
  timestamp    = {Thu, 28 Nov 2024 21:32:46 +0100},
  biburl       = {https://dblp.org/rec/journals/corr/abs-2410-20286.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}