Dense Retrieval API Reference¶

Encoding¶

class pyterrier_dr.BiEncoder(*args, **kwargs)[source]¶

Represents a single-vector dense bi-encoder.

A BiEncoder encodes the text of a query or document into a dense vector.

This class functions as a transformer factory:

Query encoding using query_encoder()
Document encoding using doc_encoder()
Text scoring (re-reranking) using text_scorer()

It can also be used as a transformer directly. It infers which transformer to use based on columns present in the input frame.

Note that in most cases, you will want to use a BiEncoder as part of a pipeline with a FlexIndex to perform dense indexing and retrival.

Parameters:

batch_size – The default batch size to use for query/document encoding
text_field – The field in the input dataframe that contains the document text
verbose – Whether to show progress bars

query_encoder(verbose=None, batch_size=None)[source]¶

Query encoding

Return type:: Transformer

doc_encoder(verbose=None, batch_size=None)[source]¶

Doc encoding

Return type:: Transformer

text_scorer(verbose=None, batch_size=None, sim_fn=None)[source]¶

Text Scoring (re-ranking)

Return type:: Transformer

property sim_fn: SimFn¶: The similarity function to use between embeddings for this model

abstractmethod encode_queries_torch(texts, batch_size=None)[source]¶

Abstract method to encode a list of query texts into dense vectors.

This function is used by the transformer returned by encode_queries(). It can be used for training, so should not detach gradients. encode_queries() applies torch.no_grad().

Return type:

Tensor

Parameters:

texts (List[str]) – A list of query texts
batch_size (int | None) – The batch size to use for encoding

Returns:

A tensor of shape (n_queries, n_dims)

Return type:

torch.Tensor

encode_queries(texts, batch_size=None)[source]¶

Default method to encode a list of query texts into dense vectors and return as numpy array.

This function is used by the transformer returned by query_encoder().

The default implementation of this method called encode_queries_torch().

Return type:

array

Parameters:

texts (List[str]) – A list of query texts
batch_size (int | None) – The batch size to use for encoding

Returns:

A numpy array of shape (n_queries, n_dims)

Return type:

np.array

abstractmethod encode_docs(texts, batch_size=None)[source]¶

Abstract method to encode a list of document texts into dense vectors.

This function is used by the transformer returned by doc_encoder().

Return type:

array

Parameters:

texts (List[str]) – A list of document texts
batch_size (int | None) – The batch size to use for encoding

Returns:

A numpy array of shape (n_docs, n_dims)

Return type:

np.array

class pyterrier_dr.SBertBiEncoder(*args, **kwargs)[source]¶

encode_queries_torch(texts, batch_size=None, prompt=None, normalize_embeddings=False, **kwargs)[source]¶

Encode queries while enabling gradients (for training).

Return type:

Tensor

Parameters:

texts (list[str])
batch_size (int | None)
prompt (str | None)
normalize_embeddings (bool)
kwargs (Any)

Indexing and Retrieval¶

class pyterrier_dr.FlexIndex(path, *, sim_fn=SimFn.dot, verbose=True)[source]¶

Represents a FLexible EXecution (FLEX) Index, which is a dense index format.

FLEX allows for a variety of retrieval implementations (NumPy, FAISS, etc.) and algorithms (exhaustive, HNSW, etc.) to be tested. In most cases, the same vector storage can be used across implementations and algorithms, saving considerably on disk space.

Parameters:

path (str) – The path to the index directory
sim_fn (SimFn | str) – The similarity function to use
verbose (bool) – Whether to display verbose output (e.g., progress bars)

Indexing¶

index(inp)[source]¶

Index the given input data stream to a new index at this location.

Each record in inp is expected to be a dictionary containing at least two keys: docno (a unique document identifier) and doc_vec (a dense vector representation of the document).

Typically this method will be used in a pipeline of operations, where the input data is first transformed by a document encoder to add the doc_vec values before it is indexed. For example:

Index documents into a FlexIndex using a TasB encoder.¶

from pyterrier_dr import TasB, FlexIndex
encoder = TasB.dot()
index = FlexIndex('my_index')
pipeline = encoder >> index
pipeline.index([
    {'docno': 'doc1', 'text': 'hello'},
    {'docno': 'doc2', 'text': 'world'},
])

Parameters:: inp (Iterable[Dict]) – An iterable of dictionaries to index.
Returns:: A reference back to this index (self).
Return type:: pyterrier_alpha.Artifact
Raises:: RuntimeError – If the index is aready built.
Return type:: Artifact

indexer(*, mode=IndexingMode.create)[source]¶

Return an indexer for this index with the specified options.

This transformer gives more fine-grained control over the indexing process, allowing you to specify whether to create a new index or overwrite an existing one.

Similar to index(), this method will typically be used in a pipeline of operations, where the input data is first transformed by a document encoder to add the doc_vec values before it is indexed. For example:

Oerwrite a FlexIndex using a TasB encoder.¶

from pyterrier_dr import TasB, FlexIndex
encoder = TasB.dot()
index = FlexIndex('my_index')
pipeline = encoder >> index.indexer(mode='overwrite')
pipeline.index([
    {'docno': 'doc1', 'text': 'hello'},
    {'docno': 'doc2', 'text': 'world'},
])

Parameters:: mode (IndexingMode | str) – The indexing mode to use (create or overwrite).
Returns:: A new indexer instance.
Return type:: Indexer
Return type:: FlexIndexer

Retrieval¶

retriever(*, num_results=1000)¶: Returns a transformer that performs basic exact retrieval over indexed vectors using a brute force search. An alias to np_retriever().

np_retriever(*, num_results=1000, batch_size=None, drop_query_vec=False, mask=None)¶

Return a retriever that uses numpy to perform a brute force search over the index.

The returned transformer expects a DataFrame with columns qid and query_vec. It outpus a result frame containing the retrieved results.

Return type:

Transformer

Parameters:

num_results (int) – The number of results to return per query.
batch_size (int | None) – The number of documents to score in each batch.
drop_query_vec (bool) – Whether to drop the query vector from the output.
mask (ndarray | None) – Optional binary array (0 or 1) of length equal to the number of documents. Documents with mask value 0 have their scores zeroed out during retrieval.

Returns:

A retriever that uses numpy to perform a brute force search.

Return type:

Transformer

torch_retriever(*, num_results=1000, device=None, fp16=False, qbatch=64, drop_query_vec=False, mask=None)¶

Return a retriever that uses pytorch to perform brute-force retrieval results using the indexed vectors.

The returned pyterrier.Transformer expects a DataFrame with columns qid, query_vec.

Caution

This method loads the entire index into memory on the provided device. If the index is too large to fit in memory, consider using a different retriever that does not fully load the index into memory, like np_retriever().

Parameters:

num_results (int) – The number of results to return per query.
device (str | None) – The device to use for scoring. If not provided, the default device is used (cuda if available, otherwise cpu).
fp16 (bool) – Whether to use half precision (fp16) for scoring.
qbatch (int) – The number of queries to score in each batch.
drop_query_vec (bool) – Whether to drop the query vector from the output.
mask (ndarray | None) – Optional binary array (0 or 1) of length equal to the number of documents. Documents with mask value 0 are excluded from retrieval entirely (internally converted to a document id subset for an in-memory gather, rather than zeroing scores as np_retriever() does).

Returns:

A transformer that retrieves using pytorch.

Return type:

Transformer

faiss_flat_retriever(*, gpu=False, qbatch=64, drop_query_vec=False)¶

Returns a retriever that uses FAISS to perform brute-force search over the indexed vectors.

Parameters:

gpu – Whether to load the index onto GPU for scoring
qbatch – The batch size during search
drop_query_vec – Whether to drop the query vector from the output

Returns:

A retriever that uses FAISS to perform brute-force search over the indexed vectors

Return type:

Transformer

Note

This transformer requires the faiss package to be installed.

Citation

Douze et al. The Faiss library. arXiv 2024. [link]

@article{DBLP:journals/corr/abs-2401-08281,
  author       = {Matthijs Douze and
                  Alexandr Guzhva and
                  Chengqi Deng and
                  Jeff Johnson and
                  Gergely Szilvasy and
                  Pierre{-}Emmanuel Mazar{\'{e}} and
                  Maria Lomeli and
                  Lucas Hosseini and
                  Herv{\'{e}} J{\'{e}}gou},
  title        = {The Faiss library},
  journal      = {CoRR},
  volume       = {abs/2401.08281},
  year         = {2024},
  url          = {https://doi.org/10.48550/arXiv.2401.08281},
  doi          = {10.48550/ARXIV.2401.08281},
  eprinttype    = {arXiv},
  eprint       = {2401.08281},
  timestamp    = {Thu, 01 Feb 2024 15:35:36 +0100},
  biburl       = {https://dblp.org/rec/journals/corr/abs-2401-08281.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

faiss_hnsw_retriever(neighbours=32, *, num_results=1000, ef_construction=40, ef_search=16, cache=True, search_bounded_queue=True, qbatch=64, drop_query_vec=False)¶

Returns a retriever that uses FAISS over a HNSW index.

Creates the HNSW graph structure if it does not already exist. When cache=True (dfault), this graph structure is cached to disk for subsequent use.

Return type:

Transformer

Parameters:

neighbours (int) – The number of neighbours of the constructed neighborhood graph
num_results (int) – The number of results to return per query
ef_construction (int) – The number of neighbours to consider during construction
ef_search (int) – The number of neighbours to consider during search
cache (bool) – Whether to cache the index to disk
search_bounded_queue (bool) – Whether to use a bounded queue during search
qbatch (int) – The batch size during search
drop_query_vec (bool) – Whether to drop the query vector from the output

Returns:

A retriever that uses FAISS over a HNSW index

Return type:

Transformer

Note

This transformer requires the faiss package to be installed.

Citation

Douze et al. The Faiss library. arXiv 2024. [link]

@article{DBLP:journals/corr/abs-2401-08281,
  author       = {Matthijs Douze and
                  Alexandr Guzhva and
                  Chengqi Deng and
                  Jeff Johnson and
                  Gergely Szilvasy and
                  Pierre{-}Emmanuel Mazar{\'{e}} and
                  Maria Lomeli and
                  Lucas Hosseini and
                  Herv{\'{e}} J{\'{e}}gou},
  title        = {The Faiss library},
  journal      = {CoRR},
  volume       = {abs/2401.08281},
  year         = {2024},
  url          = {https://doi.org/10.48550/arXiv.2401.08281},
  doi          = {10.48550/ARXIV.2401.08281},
  eprinttype    = {arXiv},
  eprint       = {2401.08281},
  timestamp    = {Thu, 01 Feb 2024 15:35:36 +0100},
  biburl       = {https://dblp.org/rec/journals/corr/abs-2401-08281.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

faiss_ivf_retriever(*, num_results=1000, train_sample=None, n_list=None, cache=True, n_probe=1, drop_query_vec=False)¶

Returns a retriever that uses FAISS over an IVF index.

If the IVF structure does not already exist, it is created and cached to disk (when cache=True (default)).

Parameters:

num_results (int) – The number of results to return per query
train_sample (int | None) – The number of training samples to use for training the index. If not provided, a default value is used (approximately the square root of the number of documents).
n_list (int | None) – The number of posting lists to use for the index. If not provided, a default value is used (approximately train_sample/39).
cache (bool) – Whether to cache the index to disk.
n_probe (int) – The number of posting lists to probe during search. The higher the value, the better the approximation will be, but the longer it will take.
drop_query_vec (bool) – Whether to drop the query vector from the output.

Returns:

A retriever that uses FAISS over an IVF index

Return type:

Transformer

Note

This transformer requires the faiss package to be installed.

Citation

Douze et al. The Faiss library. arXiv 2024. [link]

@article{DBLP:journals/corr/abs-2401-08281,
  author       = {Matthijs Douze and
                  Alexandr Guzhva and
                  Chengqi Deng and
                  Jeff Johnson and
                  Gergely Szilvasy and
                  Pierre{-}Emmanuel Mazar{\'{e}} and
                  Maria Lomeli and
                  Lucas Hosseini and
                  Herv{\'{e}} J{\'{e}}gou},
  title        = {The Faiss library},
  journal      = {CoRR},
  volume       = {abs/2401.08281},
  year         = {2024},
  url          = {https://doi.org/10.48550/arXiv.2401.08281},
  doi          = {10.48550/ARXIV.2401.08281},
  eprinttype    = {arXiv},
  eprint       = {2401.08281},
  timestamp    = {Thu, 01 Feb 2024 15:35:36 +0100},
  biburl       = {https://dblp.org/rec/journals/corr/abs-2401-08281.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

flatnav_retriever(k=32, *, ef_search=100, num_initializations=100, ef_construction=100, threads=16, num_results=1000, cache=True, qbatch=64, drop_query_vec=False, verbose=False)¶

Returns a retriever that searchers over a flatnav index.

Return type:

Transformer

Parameters:

k (int) – the maximum number of edges per document in the index
ef_search (int) – the size of the list during searches. Higher values are slower but more accurate.
num_initializations (int) – the number of random initializations to use during search.
ef_construction (int) – the size of the list during graph construction. Higher values are slower but more accurate.
threads (int) – the number of threads to use
num_results (int) – the number of results to return per query
cache (bool) – whether to cache the index to disk
qbatch (int) – the number of queries to search at once
drop_query_vec (bool) – whether to drop the query_vec column after retrieval
verbose (bool) – whether to show progress bars

Added in version 0.4.0.

Changed in version 0.4.1: fixed bug with num_initializations

Note

This transformer requires the flatnav package to be installed. Instructions are available in the flatnav repository.

Citation

Munyampirwa et al. Down with the Hierarchy: The 'H' in HNSW Stands for "Hubs". arXiv 2024. [link]

scann_retriever(*, n_leaves=None, leaves_to_search=1, num_results=1000, train_sample=None, drop_query_vec=False)¶

Returns a retriever over a ScaNN (Scalable Nearest Neighbors) index.

Parameters:

n_leaves (int, optional) – Number of leaves in the ScaNN index. Defaults to approximatley sqrt(doc_count).
leaves_to_search (int, optional) – Number of leaves to search. Defaults to 1. The higher the value, the more accurate the search.
num_results (int, optional) – Number of results to return. Defaults to 1000.
train_sample (int, optional) – Number of training samples. Defaults to n_leaves*39.
drop_query_vec (bool, optional) – Whether to drop the query vector from the output.

Returns:

A transformer that retrieves using ScaNN.

Return type:

Transformer

Note

This method requires the scann package. Install it via pip install scann.

Citation

Guo et al. Accelerating Large-Scale Inference with Anisotropic Vector Quantization. ICML 2020. [link]

@inproceedings{DBLP:conf/icml/GuoSLGSCK20,
  author       = {Ruiqi Guo and
                  Philip Sun and
                  Erik Lindgren and
                  Quan Geng and
                  David Simcha and
                  Felix Chern and
                  Sanjiv Kumar},
  title        = {Accelerating Large-Scale Inference with Anisotropic Vector Quantization},
  booktitle    = {Proceedings of the 37th International Conference on Machine Learning,
                  {ICML} 2020, 13-18 July 2020, Virtual Event},
  series       = {Proceedings of Machine Learning Research},
  volume       = {119},
  pages        = {3887--3896},
  publisher    = {{PMLR}},
  year         = {2020},
  url          = {http://proceedings.mlr.press/v119/guo20h.html},
  timestamp    = {Tue, 15 Dec 2020 17:40:18 +0100},
  biburl       = {https://dblp.org/rec/conf/icml/GuoSLGSCK20.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

voyager_retriever(neighbours=12, *, num_results=1000, ef_construction=200, random_seed=1, storage_data_type='float32', query_ef=10, drop_query_vec=False)¶

Returns a retriever that uses HNSW to search over a Voyager index.

Return type:

Transformer

Parameters:

neighbours (int, optional) – Number of neighbours to search. Defaults to 12.
num_results (int, optional) – Number of results to return per query. Defaults to 1000.
ef_construction (int, optional) – Expansion factor for graph construction. Defaults to 200.
random_seed (int, optional) – Random seed. Defaults to 1.
storage_data_type (str, optional) – Storage data type. One of ‘float32’, ‘float8’, ‘e4m3’. Defaults to ‘float32’.
query_ef (int, optional) – Expansion factor during querying. Defaults to 10.
drop_query_vec (bool, optional) – Drop the query vector from the output. Defaults to False.

Returns:

A retriever that uses HNSW to search over a Voyager index.

Return type:

Transformer

Note

This method requires the voyager package. Install it via pip install voyager.

Re-Ranking¶

scorer()¶: An alias to np_scorer().

np_scorer(*, num_results=None)¶

Return a scorer that uses numpy to score (re-rank) results using indexed vectors.

The returned transformer expects a DataFrame with columns qid, query_vec and docno. (If an internal docid column is provided, this will be used to speed up vector lookups.)

This method uses memory-mapping to avoid loading the entire index into memory at once.

Return type:

Transformer

Parameters:

num_results (int | None) – The number of results to return per query. If not provided, all resuls from the original fram are returned.
mask – Optional binary array (0 or 1) of length equal to the number of documents. Documents with mask value 0 have their scores zeroed out during retrieval.

Returns:

A transformer that scores query vectors with numpy.

Return type:

Transformer

torch_scorer(*, num_results=None, device=None, fp16=False)¶

Return a scorer that uses pytorch to score (re-rank) results using indexed vectors.

The returned pyterrier.Transformer expects a DataFrame with columns qid, query_vec and docno. (If an internal docid column is provided, this will be used to speed up vector lookups.)

Caution

This method loads the entire index into memory on the provided device. If the index is too large to fit in memory, consider using a different scorer that does not fully load the index into memory, like np_scorer().

Parameters:

num_results (int | None) – The number of results to return per query. If not provided, all resuls from the original fram are returned.
device (str | None) – The device to use for scoring. If not provided, the default device is used (cuda if available, otherwise cpu).
fp16 (bool) – Whether to use half precision (fp16) for scoring.

Returns:

A transformer that scores query vectors with pytorch.

Return type:

Transformer

gar(k=16, *, batch_size=128, num_results=1000)¶

Returns a retriever that uses a corpus graph to search over a FlexIndex.

Return type:

Transformer

Parameters:

k (int) – Number of neighbours in the corpus graph. Defaults to 16.
batch_size (int) – Batch size for retrieval. Defaults to 128.
num_results (int) – Number of results per query to return. Defaults to 1000.

Returns:

A retriever that uses a corpus graph to search over a FlexIndex.

Return type:

Transformer

Citation

MacAvaney et al. Adaptive Re-Ranking with a Corpus Graph. CIKM 2022. [link]

@inproceedings{DBLP:conf/cikm/MacAvaneyTM22,
  author       = {Sean MacAvaney and
                  Nicola Tonellotto and
                  Craig Macdonald},
  editor       = {Mohammad Al Hasan and
                  Li Xiong},
  title        = {Adaptive Re-Ranking with a Corpus Graph},
  booktitle    = {Proceedings of the 31st {ACM} International Conference on Information
                  {\&} Knowledge Management, Atlanta, GA, USA, October 17-21, 2022},
  pages        = {1491--1500},
  publisher    = {{ACM}},
  year         = {2022},
  url          = {https://doi.org/10.1145/3511808.3557231},
  doi          = {10.1145/3511808.3557231},
  timestamp    = {Wed, 19 Oct 2022 17:09:02 +0200},
  biburl       = {https://dblp.org/rec/conf/cikm/MacAvaneyTM22.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

ladr_proactive(k=16, *, hops=1, num_results=1000, dense_scorer=None, drop_query_vec=False, budget=False)¶

Returns a proactive LADR (Lexicaly-Accelerated Dense Retrieval) transformer.

Return type:

Transformer

Parameters:

k (int) – The number of neighbours in the corpus graph.
hops (int) – The number of hops to consider. Defaults to 1.
num_results (int) – The number of results to return per query.
dense_scorer (Transformer, optional) – The dense scorer to use. Defaults to np_scorer().
drop_query_vec (bool) – Whether to drop the query vector from the output.
budget (bool or int) – The maximum number of vectors to score. If False, no maximum is applied. If True, the budget is set to num_results. If an integer, this value is used as the budget.

Returns:

A proactive LADR transformer.

Return type:

Transformer

Citation

Kulkarni et al. Lexically-Accelerated Dense Retrieval. SIGIR 2023. [link]

@inproceedings{DBLP:conf/sigir/KulkarniMGF23,
  author       = {Hrishikesh Kulkarni and
                  Sean MacAvaney and
                  Nazli Goharian and
                  Ophir Frieder},
  editor       = {Hsin{-}Hsi Chen and
                  Wei{-}Jou (Edward) Duh and
                  Hen{-}Hsen Huang and
                  Makoto P. Kato and
                  Josiane Mothe and
                  Barbara Poblete},
  title        = {Lexically-Accelerated Dense Retrieval},
  booktitle    = {Proceedings of the 46th International {ACM} {SIGIR} Conference on
                  Research and Development in Information Retrieval, {SIGIR} 2023, Taipei,
                  Taiwan, July 23-27, 2023},
  pages        = {152--162},
  publisher    = {{ACM}},
  year         = {2023},
  url          = {https://doi.org/10.1145/3539618.3591715},
  doi          = {10.1145/3539618.3591715},
  timestamp    = {Fri, 21 Jul 2023 22:25:19 +0200},
  biburl       = {https://dblp.org/rec/conf/sigir/KulkarniMGF23.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

ladr_adaptive(k=16, *, depth=100, num_results=1000, dense_scorer=None, max_hops=None, drop_query_vec=False, budget=False)¶

Returns an adaptive LADR (Lexicaly-Accelerated Dense Retrieval) transformer.

Return type:

Transformer

Parameters:

k (int) – The number of neighbours in the corpus graph.
depth (int) – The depth of the ranked list to consider for convergence.
num_results (int) – The number of results to return per query.
dense_scorer (Transformer, optional) – The dense scorer to use. Defaults to np_scorer().
max_hops (int, optional) – The maximum number of hops to consider. Defaults to None (no limit).
drop_query_vec (bool) – Whether to drop the query vector from the output.
budget (bool or int) – The maximum number of vectors to score. If False, no maximum is applied. If True, the budget is set to num_results. If an integer, this value is used as the budget.

Returns:

An adaptive LADR transformer.

Return type:

Transformer

Citation

Kulkarni et al. Lexically-Accelerated Dense Retrieval. SIGIR 2023. [link]

@inproceedings{DBLP:conf/sigir/KulkarniMGF23,
  author       = {Hrishikesh Kulkarni and
                  Sean MacAvaney and
                  Nazli Goharian and
                  Ophir Frieder},
  editor       = {Hsin{-}Hsi Chen and
                  Wei{-}Jou (Edward) Duh and
                  Hen{-}Hsen Huang and
                  Makoto P. Kato and
                  Josiane Mothe and
                  Barbara Poblete},
  title        = {Lexically-Accelerated Dense Retrieval},
  booktitle    = {Proceedings of the 46th International {ACM} {SIGIR} Conference on
                  Research and Development in Information Retrieval, {SIGIR} 2023, Taipei,
                  Taiwan, July 23-27, 2023},
  pages        = {152--162},
  publisher    = {{ACM}},
  year         = {2023},
  url          = {https://doi.org/10.1145/3539618.3591715},
  doi          = {10.1145/3539618.3591715},
  timestamp    = {Fri, 21 Jul 2023 22:25:19 +0200},
  biburl       = {https://dblp.org/rec/conf/sigir/KulkarniMGF23.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

mmr(*, Lambda=0.5, norm_rel=False, norm_sim=False, drop_doc_vec=True, verbose=False)¶

Returns an MMR (Maximal Marginal Relevance) scorer (i.e., re-ranker) over this index.

The method first loads vectors from the index and then applies MmrScorer to re-rank the results. See MmrScorer for more details on MMR.

Return type:

Transformer

Parameters:

Lambda (float) – The balance parameter between relevance and diversity (default: 0.5)
norm_rel (bool) – Whether to normalize relevance scores to [0, 1] (default: False)
norm_sim (bool) – Whether to normalize similarity scores to [0, 1] (default: False)
drop_doc_vec (bool) – Whether to drop the ‘doc_vec’ column after re-ranking (default: True)
verbose (bool) – Whether to display verbose output (e.g., progress bars) (default: False)

Citation

Carbonell and Goldstein. The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries. SIGIR 1998. [link]

@inproceedings{DBLP:conf/sigir/CarbonellG98,
  author       = {Jaime G. Carbonell and
                  Jade Goldstein},
  editor       = {W. Bruce Croft and
                  Alistair Moffat and
                  C. J. van Rijsbergen and
                  Ross Wilkinson and
                  Justin Zobel},
  title        = {The Use of MMR, Diversity-Based Reranking for Reordering Documents
                  and Producing Summaries},
  booktitle    = {{SIGIR} '98: Proceedings of the 21st Annual International {ACM} {SIGIR}
                  Conference on Research and Development in Information Retrieval, August
                  24-28 1998, Melbourne, Australia},
  pages        = {335--336},
  publisher    = {{ACM}},
  year         = {1998},
  url          = {https://doi.org/10.1145/290941.291025},
  doi          = {10.1145/290941.291025},
  timestamp    = {Wed, 14 Nov 2018 10:58:11 +0100},
  biburl       = {https://dblp.org/rec/conf/sigir/CarbonellG98.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

See also

Diversity using Dense Vectors » Search Result Diversification

Evaluation¶

property ILS: Measure¶: Return an ILS (Intra-List Similarity) measure for this index. See pyterrier_dr.ILS() for more details.

See also

Diversity using Dense Vectors » Diversity Evaluation

Index Data Access¶

built()[source]¶

Check if the index has been built.

Return type:: bool
Returns:: True if the index has been built, otherwise False.
Return type:: bool

vec_loader()¶

Return a transformer that loads indexed vectors.

The returned transformer expects a DataFrame with columns docno. It outputs a frame that includes a column doc_vec, which contains the indexed vectors. For example:

Load vectors from a FlexIndex¶

index = FlexIndex.from_hf('macavaney/msmarco-passage.tasb.flex')
loader = index.vec_loader()
loader(pd.DataFrame([
    {"docno": "5"},
    {"docno": "100"},
    {"docno": "74356"},
]))
# docno                                            doc_vec
#     5  [-0.09343405, 0.12045559, -0.25184962, 0.15029...
#   100  [-0.11527929, 0.63400555, -0.0877756, -0.26490...
# 74356  [0.15367049, 0.16049547, -0.012261144, -0.2588...

Returns:: A transformer that loads indexed vectors.
Return type:: Transformer

get_corpus_iter(start_idx=None, stop_idx=None, verbose=True)[source]¶

Iterate over the documents in the index.

Return type:

Iterable[Dict]

Parameters:

start_idx – The index of the first document to return (or None to start at the first document).
stop_idx – The index of the last document to return (or None to end on the last document).
verbose – Whether to display a progress bar.

Yields:

Dict[str,Any] – A dictionary with keys docno and doc_vec.

np_vecs()¶

Return the indexed vectors.

Return type:: ndarray
Returns:: The indexed vectors as a memory-mapped numpy array.
Return type:: numpy.ndarray

torch_vecs(*, device=None, fp16=False)¶

Return the indexed vectors as a pytorch tensor.

Caution

This method loads the entire index into memory on the provided device. If the index is too large to fit in memory, consider using a different method that does not fully load the index into memory, like np_vecs() or get_corpus_iter().

Parameters:

device (str | None) – The device to use for the tensor. If not provided, the default device is used (cuda if available, otherwise cpu).
fp16 (bool) – Whether to use half precision (fp16) for the tensor.

Returns:

The indexed vectors as a torch tensor.

Return type:

torch.Tensor

Return type:

Tensor

docnos()[source]¶

Return the document identifier (docno) lookup data structure.

Return type:: Lookup
Returns:: The document number lookup.
Return type:: npids.Lookup

corpus_graph(k=16, *, batch_size=8192)¶

Return the corpus graph (neighborhood graph) for the index.

The corpus graph is a directed graph where each node represents a document and each edge represents a connection between two documents. The graph is built by computing the cosine similarity between each pair of documents and storing the k-nearest neighbors for each document.

If the corpus graph has not been built yet, it will be built using the given k and batch size.

Parameters:

k (int) – The number of neighbors to store for each document.
batch_size (int) – The number of vectors to process in each batch.

Returns:

The corpus graph for the index.

Return type:

pyterrier_adaptive.CorpusGraph

faiss_hnsw_graph(neighbours=32, *, ef_construction=40)¶

Returns the (approximate) HNSW graph structure created by the HNSW index.

If the graph structure does not already exist, it is created and cached to disk.

Parameters:

neighbours (int) – The number of neighbours of the constructed neighborhood graph
ef_construction (int) – The number of neighbours to consider during construction

Returns:

The HNSW graph structure

Return type:

pyterrier_adaptive.CorpusGraph

Note

This function requires the faiss package to be installed.

Citation

Douze et al. The Faiss library. arXiv 2024. [link]

@article{DBLP:journals/corr/abs-2401-08281,
  author       = {Matthijs Douze and
                  Alexandr Guzhva and
                  Chengqi Deng and
                  Jeff Johnson and
                  Gergely Szilvasy and
                  Pierre{-}Emmanuel Mazar{\'{e}} and
                  Maria Lomeli and
                  Lucas Hosseini and
                  Herv{\'{e}} J{\'{e}}gou},
  title        = {The Faiss library},
  journal      = {CoRR},
  volume       = {abs/2401.08281},
  year         = {2024},
  url          = {https://doi.org/10.48550/arXiv.2401.08281},
  doi          = {10.48550/ARXIV.2401.08281},
  eprinttype    = {arXiv},
  eprint       = {2401.08281},
  timestamp    = {Thu, 01 Feb 2024 15:35:36 +0100},
  biburl       = {https://dblp.org/rec/journals/corr/abs-2401-08281.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Pseudo-Relevance Feedback¶

class pyterrier_dr.AveragePrf(*, k=3)[source]¶

Performs Average PRF (as described by Li et al.) by averaging the query_vec column with the doc_vec column of the top k documents.

Parameters:: k (-) – number of pseudo-relevant feedback documents

Expected Input Columns: ['qid', 'query_vec', 'docno', 'doc_vec']

Output Columns: ['qid', 'query_vec'] (Any other query columns from the input are also pulled included in the output.)

Example:

prf_pipe = model >> index >> index.vec_loader() >> pyterrier_dr.AveragePrf() >> index

Citation

Li et al. Pseudo Relevance Feedback with Deep Language Models and Dense Retrievers: Successes and Pitfalls. ACM Trans. Inf. Syst. 2023. [link]

@article{DBLP:journals/tois/0009MZKZ23,
  author       = {Hang Li and
                  Ahmed Mourad and
                  Shengyao Zhuang and
                  Bevan Koopman and
                  Guido Zuccon},
  title        = {Pseudo Relevance Feedback with Deep Language Models and Dense Retrievers:
                  Successes and Pitfalls},
  journal      = {{ACM} Trans. Inf. Syst.},
  volume       = {41},
  number       = {3},
  pages        = {62:1--62:40},
  year         = {2023},
  url          = {https://doi.org/10.1145/3570724},
  doi          = {10.1145/3570724},
  timestamp    = {Fri, 21 Jul 2023 22:26:51 +0200},
  biburl       = {https://dblp.org/rec/journals/tois/0009MZKZ23.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

transform(inp)[source]¶

Performs Average PRF on the input dataframe.

Return type:: DataFrame
Parameters:: inp (DataFrame)

See also

Dense Pseudo-Relevance Feedback

class pyterrier_dr.VectorPrf(*, alpha=1, beta=0.2, k=3)[source]¶

Performs a Rocchio-esque PRF by linearly combining the query_vec column with the doc_vec column of the top k documents.

Parameters:

alpha (-) – weight of original query_vec
beta (-) – weight of doc_vec
k (-) – number of pseudo-relevant feedback documents

Expected Input Columns: ['qid', 'query_vec', 'docno', 'doc_vec']

Output Columns: ['qid', 'query_vec'] (Any other query columns from the input are also pulled included in the output.)

Example:

prf_pipe = model >> index >> index.vec_loader() >> pyterrier_dr.VectorPrf() >> index

Citation

Li et al. Pseudo Relevance Feedback with Deep Language Models and Dense Retrievers: Successes and Pitfalls. ACM Trans. Inf. Syst. 2023. [link]

@article{DBLP:journals/tois/0009MZKZ23,
  author       = {Hang Li and
                  Ahmed Mourad and
                  Shengyao Zhuang and
                  Bevan Koopman and
                  Guido Zuccon},
  title        = {Pseudo Relevance Feedback with Deep Language Models and Dense Retrievers:
                  Successes and Pitfalls},
  journal      = {{ACM} Trans. Inf. Syst.},
  volume       = {41},
  number       = {3},
  pages        = {62:1--62:40},
  year         = {2023},
  url          = {https://doi.org/10.1145/3570724},
  doi          = {10.1145/3570724},
  timestamp    = {Fri, 21 Jul 2023 22:26:51 +0200},
  biburl       = {https://dblp.org/rec/journals/tois/0009MZKZ23.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

transform(inp)[source]¶

Performs Vector PRF on the input dataframe.

Return type:: DataFrame
Parameters:: inp (DataFrame)

See also

Dense Pseudo-Relevance Feedback

Diversity¶

class pyterrier_dr.MmrScorer(*, Lambda=0.5, norm_rel=False, norm_sim=False, drop_doc_vec=True, verbose=False)[source]¶

Parameters:

Lambda (float) – The balance parameter between relevance and diversity (default: 0.5)
norm_rel (bool) – Whether to normalize relevance scores to [0, 1] (default: False)
norm_sim (bool) – Whether to normalize similarity scores to [0, 1] (default: False)
drop_doc_vec (bool) – Whether to drop the ‘doc_vec’ column after re-ranking (default: True)
verbose (bool) – Whether to display verbose output (e.g., progress bars) (default: False)

See also

Diversity using Dense Vectors » Search Result Diversification

pyterrier_dr.ILS(index, *, name=None, verbose=False)[source]¶

Create an ILS (Intra-List Similarity) measure calculated using the vectors in the provided index.

Higher scores indicate lower diversity in the results.

This measure supports the @k convention for applying a top-k cutoff before scoring.

Return type:

Measure

Parameters:

index (FlexIndex) – The index to use for loading document vectors.
name (str, optional) – The name of the measure (default: “ILS”).
verbose (bool, optional) – Whether to display a progress bar.

Returns:

An ILS measure object.

Return type:

ir_measures.Measure

Citation

Ziegler et al. Improving recommendation lists through topic diversification. WWW 2005. [link]

@inproceedings{DBLP:conf/www/ZieglerMKL05,
  author       = {Cai{-}Nicolas Ziegler and
                  Sean M. McNee and
                  Joseph A. Konstan and
                  Georg Lausen},
  editor       = {Allan Ellis and
                  Tatsuya Hagino},
  title        = {Improving recommendation lists through topic diversification},
  booktitle    = {Proceedings of the 14th international conference on World Wide Web,
                  {WWW} 2005, Chiba, Japan, May 10-14, 2005},
  pages        = {22--32},
  publisher    = {{ACM}},
  year         = {2005},
  url          = {https://doi.org/10.1145/1060745.1060754},
  doi          = {10.1145/1060745.1060754},
  timestamp    = {Fri, 25 Dec 2020 01:14:58 +0100},
  biburl       = {https://dblp.org/rec/conf/www/ZieglerMKL05.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

See also

Diversity using Dense Vectors » Diversity Evaluation

pyterrier_dr.ils(results, index=None, *, verbose=False)[source]¶

Calculate the ILS (Intra-List Similarity) of a set of results.

Higher scores indicate lower diversity in the results.

Return type:

Iterable[Tuple[str, float]]

Parameters:

results (DataFrame) – The result frame to calculate ILS for.
index (FlexIndex | None) – The index to use for loading document vectors. Required if results does not have a doc_vec column.
verbose (bool) – Whether to display a progress bar.

Returns:

An iterable of (qid, ILS) pairs.

Return type:

Iterable[Tuple[str,float]]

Citation

Ziegler et al. Improving recommendation lists through topic diversification. WWW 2005. [link]

@inproceedings{DBLP:conf/www/ZieglerMKL05,
  author       = {Cai{-}Nicolas Ziegler and
                  Sean M. McNee and
                  Joseph A. Konstan and
                  Georg Lausen},
  editor       = {Allan Ellis and
                  Tatsuya Hagino},
  title        = {Improving recommendation lists through topic diversification},
  booktitle    = {Proceedings of the 14th international conference on World Wide Web,
                  {WWW} 2005, Chiba, Japan, May 10-14, 2005},
  pages        = {22--32},
  publisher    = {{ACM}},
  year         = {2005},
  url          = {https://doi.org/10.1145/1060745.1060754},
  doi          = {10.1145/1060745.1060754},
  timestamp    = {Fri, 25 Dec 2020 01:14:58 +0100},
  biburl       = {https://dblp.org/rec/conf/www/ZieglerMKL05.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

See also

Diversity using Dense Vectors » Diversity Evaluation

Joint Product Quantization¶

Deprecated¶

Warning

The following classes are deprecated and will be removed in future releases.

class pyterrier_dr.DocnoFile(path)[source]¶: Represents a document ID lookup file.

Deprecated since version 0.6.0: This class was replaced with npids.Lookup

class pyterrier_dr.NilIndex(*args, **kwargs)[source]¶: This class is an indexer that does nothing. It is meant to be used for testing.

Deprecated since version 0.6.0.

class pyterrier_dr.NumpyIndex(*args, **kwargs)[source]¶: This class implements a disk-based dense vector index using numpy memory maps.

Deprecated since version 0.6.0: This class has been replaced with pyterrier_dr.FlexIndex.

class pyterrier_dr.MemIndex(*args, **kwargs)[source]¶: This class implements an in-memory dense vector index using numpy arrays.

Deprecated since version 0.6.0: This class has been replaced with pyterrier_dr.FlexIndex.

class pyterrier_dr.FaissFlat(*args, **kwargs)[source]¶: This class implements a disk-based dense vector index using Faiss Flat indexes.

Deprecated since version 0.6.0: This class has been replaced with pyterrier_dr.FlexIndex.

class pyterrier_dr.FaissHnsw(*args, **kwargs)[source]¶: This class implements a disk-based dense vector index using Faiss HNSW for approximate nearest neighbor retrieval.

Deprecated since version 0.6.0: This class has been replaced with pyterrier_dr.FlexIndex.

class pyterrier_dr.TorchIndex(*args, **kwargs)[source]¶: This class implements a disk-based dense vector index using PyTorch for GPU-accelerated retrieval.

Deprecated since version 0.6.0: This class has been replaced with pyterrier_dr.FlexIndex.

Dense Retrieval API Reference¶

Encoding¶

Indexing and Retrieval¶

Indexing¶

Retrieval¶

Re-Ranking¶

Evaluation¶

Index Data Access¶

Sharing¶

Pseudo-Relevance Feedback¶

Diversity¶

Joint Product Quantization¶

Deprecated¶