Dense Retrieval API Reference¶
Encoding¶
- class pyterrier_dr.BiEncoder(*args, **kwargs)[source]¶
Represents a single-vector dense bi-encoder.
A
BiEncoderencodes the text of a query or document into a dense vector.- This class functions as a transformer factory:
Query encoding using
query_encoder()Document encoding using
doc_encoder()Text scoring (re-reranking) using
text_scorer()
It can also be used as a transformer directly. It infers which transformer to use based on columns present in the input frame.
Note that in most cases, you will want to use a
BiEncoderas part of a pipeline with aFlexIndexto perform dense indexing and retrival.- Parameters:
batch_size – The default batch size to use for query/document encoding
text_field – The field in the input dataframe that contains the document text
verbose – Whether to show progress bars
- text_scorer(verbose=None, batch_size=None, sim_fn=None)[source]¶
Text Scoring (re-ranking)
- Return type:
- property sim_fn: SimFn¶
The similarity function to use between embeddings for this model
- abstract encode_queries(texts, batch_size=None)[source]¶
Abstract method to encode a list of query texts into dense vectors.
This function is used by the transformer returned by
query_encoder().- Return type:
array- Parameters:
texts (List[str]) – A list of query texts
batch_size (int | None) – The batch size to use for encoding
- Returns:
A numpy array of shape (n_queries, n_dims)
- Return type:
np.array
- abstract encode_docs(texts, batch_size=None)[source]¶
Abstract method to encode a list of document texts into dense vectors.
This function is used by the transformer returned by
doc_encoder().- Return type:
array- Parameters:
texts (List[str]) – A list of document texts
batch_size (int | None) – The batch size to use for encoding
- Returns:
A numpy array of shape (n_docs, n_dims)
- Return type:
np.array
Indexing and Retrieval¶
- class pyterrier_dr.FlexIndex(path, *, sim_fn=SimFn.dot, verbose=True)[source]¶
Represents a FLexible EXecution (FLEX) Index, which is a dense index format.
FLEX allows for a variety of retrieval implementations (NumPy, FAISS, etc.) and algorithms (exhaustive, HNSW, etc.) to be tested. In most cases, the same vector storage can be used across implementations and algorithms, saving considerably on disk space.
- Parameters:
path (str) – The path to the index directory
sim_fn (SimFn | str) – The similarity function to use
verbose (bool) – Whether to display verbose output (e.g., progress bars)
Indexing¶
- index(inp)[source]¶
Index the given input data stream to a new index at this location.
Each record in
inpis expected to be a dictionary containing at least two keys:docno(a unique document identifier) anddoc_vec(a dense vector representation of the document).Typically this method will be used in a pipeline of operations, where the input data is first transformed by a document encoder to add the
doc_vecvalues before it is indexed. For example: :rtype:Artifactfrom pyterrier_dr import TasB, FlexIndex encoder = TasB.dot() index = FlexIndex('my_index') pipeline = encoder >> index pipeline.index([ {'docno': 'doc1', 'text': 'hello'}, {'docno': 'doc2', 'text': 'world'}, ])
- Parameters:
inp (Iterable[Dict]) – An iterable of dictionaries to index.
- Returns:
A reference back to this index (
self).- Return type:
pyterrier_alpha.Artifact- Raises:
RuntimeError – If the index is aready built.
- indexer(*, mode=IndexingMode.create)[source]¶
Return an indexer for this index with the specified options.
This transformer gives more fine-grained control over the indexing process, allowing you to specify whether to create a new index or overwrite an existing one.
Similar to
index(), this method will typically be used in a pipeline of operations, where the input data is first transformed by a document encoder to add thedoc_vecvalues before it is indexed. For example: :rtype:FlexIndexerfrom pyterrier_dr import TasB, FlexIndex encoder = TasB.dot() index = FlexIndex('my_index') pipeline = encoder >> index.indexer(mode='overwrite') pipeline.index([ {'docno': 'doc1', 'text': 'hello'}, {'docno': 'doc2', 'text': 'world'}, ])
- Parameters:
mode (IndexingMode | str) – The indexing mode to use (
createoroverwrite).- Returns:
A new indexer instance.
- Return type:
Retrieval¶
- retriever(*, num_results=1000)¶
Returns a transformer that performs basic exact retrieval over indexed vectors using a brute force search. An alias to
np_retriever().
- np_retriever(*, num_results=1000, batch_size=None, drop_query_vec=False)¶
Return a retriever that uses numpy to perform a brute force search over the index.
The returned transformer expects a DataFrame with columns
qidandquery_vec. It outpus a result frame containing the retrieved results.- Parameters:
num_results (int) – The number of results to return per query.
batch_size (int | None) – The number of documents to score in each batch.
drop_query_vec (bool) – Whether to drop the query vector from the output.
- Returns:
A retriever that uses numpy to perform a brute force search.
- Return type:
- torch_retriever(*, num_results=1000, device=None, fp16=False, qbatch=64, drop_query_vec=False)¶
Return a retriever that uses pytorch to perform brute-force retrieval results using the indexed vectors.
The returned
pyterrier.Transformerexpects a DataFrame with columnsqid,query_vec.Caution
This method loads the entire index into memory on the provided device. If the index is too large to fit in memory, consider using a different retriever that does not fully load the index into memory, like
np_retriever().- Parameters:
num_results (int) – The number of results to return per query.
device (str | None) – The device to use for scoring. If not provided, the default device is used (cuda if available, otherwise cpu).
fp16 (bool) – Whether to use half precision (fp16) for scoring.
qbatch (int) – The number of queries to score in each batch.
drop_query_vec (bool) – Whether to drop the query vector from the output.
- Returns:
A transformer that retrieves using pytorch.
- Return type:
- faiss_flat_retriever(*, gpu=False, qbatch=64, drop_query_vec=False)¶
Returns a retriever that uses FAISS to perform brute-force search over the indexed vectors.
- Parameters:
gpu – Whether to load the index onto GPU for scoring
qbatch – The batch size during search
drop_query_vec – Whether to drop the query vector from the output
- Returns:
A retriever that uses FAISS to perform brute-force search over the indexed vectors
- Return type:
Note
This transformer requires the
faisspackage to be installed.Citation
Douze et al. The Faiss library. arXiv 2024. [link]
@article{DBLP:journals/corr/abs-2401-08281, author = {Matthijs Douze and Alexandr Guzhva and Chengqi Deng and Jeff Johnson and Gergely Szilvasy and Pierre{-}Emmanuel Mazar{\'{e}} and Maria Lomeli and Lucas Hosseini and Herv{\'{e}} J{\'{e}}gou}, title = {The Faiss library}, journal = {CoRR}, volume = {abs/2401.08281}, year = {2024}, url = {https://doi.org/10.48550/arXiv.2401.08281}, doi = {10.48550/ARXIV.2401.08281}, eprinttype = {arXiv}, eprint = {2401.08281}, timestamp = {Thu, 01 Feb 2024 15:35:36 +0100}, biburl = {https://dblp.org/rec/journals/corr/abs-2401-08281.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
- faiss_hnsw_retriever(neighbours=32, *, num_results=1000, ef_construction=40, ef_search=16, cache=True, search_bounded_queue=True, qbatch=64, drop_query_vec=False)¶
Returns a retriever that uses FAISS over a HNSW index.
Creates the HNSW graph structure if it does not already exist. When
cache=True(dfault), this graph structure is cached to disk for subsequent use.- Return type:
- Parameters:
neighbours (int) – The number of neighbours of the constructed neighborhood graph
num_results (int) – The number of results to return per query
ef_construction (int) – The number of neighbours to consider during construction
ef_search (int) – The number of neighbours to consider during search
cache (bool) – Whether to cache the index to disk
search_bounded_queue (bool) – Whether to use a bounded queue during search
qbatch (int) – The batch size during search
drop_query_vec (bool) – Whether to drop the query vector from the output
- Returns:
A retriever that uses FAISS over a HNSW index
- Return type:
Note
This transformer requires the
faisspackage to be installed.Citation
Douze et al. The Faiss library. arXiv 2024. [link]
@article{DBLP:journals/corr/abs-2401-08281, author = {Matthijs Douze and Alexandr Guzhva and Chengqi Deng and Jeff Johnson and Gergely Szilvasy and Pierre{-}Emmanuel Mazar{\'{e}} and Maria Lomeli and Lucas Hosseini and Herv{\'{e}} J{\'{e}}gou}, title = {The Faiss library}, journal = {CoRR}, volume = {abs/2401.08281}, year = {2024}, url = {https://doi.org/10.48550/arXiv.2401.08281}, doi = {10.48550/ARXIV.2401.08281}, eprinttype = {arXiv}, eprint = {2401.08281}, timestamp = {Thu, 01 Feb 2024 15:35:36 +0100}, biburl = {https://dblp.org/rec/journals/corr/abs-2401-08281.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
- faiss_ivf_retriever(*, num_results=1000, train_sample=None, n_list=None, cache=True, n_probe=1, drop_query_vec=False)¶
Returns a retriever that uses FAISS over an IVF index.
If the IVF structure does not already exist, it is created and cached to disk (when
cache=True(default)).- Parameters:
num_results (int) – The number of results to return per query
train_sample (int | None) – The number of training samples to use for training the index. If not provided, a default value is used (approximately the square root of the number of documents).
n_list (int | None) – The number of posting lists to use for the index. If not provided, a default value is used (approximately
train_sample/39).cache (bool) – Whether to cache the index to disk.
n_probe (int) – The number of posting lists to probe during search. The higher the value, the better the approximation will be, but the longer it will take.
drop_query_vec (bool) – Whether to drop the query vector from the output.
- Returns:
A retriever that uses FAISS over an IVF index
- Return type:
Note
This transformer requires the
faisspackage to be installed.Citation
Douze et al. The Faiss library. arXiv 2024. [link]
@article{DBLP:journals/corr/abs-2401-08281, author = {Matthijs Douze and Alexandr Guzhva and Chengqi Deng and Jeff Johnson and Gergely Szilvasy and Pierre{-}Emmanuel Mazar{\'{e}} and Maria Lomeli and Lucas Hosseini and Herv{\'{e}} J{\'{e}}gou}, title = {The Faiss library}, journal = {CoRR}, volume = {abs/2401.08281}, year = {2024}, url = {https://doi.org/10.48550/arXiv.2401.08281}, doi = {10.48550/ARXIV.2401.08281}, eprinttype = {arXiv}, eprint = {2401.08281}, timestamp = {Thu, 01 Feb 2024 15:35:36 +0100}, biburl = {https://dblp.org/rec/journals/corr/abs-2401-08281.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
Returns a retriever that searchers over a flatnav index.
- Return type:
- Parameters:
k (int) – the maximum number of edges per document in the index
ef_search (int) – the size of the list during searches. Higher values are slower but more accurate.
num_initializations (int) – the number of random initializations to use during search.
ef_construction (int) – the size of the list during graph construction. Higher values are slower but more accurate.
threads (int) – the number of threads to use
num_results (int) – the number of results to return per query
cache (bool) – whether to cache the index to disk
qbatch (int) – the number of queries to search at once
drop_query_vec (bool) – whether to drop the query_vec column after retrieval
verbose (bool) – whether to show progress bars
Added in version 0.4.0.
Changed in version 0.4.1: fixed bug with num_initializations
Note
This transformer requires the
flatnavpackage to be installed. Instructions are available in the flatnav repository.Citation
Munyampirwa et al. Down with the Hierarchy: The 'H' in HNSW Stands for "Hubs". arXiv 2024. [link]
- scann_retriever(*, n_leaves=None, leaves_to_search=1, num_results=1000, train_sample=None, drop_query_vec=False)¶
Returns a retriever over a ScaNN (Scalable Nearest Neighbors) index.
- Parameters:
n_leaves (int, optional) – Number of leaves in the ScaNN index. Defaults to approximatley sqrt(doc_count).
leaves_to_search (int, optional) – Number of leaves to search. Defaults to 1. The higher the value, the more accurate the search.
num_results (int, optional) – Number of results to return. Defaults to 1000.
train_sample (int, optional) – Number of training samples. Defaults to
n_leaves*39.drop_query_vec (bool, optional) – Whether to drop the query vector from the output.
- Returns:
A transformer that retrieves using ScaNN.
- Return type:
Note
This method requires the
scannpackage. Install it viapip install scann.Citation
Guo et al. Accelerating Large-Scale Inference with Anisotropic Vector Quantization. ICML 2020. [link]
@inproceedings{DBLP:conf/icml/GuoSLGSCK20, author = {Ruiqi Guo and Philip Sun and Erik Lindgren and Quan Geng and David Simcha and Felix Chern and Sanjiv Kumar}, title = {Accelerating Large-Scale Inference with Anisotropic Vector Quantization}, booktitle = {Proceedings of the 37th International Conference on Machine Learning, {ICML} 2020, 13-18 July 2020, Virtual Event}, series = {Proceedings of Machine Learning Research}, volume = {119}, pages = {3887--3896}, publisher = {{PMLR}}, year = {2020}, url = {http://proceedings.mlr.press/v119/guo20h.html}, timestamp = {Tue, 15 Dec 2020 17:40:18 +0100}, biburl = {https://dblp.org/rec/conf/icml/GuoSLGSCK20.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
- voyager_retriever(neighbours=12, *, num_results=1000, ef_construction=200, random_seed=1, storage_data_type='float32', query_ef=10, drop_query_vec=False)¶
Returns a retriever that uses HNSW to search over a Voyager index.
- Return type:
- Parameters:
neighbours (int, optional) – Number of neighbours to search. Defaults to 12.
num_results (int, optional) – Number of results to return per query. Defaults to 1000.
ef_construction (int, optional) – Expansion factor for graph construction. Defaults to 200.
random_seed (int, optional) – Random seed. Defaults to 1.
storage_data_type (str, optional) – Storage data type. One of ‘float32’, ‘float8’, ‘e4m3’. Defaults to ‘float32’.
query_ef (int, optional) – Expansion factor during querying. Defaults to 10.
drop_query_vec (bool, optional) – Drop the query vector from the output. Defaults to False.
- Returns:
A retriever that uses HNSW to search over a Voyager index.
- Return type:
Note
This method requires the
voyagerpackage. Install it viapip install voyager.
Re-Ranking¶
- scorer()¶
An alias to
np_scorer().
- np_scorer(*, num_results=None)¶
Return a scorer that uses numpy to score (re-rank) results using indexed vectors.
The returned transformer expects a DataFrame with columns
qid,query_vecanddocno. (If an internaldocidcolumn is provided, this will be used to speed up vector lookups.)This method uses memory-mapping to avoid loading the entire index into memory at once.
- Return type:
- Parameters:
num_results (int | None) – The number of results to return per query. If not provided, all resuls from the original fram are returned.
- Returns:
A transformer that scores query vectors with numpy.
- Return type:
- torch_scorer(*, num_results=None, device=None, fp16=False)¶
Return a scorer that uses pytorch to score (re-rank) results using indexed vectors.
The returned
pyterrier.Transformerexpects a DataFrame with columnsqid,query_vecanddocno. (If an internaldocidcolumn is provided, this will be used to speed up vector lookups.)Caution
This method loads the entire index into memory on the provided device. If the index is too large to fit in memory, consider using a different scorer that does not fully load the index into memory, like
np_scorer().- Parameters:
num_results (int | None) – The number of results to return per query. If not provided, all resuls from the original fram are returned.
device (str | None) – The device to use for scoring. If not provided, the default device is used (cuda if available, otherwise cpu).
fp16 (bool) – Whether to use half precision (fp16) for scoring.
- Returns:
A transformer that scores query vectors with pytorch.
- Return type:
- gar(k=16, *, batch_size=128, num_results=1000)¶
Returns a retriever that uses a corpus graph to search over a FlexIndex.
- Return type:
- Parameters:
k (int) – Number of neighbours in the corpus graph. Defaults to 16.
batch_size (int) – Batch size for retrieval. Defaults to 128.
num_results (int) – Number of results per query to return. Defaults to 1000.
- Returns:
A retriever that uses a corpus graph to search over a FlexIndex.
- Return type:
Citation
MacAvaney et al. Adaptive Re-Ranking with a Corpus Graph. CIKM 2022. [link]
@inproceedings{DBLP:conf/cikm/MacAvaneyTM22, author = {Sean MacAvaney and Nicola Tonellotto and Craig Macdonald}, editor = {Mohammad Al Hasan and Li Xiong}, title = {Adaptive Re-Ranking with a Corpus Graph}, booktitle = {Proceedings of the 31st {ACM} International Conference on Information {\&} Knowledge Management, Atlanta, GA, USA, October 17-21, 2022}, pages = {1491--1500}, publisher = {{ACM}}, year = {2022}, url = {https://doi.org/10.1145/3511808.3557231}, doi = {10.1145/3511808.3557231}, timestamp = {Wed, 19 Oct 2022 17:09:02 +0200}, biburl = {https://dblp.org/rec/conf/cikm/MacAvaneyTM22.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
- ladr_proactive(k=16, *, hops=1, num_results=1000, dense_scorer=None, drop_query_vec=False, budget=False)¶
Returns a proactive LADR (Lexicaly-Accelerated Dense Retrieval) transformer.
- Return type:
- Parameters:
k (int) – The number of neighbours in the corpus graph.
hops (int) – The number of hops to consider. Defaults to 1.
num_results (int) – The number of results to return per query.
dense_scorer (
Transformer, optional) – The dense scorer to use. Defaults tonp_scorer().drop_query_vec (bool) – Whether to drop the query vector from the output.
budget (bool or int) – The maximum number of vectors to score. If
False, no maximum is applied. IfTrue, the budget is set tonum_results. If an integer, this value is used as the budget.
- Returns:
A proactive LADR transformer.
- Return type:
Citation
Kulkarni et al. Lexically-Accelerated Dense Retrieval. SIGIR 2023. [link]
@inproceedings{DBLP:conf/sigir/KulkarniMGF23, author = {Hrishikesh Kulkarni and Sean MacAvaney and Nazli Goharian and Ophir Frieder}, editor = {Hsin{-}Hsi Chen and Wei{-}Jou (Edward) Duh and Hen{-}Hsen Huang and Makoto P. Kato and Josiane Mothe and Barbara Poblete}, title = {Lexically-Accelerated Dense Retrieval}, booktitle = {Proceedings of the 46th International {ACM} {SIGIR} Conference on Research and Development in Information Retrieval, {SIGIR} 2023, Taipei, Taiwan, July 23-27, 2023}, pages = {152--162}, publisher = {{ACM}}, year = {2023}, url = {https://doi.org/10.1145/3539618.3591715}, doi = {10.1145/3539618.3591715}, timestamp = {Fri, 21 Jul 2023 22:25:19 +0200}, biburl = {https://dblp.org/rec/conf/sigir/KulkarniMGF23.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
- ladr_adaptive(k=16, *, depth=100, num_results=1000, dense_scorer=None, max_hops=None, drop_query_vec=False, budget=False)¶
Returns an adaptive LADR (Lexicaly-Accelerated Dense Retrieval) transformer.
- Return type:
- Parameters:
k (int) – The number of neighbours in the corpus graph.
depth (int) – The depth of the ranked list to consider for convergence.
num_results (int) – The number of results to return per query.
dense_scorer (
Transformer, optional) – The dense scorer to use. Defaults tonp_scorer().max_hops (int, optional) – The maximum number of hops to consider. Defaults to
None(no limit).drop_query_vec (bool) – Whether to drop the query vector from the output.
budget (bool or int) – The maximum number of vectors to score. If
False, no maximum is applied. IfTrue, the budget is set tonum_results. If an integer, this value is used as the budget.
- Returns:
An adaptive LADR transformer.
- Return type:
Citation
Kulkarni et al. Lexically-Accelerated Dense Retrieval. SIGIR 2023. [link]
@inproceedings{DBLP:conf/sigir/KulkarniMGF23, author = {Hrishikesh Kulkarni and Sean MacAvaney and Nazli Goharian and Ophir Frieder}, editor = {Hsin{-}Hsi Chen and Wei{-}Jou (Edward) Duh and Hen{-}Hsen Huang and Makoto P. Kato and Josiane Mothe and Barbara Poblete}, title = {Lexically-Accelerated Dense Retrieval}, booktitle = {Proceedings of the 46th International {ACM} {SIGIR} Conference on Research and Development in Information Retrieval, {SIGIR} 2023, Taipei, Taiwan, July 23-27, 2023}, pages = {152--162}, publisher = {{ACM}}, year = {2023}, url = {https://doi.org/10.1145/3539618.3591715}, doi = {10.1145/3539618.3591715}, timestamp = {Fri, 21 Jul 2023 22:25:19 +0200}, biburl = {https://dblp.org/rec/conf/sigir/KulkarniMGF23.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
- mmr(*, Lambda=0.5, norm_rel=False, norm_sim=False, drop_doc_vec=True, verbose=False)¶
Returns an MMR (Maximal Marginal Relevance) scorer (i.e., re-ranker) over this index.
The method first loads vectors from the index and then applies
MmrScorerto re-rank the results. SeeMmrScorerfor more details on MMR.- Return type:
- Parameters:
Lambda (float) – The balance parameter between relevance and diversity (default: 0.5)
norm_rel (bool) – Whether to normalize relevance scores to [0, 1] (default: False)
norm_sim (bool) – Whether to normalize similarity scores to [0, 1] (default: False)
drop_doc_vec (bool) – Whether to drop the ‘doc_vec’ column after re-ranking (default: True)
verbose (bool) – Whether to display verbose output (e.g., progress bars) (default: False)
Citation
Carbonell and Goldstein. The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries. SIGIR 1998. [link]
@inproceedings{DBLP:conf/sigir/CarbonellG98, author = {Jaime G. Carbonell and Jade Goldstein}, editor = {W. Bruce Croft and Alistair Moffat and C. J. van Rijsbergen and Ross Wilkinson and Justin Zobel}, title = {The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries}, booktitle = {{SIGIR} '98: Proceedings of the 21st Annual International {ACM} {SIGIR} Conference on Research and Development in Information Retrieval, August 24-28 1998, Melbourne, Australia}, pages = {335--336}, publisher = {{ACM}}, year = {1998}, url = {https://doi.org/10.1145/290941.291025}, doi = {10.1145/290941.291025}, timestamp = {Wed, 14 Nov 2018 10:58:11 +0100}, biburl = {https://dblp.org/rec/conf/sigir/CarbonellG98.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
Evaluation¶
- property ILS: Measure¶
Return an ILS (Intra-List Similarity) measure for this index. See
pyterrier_dr.ILS()for more details.
Index Data Access¶
- built()[source]¶
Check if the index has been built.
- Return type:
bool- Returns:
Trueif the index has been built, otherwiseFalse.- Return type:
bool
- vec_loader()¶
Return a transformer that loads indexed vectors.
The returned transformer expects a DataFrame with columns
docno. It outputs a frame that includes a columndoc_vec, which contains the indexed vectors. For example:Load vectors from aFlexIndex¶index = FlexIndex.from_hf('macavaney/msmarco-passage.tasb.flex') loader = index.vec_loader() loader(pd.DataFrame([ {"docno": "5"}, {"docno": "100"}, {"docno": "74356"}, ])) # docno doc_vec # 5 [-0.09343405, 0.12045559, -0.25184962, 0.15029... # 100 [-0.11527929, 0.63400555, -0.0877756, -0.26490... # 74356 [0.15367049, 0.16049547, -0.012261144, -0.2588...
- Returns:
A transformer that loads indexed vectors.
- Return type:
- get_corpus_iter(start_idx=None, stop_idx=None, verbose=True)[source]¶
Iterate over the documents in the index.
- Return type:
Iterable[Dict]- Parameters:
start_idx – The index of the first document to return (or
Noneto start at the first document).stop_idx – The index of the last document to return (or
Noneto end on the last document).verbose – Whether to display a progress bar.
- Yields:
Dict[str,Any] – A dictionary with keys
docnoanddoc_vec.
- np_vecs()¶
Return the indexed vectors.
- Return type:
ndarray- Returns:
The indexed vectors as a memory-mapped numpy array.
- Return type:
numpy.ndarray
- torch_vecs(*, device=None, fp16=False)¶
Return the indexed vectors as a pytorch tensor. :rtype:
TensorCaution
This method loads the entire index into memory on the provided device. If the index is too large to fit in memory, consider using a different method that does not fully load the index into memory, like
np_vecs()orget_corpus_iter().- Parameters:
device (str | None) – The device to use for the tensor. If not provided, the default device is used (cuda if available, otherwise cpu).
fp16 (bool) – Whether to use half precision (fp16) for the tensor.
- Returns:
The indexed vectors as a torch tensor.
- Return type:
torch.Tensor
- docnos()[source]¶
Return the document identifier (docno) lookup data structure.
- Return type:
Lookup- Returns:
The document number lookup.
- Return type:
npids.Lookup
- corpus_graph(k=16, *, batch_size=8192)¶
Return the corpus graph (neighborhood graph) for the index.
The corpus graph is a directed graph where each node represents a document and each edge represents a connection between two documents. The graph is built by computing the cosine similarity between each pair of documents and storing the k-nearest neighbors for each document.
If the corpus graph has not been built yet, it will be built using the given k and batch size.
- Parameters:
k (int) – The number of neighbors to store for each document.
batch_size (int) – The number of vectors to process in each batch.
- Returns:
The corpus graph for the index.
- Return type:
- faiss_hnsw_graph(neighbours=32, *, ef_construction=40)¶
Returns the (approximate) HNSW graph structure created by the HNSW index.
If the graph structure does not already exist, it is created and cached to disk.
- Parameters:
neighbours (int) – The number of neighbours of the constructed neighborhood graph
ef_construction (int) – The number of neighbours to consider during construction
- Returns:
The HNSW graph structure
- Return type:
Note
This function requires the
faisspackage to be installed.Citation
Douze et al. The Faiss library. arXiv 2024. [link]
@article{DBLP:journals/corr/abs-2401-08281, author = {Matthijs Douze and Alexandr Guzhva and Chengqi Deng and Jeff Johnson and Gergely Szilvasy and Pierre{-}Emmanuel Mazar{\'{e}} and Maria Lomeli and Lucas Hosseini and Herv{\'{e}} J{\'{e}}gou}, title = {The Faiss library}, journal = {CoRR}, volume = {abs/2401.08281}, year = {2024}, url = {https://doi.org/10.48550/arXiv.2401.08281}, doi = {10.48550/ARXIV.2401.08281}, eprinttype = {arXiv}, eprint = {2401.08281}, timestamp = {Thu, 01 Feb 2024 15:35:36 +0100}, biburl = {https://dblp.org/rec/journals/corr/abs-2401-08281.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
Pseudo-Relevacne Feedback¶
- class pyterrier_dr.AveragePrf(*, k=3)[source]¶
Performs Average PRF (as described by Li et al.) by averaging the query_vec column with the doc_vec column of the top k documents.
- Parameters:
k (-) – number of pseudo-relevant feedback documents
Expected Input Columns:
['qid', 'query_vec', 'docno', 'doc_vec']Output Columns:
['qid', 'query_vec'](Any other query columns from the input are also pulled included in the output.)Example:
prf_pipe = model >> index >> index.vec_loader() >> pyterrier_dr.AveragePrf() >> index
Citation
Li et al. Pseudo Relevance Feedback with Deep Language Models and Dense Retrievers: Successes and Pitfalls. ACM Trans. Inf. Syst. 2023. [link]
@article{DBLP:journals/tois/0009MZKZ23, author = {Hang Li and Ahmed Mourad and Shengyao Zhuang and Bevan Koopman and Guido Zuccon}, title = {Pseudo Relevance Feedback with Deep Language Models and Dense Retrievers: Successes and Pitfalls}, journal = {{ACM} Trans. Inf. Syst.}, volume = {41}, number = {3}, pages = {62:1--62:40}, year = {2023}, url = {https://doi.org/10.1145/3570724}, doi = {10.1145/3570724}, timestamp = {Fri, 21 Jul 2023 22:26:51 +0200}, biburl = {https://dblp.org/rec/journals/tois/0009MZKZ23.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
- class pyterrier_dr.VectorPrf(*, alpha=1, beta=0.2, k=3)[source]¶
Performs a Rocchio-esque PRF by linearly combining the query_vec column with the doc_vec column of the top k documents.
- Parameters:
alpha (-) – weight of original query_vec
beta (-) – weight of doc_vec
k (-) – number of pseudo-relevant feedback documents
Expected Input Columns:
['qid', 'query_vec', 'docno', 'doc_vec']Output Columns:
['qid', 'query_vec'](Any other query columns from the input are also pulled included in the output.)Example:
prf_pipe = model >> index >> index.vec_loader() >> pyterrier_dr.VectorPrf() >> index
Citation
Li et al. Pseudo Relevance Feedback with Deep Language Models and Dense Retrievers: Successes and Pitfalls. ACM Trans. Inf. Syst. 2023. [link]
@article{DBLP:journals/tois/0009MZKZ23, author = {Hang Li and Ahmed Mourad and Shengyao Zhuang and Bevan Koopman and Guido Zuccon}, title = {Pseudo Relevance Feedback with Deep Language Models and Dense Retrievers: Successes and Pitfalls}, journal = {{ACM} Trans. Inf. Syst.}, volume = {41}, number = {3}, pages = {62:1--62:40}, year = {2023}, url = {https://doi.org/10.1145/3570724}, doi = {10.1145/3570724}, timestamp = {Fri, 21 Jul 2023 22:26:51 +0200}, biburl = {https://dblp.org/rec/journals/tois/0009MZKZ23.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
Diversity¶
- class pyterrier_dr.MmrScorer(*, Lambda=0.5, norm_rel=False, norm_sim=False, drop_doc_vec=True, verbose=False)[source]¶
- Parameters:
Lambda (float) – The balance parameter between relevance and diversity (default: 0.5)
norm_rel (bool) – Whether to normalize relevance scores to [0, 1] (default: False)
norm_sim (bool) – Whether to normalize similarity scores to [0, 1] (default: False)
drop_doc_vec (bool) – Whether to drop the ‘doc_vec’ column after re-ranking (default: True)
verbose (bool) – Whether to display verbose output (e.g., progress bars) (default: False)
- pyterrier_dr.ILS(index, *, name=None, verbose=False)[source]¶
Create an ILS (Intra-List Similarity) measure calculated using the vectors in the provided index.
Higher scores indicate lower diversity in the results.
This measure supports the
@kconvention for applying a top-k cutoff before scoring.- Return type:
Measure- Parameters:
index (FlexIndex) – The index to use for loading document vectors.
name (str, optional) – The name of the measure (default: “ILS”).
verbose (bool, optional) – Whether to display a progress bar.
- Returns:
An ILS measure object.
- Return type:
ir_measures.Measure
Citation
Ziegler et al. Improving recommendation lists through topic diversification. WWW 2005. [link]
@inproceedings{DBLP:conf/www/ZieglerMKL05, author = {Cai{-}Nicolas Ziegler and Sean M. McNee and Joseph A. Konstan and Georg Lausen}, editor = {Allan Ellis and Tatsuya Hagino}, title = {Improving recommendation lists through topic diversification}, booktitle = {Proceedings of the 14th international conference on World Wide Web, {WWW} 2005, Chiba, Japan, May 10-14, 2005}, pages = {22--32}, publisher = {{ACM}}, year = {2005}, url = {https://doi.org/10.1145/1060745.1060754}, doi = {10.1145/1060745.1060754}, timestamp = {Fri, 25 Dec 2020 01:14:58 +0100}, biburl = {https://dblp.org/rec/conf/www/ZieglerMKL05.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
- pyterrier_dr.ils(results, index=None, *, verbose=False)[source]¶
Calculate the ILS (Intra-List Similarity) of a set of results.
Higher scores indicate lower diversity in the results.
- Return type:
Iterable[Tuple[str,float]]- Parameters:
results (DataFrame) – The result frame to calculate ILS for.
index (FlexIndex | None) – The index to use for loading document vectors. Required if results does not have a doc_vec column.
verbose (bool) – Whether to display a progress bar.
- Returns:
An iterable of (qid, ILS) pairs.
- Return type:
Iterable[Tuple[str,float]]
Citation
Ziegler et al. Improving recommendation lists through topic diversification. WWW 2005. [link]
@inproceedings{DBLP:conf/www/ZieglerMKL05, author = {Cai{-}Nicolas Ziegler and Sean M. McNee and Joseph A. Konstan and Georg Lausen}, editor = {Allan Ellis and Tatsuya Hagino}, title = {Improving recommendation lists through topic diversification}, booktitle = {Proceedings of the 14th international conference on World Wide Web, {WWW} 2005, Chiba, Japan, May 10-14, 2005}, pages = {22--32}, publisher = {{ACM}}, year = {2005}, url = {https://doi.org/10.1145/1060745.1060754}, doi = {10.1145/1060745.1060754}, timestamp = {Fri, 25 Dec 2020 01:14:58 +0100}, biburl = {https://dblp.org/rec/conf/www/ZieglerMKL05.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
Deprecated¶
Warning
The following classes are deprecated and will be removed in future releases.
- class pyterrier_dr.DocnoFile(path)[source]¶
Represents a document ID lookup file.
Deprecated since version 0.6.0: This class was replaced with
npids.Lookup
- class pyterrier_dr.NilIndex(*args, **kwargs)[source]¶
This class is an indexer that does nothing. It is meant to be used for testing.
Deprecated since version 0.6.0.
- class pyterrier_dr.NumpyIndex(*args, **kwargs)[source]¶
This class implements a disk-based dense vector index using numpy memory maps.
Deprecated since version 0.6.0: This class has been replaced with
pyterrier_dr.FlexIndex.
- class pyterrier_dr.MemIndex(*args, **kwargs)[source]¶
This class implements an in-memory dense vector index using numpy arrays.
Deprecated since version 0.6.0: This class has been replaced with
pyterrier_dr.FlexIndex.
- class pyterrier_dr.FaissFlat(*args, **kwargs)[source]¶
This class implements a disk-based dense vector index using Faiss Flat indexes.
Deprecated since version 0.6.0: This class has been replaced with
pyterrier_dr.FlexIndex.
- class pyterrier_dr.FaissHnsw(*args, **kwargs)[source]¶
This class implements a disk-based dense vector index using Faiss HNSW for approximate nearest neighbor retrieval.
Deprecated since version 0.6.0: This class has been replaced with
pyterrier_dr.FlexIndex.
- class pyterrier_dr.TorchIndex(*args, **kwargs)[source]¶
This class implements a disk-based dense vector index using PyTorch for GPU-accelerated retrieval.
Deprecated since version 0.6.0: This class has been replaced with
pyterrier_dr.FlexIndex.