Indexing & Retrieval

This page covers the indexing and retrieval functionality provided by pyterrier_dr.

FlexIndex provides a flexible way to index and retrieve documents using dense vectors, and is the main class for indexing and retrieval.

API Documentation

class pyterrier_dr.FlexIndex(path, *, sim_fn=SimFn.dot, verbose=True)[source]

Bases: Artifact, Indexer

Represents a FLexible EXecution (FLEX) Index, which is a dense index format.

FLEX allows for a variety of retrieval implementations (NumPy, FAISS, etc.) and algorithms (exhaustive, HNSW, etc.) to be tested. In most cases, the same vector storage can be used across implementations and algorithms, saving considerably on disk space.

Parameters:
  • path – The path to the index directory

  • sim_fn – The similarity function to use

  • verbose – Whether to display verbose output (e.g., progress bars)

Indexing

Basic indexing functionality is provided through index(). For more advanced options, use indexer().

index(inp)[source]

Index the given input data stream to a new index at this location.

Each record in inp is expected to be a dictionary containing at least two keys: docno (a unique document identifier) and doc_vec (a dense vector representation of the document).

Typically this method will be used in a pipeline of operations, where the input data is first transformed by a document encoder to add the doc_vec values before it is indexed. For example: :rtype: Artifact

Index documents into a FlexIndex using a TasB encoder.
from pyterrier_dr import TasB, FlexIndex
encoder = TasB.dot()
index = FlexIndex('my_index')
pipeline = encoder >> index
pipeline.index([
    {'docno': 'doc1', 'text': 'hello'},
    {'docno': 'doc2', 'text': 'world'},
])
Parameters:

inp – An iterable of dictionaries to index.

Returns:

A reference back to this index (self).

Return type:

pyterrier_alpha.Artifact

Raises:

RuntimeError – If the index is aready built.

indexer(*, mode=IndexingMode.create)[source]

Return an indexer for this index with the specified options.

This transformer gives more fine-grained control over the indexing process, allowing you to specify whether to create a new index or overwrite an existing one.

Similar to index(), this method will typically be used in a pipeline of operations, where the input data is first transformed by a document encoder to add the doc_vec values before it is indexed. For example: :rtype: FlexIndexer

Oerwrite a FlexIndex using a TasB encoder.
from pyterrier_dr import TasB, FlexIndex
encoder = TasB.dot()
index = FlexIndex('my_index')
pipeline = encoder >> index.indexer(mode='overwrite')
pipeline.index([
    {'docno': 'doc1', 'text': 'hello'},
    {'docno': 'doc2', 'text': 'world'},
])
Parameters:

mode – The indexing mode to use (create or overwrite).

Returns:

A new indexer instance.

Return type:

Indexer

Retrieval

FlexIndex provides a variety of retriever backends. Each one expects qid and query_vec columns as input, and outputs a result frame. When you do not care about which backend you want, you can use retriever() (an alias to np_retriever()), which preforms exact retrieval using a brute force search over all vectors.

retriever(*, num_results=1000)

Returns a transformer that performs basic exact retrieval over indexed vectors using a brute force search. An alias to np_retriever().

np_retriever(*, num_results=1000, batch_size=None, drop_query_vec=False)

Return a retriever that uses numpy to perform a brute force search over the index.

The returned transformer expects a DataFrame with columns qid and query_vec. It outpus a result frame containing the retrieved results.

Parameters:
  • num_results – The number of results to return per query.

  • batch_size – The number of documents to score in each batch.

  • drop_query_vec – Whether to drop the query vector from the output.

Returns:

A retriever that uses numpy to perform a brute force search.

Return type:

Transformer

torch_retriever(*, num_results=1000, device=None, fp16=False, qbatch=64, drop_query_vec=False)

Return a retriever that uses pytorch to perform brute-force retrieval results using the indexed vectors.

The returned pyterrier.Transformer expects a DataFrame with columns qid, query_vec.

Caution

This method loads the entire index into memory on the provided device. If the index is too large to fit in memory, consider using a different retriever that does not fully load the index into memory, like np_retriever().

Parameters:
  • num_results – The number of results to return per query.

  • device – The device to use for scoring. If not provided, the default device is used (cuda if available, otherwise cpu).

  • fp16 – Whether to use half precision (fp16) for scoring.

  • qbatch – The number of queries to score in each batch.

  • drop_query_vec – Whether to drop the query vector from the output.

Returns:

A transformer that retrieves using pytorch.

Return type:

Transformer

faiss_flat_retriever(*, gpu=False, qbatch=64, drop_query_vec=False)

Returns a retriever that uses FAISS to perform brute-force search over the indexed vectors.

Parameters:
  • gpu – Whether to load the index onto GPU for scoring

  • qbatch – The batch size during search

  • drop_query_vec – Whether to drop the query vector from the output

Returns:

A retriever that uses FAISS to perform brute-force search over the indexed vectors

Return type:

Transformer

Note

This transformer requires the faiss package to be installed.

Citation

Douze et al. The Faiss library. arXiv 2024. [link]
@article{DBLP:journals/corr/abs-2401-08281,
  author       = {Matthijs Douze and
                  Alexandr Guzhva and
                  Chengqi Deng and
                  Jeff Johnson and
                  Gergely Szilvasy and
                  Pierre{-}Emmanuel Mazar{\'{e}} and
                  Maria Lomeli and
                  Lucas Hosseini and
                  Herv{\'{e}} J{\'{e}}gou},
  title        = {The Faiss library},
  journal      = {CoRR},
  volume       = {abs/2401.08281},
  year         = {2024},
  url          = {https://doi.org/10.48550/arXiv.2401.08281},
  doi          = {10.48550/ARXIV.2401.08281},
  eprinttype    = {arXiv},
  eprint       = {2401.08281},
  timestamp    = {Thu, 01 Feb 2024 15:35:36 +0100},
  biburl       = {https://dblp.org/rec/journals/corr/abs-2401-08281.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}
faiss_hnsw_retriever(neighbours=32, *, num_results=1000, ef_construction=40, ef_search=16, cache=True, search_bounded_queue=True, qbatch=64, drop_query_vec=False)

Returns a retriever that uses FAISS over a HNSW index.

Creates the HNSW graph structure if it does not already exist. When cache=True (dfault), this graph structure is cached to disk for subsequent use.

Return type:

Transformer

Parameters:
  • neighbours – The number of neighbours of the constructed neighborhood graph

  • num_results – The number of results to return per query

  • ef_construction – The number of neighbours to consider during construction

  • ef_search – The number of neighbours to consider during search

  • cache – Whether to cache the index to disk

  • search_bounded_queue – Whether to use a bounded queue during search

  • qbatch – The batch size during search

  • drop_query_vec – Whether to drop the query vector from the output

Returns:

A retriever that uses FAISS over a HNSW index

Return type:

Transformer

Note

This transformer requires the faiss package to be installed.

Citation

Douze et al. The Faiss library. arXiv 2024. [link]
@article{DBLP:journals/corr/abs-2401-08281,
  author       = {Matthijs Douze and
                  Alexandr Guzhva and
                  Chengqi Deng and
                  Jeff Johnson and
                  Gergely Szilvasy and
                  Pierre{-}Emmanuel Mazar{\'{e}} and
                  Maria Lomeli and
                  Lucas Hosseini and
                  Herv{\'{e}} J{\'{e}}gou},
  title        = {The Faiss library},
  journal      = {CoRR},
  volume       = {abs/2401.08281},
  year         = {2024},
  url          = {https://doi.org/10.48550/arXiv.2401.08281},
  doi          = {10.48550/ARXIV.2401.08281},
  eprinttype    = {arXiv},
  eprint       = {2401.08281},
  timestamp    = {Thu, 01 Feb 2024 15:35:36 +0100},
  biburl       = {https://dblp.org/rec/journals/corr/abs-2401-08281.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}
faiss_ivf_retriever(*, num_results=1000, train_sample=None, n_list=None, cache=True, n_probe=1, drop_query_vec=False)

Returns a retriever that uses FAISS over an IVF index.

If the IVF structure does not already exist, it is created and cached to disk (when cache=True (default)).

Parameters:
  • num_results – The number of results to return per query

  • train_sample – The number of training samples to use for training the index. If not provided, a default value is used (approximately the square root of the number of documents).

  • n_list – The number of posting lists to use for the index. If not provided, a default value is used (approximately train_sample/39).

  • cache – Whether to cache the index to disk.

  • n_probe – The number of posting lists to probe during search. The higher the value, the better the approximation will be, but the longer it will take.

  • drop_query_vec – Whether to drop the query vector from the output.

Returns:

A retriever that uses FAISS over an IVF index

Return type:

Transformer

Note

This transformer requires the faiss package to be installed.

Citation

Douze et al. The Faiss library. arXiv 2024. [link]
@article{DBLP:journals/corr/abs-2401-08281,
  author       = {Matthijs Douze and
                  Alexandr Guzhva and
                  Chengqi Deng and
                  Jeff Johnson and
                  Gergely Szilvasy and
                  Pierre{-}Emmanuel Mazar{\'{e}} and
                  Maria Lomeli and
                  Lucas Hosseini and
                  Herv{\'{e}} J{\'{e}}gou},
  title        = {The Faiss library},
  journal      = {CoRR},
  volume       = {abs/2401.08281},
  year         = {2024},
  url          = {https://doi.org/10.48550/arXiv.2401.08281},
  doi          = {10.48550/ARXIV.2401.08281},
  eprinttype    = {arXiv},
  eprint       = {2401.08281},
  timestamp    = {Thu, 01 Feb 2024 15:35:36 +0100},
  biburl       = {https://dblp.org/rec/journals/corr/abs-2401-08281.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}
flatnav_retriever(k=32, *, ef_search=100, num_initializations=100, ef_construction=100, threads=16, num_results=1000, cache=True, qbatch=64, drop_query_vec=False, verbose=False)

Returns a retriever that searchers over a flatnav index.

Return type:

Transformer

Parameters:
  • k (int) – the maximum number of edges per document in the index

  • ef_search (int) – the size of the list during searches. Higher values are slower but more accurate.

  • num_initializations (int) – the number of random initializations to use during search.

  • ef_construction (int) – the size of the list during graph construction. Higher values are slower but more accurate.

  • threads (int) – the number of threads to use

  • num_results (int) – the number of results to return per query

  • cache (bool) – whether to cache the index to disk

  • qbatch (int) – the number of queries to search at once

  • drop_query_vec (bool) – whether to drop the query_vec column after retrieval

  • verbose (bool) – whether to show progress bars

Added in version 0.4.0.

Changed in version 0.4.1: fixed bug with num_initializations

Note

This transformer requires the flatnav package to be installed. Instructions are available in the flatnav repository.

Citation

Munyampirwa et al. Down with the Hierarchy: The 'H' in HNSW Stands for "Hubs". arXiv 2024. [link]

scann_retriever(*, n_leaves=None, leaves_to_search=1, num_results=1000, train_sample=None, drop_query_vec=False)

Returns a retriever over a ScaNN (Scalable Nearest Neighbors) index.

Parameters:
  • n_leaves (int, optional) – Number of leaves in the ScaNN index. Defaults to approximatley sqrt(doc_count).

  • leaves_to_search (int, optional) – Number of leaves to search. Defaults to 1. The higher the value, the more accurate the search.

  • num_results (int, optional) – Number of results to return. Defaults to 1000.

  • train_sample (int, optional) – Number of training samples. Defaults to n_leaves*39.

  • drop_query_vec (bool, optional) – Whether to drop the query vector from the output.

Returns:

A transformer that retrieves using ScaNN.

Return type:

Transformer

Note

This method requires the scann package. Install it via pip install scann.

Citation

Guo et al. Accelerating Large-Scale Inference with Anisotropic Vector Quantization. ICML 2020. [link]
@inproceedings{DBLP:conf/icml/GuoSLGSCK20,
  author       = {Ruiqi Guo and
                  Philip Sun and
                  Erik Lindgren and
                  Quan Geng and
                  David Simcha and
                  Felix Chern and
                  Sanjiv Kumar},
  title        = {Accelerating Large-Scale Inference with Anisotropic Vector Quantization},
  booktitle    = {Proceedings of the 37th International Conference on Machine Learning,
                  {ICML} 2020, 13-18 July 2020, Virtual Event},
  series       = {Proceedings of Machine Learning Research},
  volume       = {119},
  pages        = {3887--3896},
  publisher    = {{PMLR}},
  year         = {2020},
  url          = {http://proceedings.mlr.press/v119/guo20h.html},
  timestamp    = {Tue, 15 Dec 2020 17:40:18 +0100},
  biburl       = {https://dblp.org/rec/conf/icml/GuoSLGSCK20.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}
voyager_retriever(neighbours=12, *, num_results=1000, ef_construction=200, random_seed=1, storage_data_type='float32', query_ef=10, drop_query_vec=False)

Returns a retriever that uses HNSW to search over a Voyager index.

Return type:

Transformer

Parameters:
  • neighbours (int, optional) – Number of neighbours to search. Defaults to 12.

  • num_results (int, optional) – Number of results to return per query. Defaults to 1000.

  • ef_construction (int, optional) – Expansion factor for graph construction. Defaults to 200.

  • random_seed (int, optional) – Random seed. Defaults to 1.

  • storage_data_type (str, optional) – Storage data type. One of ‘float32’, ‘float8’, ‘e4m3’. Defaults to ‘float32’.

  • query_ef (int, optional) – Expansion factor during querying. Defaults to 10.

  • drop_query_vec (bool, optional) – Drop the query vector from the output. Defaults to False.

Returns:

A retriever that uses HNSW to search over a Voyager index.

Return type:

Transformer

Note

This method requires the voyager package. Install it via pip install voyager.

Re-Ranking

Results can be re-ranked using indexed vectors using scorer(). (np_scorer() and torch_scorer() are available as specific implementations, if needed.)

gar(), ladr_proactive(), and ladr_adaptive() are adaptive re-ranking approaches that pull in other documents from the corpus that may be relevant.

scorer()

An alias to np_scorer().

np_scorer(*, num_results=None)

Return a scorer that uses numpy to score (re-rank) results using indexed vectors.

The returned transformer expects a DataFrame with columns qid, query_vec and docno. (If an internal docid column is provided, this will be used to speed up vector lookups.)

This method uses memory-mapping to avoid loading the entire index into memory at once.

Return type:

Transformer

Parameters:

num_results – The number of results to return per query. If not provided, all resuls from the original fram are returned.

Returns:

A transformer that scores query vectors with numpy.

Return type:

Transformer

torch_scorer(*, num_results=None, device=None, fp16=False)

Return a scorer that uses pytorch to score (re-rank) results using indexed vectors.

The returned pyterrier.Transformer expects a DataFrame with columns qid, query_vec and docno. (If an internal docid column is provided, this will be used to speed up vector lookups.)

Caution

This method loads the entire index into memory on the provided device. If the index is too large to fit in memory, consider using a different scorer that does not fully load the index into memory, like np_scorer().

Parameters:
  • num_results – The number of results to return per query. If not provided, all resuls from the original fram are returned.

  • device – The device to use for scoring. If not provided, the default device is used (cuda if available, otherwise cpu).

  • fp16 – Whether to use half precision (fp16) for scoring.

Returns:

A transformer that scores query vectors with pytorch.

Return type:

Transformer

gar(k=16, *, batch_size=128, num_results=1000)

Returns a retriever that uses a corpus graph to search over a FlexIndex.

Return type:

Transformer

Parameters:
  • k (int) – Number of neighbours in the corpus graph. Defaults to 16.

  • batch_size (int) – Batch size for retrieval. Defaults to 128.

  • num_results (int) – Number of results per query to return. Defaults to 1000.

Returns:

A retriever that uses a corpus graph to search over a FlexIndex.

Return type:

Transformer

Citation

MacAvaney et al. Adaptive Re-Ranking with a Corpus Graph. CIKM 2022. [link]
@inproceedings{DBLP:conf/cikm/MacAvaneyTM22,
  author       = {Sean MacAvaney and
                  Nicola Tonellotto and
                  Craig Macdonald},
  editor       = {Mohammad Al Hasan and
                  Li Xiong},
  title        = {Adaptive Re-Ranking with a Corpus Graph},
  booktitle    = {Proceedings of the 31st {ACM} International Conference on Information
                  {\&} Knowledge Management, Atlanta, GA, USA, October 17-21, 2022},
  pages        = {1491--1500},
  publisher    = {{ACM}},
  year         = {2022},
  url          = {https://doi.org/10.1145/3511808.3557231},
  doi          = {10.1145/3511808.3557231},
  timestamp    = {Wed, 19 Oct 2022 17:09:02 +0200},
  biburl       = {https://dblp.org/rec/conf/cikm/MacAvaneyTM22.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}
ladr_proactive(k=16, *, hops=1, num_results=1000, dense_scorer=None, drop_query_vec=False, budget=False)

Returns a proactive LADR (Lexicaly-Accelerated Dense Retrieval) transformer.

Return type:

Transformer

Parameters:
  • k (int) – The number of neighbours in the corpus graph.

  • hops (int) – The number of hops to consider. Defaults to 1.

  • num_results (int) – The number of results to return per query.

  • dense_scorer (Transformer, optional) – The dense scorer to use. Defaults to np_scorer().

  • drop_query_vec (bool) – Whether to drop the query vector from the output.

  • budget (bool or int) – The maximum number of vectors to score. If False, no maximum is applied. If True, the budget is set to num_results. If an integer, this value is used as the budget.

Returns:

A proactive LADR transformer.

Return type:

Transformer

Citation

Kulkarni et al. Lexically-Accelerated Dense Retrieval. SIGIR 2023. [link]
@inproceedings{DBLP:conf/sigir/KulkarniMGF23,
  author       = {Hrishikesh Kulkarni and
                  Sean MacAvaney and
                  Nazli Goharian and
                  Ophir Frieder},
  editor       = {Hsin{-}Hsi Chen and
                  Wei{-}Jou (Edward) Duh and
                  Hen{-}Hsen Huang and
                  Makoto P. Kato and
                  Josiane Mothe and
                  Barbara Poblete},
  title        = {Lexically-Accelerated Dense Retrieval},
  booktitle    = {Proceedings of the 46th International {ACM} {SIGIR} Conference on
                  Research and Development in Information Retrieval, {SIGIR} 2023, Taipei,
                  Taiwan, July 23-27, 2023},
  pages        = {152--162},
  publisher    = {{ACM}},
  year         = {2023},
  url          = {https://doi.org/10.1145/3539618.3591715},
  doi          = {10.1145/3539618.3591715},
  timestamp    = {Fri, 21 Jul 2023 22:25:19 +0200},
  biburl       = {https://dblp.org/rec/conf/sigir/KulkarniMGF23.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}
ladr_adaptive(k=16, *, depth=100, num_results=1000, dense_scorer=None, max_hops=None, drop_query_vec=False, budget=False)

Returns an adaptive LADR (Lexicaly-Accelerated Dense Retrieval) transformer.

Return type:

Transformer

Parameters:
  • k (int) – The number of neighbours in the corpus graph.

  • depth (int) – The depth of the ranked list to consider for convergence.

  • num_results (int) – The number of results to return per query.

  • dense_scorer (Transformer, optional) – The dense scorer to use. Defaults to np_scorer().

  • max_hops (int, optional) – The maximum number of hops to consider. Defaults to None (no limit).

  • drop_query_vec (bool) – Whether to drop the query vector from the output.

  • budget (bool or int) – The maximum number of vectors to score. If False, no maximum is applied. If True, the budget is set to num_results. If an integer, this value is used as the budget.

Returns:

An adaptive LADR transformer.

Return type:

Transformer

Citation

Kulkarni et al. Lexically-Accelerated Dense Retrieval. SIGIR 2023. [link]
@inproceedings{DBLP:conf/sigir/KulkarniMGF23,
  author       = {Hrishikesh Kulkarni and
                  Sean MacAvaney and
                  Nazli Goharian and
                  Ophir Frieder},
  editor       = {Hsin{-}Hsi Chen and
                  Wei{-}Jou (Edward) Duh and
                  Hen{-}Hsen Huang and
                  Makoto P. Kato and
                  Josiane Mothe and
                  Barbara Poblete},
  title        = {Lexically-Accelerated Dense Retrieval},
  booktitle    = {Proceedings of the 46th International {ACM} {SIGIR} Conference on
                  Research and Development in Information Retrieval, {SIGIR} 2023, Taipei,
                  Taiwan, July 23-27, 2023},
  pages        = {152--162},
  publisher    = {{ACM}},
  year         = {2023},
  url          = {https://doi.org/10.1145/3539618.3591715},
  doi          = {10.1145/3539618.3591715},
  timestamp    = {Fri, 21 Jul 2023 22:25:19 +0200},
  biburl       = {https://dblp.org/rec/conf/sigir/KulkarniMGF23.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}
mmr(*, Lambda=0.5, norm_rel=False, norm_sim=False, drop_doc_vec=True, verbose=False)

Returns an MMR (Maximal Marginal Relevance) scorer (i.e., re-ranker) over this index.

The method first loads vectors from the index and then applies MmrScorer to re-rank the results. See MmrScorer for more details on MMR.

Return type:

Transformer

Parameters:
  • Lambda – The balance parameter between relevance and diversity (default: 0.5)

  • norm_rel – Whether to normalize relevance scores to [0, 1] (default: False)

  • norm_sim – Whether to normalize similarity scores to [0, 1] (default: False)

  • drop_doc_vec – Whether to drop the ‘doc_vec’ column after re-ranking (default: True)

  • verbose – Whether to display verbose output (e.g., progress bars) (default: False)

Citation

Carbonell and Goldstein. The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries. SIGIR 1998. [link]
@inproceedings{DBLP:conf/sigir/CarbonellG98,
  author       = {Jaime G. Carbonell and
                  Jade Goldstein},
  editor       = {W. Bruce Croft and
                  Alistair Moffat and
                  C. J. van Rijsbergen and
                  Ross Wilkinson and
                  Justin Zobel},
  title        = {The Use of MMR, Diversity-Based Reranking for Reordering Documents
                  and Producing Summaries},
  booktitle    = {{SIGIR} '98: Proceedings of the 21st Annual International {ACM} {SIGIR}
                  Conference on Research and Development in Information Retrieval, August
                  24-28 1998, Melbourne, Australia},
  pages        = {335--336},
  publisher    = {{ACM}},
  year         = {1998},
  url          = {https://doi.org/10.1145/290941.291025},
  doi          = {10.1145/290941.291025},
  timestamp    = {Wed, 14 Nov 2018 10:58:11 +0100},
  biburl       = {https://dblp.org/rec/conf/sigir/CarbonellG98.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Evaluation

property ILS: Measure

Return an ILS (Intra-List Similarity) measure for this index. See pyterrier_dr.ILS() for more details.

Index Data Access

These methods are for low-level index data access.

vec_loader()

Return a transformer that loads indexed vectors.

The returned transformer expects a DataFrame with columns docno. It outputs a frame that includes a column doc_vec, which contains the indexed vectors. For example:

Load vectors from a FlexIndex
index = FexIndex.from_hf('macavaney/msmarco-passage.tasb.flex')
loader = index.vec_loader()
loader(pd.DataFrame([
    {"docno": "5"},
    {"docno": "100"},
    {"docno": "74356"},
]))
# docno                                            doc_vec
#     5  [-0.09343405, 0.12045559, -0.25184962, 0.15029...
#   100  [-0.11527929, 0.63400555, -0.0877756, -0.26490...
# 74356  [0.15367049, 0.16049547, -0.012261144, -0.2588...
Returns:

A transformer that loads indexed vectors.

Return type:

Transformer

get_corpus_iter(start_idx=None, stop_idx=None, verbose=True)[source]

Iterate over the documents in the index.

Return type:

Iterable[Dict]

Parameters:
  • start_idx – The index of the first document to return (or None to start at the first document).

  • stop_idx – The index of the last document to return (or None to end on the last document).

  • verbose – Whether to display a progress bar.

Yields:

Dict[str,Any] – A dictionary with keys docno and doc_vec.

np_vecs()

Return the indexed vectors.

Return type:

ndarray

Returns:

The indexed vectors as a memory-mapped numpy array.

Return type:

numpy.ndarray

torch_vecs(*, device=None, fp16=False)

Return the indexed vectors as a pytorch tensor. :rtype: Tensor

Caution

This method loads the entire index into memory on the provided device. If the index is too large to fit in memory, consider using a different method that does not fully load the index into memory, like np_vecs() or get_corpus_iter().

Parameters:
  • device – The device to use for the tensor. If not provided, the default device is used (cuda if available, otherwise cpu).

  • fp16 – Whether to use half precision (fp16) for the tensor.

Returns:

The indexed vectors as a torch tensor.

Return type:

torch.Tensor

docnos()[source]

Return the document identifier (docno) lookup data structure.

Return type:

Lookup

Returns:

The document number lookup.

Return type:

npids.Lookup

corpus_graph(k=16, *, batch_size=8192)

Return the corpus graph (neighborhood graph) for the index.

The corpus graph is a directed graph where each node represents a document and each edge represents a connection between two documents. The graph is built by computing the cosine similarity between each pair of documents and storing the k-nearest neighbors for each document.

If the corpus graph has not been built yet, it will be built using the given k and batch size.

Parameters:
  • k – The number of neighbors to store for each document.

  • batch_size – The number of vectors to process in each batch.

Returns:

The corpus graph for the index.

Return type:

pyterrier_adaptive.CorpusGraph

faiss_hnsw_graph(neighbours=32, *, ef_construction=40)

Returns the (approximate) HNSW graph structure created by the HNSW index.

If the graph structure does not already exist, it is created and cached to disk.

Parameters:
  • neighbours – The number of neighbours of the constructed neighborhood graph

  • ef_construction – The number of neighbours to consider during construction

Returns:

The HNSW graph structure

Return type:

pyterrier_adaptive.CorpusGraph

Note

This function requires the faiss package to be installed.

Citation

Douze et al. The Faiss library. arXiv 2024. [link]
@article{DBLP:journals/corr/abs-2401-08281,
  author       = {Matthijs Douze and
                  Alexandr Guzhva and
                  Chengqi Deng and
                  Jeff Johnson and
                  Gergely Szilvasy and
                  Pierre{-}Emmanuel Mazar{\'{e}} and
                  Maria Lomeli and
                  Lucas Hosseini and
                  Herv{\'{e}} J{\'{e}}gou},
  title        = {The Faiss library},
  journal      = {CoRR},
  volume       = {abs/2401.08281},
  year         = {2024},
  url          = {https://doi.org/10.48550/arXiv.2401.08281},
  doi          = {10.48550/ARXIV.2401.08281},
  eprinttype    = {arXiv},
  eprint       = {2401.08281},
  timestamp    = {Thu, 01 Feb 2024 15:35:36 +0100},
  biburl       = {https://dblp.org/rec/journals/corr/abs-2401-08281.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Extras

built()[source]

Check if the index has been built.

Return type:

bool

Returns:

True if the index has been built, otherwise False.

Return type:

bool

classmethod from_hf(repo)

Loads the index from HuggingFace Hub.

Parameters:

repo – The repository name download from.

Returns:

A FlexIndex object.

to_hf(repo)

Uploads the index to HuggingFace Hub.

Parameters:

repo – The repository name to upload to.