Indexing & Retrieval¶
This page covers the indexing and retrieval functionality provided by pyterrier_dr
.
FlexIndex
provides a flexible way to index and retrieve documents
using dense vectors, and is the main class for indexing and retrieval.
API Documentation¶
- class pyterrier_dr.FlexIndex(path, *, sim_fn=SimFn.dot, verbose=True)[source]¶
-
Represents a FLexible EXecution (FLEX) Index, which is a dense index format.
FLEX allows for a variety of retrieval implementations (NumPy, FAISS, etc.) and algorithms (exhaustive, HNSW, etc.) to be tested. In most cases, the same vector storage can be used across implementations and algorithms, saving considerably on disk space.
- Parameters:
path – The path to the index directory
sim_fn – The similarity function to use
verbose – Whether to display verbose output (e.g., progress bars)
Indexing¶
Basic indexing functionality is provided through
index()
. For more advanced options, useindexer()
.- index(inp)[source]¶
Index the given input data stream to a new index at this location.
Each record in
inp
is expected to be a dictionary containing at least two keys:docno
(a unique document identifier) anddoc_vec
(a dense vector representation of the document).Typically this method will be used in a pipeline of operations, where the input data is first transformed by a document encoder to add the
doc_vec
values before it is indexed. For example: :rtype:Artifact
from pyterrier_dr import TasB, FlexIndex encoder = TasB.dot() index = FlexIndex('my_index') pipeline = encoder >> index pipeline.index([ {'docno': 'doc1', 'text': 'hello'}, {'docno': 'doc2', 'text': 'world'}, ])
- Parameters:
inp – An iterable of dictionaries to index.
- Returns:
A reference back to this index (
self
).- Return type:
- Raises:
RuntimeError – If the index is aready built.
- indexer(*, mode=IndexingMode.create)[source]¶
Return an indexer for this index with the specified options.
This transformer gives more fine-grained control over the indexing process, allowing you to specify whether to create a new index or overwrite an existing one.
Similar to
index()
, this method will typically be used in a pipeline of operations, where the input data is first transformed by a document encoder to add thedoc_vec
values before it is indexed. For example: :rtype:FlexIndexer
from pyterrier_dr import TasB, FlexIndex encoder = TasB.dot() index = FlexIndex('my_index') pipeline = encoder >> index.indexer(mode='overwrite') pipeline.index([ {'docno': 'doc1', 'text': 'hello'}, {'docno': 'doc2', 'text': 'world'}, ])
- Parameters:
mode – The indexing mode to use (
create
oroverwrite
).- Returns:
A new indexer instance.
- Return type:
Retrieval¶
FlexIndex
provides a variety of retriever backends. Each one expectsqid
andquery_vec
columns as input, and outputs a result frame. When you do not care about which backend you want, you can useretriever()
(an alias tonp_retriever()
), which preforms exact retrieval using a brute force search over all vectors.- retriever(*, num_results=1000)¶
Returns a transformer that performs basic exact retrieval over indexed vectors using a brute force search. An alias to
np_retriever()
.
- np_retriever(*, num_results=1000, batch_size=None, drop_query_vec=False)¶
Return a retriever that uses numpy to perform a brute force search over the index.
The returned transformer expects a DataFrame with columns
qid
andquery_vec
. It outpus a result frame containing the retrieved results.- Parameters:
num_results – The number of results to return per query.
batch_size – The number of documents to score in each batch.
drop_query_vec – Whether to drop the query vector from the output.
- Returns:
A retriever that uses numpy to perform a brute force search.
- Return type:
- torch_retriever(*, num_results=1000, device=None, fp16=False, qbatch=64, drop_query_vec=False)¶
Return a retriever that uses pytorch to perform brute-force retrieval results using the indexed vectors.
The returned
pyterrier.Transformer
expects a DataFrame with columnsqid
,query_vec
.Caution
This method loads the entire index into memory on the provided device. If the index is too large to fit in memory, consider using a different retriever that does not fully load the index into memory, like
np_retriever()
.- Parameters:
num_results – The number of results to return per query.
device – The device to use for scoring. If not provided, the default device is used (cuda if available, otherwise cpu).
fp16 – Whether to use half precision (fp16) for scoring.
qbatch – The number of queries to score in each batch.
drop_query_vec – Whether to drop the query vector from the output.
- Returns:
A transformer that retrieves using pytorch.
- Return type:
- faiss_flat_retriever(*, gpu=False, qbatch=64, drop_query_vec=False)¶
Returns a retriever that uses FAISS to perform brute-force search over the indexed vectors.
- Parameters:
gpu – Whether to load the index onto GPU for scoring
qbatch – The batch size during search
drop_query_vec – Whether to drop the query vector from the output
- Returns:
A retriever that uses FAISS to perform brute-force search over the indexed vectors
- Return type:
Note
This transformer requires the
faiss
package to be installed.Citation
Douze et al. The Faiss library. arXiv 2024. [link]
@article{DBLP:journals/corr/abs-2401-08281, author = {Matthijs Douze and Alexandr Guzhva and Chengqi Deng and Jeff Johnson and Gergely Szilvasy and Pierre{-}Emmanuel Mazar{\'{e}} and Maria Lomeli and Lucas Hosseini and Herv{\'{e}} J{\'{e}}gou}, title = {The Faiss library}, journal = {CoRR}, volume = {abs/2401.08281}, year = {2024}, url = {https://doi.org/10.48550/arXiv.2401.08281}, doi = {10.48550/ARXIV.2401.08281}, eprinttype = {arXiv}, eprint = {2401.08281}, timestamp = {Thu, 01 Feb 2024 15:35:36 +0100}, biburl = {https://dblp.org/rec/journals/corr/abs-2401-08281.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
- faiss_hnsw_retriever(neighbours=32, *, num_results=1000, ef_construction=40, ef_search=16, cache=True, search_bounded_queue=True, qbatch=64, drop_query_vec=False)¶
Returns a retriever that uses FAISS over a HNSW index.
Creates the HNSW graph structure if it does not already exist. When
cache=True
(dfault), this graph structure is cached to disk for subsequent use.- Return type:
- Parameters:
neighbours – The number of neighbours of the constructed neighborhood graph
num_results – The number of results to return per query
ef_construction – The number of neighbours to consider during construction
ef_search – The number of neighbours to consider during search
cache – Whether to cache the index to disk
search_bounded_queue – Whether to use a bounded queue during search
qbatch – The batch size during search
drop_query_vec – Whether to drop the query vector from the output
- Returns:
A retriever that uses FAISS over a HNSW index
- Return type:
Note
This transformer requires the
faiss
package to be installed.Citation
Douze et al. The Faiss library. arXiv 2024. [link]
@article{DBLP:journals/corr/abs-2401-08281, author = {Matthijs Douze and Alexandr Guzhva and Chengqi Deng and Jeff Johnson and Gergely Szilvasy and Pierre{-}Emmanuel Mazar{\'{e}} and Maria Lomeli and Lucas Hosseini and Herv{\'{e}} J{\'{e}}gou}, title = {The Faiss library}, journal = {CoRR}, volume = {abs/2401.08281}, year = {2024}, url = {https://doi.org/10.48550/arXiv.2401.08281}, doi = {10.48550/ARXIV.2401.08281}, eprinttype = {arXiv}, eprint = {2401.08281}, timestamp = {Thu, 01 Feb 2024 15:35:36 +0100}, biburl = {https://dblp.org/rec/journals/corr/abs-2401-08281.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
- faiss_ivf_retriever(*, num_results=1000, train_sample=None, n_list=None, cache=True, n_probe=1, drop_query_vec=False)¶
Returns a retriever that uses FAISS over an IVF index.
If the IVF structure does not already exist, it is created and cached to disk (when
cache=True
(default)).- Parameters:
num_results – The number of results to return per query
train_sample – The number of training samples to use for training the index. If not provided, a default value is used (approximately the square root of the number of documents).
n_list – The number of posting lists to use for the index. If not provided, a default value is used (approximately
train_sample/39
).cache – Whether to cache the index to disk.
n_probe – The number of posting lists to probe during search. The higher the value, the better the approximation will be, but the longer it will take.
drop_query_vec – Whether to drop the query vector from the output.
- Returns:
A retriever that uses FAISS over an IVF index
- Return type:
Note
This transformer requires the
faiss
package to be installed.Citation
Douze et al. The Faiss library. arXiv 2024. [link]
@article{DBLP:journals/corr/abs-2401-08281, author = {Matthijs Douze and Alexandr Guzhva and Chengqi Deng and Jeff Johnson and Gergely Szilvasy and Pierre{-}Emmanuel Mazar{\'{e}} and Maria Lomeli and Lucas Hosseini and Herv{\'{e}} J{\'{e}}gou}, title = {The Faiss library}, journal = {CoRR}, volume = {abs/2401.08281}, year = {2024}, url = {https://doi.org/10.48550/arXiv.2401.08281}, doi = {10.48550/ARXIV.2401.08281}, eprinttype = {arXiv}, eprint = {2401.08281}, timestamp = {Thu, 01 Feb 2024 15:35:36 +0100}, biburl = {https://dblp.org/rec/journals/corr/abs-2401-08281.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
Returns a retriever that searchers over a flatnav index.
- Return type:
- Parameters:
k (int) – the maximum number of edges per document in the index
ef_search (int) – the size of the list during searches. Higher values are slower but more accurate.
num_initializations (int) – the number of random initializations to use during search.
ef_construction (int) – the size of the list during graph construction. Higher values are slower but more accurate.
threads (int) – the number of threads to use
num_results (int) – the number of results to return per query
cache (bool) – whether to cache the index to disk
qbatch (int) – the number of queries to search at once
drop_query_vec (bool) – whether to drop the query_vec column after retrieval
verbose (bool) – whether to show progress bars
Added in version 0.4.0.
Changed in version 0.4.1: fixed bug with num_initializations
Note
This transformer requires the
flatnav
package to be installed. Instructions are available in the flatnav repository.Citation
Munyampirwa et al. Down with the Hierarchy: The 'H' in HNSW Stands for "Hubs". arXiv 2024. [link]
- scann_retriever(*, n_leaves=None, leaves_to_search=1, num_results=1000, train_sample=None, drop_query_vec=False)¶
Returns a retriever over a ScaNN (Scalable Nearest Neighbors) index.
- Parameters:
n_leaves (int, optional) – Number of leaves in the ScaNN index. Defaults to approximatley sqrt(doc_count).
leaves_to_search (int, optional) – Number of leaves to search. Defaults to 1. The higher the value, the more accurate the search.
num_results (int, optional) – Number of results to return. Defaults to 1000.
train_sample (int, optional) – Number of training samples. Defaults to
n_leaves*39
.drop_query_vec (bool, optional) – Whether to drop the query vector from the output.
- Returns:
A transformer that retrieves using ScaNN.
- Return type:
Note
This method requires the
scann
package. Install it viapip install scann
.Citation
Guo et al. Accelerating Large-Scale Inference with Anisotropic Vector Quantization. ICML 2020. [link]
@inproceedings{DBLP:conf/icml/GuoSLGSCK20, author = {Ruiqi Guo and Philip Sun and Erik Lindgren and Quan Geng and David Simcha and Felix Chern and Sanjiv Kumar}, title = {Accelerating Large-Scale Inference with Anisotropic Vector Quantization}, booktitle = {Proceedings of the 37th International Conference on Machine Learning, {ICML} 2020, 13-18 July 2020, Virtual Event}, series = {Proceedings of Machine Learning Research}, volume = {119}, pages = {3887--3896}, publisher = {{PMLR}}, year = {2020}, url = {http://proceedings.mlr.press/v119/guo20h.html}, timestamp = {Tue, 15 Dec 2020 17:40:18 +0100}, biburl = {https://dblp.org/rec/conf/icml/GuoSLGSCK20.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
- voyager_retriever(neighbours=12, *, num_results=1000, ef_construction=200, random_seed=1, storage_data_type='float32', query_ef=10, drop_query_vec=False)¶
Returns a retriever that uses HNSW to search over a Voyager index.
- Return type:
- Parameters:
neighbours (int, optional) – Number of neighbours to search. Defaults to 12.
num_results (int, optional) – Number of results to return per query. Defaults to 1000.
ef_construction (int, optional) – Expansion factor for graph construction. Defaults to 200.
random_seed (int, optional) – Random seed. Defaults to 1.
storage_data_type (str, optional) – Storage data type. One of ‘float32’, ‘float8’, ‘e4m3’. Defaults to ‘float32’.
query_ef (int, optional) – Expansion factor during querying. Defaults to 10.
drop_query_vec (bool, optional) – Drop the query vector from the output. Defaults to False.
- Returns:
A retriever that uses HNSW to search over a Voyager index.
- Return type:
Note
This method requires the
voyager
package. Install it viapip install voyager
.
Re-Ranking¶
Results can be re-ranked using indexed vectors using
scorer()
. (np_scorer()
andtorch_scorer()
are available as specific implementations, if needed.)gar()
,ladr_proactive()
, andladr_adaptive()
are adaptive re-ranking approaches that pull in other documents from the corpus that may be relevant.- scorer()¶
An alias to
np_scorer()
.
- np_scorer(*, num_results=None)¶
Return a scorer that uses numpy to score (re-rank) results using indexed vectors.
The returned transformer expects a DataFrame with columns
qid
,query_vec
anddocno
. (If an internaldocid
column is provided, this will be used to speed up vector lookups.)This method uses memory-mapping to avoid loading the entire index into memory at once.
- Return type:
- Parameters:
num_results – The number of results to return per query. If not provided, all resuls from the original fram are returned.
- Returns:
A transformer that scores query vectors with numpy.
- Return type:
- torch_scorer(*, num_results=None, device=None, fp16=False)¶
Return a scorer that uses pytorch to score (re-rank) results using indexed vectors.
The returned
pyterrier.Transformer
expects a DataFrame with columnsqid
,query_vec
anddocno
. (If an internaldocid
column is provided, this will be used to speed up vector lookups.)Caution
This method loads the entire index into memory on the provided device. If the index is too large to fit in memory, consider using a different scorer that does not fully load the index into memory, like
np_scorer()
.- Parameters:
num_results – The number of results to return per query. If not provided, all resuls from the original fram are returned.
device – The device to use for scoring. If not provided, the default device is used (cuda if available, otherwise cpu).
fp16 – Whether to use half precision (fp16) for scoring.
- Returns:
A transformer that scores query vectors with pytorch.
- Return type:
- gar(k=16, *, batch_size=128, num_results=1000)¶
Returns a retriever that uses a corpus graph to search over a FlexIndex.
- Return type:
- Parameters:
k (int) – Number of neighbours in the corpus graph. Defaults to 16.
batch_size (int) – Batch size for retrieval. Defaults to 128.
num_results (int) – Number of results per query to return. Defaults to 1000.
- Returns:
A retriever that uses a corpus graph to search over a FlexIndex.
- Return type:
Citation
MacAvaney et al. Adaptive Re-Ranking with a Corpus Graph. CIKM 2022. [link]
@inproceedings{DBLP:conf/cikm/MacAvaneyTM22, author = {Sean MacAvaney and Nicola Tonellotto and Craig Macdonald}, editor = {Mohammad Al Hasan and Li Xiong}, title = {Adaptive Re-Ranking with a Corpus Graph}, booktitle = {Proceedings of the 31st {ACM} International Conference on Information {\&} Knowledge Management, Atlanta, GA, USA, October 17-21, 2022}, pages = {1491--1500}, publisher = {{ACM}}, year = {2022}, url = {https://doi.org/10.1145/3511808.3557231}, doi = {10.1145/3511808.3557231}, timestamp = {Wed, 19 Oct 2022 17:09:02 +0200}, biburl = {https://dblp.org/rec/conf/cikm/MacAvaneyTM22.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
- ladr_proactive(k=16, *, hops=1, num_results=1000, dense_scorer=None, drop_query_vec=False, budget=False)¶
Returns a proactive LADR (Lexicaly-Accelerated Dense Retrieval) transformer.
- Return type:
- Parameters:
k (int) – The number of neighbours in the corpus graph.
hops (int) – The number of hops to consider. Defaults to 1.
num_results (int) – The number of results to return per query.
dense_scorer (
Transformer
, optional) – The dense scorer to use. Defaults tonp_scorer()
.drop_query_vec (bool) – Whether to drop the query vector from the output.
budget (bool or int) – The maximum number of vectors to score. If
False
, no maximum is applied. IfTrue
, the budget is set tonum_results
. If an integer, this value is used as the budget.
- Returns:
A proactive LADR transformer.
- Return type:
Citation
Kulkarni et al. Lexically-Accelerated Dense Retrieval. SIGIR 2023. [link]
@inproceedings{DBLP:conf/sigir/KulkarniMGF23, author = {Hrishikesh Kulkarni and Sean MacAvaney and Nazli Goharian and Ophir Frieder}, editor = {Hsin{-}Hsi Chen and Wei{-}Jou (Edward) Duh and Hen{-}Hsen Huang and Makoto P. Kato and Josiane Mothe and Barbara Poblete}, title = {Lexically-Accelerated Dense Retrieval}, booktitle = {Proceedings of the 46th International {ACM} {SIGIR} Conference on Research and Development in Information Retrieval, {SIGIR} 2023, Taipei, Taiwan, July 23-27, 2023}, pages = {152--162}, publisher = {{ACM}}, year = {2023}, url = {https://doi.org/10.1145/3539618.3591715}, doi = {10.1145/3539618.3591715}, timestamp = {Fri, 21 Jul 2023 22:25:19 +0200}, biburl = {https://dblp.org/rec/conf/sigir/KulkarniMGF23.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
- ladr_adaptive(k=16, *, depth=100, num_results=1000, dense_scorer=None, max_hops=None, drop_query_vec=False, budget=False)¶
Returns an adaptive LADR (Lexicaly-Accelerated Dense Retrieval) transformer.
- Return type:
- Parameters:
k (int) – The number of neighbours in the corpus graph.
depth (int) – The depth of the ranked list to consider for convergence.
num_results (int) – The number of results to return per query.
dense_scorer (
Transformer
, optional) – The dense scorer to use. Defaults tonp_scorer()
.max_hops (int, optional) – The maximum number of hops to consider. Defaults to
None
(no limit).drop_query_vec (bool) – Whether to drop the query vector from the output.
budget (bool or int) – The maximum number of vectors to score. If
False
, no maximum is applied. IfTrue
, the budget is set tonum_results
. If an integer, this value is used as the budget.
- Returns:
An adaptive LADR transformer.
- Return type:
Citation
Kulkarni et al. Lexically-Accelerated Dense Retrieval. SIGIR 2023. [link]
@inproceedings{DBLP:conf/sigir/KulkarniMGF23, author = {Hrishikesh Kulkarni and Sean MacAvaney and Nazli Goharian and Ophir Frieder}, editor = {Hsin{-}Hsi Chen and Wei{-}Jou (Edward) Duh and Hen{-}Hsen Huang and Makoto P. Kato and Josiane Mothe and Barbara Poblete}, title = {Lexically-Accelerated Dense Retrieval}, booktitle = {Proceedings of the 46th International {ACM} {SIGIR} Conference on Research and Development in Information Retrieval, {SIGIR} 2023, Taipei, Taiwan, July 23-27, 2023}, pages = {152--162}, publisher = {{ACM}}, year = {2023}, url = {https://doi.org/10.1145/3539618.3591715}, doi = {10.1145/3539618.3591715}, timestamp = {Fri, 21 Jul 2023 22:25:19 +0200}, biburl = {https://dblp.org/rec/conf/sigir/KulkarniMGF23.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
- mmr(*, Lambda=0.5, norm_rel=False, norm_sim=False, drop_doc_vec=True, verbose=False)¶
Returns an MMR (Maximal Marginal Relevance) scorer (i.e., re-ranker) over this index.
The method first loads vectors from the index and then applies
MmrScorer
to re-rank the results. SeeMmrScorer
for more details on MMR.- Return type:
- Parameters:
Lambda – The balance parameter between relevance and diversity (default: 0.5)
norm_rel – Whether to normalize relevance scores to [0, 1] (default: False)
norm_sim – Whether to normalize similarity scores to [0, 1] (default: False)
drop_doc_vec – Whether to drop the ‘doc_vec’ column after re-ranking (default: True)
verbose – Whether to display verbose output (e.g., progress bars) (default: False)
Citation
Carbonell and Goldstein. The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries. SIGIR 1998. [link]
@inproceedings{DBLP:conf/sigir/CarbonellG98, author = {Jaime G. Carbonell and Jade Goldstein}, editor = {W. Bruce Croft and Alistair Moffat and C. J. van Rijsbergen and Ross Wilkinson and Justin Zobel}, title = {The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries}, booktitle = {{SIGIR} '98: Proceedings of the 21st Annual International {ACM} {SIGIR} Conference on Research and Development in Information Retrieval, August 24-28 1998, Melbourne, Australia}, pages = {335--336}, publisher = {{ACM}}, year = {1998}, url = {https://doi.org/10.1145/290941.291025}, doi = {10.1145/290941.291025}, timestamp = {Wed, 14 Nov 2018 10:58:11 +0100}, biburl = {https://dblp.org/rec/conf/sigir/CarbonellG98.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
Evaluation¶
- property ILS: Measure¶
Return an ILS (Intra-List Similarity) measure for this index. See
pyterrier_dr.ILS()
for more details.
Index Data Access¶
These methods are for low-level index data access.
- vec_loader()¶
Return a transformer that loads indexed vectors.
The returned transformer expects a DataFrame with columns
docno
. It outputs a frame that includes a columndoc_vec
, which contains the indexed vectors. For example:index = FexIndex.from_hf('macavaney/msmarco-passage.tasb.flex') loader = index.vec_loader() loader(pd.DataFrame([ {"docno": "5"}, {"docno": "100"}, {"docno": "74356"}, ])) # docno doc_vec # 5 [-0.09343405, 0.12045559, -0.25184962, 0.15029... # 100 [-0.11527929, 0.63400555, -0.0877756, -0.26490... # 74356 [0.15367049, 0.16049547, -0.012261144, -0.2588...
- Returns:
A transformer that loads indexed vectors.
- Return type:
- get_corpus_iter(start_idx=None, stop_idx=None, verbose=True)[source]¶
Iterate over the documents in the index.
- Return type:
Iterable
[Dict
]- Parameters:
start_idx – The index of the first document to return (or
None
to start at the first document).stop_idx – The index of the last document to return (or
None
to end on the last document).verbose – Whether to display a progress bar.
- Yields:
Dict[str,Any] – A dictionary with keys
docno
anddoc_vec
.
- np_vecs()¶
Return the indexed vectors.
- Return type:
ndarray
- Returns:
The indexed vectors as a memory-mapped numpy array.
- Return type:
numpy.ndarray
- torch_vecs(*, device=None, fp16=False)¶
Return the indexed vectors as a pytorch tensor. :rtype:
Tensor
Caution
This method loads the entire index into memory on the provided device. If the index is too large to fit in memory, consider using a different method that does not fully load the index into memory, like
np_vecs()
orget_corpus_iter()
.- Parameters:
device – The device to use for the tensor. If not provided, the default device is used (cuda if available, otherwise cpu).
fp16 – Whether to use half precision (fp16) for the tensor.
- Returns:
The indexed vectors as a torch tensor.
- Return type:
torch.Tensor
- docnos()[source]¶
Return the document identifier (docno) lookup data structure.
- Return type:
Lookup
- Returns:
The document number lookup.
- Return type:
npids.Lookup
- corpus_graph(k=16, *, batch_size=8192)¶
Return the corpus graph (neighborhood graph) for the index.
The corpus graph is a directed graph where each node represents a document and each edge represents a connection between two documents. The graph is built by computing the cosine similarity between each pair of documents and storing the k-nearest neighbors for each document.
If the corpus graph has not been built yet, it will be built using the given k and batch size.
- Parameters:
k – The number of neighbors to store for each document.
batch_size – The number of vectors to process in each batch.
- Returns:
The corpus graph for the index.
- Return type:
- faiss_hnsw_graph(neighbours=32, *, ef_construction=40)¶
Returns the (approximate) HNSW graph structure created by the HNSW index.
If the graph structure does not already exist, it is created and cached to disk.
- Parameters:
neighbours – The number of neighbours of the constructed neighborhood graph
ef_construction – The number of neighbours to consider during construction
- Returns:
The HNSW graph structure
- Return type:
Note
This function requires the
faiss
package to be installed.Citation
Douze et al. The Faiss library. arXiv 2024. [link]
@article{DBLP:journals/corr/abs-2401-08281, author = {Matthijs Douze and Alexandr Guzhva and Chengqi Deng and Jeff Johnson and Gergely Szilvasy and Pierre{-}Emmanuel Mazar{\'{e}} and Maria Lomeli and Lucas Hosseini and Herv{\'{e}} J{\'{e}}gou}, title = {The Faiss library}, journal = {CoRR}, volume = {abs/2401.08281}, year = {2024}, url = {https://doi.org/10.48550/arXiv.2401.08281}, doi = {10.48550/ARXIV.2401.08281}, eprinttype = {arXiv}, eprint = {2401.08281}, timestamp = {Thu, 01 Feb 2024 15:35:36 +0100}, biburl = {https://dblp.org/rec/journals/corr/abs-2401-08281.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
Extras¶
- built()[source]¶
Check if the index has been built.
- Return type:
bool
- Returns:
True
if the index has been built, otherwiseFalse
.- Return type:
bool
- classmethod from_hf(repo)¶
Loads the index from HuggingFace Hub.
- Parameters:
repo – The repository name download from.
- Returns:
A
FlexIndex
object.
- to_hf(repo)¶
Uploads the index to HuggingFace Hub.
- Parameters:
repo – The repository name to upload to.