Dense Retrieval Overview¶
pyterrier-dr lets you construct single-vector dense indexing and retrieval pipelines. These methods allow for retrieving based on semantic matching instead of the lexical matching used in traditional retrieval methods like BM25.
These processes involve two main components: Dense Models and Dense Indexes. These components are typically combined to create Indexing Pipelines and Retrieval Pipelines.
Dense Models¶
Dense Models are transformers that encode text (queries and documents) into dense vectors. This package provides various pretrained dense models, as well as the ability to load models from Sentence Transformers and HuggingFace.
from pyterrier_dr import SBertBiEncoder
model = SBertBiEncoder('sentence-transformers/all-MiniLM-L6-v2') # [1]
You can replace this with any Sentence Transformer model name or path.
A dense model can perform a several of operations:
Encode queries into dense vectors using its
query_encoder().Encode documents into dense vectors using its
doc_encoder().Re-rank results by encoding queries and documents and computing their similarity scores using its
text_scorer().
See also
More information about dense encoders is available on the Dense Encoding page.
Dense Indexes¶
Dense Indexes are data structures and algorithms to index and retrieve documents using dense vectors. This package provides
FlexIndex, which stores document vectors on disk and provides various retrieval backends.
from pyterrier_dr import FlexIndex
index = FlexIndex('path/to/index.flex') # [1]
You can specify any path where you want the index to be stored. By convention, we use the
.flexextension, but this is not required.
A dense index can perform several operations:
Index documents using its
indexer().Retrieve documents using methods like
retriever(),faiss_hnsw_retriever(), and more.Re-rank results using its stored vectors using methods like
scorer(),ladr_adaptive(), and more.
See also
More information about dense indexes is available on the Dense Indexing & Retrieval page.
Pipelines¶
In most cases, you will want to combine dense models and dense indexes into pipelines.
Indexing Pipelines encode documents into dense vectors and stores them in an index.
model.doc_encoder() >> index.indexer()
Retrieval Pipelines encode queries into dense vectors and retrieves documents from an index.
model.query_encoder() >> index.retriever()
Putting it all Together¶
Here’s an example of a complete dense retrieval pipeline that indexes documents and retrieves them using dense vectors.
import pyterrier_dr
index = pyterrier_dr.FlexIndex('my_index.flex')
model = pyterrier_dr.SBertBiEncoder('sentence-transformers/all-MiniLM-L6-v2')
indexer = model.doc_encoder() >> index.indexer() # [1]
indexer.index([ # [2]
{'docno': 'doc1', 'text': 'A dog is a domesticated carnivorous mammal.'},
{'docno': 'doc2', 'text': 'Dogs are known for their loyalty and companionship.'},
{'docno': 'doc3', 'text': 'The domestic dog is a subspecies of the gray wolf.'},
{'docno': 'doc4', 'text': 'Scottish Terriers are dogs that are known for their independent nature and distinctive appearance.'},
])
retriever = model.query_encoder() >> index.retriever() # [3]
results = retriever.search('scottie dog')
# qid query docno score rank
# 1 scottie dog doc4 0.428387 0 # [4]
# 1 scottie dog doc1 0.390875 1
# 1 scottie dog doc2 0.353309 2
# 1 scottie dog doc3 0.331017 3
This is an indexing pipeline that encodes documents using the model’s document encoder and indexes them into the flex index.
In this example, we only index four documents. In most cases, you’ll index much larger collections.
We construct a retrievel pipeline using the default exact retriever. Other retrievers like FAISS HNSW can also be used.
The top result is doc4, which is the most relevant document about Scottish Terriers.