Dense Retrieval Overview¶

pyterrier-dr lets you construct single-vector dense indexing and retrieval pipelines. These methods allow for retrieving based on semantic matching instead of the lexical matching used in traditional retrieval methods like BM25.

These processes involve two main components: Dense Models and Dense Indexes. These components are typically combined to create Indexing Pipelines and Retrieval Pipelines.

Dense Models¶

Dense Models are transformers that encode text (queries and documents) into dense vectors. This package provides various pretrained dense models, as well as the ability to load models from Sentence Transformers and HuggingFace.

Loading a dense model¶

from pyterrier_dr import SBertBiEncoder
model = SBertBiEncoder('sentence-transformers/all-MiniLM-L6-v2') # [1]

You can replace this with any Sentence Transformer model name or path.

A dense model can perform a several of operations:

Encode queries into dense vectors using its query_encoder().
Encode documents into dense vectors using its doc_encoder().
Re-rank results by encoding queries and documents and computing their similarity scores using its text_scorer().

Dense Indexes¶

Dense Indexes are data structures and algorithms to index and retrieve documents using dense vectors. This package provides FlexIndex, which stores document vectors on disk and provides various retrieval backends.

Creating a dense index¶

from pyterrier_dr import FlexIndex
index = FlexIndex('path/to/index.flex') # [1]

You can specify any path where you want the index to be stored. By convention, we use the .flex extension, but this is not required.

A dense index can perform several operations:

Index documents using its indexer().
Retrieve documents using methods like retriever(), faiss_hnsw_retriever(), and more.
Re-rank results using its stored vectors using methods like scorer(), ladr_adaptive(), and more.

Pipelines¶

In most cases, you will want to combine dense models and dense indexes into pipelines.

Indexing Pipelines encode documents into dense vectors and stores them in an index.

model.doc_encoder() >> index.indexer()

docno	str	(External Document ID) String ID of document in collection
text	str	Document text

docno	str	(External Document ID) String ID of document in collection
text	str	Document text
doc_vec	np.array	Dense document vector

index	FlexIndex('/home/docs/.pyterrier/artifacts/426c662fb720c2576539eb3dad459d53be9556557fc4cc4f48556f5b581f70bb')
mode	IndexingMode.create

Rendering issue. Try running the cell again.

Retrieval Pipelines encode queries into dense vectors and retrieves documents from an index.

model.query_encoder() >> index.retriever()

Input

qid	str	(Query ID) ID of query in frame
query	str	Query text

                        
                            pyterrier_dr.biencoder.BiQueryEncoder

bi_encoder_model	SBertBiEncoder('sentence-transformers/paraphrase-albert-small-v2')
verbose	False
batch_size	32

BiQueryEncoder

qid	str	(Query ID) ID of query in frame
query	str	Query text
query_vec	np.array	Dense query vector

                        
                            pyterrier_dr.flex.np_retr.NumpyRetriever

flex_index	FlexIndex('/home/docs/.pyterrier/artifacts/426c662fb720c2576539eb3dad459d53be9556557fc4cc4f48556f5b581f70bb')
num_results	1000
batch_size	4096
drop_query_vec	False

NumpyRetriever

qid	str	(Query ID) ID of query in frame
query	str	Query text
query_vec	np.array	Dense query vector
docno	str	(External Document ID) String ID of document in collection
docid	int	(Internal Document ID) Integer ID of document in a specific index
score	float	Ranking score of document to query (higher=better)
rank	int	Ranking order of document to query (lower=better)

Output

Click to explore!

Rendering issue. Try running the cell again.

Putting it all Together¶

Here’s an example of a complete dense retrieval pipeline that indexes documents and retrieves them using dense vectors.

Dense Indexing and Retrieval¶

import pyterrier_dr
index = pyterrier_dr.FlexIndex('my_index.flex')
model = pyterrier_dr.SBertBiEncoder('sentence-transformers/all-MiniLM-L6-v2')
indexer = model.doc_encoder() >> index.indexer() # [1]
indexer.index([ # [2]
    {'docno': 'doc1', 'text': 'A dog is a domesticated carnivorous mammal.'},
    {'docno': 'doc2', 'text': 'Dogs are known for their loyalty and companionship.'},
    {'docno': 'doc3', 'text': 'The domestic dog is a subspecies of the gray wolf.'},
    {'docno': 'doc4', 'text': 'Scottish Terriers are dogs that are known for their independent nature and distinctive appearance.'},
])

retriever = model.query_encoder() >> index.retriever() # [3]
results = retriever.search('scottie dog')
# qid        query  docno     score  rank
#   1  scottie dog   doc4  0.428387     0 # [4]
#   1  scottie dog   doc1  0.390875     1
#   1  scottie dog   doc2  0.353309     2
#   1  scottie dog   doc3  0.331017     3

This is an indexing pipeline that encodes documents using the model’s document encoder and indexes them into the flex index.
In this example, we only index four documents. In most cases, you’ll index much larger collections.
We construct a retrievel pipeline using the default exact retriever. Other retrievers like FAISS HNSW can also be used.
The top result is doc4, which is the most relevant document about Scottish Terriers.