
Sentence Transformers

With pyterrier_dr, its easy to support Sentence Transformer (formerly called SentenceBERT) models, e.g. from HuggingFace, for dense retrieval.

The base class is SBertBiEncoder('huggingface/path').

Pretrained Encoders

These classes are convenience aliases to popular dense encoding models.

class pyterrier_dr.Ance[source]

Dense encoder for ANCE (Approximate nearest neighbor Negative Contrastive Learning).

See BiEncoder for usage information.


Xiong et al. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. ICLR 2021. [link]
  author       = {Lee Xiong and
                  Chenyan Xiong and
                  Ye Li and
                  Kwok{-}Fung Tang and
                  Jialin Liu and
                  Paul N. Bennett and
                  Junaid Ahmed and
                  Arnold Overwijk},
  title        = {Approximate Nearest Neighbor Negative Contrastive Learning for Dense
                  Text Retrieval},
  booktitle    = {9th International Conference on Learning Representations, {ICLR} 2021,
                  Virtual Event, Austria, May 3-7, 2021},
  publisher    = {OpenReview.net},
  year         = {2021},
  url          = {https://openreview.net/forum?id=zeFrfgyZln},
  timestamp    = {Wed, 23 Jun 2021 17:36:39 +0200},
  biburl       = {https://dblp.org/rec/conf/iclr/XiongXLTLBAO21.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}

(default) Model: sentence-transformers/msmarco-roberta-base-ance-firstp [link]

class pyterrier_dr.BGEM3[source]
class pyterrier_dr.CDE(model_name='jxm/cde-small-v1', cache=None, batch_size=32, text_field='text', verbose=False, device=None)[source]
class pyterrier_dr.E5[source]

Dense encoder for E5 (EmbEddings from bidirEctional Encoder rEpresentations).

See BiEncoder for usage information.


Wang et al. Text Embeddings by Weakly-Supervised Contrastive Pre-training. arXiv 2022. [link]
  author       = {Liang Wang and
                  Nan Yang and
                  Xiaolong Huang and
                  Binxing Jiao and
                  Linjun Yang and
                  Daxin Jiang and
                  Rangan Majumder and
                  Furu Wei},
  title        = {Text Embeddings by Weakly-Supervised Contrastive Pre-training},
  journal      = {CoRR},
  volume       = {abs/2212.03533},
  year         = {2022},
  url          = {https://doi.org/10.48550/arXiv.2212.03533},
  doi          = {10.48550/ARXIV.2212.03533},
  eprinttype    = {arXiv},
  eprint       = {2212.03533},
  timestamp    = {Sat, 27 Jul 2024 13:40:52 +0200},
  biburl       = {https://dblp.org/rec/journals/corr/abs-2212-03533.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}

(default) Model: intfloat/e5-base-v2 [link]


Model: intfloat/e5-small-v2 [link]


Model: intfloat/e5-large-v2 [link]

class pyterrier_dr.GTR[source]

Dense encoder for GTR (Generalizable T5-based dense Retrievers)

See BiEncoder for usage information.


Ni et al. Large Dual Encoders Are Generalizable Retrievers. EMNLP 2022. [link]
  author       = {Jianmo Ni and
                  Chen Qu and
                  Jing Lu and
                  Zhuyun Dai and
                  Gustavo Hern{\'{a}}ndez {\'{A}}brego and
                  Ji Ma and
                  Vincent Y. Zhao and
                  Yi Luan and
                  Keith B. Hall and
                  Ming{-}Wei Chang and
                  Yinfei Yang},
  editor       = {Yoav Goldberg and
                  Zornitsa Kozareva and
                  Yue Zhang},
  title        = {Large Dual Encoders Are Generalizable Retrievers},
  booktitle    = {Proceedings of the 2022 Conference on Empirical Methods in Natural
                  Language Processing, {EMNLP} 2022, Abu Dhabi, United Arab Emirates,
                  December 7-11, 2022},
  pages        = {9844--9855},
  publisher    = {Association for Computational Linguistics},
  year         = {2022},
  url          = {https://doi.org/10.18653/v1/2022.emnlp-main.669},
  doi          = {10.18653/V1/2022.EMNLP-MAIN.669},
  timestamp    = {Thu, 10 Aug 2023 12:35:29 +0200},
  biburl       = {https://dblp.org/rec/conf/emnlp/Ni0LDAMZLHCY22.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}

(default) Model: sentence-transformers/gtr-t5-base [link]


Model: sentence-transformers/gtr-t5-large [link]


Model: sentence-transformers/gtr-t5-xl [link]


Model: sentence-transformers/gtr-t5-xxl [link]

class pyterrier_dr.Query2Query[source]

Dense query encoder model for query similarity.

Note that this encoder only provides a query_encoder() (no document encoder or scorer).


Bathwal and Samdani. State-of-the-art Query2Query Similarity. 2022. [link]


(default) Model: neeva/query2query [link]

class pyterrier_dr.RetroMAE[source]

Dense encoder for RetroMAE (Masked Auto-Encoder).

See BiEncoder for usage information.


Xiao et al. RetroMAE: Pre-Training Retrieval-oriented Language Models Via Masked Auto-Encoder. EMNLP 2022. [link]
  author       = {Shitao Xiao and
                  Zheng Liu and
                  Yingxia Shao and
                  Zhao Cao},
  editor       = {Yoav Goldberg and
                  Zornitsa Kozareva and
                  Yue Zhang},
  title        = {RetroMAE: Pre-Training Retrieval-oriented Language Models Via Masked
  booktitle    = {Proceedings of the 2022 Conference on Empirical Methods in Natural
                  Language Processing, {EMNLP} 2022, Abu Dhabi, United Arab Emirates,
                  December 7-11, 2022},
  pages        = {538--548},
  publisher    = {Association for Computational Linguistics},
  year         = {2022},
  url          = {https://doi.org/10.18653/v1/2022.emnlp-main.35},
  doi          = {10.18653/V1/2022.EMNLP-MAIN.35},
  timestamp    = {Mon, 24 Jun 2024 20:34:52 +0200},
  biburl       = {https://dblp.org/rec/conf/emnlp/XiaoLSC22.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}

(default) Model: Shitao/RetroMAE_MSMARCO_finetune [link]


Model: Shitao/RetroMAE_MSMARCO_distill [link]


Model: Shitao/RetroMAE_BEIR [link]

class pyterrier_dr.TasB[source]

Dense encoder for TAS-B (Topic Aware Sampling, Balanced).

See BiEncoder for usage information.


Hofstätter et al. Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling. SIGIR 2021. [link]
  author       = {Sebastian Hofst{\"{a}}tter and
                  Sheng{-}Chieh Lin and
                  Jheng{-}Hong Yang and
                  Jimmy Lin and
                  Allan Hanbury},
  editor       = {Fernando Diaz and
                  Chirag Shah and
                  Torsten Suel and
                  Pablo Castells and
                  Rosie Jones and
                  Tetsuya Sakai},
  title        = {Efficiently Teaching an Effective Dense Retriever with Balanced Topic
                  Aware Sampling},
  booktitle    = {{SIGIR} '21: The 44th International {ACM} {SIGIR} Conference on Research
                  and Development in Information Retrieval, Virtual Event, Canada, July
                  11-15, 2021},
  pages        = {113--122},
  publisher    = {{ACM}},
  year         = {2021},
  url          = {https://doi.org/10.1145/3404835.3462891},
  doi          = {10.1145/3404835.3462891},
  timestamp    = {Thu, 15 Jul 2021 15:30:48 +0200},
  biburl       = {https://dblp.org/rec/conf/sigir/HofstatterLYLH21.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}

(default) Model: sebastian-hofstaetter/distilbert-dot-tas_b-b256-msmarco [link]

class pyterrier_dr.TctColBert[source]

Dense encoder for TCT-ColBERT (Tightly-Coupled Teachers over ColBERT)

See BiEncoder for usage information.


Lin et al. Distilling Dense Representations for Ranking using Tightly-Coupled Teachers. arXiv 2020. [link]
  author       = {Sheng{-}Chieh Lin and
                  Jheng{-}Hong Yang and
                  Jimmy Lin},
  title        = {Distilling Dense Representations for Ranking using Tightly-Coupled
  journal      = {CoRR},
  volume       = {abs/2010.11386},
  year         = {2020},
  url          = {https://arxiv.org/abs/2010.11386},
  eprinttype    = {arXiv},
  eprint       = {2010.11386},
  timestamp    = {Mon, 26 Oct 2020 15:39:44 +0100},
  biburl       = {https://dblp.org/rec/journals/corr/abs-2010-11386.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}

(default) Model: castorini/tct_colbert-msmarco [link]


Model: castorini/tct_colbert-v2-hn-msmarco [link]


Model: castorini/tct_colbert-v2-hnp-msmarco [link]

API Documentation

class pyterrier_dr.BiEncoder(*args, **kwargs)[source]

Represents a single-vector dense bi-encoder.

A BiEncoder encodes the text of a query or document into a dense vector.

This class functions as a transformer factory:

It can also be used as a transformer directly. It infers which transformer to use based on columns present in the input frame.

Note that in most cases, you will want to use a BiEncoder as part of a pipeline with a FlexIndex to perform dense indexing and retrival.

  • batch_size – The default batch size to use for query/document encoding

  • text_field – The field in the input dataframe that contains the document text

  • verbose – Whether to show progress bars

query_encoder(verbose=None, batch_size=None)[source]

Query encoding

Return type:


doc_encoder(verbose=None, batch_size=None)[source]

Doc encoding

Return type:


text_scorer(verbose=None, batch_size=None, sim_fn=None)[source]

Text Scoring (re-ranking)

Return type:


property sim_fn: SimFn

The similarity function to use between embeddings for this model

abstract encode_queries(texts, batch_size=None)[source]

Abstract method to encode a list of query texts into dense vectors.

This function is used by the transformer returned by query_encoder().

Return type:


  • texts – A list of query texts

  • batch_size – The batch size to use for encoding


A numpy array of shape (n_queries, n_dims)

Return type:


abstract encode_docs(texts, batch_size=None)[source]

Abstract method to encode a list of document texts into dense vectors.

This function is used by the transformer returned by doc_encoder().

Return type:


  • texts – A list of document texts

  • batch_size – The batch size to use for encoding


A numpy array of shape (n_docs, n_dims)

Return type:
