Dense Encoding¶

Sentence Transformers¶

With pyterrier_dr, it’s easy to support Sentence Transformer (formerly called SentenceBERT) models, e.g. from HuggingFace, for dense retrieval.

The base class is SBertBiEncoder('huggingface/path').

Pretrained Encoders¶

These classes are convenience aliases to popular dense encoding models.

class pyterrier_dr.Ance[source]¶

Dense encoder for ANCE (Approximate nearest neighbor Negative Contrastive Learning).

See BiEncoder for usage information.

Citation

Xiong et al. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. ICLR 2021. [link]

@inproceedings{DBLP:conf/iclr/XiongXLTLBAO21,
  author       = {Lee Xiong and
                  Chenyan Xiong and
                  Ye Li and
                  Kwok{-}Fung Tang and
                  Jialin Liu and
                  Paul N. Bennett and
                  Junaid Ahmed and
                  Arnold Overwijk},
  title        = {Approximate Nearest Neighbor Negative Contrastive Learning for Dense
                  Text Retrieval},
  booktitle    = {9th International Conference on Learning Representations, {ICLR} 2021,
                  Virtual Event, Austria, May 3-7, 2021},
  publisher    = {OpenReview.net},
  year         = {2021},
  url          = {https://openreview.net/forum?id=zeFrfgyZln},
  timestamp    = {Wed, 23 Jun 2021 17:36:39 +0200},
  biburl       = {https://dblp.org/rec/conf/iclr/XiongXLTLBAO21.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

firstp()¶: (default) Model: sentence-transformers/msmarco-roberta-base-ance-firstp [link]

class pyterrier_dr.BGEM3[source]¶

class pyterrier_dr.CDE(model_name='jxm/cde-small-v1', cache=None, batch_size=32, text_field='text', verbose=False, device=None)[source]¶

Parameters:: cache (CDECache | None)

class pyterrier_dr.E5[source]¶

Dense encoder for E5 (EmbEddings from bidirEctional Encoder rEpresentations).

See BiEncoder for usage information.

Citation

Wang et al. Text Embeddings by Weakly-Supervised Contrastive Pre-training. arXiv 2022. [link]

@article{DBLP:journals/corr/abs-2212-03533,
  author       = {Liang Wang and
                  Nan Yang and
                  Xiaolong Huang and
                  Binxing Jiao and
                  Linjun Yang and
                  Daxin Jiang and
                  Rangan Majumder and
                  Furu Wei},
  title        = {Text Embeddings by Weakly-Supervised Contrastive Pre-training},
  journal      = {CoRR},
  volume       = {abs/2212.03533},
  year         = {2022},
  url          = {https://doi.org/10.48550/arXiv.2212.03533},
  doi          = {10.48550/ARXIV.2212.03533},
  eprinttype    = {arXiv},
  eprint       = {2212.03533},
  timestamp    = {Sat, 27 Jul 2024 13:40:52 +0200},
  biburl       = {https://dblp.org/rec/journals/corr/abs-2212-03533.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

base()¶: (default) Model: intfloat/e5-base-v2 [link]

small()¶: Model: intfloat/e5-small-v2 [link]

large()¶: Model: intfloat/e5-large-v2 [link]

class pyterrier_dr.GTR[source]¶

Dense encoder for GTR (Generalizable T5-based dense Retrievers)

See BiEncoder for usage information.

Citation

Ni et al. Large Dual Encoders Are Generalizable Retrievers. EMNLP 2022. [link]

@inproceedings{DBLP:conf/emnlp/Ni0LDAMZLHCY22,
  author       = {Jianmo Ni and
                  Chen Qu and
                  Jing Lu and
                  Zhuyun Dai and
                  Gustavo Hern{\'{a}}ndez {\'{A}}brego and
                  Ji Ma and
                  Vincent Y. Zhao and
                  Yi Luan and
                  Keith B. Hall and
                  Ming{-}Wei Chang and
                  Yinfei Yang},
  editor       = {Yoav Goldberg and
                  Zornitsa Kozareva and
                  Yue Zhang},
  title        = {Large Dual Encoders Are Generalizable Retrievers},
  booktitle    = {Proceedings of the 2022 Conference on Empirical Methods in Natural
                  Language Processing, {EMNLP} 2022, Abu Dhabi, United Arab Emirates,
                  December 7-11, 2022},
  pages        = {9844--9855},
  publisher    = {Association for Computational Linguistics},
  year         = {2022},
  url          = {https://doi.org/10.18653/v1/2022.emnlp-main.669},
  doi          = {10.18653/V1/2022.EMNLP-MAIN.669},
  timestamp    = {Thu, 10 Aug 2023 12:35:29 +0200},
  biburl       = {https://dblp.org/rec/conf/emnlp/Ni0LDAMZLHCY22.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

base()¶: (default) Model: sentence-transformers/gtr-t5-base [link]

large()¶: Model: sentence-transformers/gtr-t5-large [link]

xl()¶: Model: sentence-transformers/gtr-t5-xl [link]

xxl()¶: Model: sentence-transformers/gtr-t5-xxl [link]

class pyterrier_dr.Query2Query[source]¶

Dense query encoder model for query similarity.

Note that this encoder only provides a query_encoder() (no document encoder or scorer).

Citation

Bathwal and Samdani. State-of-the-art Query2Query Similarity. 2022. [link]

base()¶: (default) Model: neeva/query2query [link]

class pyterrier_dr.RetroMAE[source]¶

Dense encoder for RetroMAE (Masked Auto-Encoder).

See BiEncoder for usage information.

Citation

Xiao et al. RetroMAE: Pre-Training Retrieval-oriented Language Models Via Masked Auto-Encoder. EMNLP 2022. [link]

@inproceedings{DBLP:conf/emnlp/XiaoLSC22,
  author       = {Shitao Xiao and
                  Zheng Liu and
                  Yingxia Shao and
                  Zhao Cao},
  editor       = {Yoav Goldberg and
                  Zornitsa Kozareva and
                  Yue Zhang},
  title        = {RetroMAE: Pre-Training Retrieval-oriented Language Models Via Masked
                  Auto-Encoder},
  booktitle    = {Proceedings of the 2022 Conference on Empirical Methods in Natural
                  Language Processing, {EMNLP} 2022, Abu Dhabi, United Arab Emirates,
                  December 7-11, 2022},
  pages        = {538--548},
  publisher    = {Association for Computational Linguistics},
  year         = {2022},
  url          = {https://doi.org/10.18653/v1/2022.emnlp-main.35},
  doi          = {10.18653/V1/2022.EMNLP-MAIN.35},
  timestamp    = {Mon, 24 Jun 2024 20:34:52 +0200},
  biburl       = {https://dblp.org/rec/conf/emnlp/XiaoLSC22.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

msmarco_finetune()¶: (default) Model: Shitao/RetroMAE_MSMARCO_finetune [link]

msmarco_distill()¶: Model: Shitao/RetroMAE_MSMARCO_distill [link]

wiki_bookscorpus_beir()¶: Model: Shitao/RetroMAE_BEIR [link]

class pyterrier_dr.TasB[source]¶

Dense encoder for TAS-B (Topic Aware Sampling, Balanced).

See BiEncoder for usage information.

Citation

Hofstätter et al. Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling. SIGIR 2021. [link]

@inproceedings{DBLP:conf/sigir/HofstatterLYLH21,
  author       = {Sebastian Hofst{\"{a}}tter and
                  Sheng{-}Chieh Lin and
                  Jheng{-}Hong Yang and
                  Jimmy Lin and
                  Allan Hanbury},
  editor       = {Fernando Diaz and
                  Chirag Shah and
                  Torsten Suel and
                  Pablo Castells and
                  Rosie Jones and
                  Tetsuya Sakai},
  title        = {Efficiently Teaching an Effective Dense Retriever with Balanced Topic
                  Aware Sampling},
  booktitle    = {{SIGIR} '21: The 44th International {ACM} {SIGIR} Conference on Research
                  and Development in Information Retrieval, Virtual Event, Canada, July
                  11-15, 2021},
  pages        = {113--122},
  publisher    = {{ACM}},
  year         = {2021},
  url          = {https://doi.org/10.1145/3404835.3462891},
  doi          = {10.1145/3404835.3462891},
  timestamp    = {Thu, 15 Jul 2021 15:30:48 +0200},
  biburl       = {https://dblp.org/rec/conf/sigir/HofstatterLYLH21.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

dot()¶: (default) Model: sebastian-hofstaetter/distilbert-dot-tas_b-b256-msmarco [link]

class pyterrier_dr.TctColBert[source]¶

Dense encoder for TCT-ColBERT (Tightly-Coupled Teachers over ColBERT)

See BiEncoder for usage information.

Citation

Lin et al. Distilling Dense Representations for Ranking using Tightly-Coupled Teachers. arXiv 2020. [link]

@article{DBLP:journals/corr/abs-2010-11386,
  author       = {Sheng{-}Chieh Lin and
                  Jheng{-}Hong Yang and
                  Jimmy Lin},
  title        = {Distilling Dense Representations for Ranking using Tightly-Coupled
                  Teachers},
  journal      = {CoRR},
  volume       = {abs/2010.11386},
  year         = {2020},
  url          = {https://arxiv.org/abs/2010.11386},
  eprinttype    = {arXiv},
  eprint       = {2010.11386},
  timestamp    = {Mon, 26 Oct 2020 15:39:44 +0100},
  biburl       = {https://dblp.org/rec/journals/corr/abs-2010-11386.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

base()¶: (default) Model: castorini/tct_colbert-msmarco [link]

hn()¶: Model: castorini/tct_colbert-v2-hn-msmarco [link]

hnp()¶: Model: castorini/tct_colbert-v2-hnp-msmarco [link]

API Documentation¶

class pyterrier_dr.BiEncoder(*args, **kwargs)[source]¶

Represents a single-vector dense bi-encoder.

A BiEncoder encodes the text of a query or document into a dense vector.

This class functions as a transformer factory:

Query encoding using query_encoder()
Document encoding using doc_encoder()
Text scoring (re-reranking) using text_scorer()

It can also be used as a transformer directly. It infers which transformer to use based on columns present in the input frame.

Note that in most cases, you will want to use a BiEncoder as part of a pipeline with a FlexIndex to perform dense indexing and retrival.

Parameters:

batch_size – The default batch size to use for query/document encoding
text_field – The field in the input dataframe that contains the document text
verbose – Whether to show progress bars

query_encoder(verbose=None, batch_size=None)[source]¶

Query encoding

Return type:: Transformer

doc_encoder(verbose=None, batch_size=None)[source]¶

Doc encoding

Return type:: Transformer

text_scorer(verbose=None, batch_size=None, sim_fn=None)[source]¶

Text Scoring (re-ranking)

Return type:: Transformer

property sim_fn: SimFn¶: The similarity function to use between embeddings for this model

abstractmethod encode_queries(texts, batch_size=None)[source]¶

Abstract method to encode a list of query texts into dense vectors.

This function is used by the transformer returned by query_encoder().

Return type:

array

Parameters:

texts (List[str]) – A list of query texts
batch_size (int | None) – The batch size to use for encoding

Returns:

A numpy array of shape (n_queries, n_dims)

Return type:

np.array

abstractmethod encode_docs(texts, batch_size=None)[source]¶

Abstract method to encode a list of document texts into dense vectors.

This function is used by the transformer returned by doc_encoder().

Return type:

array

Parameters:

texts (List[str]) – A list of document texts
batch_size (int | None) – The batch size to use for encoding

Returns:

A numpy array of shape (n_docs, n_dims)

Return type:

np.array