Encoding¶
Sentence Transformers¶
With pyterrier_dr, its easy to support Sentence Transformer (formerly called SentenceBERT) models, e.g. from HuggingFace, for dense retrieval.
The base class is SBertBiEncoder('huggingface/path')
.
Pretrained Encoders¶
These classes are convenience aliases to popular dense encoding models.
- class pyterrier_dr.Ance[source]¶
Dense encoder for ANCE (Approximate nearest neighbor Negative Contrastive Learning).
See
BiEncoder
for usage information.Citation
Xiong et al. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. ICLR 2021. [link]
@inproceedings{DBLP:conf/iclr/XiongXLTLBAO21, author = {Lee Xiong and Chenyan Xiong and Ye Li and Kwok{-}Fung Tang and Jialin Liu and Paul N. Bennett and Junaid Ahmed and Arnold Overwijk}, title = {Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval}, booktitle = {9th International Conference on Learning Representations, {ICLR} 2021, Virtual Event, Austria, May 3-7, 2021}, publisher = {OpenReview.net}, year = {2021}, url = {https://openreview.net/forum?id=zeFrfgyZln}, timestamp = {Wed, 23 Jun 2021 17:36:39 +0200}, biburl = {https://dblp.org/rec/conf/iclr/XiongXLTLBAO21.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
- class pyterrier_dr.CDE(model_name='jxm/cde-small-v1', cache=None, batch_size=32, text_field='text', verbose=False, device=None)[source]¶
- class pyterrier_dr.E5[source]¶
Dense encoder for E5 (EmbEddings from bidirEctional Encoder rEpresentations).
See
BiEncoder
for usage information.Citation
Wang et al. Text Embeddings by Weakly-Supervised Contrastive Pre-training. arXiv 2022. [link]
@article{DBLP:journals/corr/abs-2212-03533, author = {Liang Wang and Nan Yang and Xiaolong Huang and Binxing Jiao and Linjun Yang and Daxin Jiang and Rangan Majumder and Furu Wei}, title = {Text Embeddings by Weakly-Supervised Contrastive Pre-training}, journal = {CoRR}, volume = {abs/2212.03533}, year = {2022}, url = {https://doi.org/10.48550/arXiv.2212.03533}, doi = {10.48550/ARXIV.2212.03533}, eprinttype = {arXiv}, eprint = {2212.03533}, timestamp = {Sat, 27 Jul 2024 13:40:52 +0200}, biburl = {https://dblp.org/rec/journals/corr/abs-2212-03533.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
- class pyterrier_dr.GTR[source]¶
Dense encoder for GTR (Generalizable T5-based dense Retrievers)
See
BiEncoder
for usage information.Citation
Ni et al. Large Dual Encoders Are Generalizable Retrievers. EMNLP 2022. [link]
@inproceedings{DBLP:conf/emnlp/Ni0LDAMZLHCY22, author = {Jianmo Ni and Chen Qu and Jing Lu and Zhuyun Dai and Gustavo Hern{\'{a}}ndez {\'{A}}brego and Ji Ma and Vincent Y. Zhao and Yi Luan and Keith B. Hall and Ming{-}Wei Chang and Yinfei Yang}, editor = {Yoav Goldberg and Zornitsa Kozareva and Yue Zhang}, title = {Large Dual Encoders Are Generalizable Retrievers}, booktitle = {Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, {EMNLP} 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022}, pages = {9844--9855}, publisher = {Association for Computational Linguistics}, year = {2022}, url = {https://doi.org/10.18653/v1/2022.emnlp-main.669}, doi = {10.18653/V1/2022.EMNLP-MAIN.669}, timestamp = {Thu, 10 Aug 2023 12:35:29 +0200}, biburl = {https://dblp.org/rec/conf/emnlp/Ni0LDAMZLHCY22.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
- class pyterrier_dr.Query2Query[source]¶
Dense query encoder model for query similarity.
Note that this encoder only provides a
query_encoder()
(no document encoder or scorer).Citation
Bathwal and Samdani. State-of-the-art Query2Query Similarity. 2022. [link]
- class pyterrier_dr.RetroMAE[source]¶
Dense encoder for RetroMAE (Masked Auto-Encoder).
See
BiEncoder
for usage information.Citation
Xiao et al. RetroMAE: Pre-Training Retrieval-oriented Language Models Via Masked Auto-Encoder. EMNLP 2022. [link]
@inproceedings{DBLP:conf/emnlp/XiaoLSC22, author = {Shitao Xiao and Zheng Liu and Yingxia Shao and Zhao Cao}, editor = {Yoav Goldberg and Zornitsa Kozareva and Yue Zhang}, title = {RetroMAE: Pre-Training Retrieval-oriented Language Models Via Masked Auto-Encoder}, booktitle = {Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, {EMNLP} 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022}, pages = {538--548}, publisher = {Association for Computational Linguistics}, year = {2022}, url = {https://doi.org/10.18653/v1/2022.emnlp-main.35}, doi = {10.18653/V1/2022.EMNLP-MAIN.35}, timestamp = {Mon, 24 Jun 2024 20:34:52 +0200}, biburl = {https://dblp.org/rec/conf/emnlp/XiaoLSC22.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
- class pyterrier_dr.TasB[source]¶
Dense encoder for TAS-B (Topic Aware Sampling, Balanced).
See
BiEncoder
for usage information.Citation
Hofstätter et al. Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling. SIGIR 2021. [link]
@inproceedings{DBLP:conf/sigir/HofstatterLYLH21, author = {Sebastian Hofst{\"{a}}tter and Sheng{-}Chieh Lin and Jheng{-}Hong Yang and Jimmy Lin and Allan Hanbury}, editor = {Fernando Diaz and Chirag Shah and Torsten Suel and Pablo Castells and Rosie Jones and Tetsuya Sakai}, title = {Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling}, booktitle = {{SIGIR} '21: The 44th International {ACM} {SIGIR} Conference on Research and Development in Information Retrieval, Virtual Event, Canada, July 11-15, 2021}, pages = {113--122}, publisher = {{ACM}}, year = {2021}, url = {https://doi.org/10.1145/3404835.3462891}, doi = {10.1145/3404835.3462891}, timestamp = {Thu, 15 Jul 2021 15:30:48 +0200}, biburl = {https://dblp.org/rec/conf/sigir/HofstatterLYLH21.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
- class pyterrier_dr.TctColBert[source]¶
Dense encoder for TCT-ColBERT (Tightly-Coupled Teachers over ColBERT)
See
BiEncoder
for usage information.Citation
Lin et al. Distilling Dense Representations for Ranking using Tightly-Coupled Teachers. arXiv 2020. [link]
@article{DBLP:journals/corr/abs-2010-11386, author = {Sheng{-}Chieh Lin and Jheng{-}Hong Yang and Jimmy Lin}, title = {Distilling Dense Representations for Ranking using Tightly-Coupled Teachers}, journal = {CoRR}, volume = {abs/2010.11386}, year = {2020}, url = {https://arxiv.org/abs/2010.11386}, eprinttype = {arXiv}, eprint = {2010.11386}, timestamp = {Mon, 26 Oct 2020 15:39:44 +0100}, biburl = {https://dblp.org/rec/journals/corr/abs-2010-11386.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
API Documentation¶
- class pyterrier_dr.BiEncoder(*args, **kwargs)[source]¶
Represents a single-vector dense bi-encoder.
A
BiEncoder
encodes the text of a query or document into a dense vector.- This class functions as a transformer factory:
Query encoding using
query_encoder()
Document encoding using
doc_encoder()
Text scoring (re-reranking) using
text_scorer()
It can also be used as a transformer directly. It infers which transformer to use based on columns present in the input frame.
Note that in most cases, you will want to use a
BiEncoder
as part of a pipeline with aFlexIndex
to perform dense indexing and retrival.- Parameters:
batch_size – The default batch size to use for query/document encoding
text_field – The field in the input dataframe that contains the document text
verbose – Whether to show progress bars
- text_scorer(verbose=None, batch_size=None, sim_fn=None)[source]¶
Text Scoring (re-ranking)
- Return type:
- property sim_fn: SimFn¶
The similarity function to use between embeddings for this model
- abstract encode_queries(texts, batch_size=None)[source]¶
Abstract method to encode a list of query texts into dense vectors.
This function is used by the transformer returned by
query_encoder()
.- Return type:
array
- Parameters:
texts – A list of query texts
batch_size – The batch size to use for encoding
- Returns:
A numpy array of shape (n_queries, n_dims)
- Return type:
np.array
- abstract encode_docs(texts, batch_size=None)[source]¶
Abstract method to encode a list of document texts into dense vectors.
This function is used by the transformer returned by
doc_encoder()
.- Return type:
array
- Parameters:
texts – A list of document texts
batch_size – The batch size to use for encoding
- Returns:
A numpy array of shape (n_docs, n_dims)
- Return type:
np.array