SuitEval Suites

BEIR

BEIR is a heterogeneous benchmark containing diverse IR tasks.

Citation

Thakur et al. BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models. arXiv 2021. [link]
@article{DBLP:journals/corr/abs-2104-08663,
  author       = {Nandan Thakur and
                  Nils Reimers and
                  Andreas R{\"{u}}ckl{\'{e}} and
                  Abhishek Srivastava and
                  Iryna Gurevych},
  title        = {{BEIR:} {A} Heterogenous Benchmark for Zero-shot Evaluation of Information
                  Retrieval Models},
  journal      = {CoRR},
  volume       = {abs/2104.08663},
  year         = {2021},
  url          = {https://arxiv.org/abs/2104.08663},
  eprinttype    = {arXiv},
  eprint       = {2104.08663},
  timestamp    = {Thu, 14 Oct 2021 09:14:46 +0200},
  biburl       = {https://dblp.org/rec/journals/corr/abs-2104-08663.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Usage

from suiteeval.suite import BEIR
results = BEIR(pipelines)

NanoBEIR

Compact BEIR subset for faster iteration.

Usage

from suiteeval.suite import NanoBEIR
results = NanoBEIR(pipelines)

LoTTE

LoTTE (Long-Tail Topic-stratified Evaluation) is a set of test collections focused on out-of-domain evaluation.

Citation

Santhanam et al. ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction. NAACL-HLT 2022. [link]
@inproceedings{DBLP:conf/naacl/SanthanamKSPZ22,
  author       = {Keshav Santhanam and
                  Omar Khattab and
                  Jon Saad{-}Falcon and
                  Christopher Potts and
                  Matei Zaharia},
  editor       = {Marine Carpuat and
                  Marie{-}Catherine de Marneffe and
                  Iv{\'{a}}n Vladimir Meza Ru{\'{\i}}z},
  title        = {ColBERTv2: Effective and Efficient Retrieval via Lightweight Late
                  Interaction},
  booktitle    = {Proceedings of the 2022 Conference of the North American Chapter of
                  the Association for Computational Linguistics: Human Language Technologies,
                  {NAACL} 2022, Seattle, WA, United States, July 10-15, 2022},
  pages        = {3715--3734},
  publisher    = {Association for Computational Linguistics},
  year         = {2022},
  url          = {https://doi.org/10.18653/v1/2022.naacl-main.272},
  doi          = {10.18653/V1/2022.NAACL-MAIN.272},
  timestamp    = {Mon, 01 Aug 2022 16:28:04 +0200},
  biburl       = {https://dblp.org/rec/conf/naacl/SanthanamKSPZ22.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Usage

from suiteeval.suite import Lotte
results = Lotte(pipelines)

BRIGHT

BRIGHT comprises 12 diverse datasets, spanning biology, economics, robotics, math, code and more. The queries can be long StackExchange posts, math or code question.

Citation

Su et al. BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval. ICLR 2025. [link]
@inproceedings{DBLP:conf/iclr/SuYXSMWLSST0YA025,
  author       = {Hongjin Su and
                  Howard Yen and
                  Mengzhou Xia and
                  Weijia Shi and
                  Niklas Muennighoff and
                  Han{-}yu Wang and
                  Haisu Liu and
                  Quan Shi and
                  Zachary S. Siegel and
                  Michael Tang and
                  Ruoxi Sun and
                  Jinsung Yoon and
                  Sercan {\"{O}}. Arik and
                  Danqi Chen and
                  Tao Yu},
  title        = {{BRIGHT:} {A} Realistic and Challenging Benchmark for Reasoning-Intensive
                  Retrieval},
  booktitle    = {The Thirteenth International Conference on Learning Representations,
                  {ICLR} 2025, Singapore, April 24-28, 2025},
  publisher    = {OpenReview.net},
  year         = {2025},
  url          = {https://openreview.net/forum?id=ykuc5q381b},
  timestamp    = {Thu, 15 May 2025 17:19:05 +0200},
  biburl       = {https://dblp.org/rec/conf/iclr/SuYXSMWLSST0YA025.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Usage

from suiteeval.suite import BRIGHT
results = BRIGHT(pipelines)

MS MARCO (Document & Passage)

MSMARCO is a large-scale dataset for training and evaluating information retrieval models. These suites contain TREC Deep Learning queries and relevance judgments for both document and passage retrieval tasks.

Citation

Craswell et al. MS MARCO: Benchmarking Ranking Models in the Large-Data Regime. SIGIR 2021. [link]
@inproceedings{DBLP:conf/sigir/CraswellMYCL21,
  author       = {Nick Craswell and
                  Bhaskar Mitra and
                  Emine Yilmaz and
                  Daniel Campos and
                  Jimmy Lin},
  editor       = {Fernando Diaz and
                  Chirag Shah and
                  Torsten Suel and
                  Pablo Castells and
                  Rosie Jones and
                  Tetsuya Sakai},
  title        = {{MS} {MARCO:} Benchmarking Ranking Models in the Large-Data Regime},
  booktitle    = {{SIGIR} '21: The 44th International {ACM} {SIGIR} Conference on Research
                  and Development in Information Retrieval, Virtual Event, Canada, July
                  11-15, 2021},
  pages        = {1566--1576},
  publisher    = {{ACM}},
  year         = {2021},
  url          = {https://doi.org/10.1145/3404835.3462804},
  doi          = {10.1145/3404835.3462804},
  timestamp    = {Sun, 02 Nov 2025 21:27:20 +0100},
  biburl       = {https://dblp.org/rec/conf/sigir/CraswellMYCL21.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Usage

from suiteeval.suite import MSMARCODocument, MSMARCOPassage
doc_results = MSMARCODocument(pipelines)
pas_results = MSMARCOPassage(pipelines)