Terrier Quick Start Tutorial¶

Terrier is an open-source search engine that allows for efficient indexing and retrieval of documents.

In this tutorial, you will:

Index a small collection of web text using Terrier
Retrieve over that text using BM25
Build a simple retrieval pipeline

To complete this tutorial, you first need to:

Have PyTerrier installed.

Tutorial¶

In this tutorial, you will index and retrieve from ANTIQUE [1], which is a collection of around 400,000 web documents from question-answering forums. ANTIQUE is built into PyTerrier’s Dataset API, so it is downloaded for you automatically when you first need it.

Optional: Exploring the ANTIQUE Collection

You can check out what the collection looks like by loading it into a DataFrame:

import pandas as pd
import pyterrier as pt
dataset = pt.get_dataset('irds:antique')
corpus = pd.DataFrame(dataset.get_corpus_iter()) # [1]

This loads the entire ANTIQUE corpus into a Pandas DataFrame for exploration. Most collections will be too large to load into memory like this, but ANTIQUE is small enough to do so.

The resulting corpus dataframe will look something like this:

	docno	text
0	2020338_0	A small group of politicians believed strongly …
1	2020338_1	Because there is a lot of oil in Iraq.
2	2020338_2	It is tempting to say that the US invaded Iraq …
3	2020338_3	I think Yuval is pretty spot on. It’s a proving…
4	2874684_0	Call an area apiarist. They should be able to …
…	…	…

Go ahead and play around with it to get a feel for the data! You can try answering the following questions:

How many documents are in the collection?
Are there any documents that are particularly long or short?
Can you find any interesting patterns or themes in the text?

We will start by building an index of the collection. This constructs data structures that allow for efficient retrieval of documents based on their content.

To index a collection using Terrier, you first need to create a TerrierIndex object. Since Terrier indexes are stored on disk, you need to provide a path where the index will be stored when constructing it. To add documents to the index, you can call index(), passing in the corpus.

Creating a Terrier index¶

import pyterrier as pt
my_index = pt.terrier.TerrierIndex('my_index.terrier') # [1]
dataset = pt.get_dataset('irds:antique')
my_index.index(dataset.get_corpus_iter())

You can specify any path you like here. We typically use the .terrier extension to indicate that it is a Terrier index, but this isn’t required.

This step may take a minute or two to download the dataset and index it, but once it is done, you will have a Terrier index stored at the specified location.

Once indexing is complete, we can retrieve documents. Terrier has a variety of ways to retrieve documents, but we will use the popular BM25 retrieval model. To retrieve documents using BM25, we can use the bm25() method of the TerrierIndex object. This method returns a retriever object that can be used to perform retrieval.

Retrieving documents using BM25¶

bm25_retriever = my_index.bm25() # [1]
results = bm25_retriever.search("capital of Germany") # [2]

This creates a BM25 transformer object that can be used to perform retrieval over my_index.
This performs retrieval for the given query and returns the results as a DataFrame.

You should get results that look like this:

	qid	docid	docno	rank	score	query
0	1	218864	846016_7	0	22.357888	capital of Germany
1	1	42629	4034012_0	1	21.672244	capital of Germany
2	1	347695	58580_10	2	17.453893	capital of Germany
3	1	92087	4255880_12	3	16.887855	capital of Germany
…	…	…	…	…	…	…

We can see that retrieval worked, returning documents for our query. However, we do not see the contents of the documents, only their unique identifier (docno). We can build a simple pipeline to load the document text so we can see what was retrieved.

Building a retrieval pipeline that loads document text¶

retrieval_pipeline = my_index.bm25() >> dataset.text_loader() # [1]
results = retrieval_pipeline.search("capital of Germany")

Here, we build a pipeline that first retrieves documents using BM25, then loads the document text using the dataset’s text loader.

Now, when we run the retrieval pipeline, we get results that include the document text:

	qid	docid	docno	rank	score	query	text
0	1	218864	846016_7	0	22.357888	capital of Germany	Why can’t you just be glad that Hamburg isn’t …
1	1	42629	4034012_0	1	21.672244	capital of Germany	Berlin is the Capital of Germany.. . It as als…
2	1	347695	58580_10	2	17.453893	capital of Germany	I go to school in the U.S. and they don’t real…
3	1	92087	4255880_12	3	16.887855	capital of Germany	American - Capitol Amber (Madison, Wisconsin)….
…	…	…	…	…	…	…	…

Although not all the results are relevant, we can see that we have the answer to our question (Berlin is the Capital of Germany) in row 1.

References¶

[1]

Citation

Hashemi et al. ANTIQUE: A Non-factoid Question Answering Benchmark. ECIR (2) 2020. [link]

@inproceedings{DBLP:conf/ecir/HashemiAZC20,
  author       = {Helia Hashemi and
                  Mohammad Aliannejadi and
                  Hamed Zamani and
                  W. Bruce Croft},
  editor       = {Joemon M. Jose and
                  Emine Yilmaz and
                  Jo{\~{a}}o Magalh{\~{a}}es and
                  Pablo Castells and
                  Nicola Ferro and
                  M{\'{a}}rio J. Silva and
                  Fl{\'{a}}vio Martins},
  title        = {{ANTIQUE:} {A} Non-factoid Question Answering Benchmark},
  booktitle    = {Advances in Information Retrieval - 42nd European Conference on {IR}
                  Research, {ECIR} 2020, Lisbon, Portugal, April 14-17, 2020, Proceedings,
                  Part {II}},
  series       = {Lecture Notes in Computer Science},
  volume       = {12036},
  pages        = {166--173},
  publisher    = {Springer},
  year         = {2020},
  url          = {https://doi.org/10.1007/978-3-030-45442-5\_21},
  doi          = {10.1007/978-3-030-45442-5\_21},
  timestamp    = {Sun, 02 Nov 2025 21:27:37 +0100},
  biburl       = {https://dblp.org/rec/conf/ecir/HashemiAZC20.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}