Terrier Quick Start Tutorial

Terrier is an open-source search engine that allows for efficient indexing and retrieval of documents.

In this tutorial, you will:

  • Index a small collection of web text using Terrier

  • Retrieve over that text using BM25

  • Build a simple retrieval pipeline

To complete this tutorial, you first need to:

Tutorial

In this tutorial, you will index and retrieve from ANTIQUE [1], which is a collection of around 400,000 web documents from question-answering forums. ANTIQUE is built into PyTerrier’s Dataset API, so it is downloaded for you automatically when you first need it.

We will start by building an index of the collection. This constructs data structures that allow for efficient retrieval of documents based on their content.

To index a collection using Terrier, you first need to create a TerrierIndex object. Since Terrier indexes are stored on disk, you need to provide a path where the index will be stored when constructing it. To add documents to the index, you can call index(), passing in the corpus.

Creating a Terrier index
import pyterrier as pt
my_index = pt.terrier.TerrierIndex('my_index.terrier') # [1]
dataset = pt.get_dataset('irds:antique')
my_index.index(dataset.get_corpus_iter())
  1. You can specify any path you like here. We typically use the .terrier extension to indicate that it is a Terrier index, but this isn’t required.

This step may take a minute or two to download the dataset and index it, but once it is done, you will have a Terrier index stored at the specified location.

Once indexing is complete, we can retrieve documents. Terrier has a variety of ways to retrieve documents, but we will use the popular BM25 retrieval model. To retrieve documents using BM25, we can use the bm25() method of the TerrierIndex object. This method returns a retriever object that can be used to perform retrieval.

Retrieving documents using BM25
bm25_retriever = my_index.bm25() # [1]
results = bm25_retriever.search("capital of Germany") # [2]
  1. This creates a BM25 transformer object that can be used to perform retrieval over my_index.

  2. This performs retrieval for the given query and returns the results as a DataFrame.

You should get results that look like this:

qid

docid

docno

rank

score

query

0

1

218864

846016_7

0

22.357888

capital of Germany

1

1

42629

4034012_0

1

21.672244

capital of Germany

2

1

347695

58580_10

2

17.453893

capital of Germany

3

1

92087

4255880_12

3

16.887855

capital of Germany

We can see that retrieval worked, returning documents for our query. However, we do not see the contents of the documents, only their unique identifier (docno). We can build a simple pipeline to load the document text so we can see what was retrieved.

Building a retrieval pipeline that loads document text
retrieval_pipeline = my_index.bm25() >> dataset.text_loader() # [1]
results = retrieval_pipeline.search("capital of Germany")
  1. Here, we build a pipeline that first retrieves documents using BM25, then loads the document text using the dataset’s text loader.

Now, when we run the retrieval pipeline, we get results that include the document text:

qid

docid

docno

rank

score

query

text

0

1

218864

846016_7

0

22.357888

capital of Germany

Why can’t you just be glad that Hamburg isn’t …

1

1

42629

4034012_0

1

21.672244

capital of Germany

Berlin is the Capital of Germany.. . It as als…

2

1

347695

58580_10

2

17.453893

capital of Germany

I go to school in the U.S. and they don’t real…

3

1

92087

4255880_12

3

16.887855

capital of Germany

American - Capitol Amber (Madison, Wisconsin)….

Although not all the results are relevant, we can see that we have the answer to our question (Berlin is the Capital of Germany) in row 1.

References