Extending PyTerrier with New Datasets¶
Note
This guide is for adding new datasets to PyTerrier, allowing them to be easily used by others.
If you simply want to run PyTerrier with your own data, you can build Pandas DataFrames compatible
with the PyTerrier Data Model - for example, using pt.io.read_topics()
to read from a file.
If you want to use existing built-in datasets, you can find them on this page.
Hint
If you’re adding a typical “Cranfield-style” open source dataset (queries/docs/qrels), consider contributing to ir-datasets instead – they will be imported into PyTerrier automatically.
If you want to make a whole set of new datasets to PyTerrier through the Datasets API (an advanced use case), then this involves:
Creating a new
Dataset
class, which contains the logic for processing your data into the PyTerrier Data Model.Creating a new
DatasetProvider
class, which provides access to your datasets.Registering the new dataset provider with PyTerrier using an entry point.
Each of these steps is detailed below.
Dataset
Classes¶
Your Dataset
class should inherit from pyterrier.datasets.Dataset
. Most commonly, the class will implement
the following methods:
get_topics()
: Returns the topics for the dataset.get_qrels()
: Returns the qrels for the dataset.get_corpus_iter()
: Returns an iterator over the corpus for the dataset.
You can also provide other methods to provide data based on your specific needs. For instance, the implementation in
PyTerrier_rag <https://github.com/terrierteam/pyterrier_rag/blob/main/pyterrier_rag/_datasets.py> includes a get_answers()
method to provide reference answers.
For example:
Dataset
Implementation¶from typing import Iterable, Dict, Any
import pyterrier as pt
import pandas as pd
class MyDataset(pt.datasets.Dataset):
def get_topics(self, variant=None) -> pd.DataFrame:
# Logic to load and return topics
return pd.DataFrame(read_topics('my_topics'))
def get_qrels(self, variant=None) -> pd.DataFrame:
# Logic to load and return qrels
return pd.DataFrame(read_qrels('my_qrels'))
def get_corpus_iter(self, verbose=True) -> Iterable[Dict[str, Any]]:
# Logic to load and return corpus iterator
for line in open('my_file'):
docno, text = parse_line(line)
yield {'docno': docno, 'text': text}
DatasetProvider
Classes¶
Your DatasetProvider
class provides access to the datasets you want to include in your package.
It should inherit from pt.datasets.DatasetProvider
and implement the following methods:
get_dataset(name)
: Returns a specific dataset by name.list_dataset_names()
: Returns a list of the names (IDs) of all datasets provided by this object.
For example:
DatasetProvider
Implementation¶import pyterrier as pt
class MyDatasetProvider(pt.datasets.DatasetProvider):
def get_dataset(self, name):
if name == "my_dataset":
return MyDataset()
else:
raise ValueError(f"Dataset {name} not found")
def list_dataset_names(self):
return ["my_dataset"]
Registering your DatasetProvider
¶
You can register your DatasetProvider
with PyTerrier using an entry point in your package’s setup.py
file or
pyproject.toml
file. This allows PyTerrier to discover your datasets when your package is installed.
The entry point should provide a prefix that identifies your dataset provider. When a user requests a dataset with a
name that starts with this prefix, PyTerrier will use your DatasetProvider
to load the dataset. For example, if you
register your provider with the prefix my_prefix
, if a user requests the dataset pt.get_dataset("my_prefix:my_dataset")
,
PyTerrier will load your MyDatasetProvider
class and invoke its get_dataset("my_dataset")
method.
If you are using a setup.py
file, you can add the following entry point as follows:
setup.py
¶from setuptools import setup
setup(
... # <-- the rest of your configuration
entry_points={
"pyterrier.dataset_provider": [ # <-- PyTerrier looks for this entry point
"my_prefix = my_package.MyDatasetProvider" # <-- when a dataset looks like 'my_prefix:{name}', it will load MyDatasetProvider
]
},
)
If you are using pyproject.toml
, you can add the entry point as follows:
pyproject.toml
¶... # <-- the rest of your configuration
[project.entry-points."pyterrier.dataset_provider"] # <-- PyTerrier looks for this entry point
"my_prefix" = "my_package.MyDatasetProvider" # <-- when a dataset looks like 'my_prefix:{name}', it will load MyDatasetProvider