.. _datasets:
Importing Datasets
-----------------------
The datasets module allows easy access to existing standard test collections, particulary those from `TREC `_. In particular,
each defined dataset can download and provide easy access to:
- files containing the documents of the corpus
- topics (queries), as a dataframe, ready for retrieval
- relevance assessments (aka, labels or qrels), as a dataframe, ready for evaluation
- ready-made Terrier indices, where appropriate
.. autofunction:: pyterrier.datasets.list_datasets
.. autofunction:: pyterrier.datasets.find_datasets
.. autofunction:: pyterrier.datasets.get_dataset
.. autoclass:: pyterrier.datasets.Dataset
:members:
Examples
========
Many of the PyTerrier unit tests are based on the `Vaswani NPL test collection `_, a corpus of scientific abstract from ~11,000 documents.
PyTerrier provides a ready-made index on the `Terrier Data Repository `_. This allows experiments to be easily conducted::
dataset = pt.get_dataset("vaswani")
bm25 = pt.terrier.Retriever.from_dataset(dataset, "terrier_stemmed", wmodel="BM25")
dph = pt.terrier.Retriever.from_dataset(dataset, "terrier_stemmed", wmodel="DPH")
pt.Experiment(
[bm25, dph],
dataset.get_topics(),
dataset.get_qrels(),
eval_metrics=["map"]
)
Indexing and then retrieval of documents from the `MSMARCO document corpus `_ can be achieved as follows::
dataset = pt.get_dataset("trec-deep-learning-docs")
indexer = pt.TRECCollectionIndexer("./index")
# this downloads the file msmarco-docs.trec.gz
indexref = indexer.index(dataset.get_corpus())
index = pt.IndexFactory.of(indexref)
DPH_br = pt.terrier.Retriever(index, wmodel="DPH") % 100
BM25_br = pt.terrier.Retriever(index, wmodel="BM25") % 100
# this runs an experiment to obtain results on the TREC 2019 Deep Learning track queries and qrels
pt.Experiment(
[DPH_br, BM25_br],
dataset.get_topics("test"),
dataset.get_qrels("test"),
eval_metrics=["recip_rank", "ndcg_cut_10", "map"])
For more details on use of MSMARCO, see `our MSMARCO leaderboard submission notebooks `_.
You can also index datasets that include a corpus using IterDictIndexer and get_corpus_iter::
dataset = pt.datasets.get_dataset('irds:cord19/trec-covid')
indexer = pt.index.IterDictIndexer('./cord19-index')
indexref = indexer.index(dataset.get_corpus_iter(), fields=('title', 'abstract'))
index = pt.IndexFactory.of(indexref)
DPH_br = pt.terrier.Retriever(index, wmodel="DPH") % 100
BM25_br = pt.terrier.Retriever(index, wmodel="BM25") % 100
# this runs an experiment to obtain results on the TREC COVID queries and qrels
pt.Experiment(
[DPH_br, BM25_br],
dataset.get_topics('title'),
dataset.get_qrels(),
eval_metrics=["P.5", "P.10", "ndcg_cut.10", "map"])
Available Datasets
==================
.. note::
If you want to run PyTerrier with your own data, you can build Pandas DataFrames compatible
with the :doc:`PyTerrier Data Model `.
If you want to add new datasets to PyTerrier, you can use :doc:`this guide `.
The table below lists the provided datasets, detailing the attributes available for each dataset.
In each column, True designates the presence of a single artefact of that type, while a list denotes the available variants.
Datasets with the ``irds:`` prefix are from the `ir_datasets package `_; further
documentation on these datasets can be found `here `_.
.. include:: ./_includes/datasets-list-inc.rst