SuitEval API Refernece¶
suiteeval.suite¶
Abstract suite interface.
- class suiteeval.suite.base.Suite(*args, **kwargs)[source]¶
Bases:
ABCAbstract base class for a set of related evaluations across one or more datasets.
Subclasses (or classes created via
SuiteMeta.register()) must populate:- _datasets¶
Either a
dict[str, str]mapping display name → IRDS dataset ID, or alist[str]of IRDS dataset IDs.
- _dataset_ids¶
Normalized mapping of display name → IRDS dataset ID (filled in by registration helpers).
- _metadata¶
Optional per-dataset or global metadata.
- _measures¶
A list of
ir_measures.Measureor a mapping from dataset name to such a list. When not provided, defaults are derived from metadata or IRDS documentation; ultimately falling back to[nDCG@10].
- _query_field¶
Optional topic field name to use when fetching topics.
Notes
Instances are singletons per subclass (enforced by
SuiteMeta).- static parse_measures(measures)[source]¶
Convert a list of measure strings or
ir_measures.Measureobjects into a flatlist[Measure].- Return type:
list[Measure]- Parameters:
measures (list[str | Measure]) – A sequence containing measure strings (e.g.,
"nDCG@10") and/orir_measures.Measureinstances.- Returns:
Parsed measure objects.
- Return type:
list[Measure]
- Raises:
ValueError – If a string entry cannot be parsed by either
ir_measures.parse_measure()orir_measures.parse_trec_measure(), or if an entry has an invalid type.
- coerce_measures(metadata)[source]¶
Populate
self._measuresby aggregating available sources in priority order: :rtype:NoneGlobal
metadata['official_measures']if present.Per-dataset
metadata[name]['official_measures']if present.IRDS documentation
official_measuresfor each dataset (when available).
If no measures are discovered, default to
[nDCG@10].- Parameters:
metadata (dict[str, Any]) – The suite metadata dictionary as configured at construction time.
- Returns:
None
- Return type:
None
- coerce_pipelines_sequential(context, pipeline_generators)[source]¶
Yield pipelines lazily, one at a time, without materializing the full set.
Use this when you want to minimize memory/VRAM footprint and you do not require joint analysis across all systems at once (e.g., significance testing).
- Parameters:
context (DatasetContext) – The shared
DatasetContextfor the current corpus group.pipeline_generators (Callable[[DatasetContext], Any] | Sequence) – Callable or sequence of callables that produce either: * a single
pyterrier.Transformer, * a sequence of transformers, * a tuple(pipelines, name_or_names)where names may be a single label applied to all pipelines or a sequence aligned withpipelines.
- Yields:
tuple[Transformer, Optional[str]] – The pipeline and an optional display name.
- Raises:
ValueError – If a generator yields an invalid structure.
- coerce_pipelines_grouped(context, pipeline_generators)[source]¶
Materialize all pipelines (and optional names) into lists.
Use this when downstream evaluation requires access to the full set of systems simultaneously (e.g., significance tests).
- Return type:
Tuple[list[Transformer],list[str] |None]- Parameters:
context (DatasetContext) – The shared
DatasetContextfor the current corpus group.pipeline_generators (Callable[[DatasetContext], Any] | Sequence) – Callable or sequence of callables following the same conventions as in
coerce_pipelines_sequential().
- Returns:
A list of pipelines and, if provided, a list of corresponding names. If no names were supplied, returns
Nonefor the second element.- Return type:
tuple[list[Transformer], Optional[list[str]]]
- Raises:
ValueError – If the generators produce no pipelines or an invalid structure.
- compute_overall_mean(results, eval_metrics=None)[source]¶
Append overall (geometric mean) rows across datasets for each system name.
This first aggregates per-dataset means over repeated runs, then computes the geometric mean across datasets for each metric and appends rows with
dataset == "Overall".- Return type:
DataFrame- Parameters:
results (DataFrame) – DataFrame with at least
["dataset", "name"]and metric columns.eval_metrics (Sequence[Any] | None) – Optional sequence of metrics to consider. If not provided, auto-detects all numeric metric columns in the results.
- Returns:
The input results with additional
Overallrows appended.- Return type:
pandas.DataFrame
- get_measures(dataset)[source]¶
Resolve the measures applicable to a given dataset name.
- Return type:
list[Measure]- Parameters:
dataset (str) – Dataset display name as used in this suite.
- Returns:
- The list configured for this dataset (or the suite-wide
list if a single list is maintained). Falls back to defaults when the dataset is unknown.
- Return type:
list[Measure]
- property datasets: Generator[Tuple[str, Dataset], None, None]¶
Iterate over declared datasets yielding display name and PyTerrier dataset.
- Yields:
tuple[str, pyterrier.datasets.Dataset] – Pairs of (name, dataset object).
- Raises:
ValueError – If
_datasetshas an invalid type.
- __call__(ranking_generators, eval_metrics=None, subset=None, compute_overall=True, **experiment_kwargs)[source]¶
Run the experiment(s) for each dataset in the suite and return a results table.
If a
baselineis provided inexperiment_kwargs, all pipelines are materialized together (grouped mode) to enable tests that require joint access (e.g., significance). Otherwise, pipelines are streamed one-by-one to reduce memory usage (sequential mode).- Return type:
DataFrame- Parameters:
ranking_generators (Callable[[DatasetContext], Any] | Sequence[Callable[[DatasetContext], Any]]) – Callable or sequence of callables producing pipelines per
DatasetContext(same conventions as incoerce_pipelines_sequential()).eval_metrics (Sequence[Any] | None) – Optional explicit metrics to evaluate; defaults to the suite’s configuration for each dataset.
subset (str | None) – Optional dataset display name to restrict evaluation to a single member.
**experiment_kwargs (dict[str, Any]) – Additional keyword arguments forwarded to
pyterrier.Experiment(). Ifsave_diris provided, it is suffixed per dataset. Ifindex_diris provided, it is suffixed per corpus for index storage.compute_overall (bool)
**experiment_kwargs
- Returns:
- The concatenated experiment results. When
perqueryis not set, an additional
Overallrow is appended per system with geometric-mean aggregation across datasets.
- The concatenated experiment results. When
- Return type:
pandas.DataFrame
Notes
This method reuses a single index per corpus group and cleans up GPU memory between pipeline evaluations.
suiteeval.context¶
- class suiteeval.context.DatasetContext(dataset, path=None)[source]¶
Bases:
objectHolds both a PyTerrier Dataset and a filesystem path (for indexes, caches, etc.).
- Parameters:
dataset (Dataset) – The pyterrier Dataset instance (must have _irds_id).
path (str | None) – Optional filesystem path to use; if omitted, a temp dir will be created for you.