SuitEval API Refernece¶

suiteeval.suite¶

Abstract suite interface.

class suiteeval.suite.base.Suite(*args, **kwargs)[source]¶

Bases: ABC

Abstract base class for a set of related evaluations across one or more datasets.

Subclasses (or classes created via SuiteMeta.register()) must populate:

_datasets¶: Either a dict[str, str] mapping display name → IRDS dataset ID, or a list[str] of IRDS dataset IDs.

_dataset_ids¶: Normalized mapping of display name → IRDS dataset ID (filled in by registration helpers).

_metadata¶: Optional per-dataset or global metadata.

_measures¶: A list of ir_measures.Measure or a mapping from dataset name to such a list. When not provided, defaults are derived from metadata or IRDS documentation; ultimately falling back to [nDCG@10].

_query_field¶: Optional topic field name to use when fetching topics.

Notes

Instances are singletons per subclass (enforced by SuiteMeta).

static parse_measures(measures)[source]¶

Convert a list of measure strings or ir_measures.Measure objects into a flat list[Measure].

Return type:: list[Measure]
Parameters:: measures (list[str | Measure]) – A sequence containing measure strings (e.g., "nDCG@10") and/or ir_measures.Measure instances.
Returns:: Parsed measure objects.
Return type:: list[Measure]
Raises:: ValueError – If a string entry cannot be parsed by either ir_measures.parse_measure() or ir_measures.parse_trec_measure(), or if an entry has an invalid type.

coerce_measures(metadata)[source]¶

Populate self._measures by aggregating available sources in priority order:

Global metadata['official_measures'] if present.
Per-dataset metadata[name]['official_measures'] if present.
IRDS documentation official_measures for each dataset (when available).

If no measures are discovered, default to [nDCG@10].

Parameters:: metadata (dict[str, Any]) – The suite metadata dictionary as configured at construction time.
Returns:: None
Return type:: None

coerce_pipelines_sequential(context, pipeline_generators)[source]¶

Yield pipelines lazily, one at a time, without materializing the full set.

Use this when you want to minimize memory/VRAM footprint and you do not require joint analysis across all systems at once (e.g., significance testing).

Parameters:

context (DatasetContext) – The shared DatasetContext for the current corpus group.
pipeline_generators (Callable[[DatasetContext], Any] | Sequence) – Callable or sequence of callables that produce either: * a single pyterrier.Transformer, * a sequence of transformers, * a tuple (pipelines, name_or_names) where names may be a single label applied to all pipelines or a sequence aligned with pipelines.

Yields:

tuple[Transformer, Optional[str]] – The pipeline and an optional display name.

Raises:

ValueError – If a generator yields an invalid structure.

coerce_pipelines_grouped(context, pipeline_generators)[source]¶

Materialize all pipelines (and optional names) into lists.

Use this when downstream evaluation requires access to the full set of systems simultaneously (e.g., significance tests).

Return type:

Tuple[list[Transformer], list[str] | None]

Parameters:

context (DatasetContext) – The shared DatasetContext for the current corpus group.
pipeline_generators (Callable[[DatasetContext], Any] | Sequence) – Callable or sequence of callables following the same conventions as in coerce_pipelines_sequential().

Returns:

A list of pipelines and, if provided, a list of corresponding names. If no names were supplied, returns None for the second element.

Return type:

tuple[list[Transformer], Optional[list[str]]]

Raises:

ValueError – If the generators produce no pipelines or an invalid structure.

compute_overall_mean(results, eval_metrics=None)[source]¶

Append overall (geometric mean) rows across datasets for each system name.

This first aggregates per-dataset means over repeated runs, then computes the geometric mean across datasets for each metric and appends rows with dataset == "Overall".

Return type:

DataFrame

Parameters:

results (DataFrame) – DataFrame with at least ["dataset", "name"] and metric columns.
eval_metrics (Sequence[Any]) – Optional sequence of metrics to consider. If not provided, auto-detects all numeric metric columns in the results.

Returns:

The input results with additional Overall rows appended.

Return type:

pandas.DataFrame

get_measures(dataset)[source]¶

Resolve the measures applicable to a given dataset name.

Return type:

list[Measure]

Parameters:

dataset (str) – Dataset display name as used in this suite.

Returns:

The list configured for this dataset (or the suite-wide: list if a single list is maintained). Falls back to defaults when the dataset is unknown.

Return type:

list[Measure]

property datasets: Generator[Tuple[str, Dataset], None, None]¶

Iterate over declared datasets yielding display name and PyTerrier dataset.

Yields:: tuple[str, pyterrier.datasets.Dataset] – Pairs of (name, dataset object).
Raises:: ValueError – If _datasets has an invalid type.

__call__(ranking_generators, eval_metrics=None, subset=None, compute_overall=True, **experiment_kwargs)[source]¶

Run the experiment(s) for each dataset in the suite and return a results table.

If a baseline is provided in experiment_kwargs, all pipelines are materialized together (grouped mode) to enable tests that require joint access (e.g., significance). Otherwise, pipelines are streamed one-by-one to reduce memory usage (sequential mode).

Return type:

DataFrame

Parameters:

ranking_generators (Callable[[DatasetContext], Any] | Sequence[Callable[[DatasetContext], Any]]) – Callable or sequence of callables producing pipelines per DatasetContext (same conventions as in coerce_pipelines_sequential()).
eval_metrics (Sequence[Any]) – Optional explicit metrics to evaluate; defaults to the suite’s configuration for each dataset.
subset (str | None) – Optional dataset display name to restrict evaluation to a single member.
**experiment_kwargs (dict[str, Any]) – Additional keyword arguments forwarded to pyterrier.Experiment(). If save_dir is provided, it is suffixed per dataset. If index_dir is provided, it is suffixed per corpus for index storage.
compute_overall (bool)
**experiment_kwargs

Returns:

The concatenated experiment results. When perquery is: not set, an additional Overall row is appended per system with geometric-mean aggregation across datasets.

Return type:

pandas.DataFrame

Notes

This method reuses a single index per corpus group and cleans up GPU memory between pipeline evaluations.

suiteeval.context¶

class suiteeval.context.DatasetContext(dataset, path=None)[source]¶

Bases: object

Holds both a PyTerrier Dataset and a filesystem path (for indexes, caches, etc.).

Parameters:

dataset (Dataset) – The pyterrier Dataset instance (must have _irds_id).
path (str | None) – Optional filesystem path to use; if omitted, a temp dir will be created for you.

text_loader(fields='*')[source]¶

Returns a IRDSTextLoader instance for retrieving document texts.

Parameters:: fields (List[str] | str | Literal['*']) – Fields to load; can be a list of field names, a single field name, or “*” for all fields.
Returns:: An IRDSTextLoader instance.

get_corpus_iter(**iter_kwargs)[source]¶

Returns an iterator over the corpus documents.

Parameters:: **iter_kwargs – Keyword arguments passed to get_corpus_iter.

suiteeval.utility¶

suiteeval.utility.geometric_mean(sequence)[source]¶

Compute the geometric mean of a sequence of numbers.

Return type:: float
Parameters:: sequence (Sequence[float]) – A sequence of numbers.
Returns:: The geometric mean of the sequence.
Return type:: float

suiteeval.utility.harmonic_mean(sequence)[source]¶

Compute the harmonic mean of a sequence of numbers.

Return type:: float
Parameters:: sequence (Sequence[float]) – A sequence of numbers.
Returns:: The harmonic mean of the sequence.
Return type:: float