SuitEval API Refernece

suiteeval.suite

Abstract suite interface.

class suiteeval.suite.base.Suite(*args, **kwargs)[source]

Bases: ABC

Abstract base class for a set of related evaluations across one or more datasets.

Subclasses (or classes created via SuiteMeta.register()) must populate:

_datasets

Either a dict[str, str] mapping display name → IRDS dataset ID, or a list[str] of IRDS dataset IDs.

_dataset_ids

Normalized mapping of display name → IRDS dataset ID (filled in by registration helpers).

_metadata

Optional per-dataset or global metadata.

_measures

A list of ir_measures.Measure or a mapping from dataset name to such a list. When not provided, defaults are derived from metadata or IRDS documentation; ultimately falling back to [nDCG@10].

_query_field

Optional topic field name to use when fetching topics.

Notes

Instances are singletons per subclass (enforced by SuiteMeta).

static parse_measures(measures)[source]

Convert a list of measure strings or ir_measures.Measure objects into a flat list[Measure].

Return type:

list[Measure]

Parameters:

measures (list[str | Measure]) – A sequence containing measure strings (e.g., "nDCG@10") and/or ir_measures.Measure instances.

Returns:

Parsed measure objects.

Return type:

list[Measure]

Raises:

ValueError – If a string entry cannot be parsed by either ir_measures.parse_measure() or ir_measures.parse_trec_measure(), or if an entry has an invalid type.

coerce_measures(metadata)[source]

Populate self._measures by aggregating available sources in priority order: :rtype: None

  1. Global metadata['official_measures'] if present.

  2. Per-dataset metadata[name]['official_measures'] if present.

  3. IRDS documentation official_measures for each dataset (when available).

If no measures are discovered, default to [nDCG@10].

Parameters:

metadata (dict[str, Any]) – The suite metadata dictionary as configured at construction time.

Returns:

None

Return type:

None

coerce_pipelines_sequential(context, pipeline_generators)[source]

Yield pipelines lazily, one at a time, without materializing the full set.

Use this when you want to minimize memory/VRAM footprint and you do not require joint analysis across all systems at once (e.g., significance testing).

Parameters:
  • context (DatasetContext) – The shared DatasetContext for the current corpus group.

  • pipeline_generators (Callable[[DatasetContext], Any] | Sequence) – Callable or sequence of callables that produce either: * a single pyterrier.Transformer, * a sequence of transformers, * a tuple (pipelines, name_or_names) where names may be a single label applied to all pipelines or a sequence aligned with pipelines.

Yields:

tuple[Transformer, Optional[str]] – The pipeline and an optional display name.

Raises:

ValueError – If a generator yields an invalid structure.

coerce_pipelines_grouped(context, pipeline_generators)[source]

Materialize all pipelines (and optional names) into lists.

Use this when downstream evaluation requires access to the full set of systems simultaneously (e.g., significance tests).

Return type:

Tuple[list[Transformer], list[str] | None]

Parameters:
Returns:

A list of pipelines and, if provided, a list of corresponding names. If no names were supplied, returns None for the second element.

Return type:

tuple[list[Transformer], Optional[list[str]]]

Raises:

ValueError – If the generators produce no pipelines or an invalid structure.

compute_overall_mean(results, eval_metrics=None)[source]

Append overall (geometric mean) rows across datasets for each system name.

This first aggregates per-dataset means over repeated runs, then computes the geometric mean across datasets for each metric and appends rows with dataset == "Overall".

Return type:

DataFrame

Parameters:
  • results (DataFrame) – DataFrame with at least ["dataset", "name"] and metric columns.

  • eval_metrics (Sequence[Any] | None) – Optional sequence of metrics to consider. If not provided, auto-detects all numeric metric columns in the results.

Returns:

The input results with additional Overall rows appended.

Return type:

pandas.DataFrame

get_measures(dataset)[source]

Resolve the measures applicable to a given dataset name.

Return type:

list[Measure]

Parameters:

dataset (str) – Dataset display name as used in this suite.

Returns:

The list configured for this dataset (or the suite-wide

list if a single list is maintained). Falls back to defaults when the dataset is unknown.

Return type:

list[Measure]

property datasets: Generator[Tuple[str, Dataset], None, None]

Iterate over declared datasets yielding display name and PyTerrier dataset.

Yields:

tuple[str, pyterrier.datasets.Dataset] – Pairs of (name, dataset object).

Raises:

ValueError – If _datasets has an invalid type.

__call__(ranking_generators, eval_metrics=None, subset=None, compute_overall=True, **experiment_kwargs)[source]

Run the experiment(s) for each dataset in the suite and return a results table.

If a baseline is provided in experiment_kwargs, all pipelines are materialized together (grouped mode) to enable tests that require joint access (e.g., significance). Otherwise, pipelines are streamed one-by-one to reduce memory usage (sequential mode).

Return type:

DataFrame

Parameters:
  • ranking_generators (Callable[[DatasetContext], Any] | Sequence[Callable[[DatasetContext], Any]]) – Callable or sequence of callables producing pipelines per DatasetContext (same conventions as in coerce_pipelines_sequential()).

  • eval_metrics (Sequence[Any] | None) – Optional explicit metrics to evaluate; defaults to the suite’s configuration for each dataset.

  • subset (str | None) – Optional dataset display name to restrict evaluation to a single member.

  • **experiment_kwargs (dict[str, Any]) – Additional keyword arguments forwarded to pyterrier.Experiment(). If save_dir is provided, it is suffixed per dataset. If index_dir is provided, it is suffixed per corpus for index storage.

  • compute_overall (bool)

  • **experiment_kwargs

Returns:

The concatenated experiment results. When perquery is

not set, an additional Overall row is appended per system with geometric-mean aggregation across datasets.

Return type:

pandas.DataFrame

Notes

This method reuses a single index per corpus group and cleans up GPU memory between pipeline evaluations.

suiteeval.context

class suiteeval.context.DatasetContext(dataset, path=None)[source]

Bases: object

Holds both a PyTerrier Dataset and a filesystem path (for indexes, caches, etc.).

Parameters:
  • dataset (Dataset) – The pyterrier Dataset instance (must have _irds_id).

  • path (str | None) – Optional filesystem path to use; if omitted, a temp dir will be created for you.

text_loader(fields='*')[source]

Returns a IRDSTextLoader instance for retrieving document texts.

Parameters:

fields (List[str] | str | Literal['*']) – Fields to load; can be a list of field names, a single field name, or “*” for all fields.

Returns:

An IRDSTextLoader instance.

get_corpus_iter(**iter_kwargs)[source]

Returns an iterator over the corpus documents.

Parameters:

**iter_kwargs – Keyword arguments passed to get_corpus_iter.

suiteeval.utility

suiteeval.utility.geometric_mean(sequence)[source]

Compute the geometric mean of a sequence of numbers.

Return type:

float

Parameters:

sequence (Sequence[float]) – A sequence of numbers.

Returns:

The geometric mean of the sequence.

Return type:

float

suiteeval.utility.harmonic_mean(sequence)[source]

Compute the harmonic mean of a sequence of numbers.

Return type:

float

Parameters:

sequence (Sequence[float]) – A sequence of numbers.

Returns:

The harmonic mean of the sequence.

Return type:

float