Tuning Transformer Pipelines¶

Many approaches will have parameters that require tuning. PyTerrier helps to achieve this by proving a grid evaluation functionality that can tune one or more parameters using a particular evaluation metric. There are two functions which helps to achieve this:

pt.GridScan() exhaustively evaluates all possibile parameters settings and computes evaluation measures.
pt.GridSearch() applies GridScan, and determines the most effective parameter setting for a given evaluation measure.
pt.KFoldGridSearch() applies GridSearch on different folds, in order to determine the most effective parameter setting for a given evaluation measure on the training topics for each fold. The results on the test topics are returned.

All of these functions are designed to have an API very similar to pt.Experiment().

Pre-requisites¶

GridScan makes several assumptions:

the parameters that you wish to tune are available as instance attributes within the transformers, or that the transformer responds suitably to set_parameter().
changing the relevant parameters has an impact upon subsequent calls to transform().

Note that transformers implemented using pt.apply functions cannot address the second requirement, as any parameters are captured naturally within the closure, and not as instances attributes of the transformer.

Parameter Scanning and Searching API¶

pyterrier.GridScan(pipeline, params, topics, qrels, metrics=['map'], jobs=1, backend='joblib', verbose=False, batch_size=None, dataframe=True)[source]¶

GridScan applies a set of named parameters on a given pipeline and evaluates the outcome. The topics and qrels must be specified. The trec_eval measure names can be optionally specified. The transformers being tuned, and their respective parameters are named in the param_dict. The parameter being varied must be changable using the set_parameter() method. This means instance variables, as well as controls in the case of Retriever.

Parameters:

pipeline (Transformer) – a transformer or pipeline
params (Dict[Transformer, Dict[str, List[str | float | int]]]) – a two-level dictionary, mapping transformer to param name to a list of values
topics (DataFrame) – topics to tune upon
qrels (DataFrame) – qrels to tune upon
metrics) – name of the metrics to report for each setting. Defaults to [“map”].
batch_size – If not None, evaluation is conducted in batches of batch_size topics. Default=None, which evaluates all topics at once. Applying a batch_size is useful if you have large numbers of topics, and/or if your pipeline requires large amounts of temporary memory during a run. Default is None.
jobs (int) – Number of parallel jobs to run. Default is 1, which means sequentially.
backend – Parallelisation backend to use. Defaults to “joblib”.
verbose (bool) – whether to display progress bars or not
dataframe – return a dataframe or a list
metrics (str | Measure | Sequence[str | Measure])

Return type:

DataFrame | List[Tuple[List[Tuple[Transformer, str, str | float | int]], Dict[str | Measure, float]]]

Returns:

A dataframe showing the effectiveness of all evaluated settings, if dataframe=True A list of settings and resulting evaluation measures, if dataframe=False

Raises:

ValueError – if a specified transformer does not have such a parameter

Example:

# graph how PL2's c parameter affects MAP
pl2 = pt.terrier.Retriever(index, wmodel="PL2", controls={'c' : 1})
rtr = pt.GridScan(
    pl2,
    {pl2 : {'c' : [0.1, 1, 5, 10, 20, 100]}},
    topics,
    qrels,
    ["map"]
)
import matplotlib.pyplot as plt
plt.plot(rtr["tran_0_c"], rtr["map"])
plt.xlabel("PL2's c value")
plt.ylabel("MAP")
plt.show()

pyterrier.GridSearch(pipeline, params, topics, qrels, metric='map', jobs=1, backend='joblib', verbose=False, batch_size=None, return_type='opt_pipeline')[source]¶

GridSearch is essentially, an argmax GridScan(), i.e. it returns an instance of the pipeline to tune with the best parameter settings among params, that were found that were obtained using the specified topics and qrels, and for the specified measure.

Parameters:

pipeline (Transformer) – a transformer or pipeline to tune
params (Dict[Transformer, Dict[str, List[str | float | int]]]) – a two-level dictionary, mapping transformer to param name to a list of values
topics (DataFrame) – topics to tune upon
qrels (DataFrame) – qrels to tune upon
metric (str | Measure) – name of the metric on which to determine the most effective setting. Defaults to “map”.
batch_size (int | None) – If not None, evaluation is conducted in batches of batch_size topics. Default=None, which evaluates all topics at once. Applying a batch_size is useful if you have large numbers of topics, and/or if your pipeline requires large amounts of temporary memory during a run. Default is None.
jobs (int) – Number of parallel jobs to run. Default is 1, which means sequentially.
backend – Parallelisation backend to use. Defaults to “joblib”.
verbose (bool) – whether to display progress bars or not
return_type (Literal['opt_pipeline', 'best_setting', 'both']) – whether to return the same transformer with optimal pipeline setting, and/or a setting of the higher metric value, and the resulting transformers and settings.

Return type:

Examples¶

Tuning BM25¶

When using Retriever, the b parameter of the BM25 weighting model can be controled using the “bm25.b” control. We must give this control an initial value when contructing the Retriever instance. Thereafter, the GridSearch parameter dictionary can be constructed by refering to the instance of transformer that has that parameter:

BM25 = pt.terrier.Retriever(index, wmodel="BM25", controls={"bm25.b" : 0.75})
pt.GridSearch(
    BM25,
    {BM25 : {"c" : [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1 ]}}
    train_topics,
    train_qrels,
    "map")

Terrier’s BM25 also responds to controls named “bm25.k_1” and “bm25.k_3”, such that all three controls can be tuned concurrently:

BM25 = pt.terrier.Retriever(index, wmodel="BM25", controls={"bm25.b" : 0.75, "bm25.k_1": 0.75, "bm25.k_3": 0.75})
pt.GridSearch(
    BM25,
    {BM25: {"bm25.b"  : [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1 ],
            "bm25.k_1": [0.3, 0.6, 0.9, 1.2, 1.4, 1.6, 2],
            "bm25.k_3": [0.5, 2, 4, 6, 8, 10, 12, 14, 20]
    }}
    train_topics,
    train_qrels,
    "map")

Tuning BM25 and RM3¶

The query expansion transformer in pt.rewrite have parameters controlling the number of feedback documents and expansion terms, namely:

fb_terms – the number of terms to add to the query.

fb_docs – the size of the pseudo-relevant set.

A full tuning of BM25 and RM3 can be achieved as thus:

bm25_for_qe = pt.terrier.Retriever(index, wmodel="BM25", controls={"bm25.b" : 0.75})
rm3 = pt.rewrite.RM3(index, fb_terms=10, fb_docs=3)
pipe_qe = bm25_for_qe >> rm3 >> bm25_for_qe

param_map = {
        bm25_for_qe : { "bm25.b" : [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1 ]},
        rm3 : {
            "fb_terms" : list(range(1, 12, 3)), # makes a list of 1,3,6,7,12
            "fb_docs" : list(range(2, 30, 6))   # etc.
        }
}
pipe_qe = pt.GridSearch(pipe_qe, param_map, train_topics, train_qrels)
pt.Experiment([pipe_qe], test_topics, test_qrels, ["map"])

Tuning BM25F¶

BM25F and PL2F are field-based weighting models which appler per-field normalisation. These have at least 2 parameters for each field: one controlling the term frequency vs. length normalisation of that field, and one for controlling the importance of the per-field normalised term frequency. The general form of BM25F and PL2F are as follows:

\[score(d,Q) = \text{weight}(tfn)\]

where \(tfn\) is defined as the weighted average of normalised lengths across each field.

\[tfn = \sum_f w_f \cdot \text{norm}(tf_f, l_f, c_f)\]

In the above, \(tf_f\) and \(l_f\) are respectively the term frequency in field \(f\) and the length of that field. \(w_f\) and \(c_f\) are respectively the field weights and normalisation parameter for that field.

In Terrier, for both the BM25F and PL2F weighting models, the relevant configuration controls are a 'c.' control for each field (controlling normalisation) and a 'w.' control for each field (controlling the weight). Fields are numbered, starting from 0.

The following is an example of scanning the parameters of BM25F for an index with two fields:

from numpy import arange  # gives a list of values in an interval

# check your index has exactly 2 fields
assert 2 == index.getCollectionStatistics().getNumberOfFields()

# instantiate Retriever for BM25F
bm25f = pt.terrier.Retriever(
    index,
    wmodel = 'BM25F',
    controls = {'w.0' : 1, 'w.1' : 1, 'c.0' : 0.4, 'c.1' : 0.4}
)

# now attempt all parameter values
pt.GridScan(
    bm25f,
    # you can name more parameters here and their values to try
    {bm25f : {
        'w.0' : arange(0, 1.1, 0.1),
        'w.1' : arange(0, 1.1, 0.1),
        'c.0' : arange(0, 1.1, 0.1),
        'c.1' : arange(0, 1.1, 0.1),
    }},
    topics,
    qrels,
    ['map']
)
# GridScan returns a table of MAP values for all attempted parameter settings

Using Multiple Folds¶

pyterrier.KFoldGridSearch(pipeline, params, topics_list, qrels, metric='map', jobs=1, backend='joblib', verbose=False, batch_size=None)[source]¶

Applies a GridSearch using different folds. It returns the results of the tuned transformer pipeline on the test topics. The number of topics dataframes passed to topics_list defines the number of folds. For each fold, all but one of the dataframes is used as training, and the remainder used for testing.

The state of the transformers in the pipeline is restored after the KFoldGridSearch has been executed.

Parameters:

pipeline (Transformer) – a transformer or pipeline to tune
params (Dict[Transformer, Dict[str, List[str | float | int]]]) – a two-level dictionary, mapping transformer to param name to a list of values
topics_list (List[DataFrame]) – a list of topics dataframes to tune upon
qrels (DataFrame | List[DataFrame]) – qrels to tune upon. A single dataframe, or a list for each fold.
metric (str | Measure) – name of the metric on which to determine the most effective setting. Defaults to “map”.
batch_size (int | None) – If not None, evaluation is conducted in batches of batch_size topics. Default=None, which evaluates all topics at once. Applying a batch_size is useful if you have large numbers of topics, and/or if your pipeline requires large amounts of temporary memory during a run. Default is None.
jobs (int) – Number of parallel jobs to run. Default is 1, which means sequentially.
backend – Parallelisation backend to use. Defaults to “joblib”.
verbose(bool) – whether to display progress bars or not
verbose (bool)

Return type:

Tuple[DataFrame, List[List[Tuple[Transformer, str, str | float | int]]]]

Returns:

A tuple containing, firstly, the results of pipeline on the test topics after tuning, and secondly, a list of the best parameter settings for each fold.

Consider tuning a terrier.Retriever PL2 where the folds of queries are pre-determined:

pl2 = pt.terrier.Retriever(index, wmodel="PL2", controls={'c' : 1})
tuned_pl2, _ = pt.KFoldGridSearch(
    pl2,
    {pl2 : {'c' : [0.1, 1, 5, 10, 20, 100]}},
    [topicsf1, topicsf2],
    qrels,
    ["map"]
)
pt.Experiment([pl2, tuned_pl2], all_topics, qrels, ["map"])

As 2 splits are defined, PL2 is first tuned on topicsf1 and tested on topicsf2, then trained on topicsf2 and tested on topicsf1. The results dataframe of PL2 after tuning of the c parameter are returned by the KFoldGridSearch, and can be used directly in a pt.Experiment().

Parallelisation¶

GridScan, GridSearch and KFoldGridSearch can all be accelerated using parallelisation to conduct evalutions of different parameter settings in parallel. Both accept jobs and backend kwargs, which define the number of backend processes to conduct, and the parallelisation backend. For instance:

pt.GridSearch(pipe_qe, param_map, train_topics, train_qrels, jobs=10)

This incantation will fork 10 Python processes to run the different settings in parallel. Each process will load a new instance of any large data structures, such as Terrier indices, so your machines must have sufficient memory to load 10 instances of the index.

The Ray backend offers parallelisation across multiple machines. For more information, see Parallelisation.