Dataframe Builder¶

DataFrameBuilder provides a simple way to progressively build a DataFrame in a Transformer.

Usage¶

A common pattern in Transformer implementation builds up an intermediate representation of the output DataFrame, but this can be a bit clunky, as shown below:

Building a DataFrame without DataFrameBuilder¶

class MyTransformer(pt.Transformer):
    def transform(self, inp: pd.DataFrame):
        result = {
            'qid': [],
            'query': [],
            'docno': [],
            'score': [],
        }
        for qid, query in zip(inp['qid'], inp['query']):
            docnos, scores = self.some_function(qid, query)
            result['qid'].append([qid] * len(docnos))
            result['query'].append([query] * len(docnos))
            result['docno'].append(docnos)
            result['score'].append(scores)
        result = pd.DataFrame({
            'qid': np.concatenate(result['qid']),
            'query': np.concatenate(result['query']),
            'docno': np.concatenate(result['docno']),
            'score': np.concatenate(result['score']),
        })
        return result

DataFrameBuilder simplifies the process of building a DataFrame by removing lots of the boilerplate. It also automatically handles various types and ensures that all columns end up with the same length. The above example can be rewritten with pta.DataFrameBuilder as follows:

Building a DataFrame using DataFrameBuilder¶

class MyTransformer(pt.Transformer):
    def transform(self, inp: pd.DataFrame):
        result = pta.DataFrameBuilder(['qid', 'query', 'docno', 'score'])
        for qid, query in zip(inp['qid'], inp['query']):
            docnos, scores = self.some_function(qid, query)
            result.extend({
                'qid': qid, # automatically repeats to the length of this batch
                'query': query, # ditto
                'docno': docnos,
                'score': scores,
            })
        return result.to_df()

You’ll often want to extend the set of columns passed to a transformer, rather than replacing them. For instance, in the previous example, perhaps inp includes is a my_special_data field added by another transformer that should be passed along to the following step. If you pass the original input frame to to_df, the function will try to merge the original frames together. The columns from the merged frame will appear before any new columns.

Merging the input frame’s data with DataFrameBuilder¶

class MyTransformer(pt.Transformer):
    def transform(self, inp: pd.DataFrame):
        result = pta.DataFrameBuilder(['docno', 'score'])
        for qid, query in zip(inp['qid'], inp['query']):
            docnos, scores = self.some_function(qid, query)
            result.extend({
                'docno': docnos,
                'score': scores,
            })
        return result.to_df(inp)

Note

The merging functionality assumes that extend is called once per row in the original frame, in the same order as the original frame.

If this is not the case, you can manually provide an _index field each time you call extend. where _index is the integer index of the row in the original frame.

API Documentation¶

class pyterrier_alpha.DataFrameBuilder(columns)[source]¶

Utility to build a DataFrame from a sequence of dictionaries.

Added in version 0.1.0.

The dictionaries must have the same keys, and the values must be either scalars, or lists of the same length.

Create a DataFrameBuilder with the given columns.

Parameters:: columns (List[str]) – the columns of the resulting DataFrame, required to be present in each call to extend().

extend(values)[source]¶

Add a dictionary of values to the DataFrameBuilder.

Changed in version 0.4.1: Allow all fields to be scalars (assumes length of 1).

Changed in version 0.7.0: Automatically infer the _index field.

Changed in version 0.9.2: Allow broadcasting of input lists with the length of 1. This allows support for inputs like arrays, which are not meant to be treated as lists themselves.

Parameters:: values (Dict[str, Any]) – a dictionary of values to add to the DataFrameBuilder. The keys must be the same as the columns provided to the constructor, and the values must be either scalars, or lists (all of the same length).
Return type:: None

to_df(merge_on_index=None)[source]¶

Convert the DataFrameBuilder to a DataFrame.

Changed in version 0.1.1: Added merge_on_index argument.

Changed in version 0.1.1: Columns from merge_on_index come first.

Changed in version 0.9.3: Fixed bug with columns that have values of numpy arrays

Parameters:: merge_on_index (DataFrame | None) – an optional DataFrame to merge the resulting DataFrame on.
Returns:: A DataFrame with the values added to the DataFrameBuilder.
Return type:: DataFrame