Dataframe Builder¶
DataFrameBuilder
provides a simple way to progressively build a DataFrame in a
Transformer
.
Usage¶
A common pattern in Transformer implementation builds up an intermediate representation of the output DataFrame, but this can be a bit clunky, as shown below:
class MyTransformer(pt.Transformer):
def transform(self, inp: pd.DataFrame):
result = {
'qid': [],
'query': [],
'docno': [],
'score': [],
}
for qid, query in zip(inp['qid'], inp['query']):
docnos, scores = self.some_function(qid, query)
result['qid'].append([qid] * len(docnos))
result['query'].append([query] * len(docnos))
result['docno'].append(docnos)
result['score'].append(scores)
result = pd.DataFrame({
'qid': np.concatenate(result['qid']),
'query': np.concatenate(result['query']),
'docno': np.concatenate(result['docno']),
'score': np.concatenate(result['score']),
})
return result
DataFrameBuilder
simplifies the process of building a DataFrame by removing lots of
the boilerplate. It also automatically handles various types and ensures that all columns end up with the same
length. The above example can be rewritten with pta.DataFrameBuilder as follows:
class MyTransformer(pt.Transformer):
def transform(self, inp: pd.DataFrame):
result = pta.DataFrameBuilder(['qid', 'query', 'docno', 'score'])
for qid, query in zip(inp['qid'], inp['query']):
docnos, scores = self.some_function(qid, query)
result.extend({
'qid': qid, # automatically repeats to the length of this batch
'query': query, # ditto
'docno': docnos,
'score': scores,
})
return result.to_df()
You’ll often want to extend the set of columns passed to a transformer, rather than replacing them.
For instance, in the previous example, perhaps inp
includes is a my_special_data
field added by
another transformer that should be passed along to the following step. If you pass the original input
frame to to_df
, the function will try to merge the original frames together. The columns from the
merged frame will appear before any new columns.
class MyTransformer(pt.Transformer):
def transform(self, inp: pd.DataFrame):
result = pta.DataFrameBuilder(['docno', 'score'])
for qid, query in zip(inp['qid'], inp['query']):
docnos, scores = self.some_function(qid, query)
result.extend({
'docno': docnos,
'score': scores,
})
return result.to_df(inp)
Note
The merging functionality assumes that extend
is called once per row in the original frame,
in the same order as the original frame.
If this is not the case, you can manually provide an _index
field each time you call extend
.
where _index
is the integer index of the row in the original frame.
API Documentation¶
- class pyterrier_alpha.DataFrameBuilder(columns)[source]¶
Utility to build a DataFrame from a sequence of dictionaries.
Added in version 0.1.0.
The dictionaries must have the same keys, and the values must be either scalars, or lists of the same length.
Create a DataFrameBuilder with the given columns.
- Parameters:
columns – the columns of the resulting DataFrame, required to be present in each call to
extend()
.
- extend(values)[source]¶
Add a dictionary of values to the DataFrameBuilder. :rtype:
None
Changed in version 0.4.1: Allow all fields to be scalars (assumes length of 1).
Changed in version 0.7.0: Automatically infer the
_index
field.Changed in version 0.9.2: Allow broadcasting of input lists with the length of 1. This allows support for inputs like arrays, which are not meant to be treated as lists themselves.
- Parameters:
values – a dictionary of values to add to the DataFrameBuilder. The keys must be the same as the columns provided to the constructor, and the values must be either scalars, or lists (all of the same length).
- to_df(merge_on_index=None)[source]¶
Convert the DataFrameBuilder to a DataFrame. :rtype:
DataFrame
Changed in version 0.1.1: Added
merge_on_index
argument.Changed in version 0.1.1: Columns from
merge_on_index
come first.Changed in version 0.9.3: Fixed bug with columns that have values of numpy arrays
- Parameters:
merge_on_index – an optional DataFrame to merge the resulting DataFrame on.
- Returns:
A DataFrame with the values added to the DataFrameBuilder.