pt.new - Creating new dataframes

This module provides useful utility methods for creating example dataframes for queries and ranked documents.

pyterrier.new.empty_Q()[source]

Returns an empty dataframe with columns [“qid”, “query”].

Return type:

DataFrame

pyterrier.new.queries(queries, qid=None, **others)[source]

Creates a new queries dataframe. Will return a dataframe with the columns [“qid”, “query”]. Any further lists in others will also be added.

Return type:

DataFrame

Parameters:
  • queries (str | Sequence[str]) – The search queries. Either a string, for a single query, or a sequence (e.g. list of strings)

  • qids – Corresponding query ids. Either a string, for a single query, or a sequence (e.g. list of strings). Must have same length as queries.

  • others – A dictionary of other attributes to add to the query dataframe

  • qid (str | Iterable[str] | None)

Examples:

# create a dataframe with one query, qid "1"
one_query = pt.new.queries("what the noise was was the question")

# create a dataframe with one query, qid "5"
one_query = pt.new.queries("what the noise was was the question", 5)

# create a dataframe with two queries
one_query = pt.new.queries(["query text A", "query text B"], ["1", "2"])

# create a dataframe with two queries
one_query = pt.new.queries(["query text A", "query text B"], ["1", "2"], categories=["catA", "catB"])
pyterrier.new.empty_R()[source]

Returns an empty dataframe with columns [“qid”, “query”, “docno”, “rank”, “score”].

Return type:

DataFrame

pyterrier.new.ranked_documents(scores, qid=None, docno=None, **others)[source]

Creates a new ranked documents dataframe. Will return a dataframe with the columns [“qid”, “docno”, “score”, “rank”]. Any further lists in others will also be added.

Return type:

DataFrame

Parameters:
  • scores (Sequence[Sequence[float]]) – The scores of the retrieved documents. Must be a list of lists.

  • qid (Sequence[str] | None) – Corresponding query ids. Must have same length as the first dimension of scores. If omitted, documents, qids are computed as strings starting from “1”

  • docno (Sequence[Sequence[str]] | None) – Corresponding docnos. Must have same length as the first dimension of scores and each 2nd dimension must be the same as the number of documents retrieved. If omitted, docnos are computed as strings starting from “d1” for each query.

  • others – A dictionary of other attributes to add to the query dataframe.

Examples:

# one query, one document
R1 = pt.new.ranked_documents([[1]])

# one query, two documents
R2 = pt.new.ranked_documents([[1, 2]])

# two queries, one documents each
R3 = pt.new.ranked_documents([[1], [2]])

# one query, one document, qid specified
R4 = pt.new.ranked_documents([[1]], qid=["q100"])

# one query, one document, qid and docno specified
R5 = pt.new.ranked_documents([[1]], qid=["q100"], docno=[["d20"]])

DataFrameBuilder

DataFrameBuilder provides a simple way to progressively build a DataFrame in a Transformer.

A common pattern in Transformer implementation builds up an intermediate representation of the output DataFrame, but this can be a bit clunky, as shown below:

Building a DataFrame without DataFrameBuilder
class MyTransformer(pt.Transformer):
    def transform(self, inp: pd.DataFrame):
        result = {
            'qid': [],
            'query': [],
            'docno': [],
            'score': [],
        }
        for qid, query in zip(inp['qid'], inp['query']):
            docnos, scores = self.some_function(qid, query)
            result['qid'].append([qid] * len(docnos))
            result['query'].append([query] * len(docnos))
            result['docno'].append(docnos)
            result['score'].append(scores)
        result = pd.DataFrame({
            'qid': np.concatenate(result['qid']),
            'query': np.concatenate(result['query']),
            'docno': np.concatenate(result['docno']),
            'score': np.concatenate(result['score']),
        })
        return result

DataFrameBuilder simplifies the process of building a DataFrame by removing lots of the boilerplate. It also automatically handles various types and ensures that all columns end up with the same length. The above example can be rewritten with pt.new.DataFrameBuilder as follows:

Building a DataFrame using DataFrameBuilder
class MyTransformer(pt.Transformer):
    def transform(self, inp: pd.DataFrame):
        result = pt.new.DataFrameBuilder(['qid', 'query', 'docno', 'score'])
        for qid, query in zip(inp['qid'], inp['query']):
            docnos, scores = self.some_function(qid, query)
            result.extend({
                'qid': qid, # automatically repeats to the length of this batch
                'query': query, # ditto
                'docno': docnos,
                'score': scores,
            })
        return result.to_df()

You’ll often want to extend the set of columns passed to a transformer, rather than replacing them. For instance, in the previous example, perhaps inp includes a my_special_data field added by another transformer that should be passed along to the following step. If you pass the original input frame to to_df, the function will try to merge the original frames together. The columns from the merged frame will appear before any new columns.

Merging the input frame’s data with DataFrameBuilder
class MyTransformer(pt.Transformer):
    def transform(self, inp: pd.DataFrame):
        result = pt.new.DataFrameBuilder(['docno', 'score'])
        for qid, query in zip(inp['qid'], inp['query']):
            docnos, scores = self.some_function(qid, query)
            result.extend({
                'docno': docnos,
                'score': scores,
            })
        return result.to_df(inp)

Note

The merging functionality assumes that extend is called once per row in the original frame, in the same order as the original frame.

If this is not the case, you can manually provide an _index field each time you call extend, where _index is the integer index of the row in the original frame.

Alternatively, if the results already include a qid column you can use merge_on_qid instead of merge_on_index. This performs a standard join on the qid column and does not require _index to be tracked at all.

Merging by qid with DataFrameBuilder
class MyTransformer(pt.Transformer):
    def transform(self, inp: pd.DataFrame):
        result = pt.new.DataFrameBuilder(['qid', 'docno', 'score'])
        for row in inp.itertuples():
            docnos, scores = self.some_function(row.qid, row.query)
            result.extend({
                'qid': row.qid,  # used as the join key
                'docno': docnos,
                'score': scores,
            })
        return result.to_df(merge_on_qid=inp)

Note

merge_on_index and merge_on_qid are mutually exclusive — pass only one.

merge_on_index is generally faster because it uses integer positional alignment rather than a full column join, and it does not require the result rows to carry a qid value. Use merge_on_qid when result rows already carry qid (e.g. results come back in arbitrary order from a concurrent backend) or when tracking _index is inconvenient.

class pyterrier.new.DataFrameBuilder(columns)[source]

Utility to build a DataFrame from a sequence of dictionaries.

The dictionaries must have the same keys, and the values must be either scalars, or lists of the same length.

When iterating sequentially, you can omit _index and let the builder track it automatically — each call to extend() increments an internal counter that maps to the corresponding row in the input DataFrame passed to to_df():

Example (auto _index, simpler):

builder = pt.new.DataFrameBuilder(['docno', 'score'])
for row in queries.itertuples():
    docnos, scores = retrieve(row.query)
    builder.extend({'docno': docnos, 'score': scores})
df = builder.to_df(merge_on_index=queries)

When results may arrive out of order (e.g. with a thread pool), pass _index explicitly so that each result batch is aligned to the correct input row. This is also the pattern to use when the integer index of the input row is already available at no extra cost:

Example (explicit _index):

builder = pt.new.DataFrameBuilder(['docno', 'score'])
for row in queries.itertuples():
    docnos, scores = retrieve(row.query)
    builder.extend({'_index': row.Index, 'docno': docnos, 'score': scores})
df = builder.to_df(merge_on_index=queries)

Both patterns produce identical results when iterating sequentially; the auto-index form is slightly simpler to write, while the explicit form is required for concurrent/out-of-order processing.

Create a DataFrameBuilder with the given columns.

Parameters:

columns (List[str]) – the columns of the resulting DataFrame, required to be present in each call to extend().

extend(values)[source]

Add a dictionary of values to the DataFrameBuilder.

Return type:

None

Parameters:

values (Dict[str, Any]) –

a dictionary of values to add to the DataFrameBuilder. The keys must be the same as the columns provided to the constructor, and the values must be either scalars, or lists (all of the same length). Strings are always treated as scalars (not sequences of characters). An empty dict is a no-op.

The optional _index key identifies which row of the input DataFrame (passed later to to_df()) this batch of results belongs to. It must be an integer position (i.e. row.Index from itertuples(), not a label-based index). When _index is omitted, an internal counter is used and incremented automatically on every call, which is correct for sequential iteration but not for out-of-order (concurrent) processing.

to_df(merge_on_index=None, merge_on_qid=None)[source]

Convert the DataFrameBuilder to a DataFrame.

Return type:

DataFrame

Parameters:
  • merge_on_index (DataFrame | None) – an optional DataFrame to merge the resulting DataFrame on. Columns from merge_on_index come first in the result. The existing results must have an _index column

  • merge_on_qid (DataFrame | None) – an optional DataFrame to merge the resulting DataFrame on. The qid columns are assumed to match.

Returns:

A DataFrame with the values added to the DataFrameBuilder.