LLM Backends¶

PyTerrier RAG supports a variety of LLM backends for generating responses. This functionality is facilitated by the Backend interface, which currently has three implementations: HuggingFaceBackend, VllmBackend, and OpenAIBackend. This architecture also allows different components to share the same backend, which is particularly useful for multi-stage RAG pipelines.

Basics¶

Start by creating an instance of a backend. For example, using the OpenAIBackend:

Create an instance of a OpenAIBackend¶

import pyterrier_rag as ptr
backend = ptr.OpenAIBackend('gpt-4o-mini', api_key="your_openai_api_key") # or loaded from OPENAI_API_KEY environment variable, if available

The backend can be used to generate responses to prompts. For example, using the generate method:

Generate a response using the backend¶

backend.generate(["What is the capital of France?"])
# Outputs: [BackendOutput(text='The capital of France is Paris.', logprobs=None)]

Backends also function as PyTerrier Transformers. By default, they take input from the prompt column and output to the qanswer column:

Generate a response using the backend¶

inp = pd.DataFrame([
    {'prompt': 'What is the capital of France?'},
    {'prompt': 'What is the capital of Germany?'},
])
backend(inp)
#                             prompt                            qanswer
# 0   What is the capital of France?    The capital of France is Paris.
# 1  What is the capital of Germany?  The capital of Germany is Berlin.

Usually you won’t use a backend directly though – they are instead typically used by other components, such as Prompts and Frameworks.

To set the global default backend, you can call pyterrier_rag.default_backend.set(). Note that this must be called before any components that use the default backend are used (i.e., if you are using default_backend, we recommended setting it at the top of your script/notebook).

Set the default backend¶

import pyterrier_rag as ptr
ptr.default_backend.set(ptr.OpenAIBackend('gpt-4o-mini'))
ptr.default_backend.generate(['What is the capital of France?']) # -> uses the OpenAIBackend from above

The default backend is automatically loaded via the PYTERRIER_RAG_DEFAULT_BACKEND (using pyterrier_rag.Backend.from_dsn()) if it is set when PyTerrier RAG is first loaded.

Readers from Backends¶

Create a reader from a backend¶

system_message = """You are an expert Q&A system that is trusted around the world.
    Always answer the query using the provided context information,
    and not prior knowledge.
    Some rules to follow:
    1. Never directly reference the given context in your answer
    2. Avoid statements like 'Based on the context, ...' or
    'The context information ...' or anything along those lines."""
prompt_text = """Context information is below.
            ---------------------
            {{ qcontext }}
            ---------------------
            Given the context information and not prior knowledge, answer the query.
            Query: {{ query }}
            "Answer: """

template = get_conversation_template("meta-llama-3.1-sp")
prompt = PromptTransformer(
    conversation_template=template,
    system_message=system_message,
    instruction=prompt_text,
    api_type="openai"
)

generation_args={
    "temperature": 0.1,
    "max_tokens": 128,
}

# this could equally be a real OpenAI model, or a HuggingFace model, or a vLLM model, etc.
llama = OpenAIBackend(model_name,
                    api_key="xxx",
                    generation_args=generation_args,
                    base_url="http://yyyy:8000/v1",)

llama_reader = Reader(llama, prompt=prompt)
bm25_llama = bm25_ret % 5 >> Concatenator() >> llama_reader

See _pyterrier_rag.readers for more information on how to use the Reader class with Backends.

Token Probabilities¶

Some components need the log probabilities of the generated tokens (and alternative tokens). This is included as part of the BackendOutput object when using return_logprobs=True in generate() or by using logprobs_generator(). For example:

Include log probabilities for response¶

backend.generate(["What is the capital of France?"], return_logprobs=True)
# [BackendOutput(text='The capital of France is Paris.', logprobs=[
#     {'The': -0.04, 'That': -0.31, ...},
#     ...,
#     {'Paris': -0.01, 'Berlin': -2.12, ...},
#     ...,
# ])]

inp = pd.DataFrame([
    {'prompt': 'What is the capital of France?'},
    {'prompt': 'What is the capital of Germany?'},
])
generator = backend.logprobs_generator()
generator(inp)
#                             prompt                            qanswer                           qanswer_logprobs
# 0   What is the capital of France?    The capital of France is Paris.  [{'The': -0.04, 'That': -0.31, ...}, ...]
# 1  What is the capital of Germany?  The capital of Germany is Berlin.  [{'The': -0.02, 'That': -0.29, ...}, ...]

This feature is typically most useful when a you have a single-token response. You can force the backend to generate a single token using max_new_tokens=1 and a suitable prompt:

Force a single token response¶

backend.generate(["What is the capital of France? Answer in a single word only."], max_new_tokens=1, return_logprobs=True)
# [BackendOutput(text='Paris', logprobs=[{'Paris': -0.01, 'Berlin': -2.12, ...}])]

inp = pd.DataFrame([
    {'prompt': 'What is the capital of France? Answer in a single word only.'},
    {'prompt': 'What is the capital of Germany? Answer in a single word only.'},
])
generator = backend.logprobs_generator(max_new_tokens=1)
generator(inp)
#                             prompt   qanswer                           qanswer_logprobs
# 0   What is the capital of France?     Paris   [{'Paris': -0.01, 'Berlin': -2.12, ...}]
# 1  What is the capital of Germany?    Berlin   [{'Berlin': -0.02, 'Paris': -2.29, ...}]

Reasoning¶

Some models output reasoning steps (contained within a <think> tag) before the final answer. If you want to extract these reasoning steps, you can use the ReasoningExtractor transformer in your pipeline.

Extract reasoning steps from a response¶

from pyterrier_rag import OpenAIBackend, ReasoningExtractor

# An example of a model that outputs reasoning steps in <think> tags:
backend = OpenAIBackend('deepseek-llama-3-8b-instruct', api_key="your_api_key", base_url="http://localhost:8000/v1")

pipeline = backend >> ReasoningExtractor() # extract reasoning after running the backend
inp = pd.DataFrame([
    {'prompt': 'What is the capital of France?'},
    {'prompt': 'What is the capital of Germany?'},
])
reasoning_extractor(inp)
#                             prompt  qanswer                                                       reasoning
# 0   What is the capital of France?    Paris    Ok, let me think about this. The capital of France is Paris.
# 1  What is the capital of Germany?   Berlin  Ok, let me think about this. The capital of Germany is Berlin.

API Documentation¶

General¶

class pyterrier_rag.Backend(model_id, *, max_input_length=512, max_new_tokens=32, verbose=False)[source]¶

Abstract base class for model-backed Transformers in PyTerrier.

Subclasses must implement the raw generation logic (generate) and the high-level generate method. Supports optional logprob extraction.

Parameters:

max_input_length (int) – Maximum token length for each input prompt.
max_new_tokens (int) – Maximum number of tokens to generate.
verbose (bool) – Flag to enable detailed logging.
device (Union[str, torch.device]) – Device for model execution.
model_id (str)

The following class attributes are available:

model_id¶

Model name or checkpoint path.

Type:: str

supports_logprobs¶

Indicates support for including the logprobs of generated tokens.

Type:: bool

supports_message_input¶

Indicates support for message (chat)-formatted (List[dict]) inputs to generate, in addition to str inputs.

Type:: bool

abstractmethod generate(inps, *, return_logprobs=False, max_new_tokens=None, stop_sequences=None)[source]¶

Generate text from input prompts.

Return type:

List[BackendOutput]

Parameters:

inps (Union[List[str], List[List[dict]]]) – Input prompts as strings or dictionaries. When strings, represent the prompts directly. When a list of dictionaries, represents a sequence of messages (if backend.supports_message_input==True).
return_logprobs (bool) – Whether to return logprobs of generated tokens along with text. (Only available if backend.supports_logprobs==True.)
max_new_tokens (Optional[int]) – Override for max tokens to generate.
stop_sequences (Optional[List[str]]) – List of tokens at which to stop generation. If None, generation is unconstrained.

Returns:

An output for each inp, each containing the generated text and optionally logprobs.

Return type:

List[BackendOutput]

text_generator(*, input_field='prompt', output_field='qanswer', batch_size=4, max_new_tokens=None, stop_sequences=None, num_responses=1)[source]¶

Create a text generator transformer using this backend.

Return type:

Transformer

Parameters:

input_field (str) – Name of the field containing input prompts.
output_field (str) – Name of the field to store generated text.
batch_size (int) – Number of prompts to process in each batch.
max_new_tokens (Optional[int]) – Override for max tokens to generate. If None, uses the backend’s max_new_tokens.
stop_sequences (Optional[List[str]]) – List of tokens at which to stop generation. If None, generation is unconstrained.
num_responses (int) – Number of responses to generate for each prompt.

logprobs_generator(*, input_field='prompt', output_field='qanswer', logprobs_field='qanswer_logprobs', batch_size=4, max_new_tokens=None, stop_sequences=None, num_responses=1)[source]¶

Create a text generator transformer that also returns the logprobs of each token using this backend.

Return type:

Transformer

Parameters:

input_field (str) – Name of the field containing input prompts.
output_field (str) – Name of the field to store generated text.
logprobs_field (str) – Name of the field to store logprobs.
batch_size (int) – Number of prompts to process in each batch.
max_new_tokens (Optional[int]) – Override for max tokens to generate. If None, uses the backend’s max_new_tokens.
stop_sequences (Optional[List[str]]) – List of tokens at which to stop generation. If None, generation is unconstrained.
num_responses (int) – Number of responses to generate for each prompt.

static from_dsn(dsn)[source]¶

Create a Backend instance from a DSN (Data Source Name) string.

The DSN format is: <provider>:<model_id> [key1=value1 key2=value2 ...].

Examples: "openai:gpt-3.5-turbo", "openai:meta-llama/Llama-4-Scout-17B-16E-Instruct base_url=http://localhost:8080/v1", "vllm:meta-llama/Llama-4-Scout-17B-16E-Instruct", ands "huggingface:meta-llama/Llama-4-Scout-17B-16E-Instruct".

See each backend implementation from_params method for their supported keys.

Return type:: Backend
Parameters:: dsn (str) – The DSN string to parse.
Returns:: An instance of the appropriate Backend subclass based on the provider.
Return type:: Backend
Raises:: ValueError – If the DSN format is invalid or the provider is unknown.

class pyterrier_rag.backend.TextGenerator(backend, *, input_field='prompt', output_field='qanswer', logprobs_field=None, batch_size=4, max_new_tokens=None, stop_sequences=None, num_responses=1)[source]¶

Transformer that generates text from the specified backend.

Parameters:

backend (Backend) – The backend to use for text generation.
input_field (str) – Name of the field containing input prompts.
output_field (str) – Name of the field to store generated text.
logprobs_field (Optional[str]) – Name of the field to store generated logprobs. If None, logprobs are not returned.
batch_size (int) – Number of prompts to process in each batch.
max_new_tokens (Optional[int]) – Override for max tokens to generate. If None, uses the backend’s max_new_tokens.
num_responses (int) – Number of responses to generate for each prompt.
stop_sequences (Optional[List[str]]) – List of tokens at which to stop generation. If None, generation is unconstrained.

class pyterrier_rag.backend.BackendOutput(text=None, logprobs=None)[source]¶

Parameters:

text (str)
logprobs (List[Dict[str, float]] | None)

Implementations¶

class pyterrier_rag.HuggingFaceBackend(model_id, *, model_args={}, generation_args=None, max_input_length=None, max_new_tokens=32, logprobs_topk=20, verbose=False, device=None)[source]¶

Backend implementation using a HuggingFace Transformer model. This backend assumes the class can be opened using AutoModelForCausalLM. If your class needs AutoModelForSeq2SeqLM, then use Seq2SeqLMBackend.

Citation

Wolf et al. HuggingFace's Transformers: State-of-the-art Natural Language Processing. arXiv 2019. [link]

@article{DBLP:journals/corr/abs-1910-03771,
  author       = {Thomas Wolf and
                  Lysandre Debut and
                  Victor Sanh and
                  Julien Chaumond and
                  Clement Delangue and
                  Anthony Moi and
                  Pierric Cistac and
                  Tim Rault and
                  R{\'{e}}mi Louf and
                  Morgan Funtowicz and
                  Jamie Brew},
  title        = {HuggingFace's Transformers: State-of-the-art Natural Language Processing},
  journal      = {CoRR},
  volume       = {abs/1910.03771},
  year         = {2019},
  url          = {http://arxiv.org/abs/1910.03771},
  eprinttype    = {arXiv},
  eprint       = {1910.03771},
  timestamp    = {Tue, 02 Jun 2020 12:49:01 +0200},
  biburl       = {https://dblp.org/rec/journals/corr/abs-1910-03771.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Parameters:

model_id (str) – Identifier or path of the pretrained model.
model_args (dict) – Arguments passed to from_pretrained for model instantiation.
generation_args (dict) – Parameters controlling text generation.
max_input_length (int) – Maximum token length for inputs (defaults to model config).
max_new_tokens (int) – Maximum number of tokens to generate per input.
verbose (bool) – Flag to enable verbose logging.
logprobs_topk (int)
device (str | device)

static from_params(params)[source]¶

Create a HuggingFaceBackend instance from parameters.

Supported params:

model_id (str): Identifier or path of the HuggingFace model.

max_input_length (int): Maximum tokens per input prompt.

max_new_tokens (int): Tokens to generate per prompt.

logprobs_topk (int): Number of top logprobs to return.

verbose (bool): Enable verbose output.

Returns:: An instance of HuggingFaceBackend.
Return type:: HuggingFaceBackend
Return type:: HuggingFaceBackend
Parameters:: params (Dict[str, str])

class pyterrier_rag.VLLMBackend(model_id, *, model_args={}, generation_args=None, max_input_length=512, max_new_tokens=32, logprobs_topk=20, verbose=False)[source]¶

Backend implementation using the vLLM library for text generation.

Citation

Kwon et al. Efficient Memory Management for Large Language Model Serving with PagedAttention. SOSP 2023. [link]

@inproceedings{DBLP:conf/sosp/KwonLZ0ZY0ZS23,
  author       = {Woosuk Kwon and
                  Zhuohan Li and
                  Siyuan Zhuang and
                  Ying Sheng and
                  Lianmin Zheng and
                  Cody Hao Yu and
                  Joseph Gonzalez and
                  Hao Zhang and
                  Ion Stoica},
  editor       = {Jason Flinn and
                  Margo I. Seltzer and
                  Peter Druschel and
                  Antoine Kaufmann and
                  Jonathan Mace},
  title        = {Efficient Memory Management for Large Language Model Serving with
                  PagedAttention},
  booktitle    = {Proceedings of the 29th Symposium on Operating Systems Principles,
                  {SOSP} 2023, Koblenz, Germany, October 23-26, 2023},
  pages        = {611--626},
  publisher    = {{ACM}},
  year         = {2023},
  url          = {https://doi.org/10.1145/3600006.3613165},
  doi          = {10.1145/3600006.3613165},
  timestamp    = {Tue, 11 Feb 2025 11:42:30 +0100},
  biburl       = {https://dblp.org/rec/conf/sosp/KwonLZ0ZY0ZS23.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Parameters:

model_id (str) – Identifier or path of the vLLM model.
model_args (dict, optional) – Keyword arguments for LLM instantiation.
generation_args (dict, optional) – Parameters for sampling (e.g., max_tokens, temperature).
max_input_length (int) – Maximum tokens per input prompt (inherited).
max_new_tokens (int) – Tokens to generate per prompt (inherited).
verbose (bool) – Enable verbose output.
logprobs_topk (int)

Raises:

ImportError – If the vllm library is unavailable.

static from_params(params)[source]¶

Create a VLLMBackend instance from parameters.

Supported params:

model_id (str): Identifier or path of the vLLM model.

max_input_length (int): Maximum tokens per input prompt.

max_new_tokens (int): Tokens to generate per prompt.

logprobs_topk (int): Number of top logprobs to return.

verbose (bool): Enable verbose output.

Returns:: An instance of VLLMBackend.
Return type:: VLLMBackend
Return type:: VLLMBackend
Parameters:: params (Dict[str, str])

class pyterrier_rag.OpenAIBackend(model_id, *, api_key=None, generation_args=None, max_input_length=512, max_new_tokens=32, max_retries=10, api='chat/completions', base_url=None, timeout=30.0, logprobs_topk=20, parallel=4, verbose=False)[source]¶

Backend using an OpenAI API-compatible endpoint.

Parameters:

model_id (str) – OpenAI model identifier.
api_key (str, optional) – API key or set via OPENAI_API_KEY env var.
generation_args (dict, optional) – Params for ChatCompletion.create.
max_input_length (int) – Max prompt tokens.
max_new_tokens (int) – Max tokens to generate.
max_retries (int) – Retry attempts for API errors.
api (str) – Which API endpoint to use.
base_url (str) – Base API URL
timeout (float) – Timeout for API calls
parallel (int) – Number of parallel requests to issue to the API.
verbose (bool) – Enable verbose logging.
logprobs_topk (int)

static from_params(params)[source]¶

Create an OpenAIBackend instance from the provided parameters.

Supported params:

model_id: str, the OpenAI model identifier (required)

api_key: str, API key for OpenAI (default: None, uses OPENAI_API_KEY env var). If value starts with $, loads the value from the provided environment variable.

max_retries: int, number of retries for API errors (default: 10)

base_url: str, base URL for the OpenAI API (default: None)

timeout: float, timeout for API calls in seconds (default: 30.0)

logprobs_topk: int, number of top log probabilities to return (default: 20)

parallel: int, number of parallel requests to issue to the API (default: 4)

verbose: bool, enable verbose logging (default: False)

Returns:
OpenAIBackend: An instance of OpenAIBackend.

Return type:: OpenAIBackend
Parameters:: params (Dict[str, str])