LLM Backends¶
PyTerrier RAG supports a variety of LLM backends for generating responses. This functionality is facilitated by the
Backend interface, which currently has three implementations: HuggingFaceBackend,
VllmBackend, and OpenAIBackend. This architecture
also allows different components to share the same backend, which is particularly useful for multi-stage RAG pipelines.
Basics¶
Start by creating an instance of a backend. For example, using the OpenAIBackend:
OpenAIBackend¶import pyterrier_rag as ptr
backend = ptr.OpenAIBackend('gpt-4o-mini', api_key="your_openai_api_key") # or loaded from OPENAI_API_KEY environment variable, if available
The backend can be used to generate responses to prompts. For example, using the generate method:
backend.generate(["What is the capital of France?"])
# Outputs: [BackendOutput(text='The capital of France is Paris.', logprobs=None)]
Backends also function as PyTerrier Transformers. By default, they take input from the prompt column and output
to the qanswer column:
inp = pd.DataFrame([
{'prompt': 'What is the capital of France?'},
{'prompt': 'What is the capital of Germany?'},
])
backend(inp)
# prompt qanswer
# 0 What is the capital of France? The capital of France is Paris.
# 1 What is the capital of Germany? The capital of Germany is Berlin.
Usually you won’t use a backend directly though – they are instead typically used by other components, such as Prompts and Frameworks.
To set the global default backend, you can call pyterrier_rag.default_backend.set(). Note that this must be
called before any components that use the default backend are used (i.e., if you are using default_backend, we
recommended setting it at the top of your script/notebook).
import pyterrier_rag as ptr
ptr.default_backend.set(ptr.OpenAIBackend('gpt-4o-mini'))
ptr.default_backend.generate(['What is the capital of France?']) # -> uses the OpenAIBackend from above
The default backend is automatically loaded via the PYTERRIER_RAG_DEFAULT_BACKEND (using pyterrier_rag.Backend.from_dsn())
if it is set when PyTerrier RAG is first loaded.
Readers from Backends¶
system_message = """You are an expert Q&A system that is trusted around the world.
Always answer the query using the provided context information,
and not prior knowledge.
Some rules to follow:
1. Never directly reference the given context in your answer
2. Avoid statements like 'Based on the context, ...' or
'The context information ...' or anything along those lines."""
prompt_text = """Context information is below.
---------------------
{{ qcontext }}
---------------------
Given the context information and not prior knowledge, answer the query.
Query: {{ query }}
"Answer: """
template = get_conversation_template("meta-llama-3.1-sp")
prompt = PromptTransformer(
conversation_template=template,
system_message=system_message,
instruction=prompt_text,
api_type="openai"
)
generation_args={
"temperature": 0.1,
"max_tokens": 128,
}
# this could equally be a real OpenAI model, or a HuggingFace model, or a vLLM model, etc.
llama = OpenAIBackend(model_name,
api_key="xxx",
generation_args=generation_args,
base_url="http://yyyy:8000/v1",)
llama_reader = Reader(llama, prompt=prompt)
bm25_llama = bm25_ret % 5 >> Concatenator() >> llama_reader
See _pyterrier_rag.readers for more information on how to use the Reader class with Backends.
Token Probabilities¶
Some components need the log probabilities of the generated tokens (and alternative tokens). This is included
as part of the BackendOutput object when using return_logprobs=True in generate()
or by using logprobs_generator(). For example:
backend.generate(["What is the capital of France?"], return_logprobs=True)
# [BackendOutput(text='The capital of France is Paris.', logprobs=[
# {'The': -0.04, 'That': -0.31, ...},
# ...,
# {'Paris': -0.01, 'Berlin': -2.12, ...},
# ...,
# ])]
inp = pd.DataFrame([
{'prompt': 'What is the capital of France?'},
{'prompt': 'What is the capital of Germany?'},
])
generator = backend.logprobs_generator()
generator(inp)
# prompt qanswer qanswer_logprobs
# 0 What is the capital of France? The capital of France is Paris. [{'The': -0.04, 'That': -0.31, ...}, ...]
# 1 What is the capital of Germany? The capital of Germany is Berlin. [{'The': -0.02, 'That': -0.29, ...}, ...]
This feature is typically most useful when a you have a single-token response. You can force the backend to generate
a single token using max_new_tokens=1 and a suitable prompt:
backend.generate(["What is the capital of France? Answer in a single word only."], max_new_tokens=1, return_logprobs=True)
# [BackendOutput(text='Paris', logprobs=[{'Paris': -0.01, 'Berlin': -2.12, ...}])]
inp = pd.DataFrame([
{'prompt': 'What is the capital of France? Answer in a single word only.'},
{'prompt': 'What is the capital of Germany? Answer in a single word only.'},
])
generator = backend.logprobs_generator(max_new_tokens=1)
generator(inp)
# prompt qanswer qanswer_logprobs
# 0 What is the capital of France? Paris [{'Paris': -0.01, 'Berlin': -2.12, ...}]
# 1 What is the capital of Germany? Berlin [{'Berlin': -0.02, 'Paris': -2.29, ...}]
Reasoning¶
Some models output reasoning steps (contained within a <think> tag) before the final answer. If you want to
extract these reasoning steps, you can use the ReasoningExtractor transformer in your pipeline.
from pyterrier_rag import OpenAIBackend, ReasoningExtractor
# An example of a model that outputs reasoning steps in <think> tags:
backend = OpenAIBackend('deepseek-llama-3-8b-instruct', api_key="your_api_key", base_url="http://localhost:8000/v1")
pipeline = backend >> ReasoningExtractor() # extract reasoning after running the backend
inp = pd.DataFrame([
{'prompt': 'What is the capital of France?'},
{'prompt': 'What is the capital of Germany?'},
])
reasoning_extractor(inp)
# prompt qanswer reasoning
# 0 What is the capital of France? Paris Ok, let me think about this. The capital of France is Paris.
# 1 What is the capital of Germany? Berlin Ok, let me think about this. The capital of Germany is Berlin.
API Documentation¶
General¶
- class pyterrier_rag.Backend(model_id, *, max_input_length=512, max_new_tokens=32, verbose=False)[source]¶
Abstract base class for model-backed Transformers in PyTerrier.
Subclasses must implement the raw generation logic (generate) and the high-level generate method. Supports optional logprob extraction.
- Parameters:
max_input_length (int) – Maximum token length for each input prompt.
max_new_tokens (int) – Maximum number of tokens to generate.
verbose (bool) – Flag to enable detailed logging.
device (Union[str, torch.device]) – Device for model execution.
model_id (str)
The following class attributes are available:
- model_id¶
Model name or checkpoint path.
- Type:
str
- supports_logprobs¶
Indicates support for including the logprobs of generated tokens.
- Type:
bool
- supports_message_input¶
Indicates support for message (chat)-formatted (
List[dict]) inputs togenerate, in addition tostrinputs.- Type:
bool
- abstract generate(inps, *, return_logprobs=False, max_new_tokens=None, stop_sequences=None)[source]¶
Generate text from input prompts.
- Return type:
List[BackendOutput]- Parameters:
inps (Union[List[str], List[List[dict]]]) – Input prompts as strings or dictionaries. When strings, represent the prompts directly. When a list of dictionaries, represents a sequence of messages (if
backend.supports_message_input==True).return_logprobs (bool) – Whether to return logprobs of generated tokens along with text. (Only available if
backend.supports_logprobs==True.)max_new_tokens (Optional[int]) – Override for max tokens to generate.
stop_sequences (Optional[List[str]]) – List of tokens at which to stop generation. If None, generation is unconstrained.
- Returns:
An output for each
inp, each containing the generated text and optionally logprobs.- Return type:
List[BackendOutput]
- text_generator(*, input_field='prompt', output_field='qanswer', batch_size=4, max_new_tokens=None, stop_sequences=None, num_responses=1)[source]¶
Create a text generator transformer using this backend.
- Return type:
- Parameters:
input_field (str) – Name of the field containing input prompts.
output_field (str) – Name of the field to store generated text.
batch_size (int) – Number of prompts to process in each batch.
max_new_tokens (Optional[int]) – Override for max tokens to generate. If None, uses the backend’s max_new_tokens.
stop_sequences (Optional[List[str]]) – List of tokens at which to stop generation. If None, generation is unconstrained.
num_responses (int) – Number of responses to generate for each prompt.
- logprobs_generator(*, input_field='prompt', output_field='qanswer', logprobs_field='qanswer_logprobs', batch_size=4, max_new_tokens=None, stop_sequences=None, num_responses=1)[source]¶
Create a text generator transformer that also returns the logprobs of each token using this backend.
- Return type:
- Parameters:
input_field (str) – Name of the field containing input prompts.
output_field (str) – Name of the field to store generated text.
logprobs_field (str) – Name of the field to store logprobs.
batch_size (int) – Number of prompts to process in each batch.
max_new_tokens (Optional[int]) – Override for max tokens to generate. If None, uses the backend’s max_new_tokens.
stop_sequences (Optional[List[str]]) – List of tokens at which to stop generation. If None, generation is unconstrained.
num_responses (int) – Number of responses to generate for each prompt.
- static from_dsn(dsn)[source]¶
Create a Backend instance from a DSN (Data Source Name) string.
The DSN format is:
<provider>:<model_id> [key1=value1 key2=value2 ...].Examples:
"openai:gpt-3.5-turbo","openai:meta-llama/Llama-4-Scout-17B-16E-Instruct base_url=http://localhost:8080/v1","vllm:meta-llama/Llama-4-Scout-17B-16E-Instruct", ands"huggingface:meta-llama/Llama-4-Scout-17B-16E-Instruct".See each backend implementation
from_paramsmethod for their supported keys.
- class pyterrier_rag.backend.TextGenerator(backend, *, input_field='prompt', output_field='qanswer', logprobs_field=None, batch_size=4, max_new_tokens=None, stop_sequences=None, num_responses=1)[source]¶
Transformer that generates text from the specified backend.
- Parameters:
backend (Backend) – The backend to use for text generation.
input_field (str) – Name of the field containing input prompts.
output_field (str) – Name of the field to store generated text.
logprobs_field (Optional[str]) – Name of the field to store generated logprobs. If None, logprobs are not returned.
batch_size (int) – Number of prompts to process in each batch.
max_new_tokens (Optional[int]) – Override for max tokens to generate. If None, uses the backend’s max_new_tokens.
num_responses (int) – Number of responses to generate for each prompt.
stop_sequences (Optional[List[str]]) – List of tokens at which to stop generation. If None, generation is unconstrained.
Implementations¶
- class pyterrier_rag.HuggingFaceBackend(model_id, *, model_args={}, generation_args=None, max_input_length=None, max_new_tokens=32, logprobs_topk=20, verbose=False, device=None)[source]¶
Backend implementation using a HuggingFace Transformer model. This backend assumes the class can be opened using AutoModelForCausalLM. If your class needs AutoModelForSeq2SeqLM, then use Seq2SeqLMBackend.
Citation
Wolf et al. HuggingFace's Transformers: State-of-the-art Natural Language Processing. arXiv 2019. [link]
@article{DBLP:journals/corr/abs-1910-03771, author = {Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and R{\'{e}}mi Louf and Morgan Funtowicz and Jamie Brew}, title = {HuggingFace's Transformers: State-of-the-art Natural Language Processing}, journal = {CoRR}, volume = {abs/1910.03771}, year = {2019}, url = {http://arxiv.org/abs/1910.03771}, eprinttype = {arXiv}, eprint = {1910.03771}, timestamp = {Tue, 02 Jun 2020 12:49:01 +0200}, biburl = {https://dblp.org/rec/journals/corr/abs-1910-03771.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }- Parameters:
model_id (str) – Identifier or path of the pretrained model.
model_args (dict) – Arguments passed to from_pretrained for model instantiation.
generation_args (dict) – Parameters controlling text generation.
max_input_length (int) – Maximum token length for inputs (defaults to model config).
max_new_tokens (int) – Maximum number of tokens to generate per input.
verbose (bool) – Flag to enable verbose logging.
logprobs_topk (int)
device (str | device)
- static from_params(params)[source]¶
Create a HuggingFaceBackend instance from parameters.
Supported params: :rtype:
HuggingFaceBackendmodel_id (str): Identifier or path of the HuggingFace model.
max_input_length (int): Maximum tokens per input prompt.
max_new_tokens (int): Tokens to generate per prompt.
logprobs_topk (int): Number of top logprobs to return.
verbose (bool): Enable verbose output.
- Returns:
An instance of HuggingFaceBackend.
- Return type:
- Parameters:
params (Dict[str, str])
- class pyterrier_rag.VLLMBackend(model_id, *, model_args={}, generation_args=None, max_input_length=512, max_new_tokens=32, logprobs_topk=20, verbose=False)[source]¶
Backend implementation using the vLLM library for text generation.
Citation
Kwon et al. Efficient Memory Management for Large Language Model Serving with PagedAttention. SOSP 2023. [link]
@inproceedings{DBLP:conf/sosp/KwonLZ0ZY0ZS23, author = {Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph Gonzalez and Hao Zhang and Ion Stoica}, editor = {Jason Flinn and Margo I. Seltzer and Peter Druschel and Antoine Kaufmann and Jonathan Mace}, title = {Efficient Memory Management for Large Language Model Serving with PagedAttention}, booktitle = {Proceedings of the 29th Symposium on Operating Systems Principles, {SOSP} 2023, Koblenz, Germany, October 23-26, 2023}, pages = {611--626}, publisher = {{ACM}}, year = {2023}, url = {https://doi.org/10.1145/3600006.3613165}, doi = {10.1145/3600006.3613165}, timestamp = {Tue, 11 Feb 2025 11:42:30 +0100}, biburl = {https://dblp.org/rec/conf/sosp/KwonLZ0ZY0ZS23.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }- Parameters:
model_id (str) – Identifier or path of the vLLM model.
model_args (dict, optional) – Keyword arguments for LLM instantiation.
generation_args (dict, optional) – Parameters for sampling (e.g., max_tokens, temperature).
max_input_length (int) – Maximum tokens per input prompt (inherited).
max_new_tokens (int) – Tokens to generate per prompt (inherited).
verbose (bool) – Enable verbose output.
logprobs_topk (int)
- Raises:
ImportError – If the vllm library is unavailable.
- static from_params(params)[source]¶
Create a VLLMBackend instance from parameters.
Supported params: :rtype:
VLLMBackendmodel_id (str): Identifier or path of the vLLM model.
max_input_length (int): Maximum tokens per input prompt.
max_new_tokens (int): Tokens to generate per prompt.
logprobs_topk (int): Number of top logprobs to return.
verbose (bool): Enable verbose output.
- Returns:
An instance of VLLMBackend.
- Return type:
- Parameters:
params (Dict[str, str])
- class pyterrier_rag.OpenAIBackend(model_id, *, api_key=None, generation_args=None, max_input_length=512, max_new_tokens=32, max_retries=10, api='chat/completions', base_url=None, timeout=30.0, logprobs_topk=20, parallel=4, verbose=False)[source]¶
Backend using an OpenAI API-compatible endpoint.
- Parameters:
model_id (str) – OpenAI model identifier.
api_key (str, optional) – API key or set via OPENAI_API_KEY env var.
generation_args (dict, optional) – Params for ChatCompletion.create.
max_input_length (int) – Max prompt tokens.
max_new_tokens (int) – Max tokens to generate.
max_retries (int) – Retry attempts for API errors.
api (str) – Which API endpoint to use.
base_url (str) – Base API URL
timeout (float) – Timeout for API calls
parallel (int) – Number of parallel requests to issue to the API.
verbose (bool) – Enable verbose logging.
logprobs_topk (int)
- static from_params(params)[source]¶
Create an OpenAIBackend instance from the provided parameters.
Supported params: :rtype:
OpenAIBackendmodel_id: str, the OpenAI model identifier (required)
api_key: str, API key for OpenAI (default: None, uses OPENAI_API_KEY env var). If value starts with $, loads the value from the provided environment variable.
max_retries: int, number of retries for API errors (default: 10)
base_url: str, base URL for the OpenAI API (default: None)
timeout: float, timeout for API calls in seconds (default: 30.0)
logprobs_topk: int, number of top log probabilities to return (default: 20)
parallel: int, number of parallel requests to issue to the API (default: 4)
verbose: bool, enable verbose logging (default: False)
- Returns:
OpenAIBackend: An instance of OpenAIBackend.
- Parameters:
params (Dict[str, str])
- Return type: