🪴 Anil's Garden

❯

❯

Generating from LLMs

Generating from LLMs

10 Feb 20255 min read

Cursory Notes from Hugging Face Chat Templates Documentation

Generation prompts: appending the “prefix” / preamble e.g. <im_start> assistant or similar after the sequence of messages to induce / encourage generation behaviour from the model (by i.e. matching training conditions)
“Pre-filling” a prompt is a useful technique for doing more classical, continuation-style generation: You can not supply conditions to begin a new message - i.e. the prefix/prompt for an assistant role - but instead supply a message and pre-fill it with some tokens, then get the model to continue this message.
- this should (I didn’t look up evidence) lead to better continuation behaviour from a chat-style finetuned model
- this is achieved with the continue_final_message parameter in Hugging Face transformers
  - “add_generation_prompt adds the tokens that start a new message, and continue_final_message removes any end-of-message tokens from the final message, it does not make sense to use them together.”
“The only argument that apply_chat_template requires is messages”
“Tool use” LLMs can choose to call functions as external tools before generating an answer. When passing tools to a tool-use model, you can simply pass a list of functions to the tools argument
- Realy nice example of tool use / calling (using NousResearch/Hermes-2-Pro-Llama-3-8B)
- “Each function you pass to the tools argument of apply_chat_template is converted into a JSON schema. These schemas are then passed to the model chat template. In other words, tool-use models do not see your functions directly, and they never see the actual code inside them. What they care about is the function definitions and the arguments they need to pass to them”
- Tool responses have a simple format: They are a message dict with the “tool” role, a “name” key giving the name of the called function, and a “content” key containing the result of the tool call. Here is a sample tool response e.g. the JSON {"role": "tool", "name": "multiply", "content": "30"}
Why do some models have multiple templates?
- Some models use different templates for different use cases.
  - For example, they might use one template for normal chat and another for tool-use, or retrieval-augmented generation. In these cases, tokenizer.chat_template is a dictionary. This can cause some confusion, and where possible, we recommend using a single template for all use-cases. You can use Jinja statements like if tools is defined and {% macro %} definitions to easily wrap multiple code paths in a single template.
- When a tokenizer has multiple templates, tokenizer.chat_template will be a dict, where each key is the name of a template.
By default, Jinja will print any whitespace that comes before or after a block. This can be a problem for chat templates, which generally want to be very precise with whitespace
- Adding - will strip any whitespace that comes before the block. The second example looks innocent, but the newline and indentation may end up being included in the output, which is probably not what you want
- this is a potential GOTCHA: see the examples in the article / docs
Special variables
- Inside your template, you will have access several special variables. The most important of these is messages, which contains the chat history as a list of message dicts. However, there are several others. Not every variable will be used in every template. The most common other variables are:
- - tools contains a list of tools in JSON schema format. Will be None or undefined if no tools are passed.
- documents contains a list of documents in the format {"title": "Title", "contents": "Contents"}, used for retrieval-augmented generation. Will be None or undefined if no documents are passed.
- add_generation_prompt is a bool that is True if the user has requested a generation prompt, and False otherwise. If this is set, your template should add the header for an assistant message to the end of the conversation. If your model doesn’t have a specific header for assistant messages, you can ignore this flag.
- Special tokens like bos_token and eos_token. These are extracted from tokenizer.special_tokens_map. The exact tokens available inside each template will differ depending on the parent tokenizer.
Compatibility with non-Python Jinja
Write long template to a file and also read them in from there to avoid errors e.g. tokenizer.chat_template = open("template.jinja").read()

How do Structured Outputs work?

LLMs generate output text auto-regressively one token at a time through a sampling step that converts a probability distribution indicating the likelihood over all tokens into one selected token at each step. In the case of structured output generation, we modify this sampling step to only emit tokens consistent with the prescribed format. We briefly describe how.

For responses that need to adhere to a specific user-defined format as specified in the response_format parameter, we construct a finite state machine (FSM) that only accepts token sequences that are consistent with the format. We rely on an optimized version of the Outlines library for reliable parsing and FSM construction.

Specifically, the FSM can be represented as a directed graph where each node represents the currently accepted partial generation, and each outgoing edge from a node represents all possible acceptable tokens from that state consistent with the user provided format.

While there are several open source alternatives available that can help generate structured outputs, our testing showed they all degrade model performance. To circumvent this, we implemented a number of engineering optimizations and were able to construct these FSMs from JSON schemas efficiently and scalably, up to 80x faster than open source alternatives.

During the decoding phase, instead of directly sampling from the probability distribution emitted by the LLM, our sampling strategy uses the FSM to determine the space of valid tokens, and mutates the probability distribution by pinning the likelihood of all invalid tokens to zero. This ensures that the sampler only picks tokens that are accepted by the FSM, and consequently is guaranteed to adhere to the prescribed response format. Due to various system optimizations we implemented, these additional acceptance checks are done efficiently at almost zero overhead over the vanilla sampling strategy.

— Introducing structured outputs with JSON response format from Cohere

Graph View

Cursory Notes from Hugging Face Chat Templates Documentation
How do Structured Outputs work?

Backlinks

Language Models

Website
Bluesky
Twitter/X
GitHub
LinkedIn
Instagram
Goodreads
Letterboxd
🍋