Cursory Notes from Hugging Face Chat Templates Documentation
- Generation prompts: appending the “prefix” / preamble e.g.
<im_start> assistant
or similar after the sequence of messages to induce / encourage generation behaviour from the model (by i.e. matching training conditions) - “Pre-filling” a prompt is a useful technique for doing more classical, continuation-style generation: You can not supply conditions to begin a new message - i.e. the prefix/prompt for an assistant role - but instead supply a message and pre-fill it with some tokens, then get the model to continue this message.
- this should (I didn’t look up evidence) lead to better continuation behaviour from a chat-style finetuned model
- this is achieved with the
continue_final_message
parameter in Hugging Facetransformers
- “
add_generation_prompt
adds the tokens that start a new message, andcontinue_final_message
removes any end-of-message tokens from the final message, it does not make sense to use them together.”
- “
- “The only argument that
apply_chat_template
requires ismessages
” - “Tool use” LLMs can choose to call functions as external tools before generating an answer. When passing tools to a tool-use model, you can simply pass a list of functions to the
tools
argument- Realy nice example of tool use / calling (using
NousResearch/Hermes-2-Pro-Llama-3-8B
) - “Each function you pass to the
tools
argument ofapply_chat_template
is converted into a JSON schema. These schemas are then passed to the model chat template. In other words, tool-use models do not see your functions directly, and they never see the actual code inside them. What they care about is the function definitions and the arguments they need to pass to them” - Tool responses have a simple format: They are a message dict with the “tool” role, a “name” key giving the name of the called function, and a “content” key containing the result of the tool call. Here is a sample tool response e.g. the JSON
{"role": "tool", "name": "multiply", "content": "30"}
- Realy nice example of tool use / calling (using
- Why do some models have multiple templates?
- Some models use different templates for different use cases.
- For example, they might use one template for normal chat and another for tool-use, or retrieval-augmented generation. In these cases, tokenizer.chat_template is a dictionary. This can cause some confusion, and where possible, we recommend using a single template for all use-cases. You can use Jinja statements like if tools is defined and {% macro %} definitions to easily wrap multiple code paths in a single template.
- When a tokenizer has multiple templates, tokenizer.chat_template will be a dict, where each key is the name of a template.
- Some models use different templates for different use cases.
- By default, Jinja will print any whitespace that comes before or after a block. This can be a problem for chat templates, which generally want to be very precise with whitespace
- Adding
-
will strip any whitespace that comes before the block. The second example looks innocent, but the newline and indentation may end up being included in the output, which is probably not what you want - this is a potential GOTCHA: see the examples in the article / docs
- Adding
- Special variables
- Inside your template, you will have access several special variables. The most important of these is messages, which contains the chat history as a list of message dicts. However, there are several others. Not every variable will be used in every template. The most common other variables are:
-
tools
contains a list of tools in JSON schema format. Will beNone
or undefined if no tools are passed.
documents
contains a list of documents in the format{"title": "Title", "contents": "Contents"}
, used for retrieval-augmented generation. Will beNone
or undefined if no documents are passed.add_generation_prompt
is a bool that isTrue
if the user has requested a generation prompt, andFalse
otherwise. If this is set, your template should add the header for an assistant message to the end of the conversation. If your model doesn’t have a specific header for assistant messages, you can ignore this flag.- Special tokens like
bos_token
andeos_token
. These are extracted fromtokenizer.special_tokens_map
. The exact tokens available inside each template will differ depending on the parent tokenizer.
- Compatibility with non-Python Jinja
- Write long template to a file and also read them in from there to avoid errors e.g.
tokenizer.chat_template = open("template.jinja").read()
How do Structured Outputs work?
LLMs generate output text auto-regressively one token at a time through a sampling step that converts a probability distribution indicating the likelihood over all tokens into one selected token at each step. In the case of structured output generation, we modify this sampling step to only emit tokens consistent with the prescribed format. We briefly describe how.
For responses that need to adhere to a specific user-defined format as specified in the
response_format
parameter, we construct a finite state machine (FSM) that only accepts token sequences that are consistent with the format. We rely on an optimized version of the Outlines library for reliable parsing and FSM construction.Specifically, the FSM can be represented as a directed graph where each node represents the currently accepted partial generation, and each outgoing edge from a node represents all possible acceptable tokens from that state consistent with the user provided format.
While there are several open source alternatives available that can help generate structured outputs, our testing showed they all degrade model performance. To circumvent this, we implemented a number of engineering optimizations and were able to construct these FSMs from JSON schemas efficiently and scalably, up to 80x faster than open source alternatives.
During the decoding phase, instead of directly sampling from the probability distribution emitted by the LLM, our sampling strategy uses the FSM to determine the space of valid tokens, and mutates the probability distribution by pinning the likelihood of all invalid tokens to zero. This ensures that the sampler only picks tokens that are accepted by the FSM, and consequently is guaranteed to adhere to the prescribed response format. Due to various system optimizations we implemented, these additional acceptance checks are done efficiently at almost zero overhead over the vanilla sampling strategy.
— Introducing structured outputs with JSON response format from Cohere