End-to-End Workflow with torchtune — torchtune 0.3 documentation

Excerpt

In this tutorial, we’ll walk through an end-to-end example of how you can fine-tune, evaluate, optionally quantize and then run generation with your favorite LLM using torchtune. We’ll also go over how you can use some popular tools and libraries from the community seemlessly with torchtune.


In this tutorial, we’ll walk through an end-to-end example of how you can fine-tune, evaluate, optionally quantize and then run generation with your favorite LLM using torchtune. We’ll also go over how you can use some popular tools and libraries from the community seemlessly with torchtune.

What this tutorial will cover:

  • Different type of recipes available in torchtune beyond fine-tuning

  • End-to-end example connecting all of these recipes

  • Different tools and libraries you can use with torchtune

Overview

Fine-tuning an LLM is usually only one step in a larger workflow. An example workflow that you might have can look something like this:

  • Download a popular model from HF Hub

  • Fine-tune the model using a relevant fine-tuning technique. The exact technique used will depend on factors such as the model, amount and nature of training data, your hardware setup and the end task for which the model will be used

  • Evaluate the model on some benchmarks to validate model quality

  • Run some generations to make sure the model output looks reasonable

  • Quantize the model for efficient inference

  • [Optional] Export the model for specific environments such as inference on a mobile phone

In this tutorial, we’ll cover how you can use torchtune for all of the above, leveraging integrations with popular tools and libraries from the ecosystem.

We’ll use the Llama2 7B model for this tutorial. You can find a complete set of models supported by torchtune here.

Download Llama2 7B

In this tutorial, we’ll use the Hugging Face model weights for the Llama2 7B mode. For more information on checkpoint formats and how these are handled in torchtune, take a look at this tutorial on checkpoints.

To download the HF format Llama2 7B model, we’ll use the tune CLI.

<span></span>tune<span> </span>download<span> </span><span>\</span>
meta-llama/Llama-2-7b-hf<span> </span><span>\</span>
--output-dir<span> </span>&lt;checkpoint_dir&gt;<span> </span><span>\</span>
--hf-token<span> </span>&lt;ACCESS<span> </span>TOKEN&gt;

Make a note of <checkpoint_dir>, we’ll use this many times in this tutorial.

Finetune the model using LoRA

For this tutorial, we’ll fine-tune the model using LoRA. LoRA is a parameter efficient fine-tuning technique which is especially helpful when you don’t have a lot of GPU memory to play with. LoRA freezes the base LLM and adds a very small percentage of learnable parameters. This helps keep memory associated with gradients and optimizer state low. Using torchtune, you should be able to fine-tune a Llama2 7B model with LoRA in less than 16GB of GPU memory using bfloat16 on a RTX 3090/4090. For more information on how to use LoRA, take a look at our LoRA Tutorial.

We’ll fine-tune using our single device LoRA recipe and use the standard settings from the default config.

This will fine-tune our model using a batch_size=2 and dtype=bfloat16. With these settings the model should have a peak memory usage of ~16GB and total training time of around two hours for each epoch. We’ll need to make some changes to the config to make sure our recipe can access the right checkpoints.

Let’s look for the right config for this use case by using the tune CLI.

<span></span>tune<span> </span>ls

RECIPE<span>                                   </span>CONFIG
full_finetune_single_device<span>              </span>llama2/7B_full_low_memory
<span>                                         </span>mistral/7B_full_low_memory
full_finetune_distributed<span>                </span>llama2/7B_full
<span>                                         </span>llama2/13B_full
<span>                                         </span>mistral/7B_full
lora_finetune_single_device<span>              </span>llama2/7B_lora_single_device
<span>                                         </span>llama2/7B_qlora_single_device
<span>                                         </span>mistral/7B_lora_single_device
...

For this tutorial we’ll use the llama2/7B_lora_single_device config.

The config already points to the HF Checkpointer and the right checkpoint files. All we need to do is update the checkpoint directory for both the model and the tokenizer. Let’s do this using the overrides in the tune CLI while starting training!

<span></span>tune<span> </span>run<span> </span>lora_finetune_single_device<span> </span><span>\</span>
--config<span> </span>llama2/7B_lora_single_device<span> </span><span>\</span>
checkpointer.checkpoint_dir<span>=</span>&lt;checkpoint_dir&gt;<span> </span><span>\</span>
tokenizer.path<span>=</span>&lt;checkpoint_dir&gt;/tokenizer.model<span> </span><span>\</span>
checkpointer.output_dir<span>=</span>&lt;checkpoint_dir&gt;

Once training is complete, you’ll see the following in the logs.

<span></span><span>[</span>_checkpointer.py:473<span>]</span><span> </span>Model<span> </span>checkpoint<span> </span>of<span> </span>size<span> </span><span>9</span>.98<span> </span>GB<span> </span>saved<span> </span>to<span> </span>&lt;checkpoint_dir&gt;/hf_model_0001_0.pt

<span>[</span>_checkpointer.py:473<span>]</span><span> </span>Model<span> </span>checkpoint<span> </span>of<span> </span>size<span> </span><span>3</span>.50<span> </span>GB<span> </span>saved<span> </span>to<span> </span>&lt;checkpoint_dir&gt;/hf_model_0002_0.pt

<span>[</span>_checkpointer.py:484<span>]</span><span> </span>Adapter<span> </span>checkpoint<span> </span>of<span> </span>size<span> </span><span>0</span>.01<span> </span>GB<span> </span>saved<span> </span>to<span> </span>&lt;checkpoint_dir&gt;/adapter_0.pt

The final trained weights are merged with the original model and split across two checkpoint files similar to the source checkpoints from the HF Hub (see the LoRA Tutorial for more details). In fact the keys will be identical between these checkpoints. We also have a third checkpoint file which is much smaller in size and contains the learnt LoRA adapter weights. For this tutorial, we’ll only use the model checkpoints and not the adapter weights.

Run Evaluation using EleutherAI’s Eval Harness

We’ve fine-tuned a model. But how well does this model really do? Let’s run some Evaluations!

torchtune integrates with EleutherAI’s evaluation harness. An example of this is available through the eleuther_eval recipe. In this tutorial, we’re going to directly use this recipe by modifying its associated config eleuther_evaluation.yaml.

Note

For this section of the tutorial, you should first run pip install lm_eval==0.4.* to install the EleutherAI evaluation harness.

Since we plan to update all of the checkpoint files to point to our fine-tuned checkpoints, let’s first copy over the config to our local working directory so we can make changes. This will be easier than overriding all of these elements through the CLI.

<span></span>tune<span> </span>cp<span> </span>eleuther_evaluation<span> </span>./custom_eval_config.yaml<span> </span><span>\</span>

For this tutorial we’ll use the truthfulqa_mc2 task from the harness. This task measures a model’s propensity to be truthful when answering questions and measures the model’s zero-shot accuracy on a question followed by one or more true responses and one or more false responses. Let’s first run a baseline without fine-tuning.

<span></span>tune<span> </span>run<span> </span>eleuther_eval<span> </span>--config<span> </span>./custom_eval_config.yaml
checkpointer.checkpoint_dir<span>=</span>&lt;checkpoint_dir&gt;<span> </span><span>\</span>
tokenizer.path<span>=</span>&lt;checkpoint_dir&gt;/tokenizer.model

<span>[</span>evaluator.py:324<span>]</span><span> </span>Running<span> </span>loglikelihood<span> </span>requests
<span>[</span>eleuther_eval.py:195<span>]</span><span> </span>Eval<span> </span>completed<span> </span><span>in</span><span> </span><span>121</span>.27<span> </span>seconds.
<span>[</span>eleuther_eval.py:197<span>]</span><span> </span>truthfulqa_mc2:<span> </span><span>{</span><span>'acc,none'</span>:<span> </span><span>0</span>.388...

The model has an accuracy around 38.8%. Let’s compare this with the fine-tuned model.

First, we modify custom_eval_config.yaml to include the fine-tuned checkpoints.

<span></span><span>checkpointer</span><span>:</span>
<span>    </span><span>_component_</span><span>:</span><span> </span><span>torchtune.training.FullModelHFCheckpointer</span>

<span>    </span><span># directory with the checkpoint files</span>
<span>    </span><span># this should match the output_dir specified during</span>
<span>    </span><span># finetuning</span>
<span>    </span><span>checkpoint_dir</span><span>:</span><span> </span><span>&lt;checkpoint_dir&gt;</span>

<span>    </span><span># checkpoint files for the fine-tuned model. This should</span>
<span>    </span><span># match what's shown in the logs above</span>
<span>    </span><span>checkpoint_files</span><span>:</span><span> </span><span>[</span>
<span>        </span><span>hf_model_0001_0.pt</span><span>,</span>
<span>        </span><span>hf_model_0002_0.pt</span><span>,</span>
<span>    </span><span>]</span>

<span>    </span><span>output_dir</span><span>:</span><span> </span><span>&lt;checkpoint_dir&gt;</span>
<span>    </span><span>model_type</span><span>:</span><span> </span><span>LLAMA2</span>

<span># Make sure to update the tokenizer path to the right</span>
<span># checkpoint directory as well</span>
<span>tokenizer</span><span>:</span>
<span>    </span><span>_component_</span><span>:</span><span> </span><span>torchtune.models.llama2.llama2_tokenizer</span>
<span>    </span><span>path</span><span>:</span><span> </span><span>&lt;checkpoint_dir&gt;/tokenizer.model</span>

Now, let’s run the recipe.

<span></span>tune<span> </span>run<span> </span>eleuther_eval<span> </span>--config<span> </span>./custom_eval_config.yaml

The results should look something like this.

<span></span><span>[</span>evaluator.py:324<span>]</span><span> </span>Running<span> </span>loglikelihood<span> </span>requests
<span>[</span>eleuther_eval.py:195<span>]</span><span> </span>Eval<span> </span>completed<span> </span><span>in</span><span> </span><span>121</span>.27<span> </span>seconds.
<span>[</span>eleuther_eval.py:197<span>]</span><span> </span>truthfulqa_mc2:<span> </span><span>{</span><span>'acc,none'</span>:<span> </span><span>0</span>.489<span> </span>...

Our fine-tuned model gets ~48% on this task, which is ~10 points better than the baseline. Great! Seems like our fine-tuning helped.

Generation

We’ve run some evaluations and the model seems to be doing well. But does it really generate meaningful text for the prompts you care about? Let’s find out!

For this, we’ll use the generate recipe and the associated config.

Let’s first copy over the config to our local working directory so we can make changes.

<span></span>tune<span> </span>cp<span> </span>generation<span> </span>./custom_generation_config.yaml

Let’s modify custom_generation_config.yaml to include the following changes.

<span></span><span>checkpointer</span><span>:</span>
<span>    </span><span>_component_</span><span>:</span><span> </span><span>torchtune.training.FullModelHFCheckpointer</span>

<span>    </span><span># directory with the checkpoint files</span>
<span>    </span><span># this should match the output_dir specified during</span>
<span>    </span><span># finetuning</span>
<span>    </span><span>checkpoint_dir</span><span>:</span><span> </span><span>&lt;checkpoint_dir&gt;</span>

<span>    </span><span># checkpoint files for the fine-tuned model. This should</span>
<span>    </span><span># match what's shown in the logs above</span>
<span>    </span><span>checkpoint_files</span><span>:</span><span> </span><span>[</span>
<span>        </span><span>hf_model_0001_0.pt</span><span>,</span>
<span>        </span><span>hf_model_0002_0.pt</span><span>,</span>
<span>    </span><span>]</span>

<span>    </span><span>output_dir</span><span>:</span><span> </span><span>&lt;checkpoint_dir&gt;</span>
<span>    </span><span>model_type</span><span>:</span><span> </span><span>LLAMA2</span>

<span># Make sure to update the tokenizer path to the right</span>
<span># checkpoint directory as well</span>
<span>tokenizer</span><span>:</span>
<span>    </span><span>_component_</span><span>:</span><span> </span><span>torchtune.models.llama2.llama2_tokenizer</span>
<span>    </span><span>path</span><span>:</span><span> </span><span>&lt;checkpoint_dir&gt;/tokenizer.model</span>

Once the config is updated, let’s kick off generation! We’ll use the default settings for sampling with top_k=300 and a temperature=0.8. These parameters control how the probabilities for sampling are computed. These are standard settings for Llama2 7B and we recommend inspecting the model with these before playing around with these parameters.

We’ll use a different prompt from the one in the config

<span></span>tune<span> </span>run<span> </span>generate<span> </span>--config<span> </span>./custom_generation_config.yaml<span> </span><span>\</span>
<span>prompt</span><span>=</span><span>"What are some interesting sites to visit in the Bay Area?"</span>

Once generation is complete, you’ll see the following in the logs.

<span></span><span>[</span>generate.py:92<span>]</span><span> </span>Exploratorium<span> </span><span>in</span><span> </span>San<span> </span>Francisco<span> </span>has<span> </span>made<span> </span>the<span> </span>cover<span> </span>of<span> </span>Time<span> </span>Magazine,
<span>                 </span>and<span> </span>its<span> </span>awesome.<span> </span>And<span> </span>the<span> </span>bridge<span> </span>is<span> </span>pretty<span> </span>cool...

<span>[</span>generate.py:96<span>]</span><span> </span>Time<span> </span><span>for</span><span> </span>inference:<span> </span><span>11</span>.61<span> </span>sec<span> </span>total,<span> </span><span>25</span>.83<span> </span>tokens/sec
<span>[</span>generate.py:99<span>]</span><span> </span>Memory<span> </span>used:<span> </span><span>15</span>.72<span> </span>GB

Indeed, the bridge is pretty cool! Seems like our LLM knows a little something about the Bay Area!

Speeding up Generation using Quantization

We rely on torchao for post-training quantization. To quantize the fine-tuned model after installing torchao we can run the following command:

<span></span><span># we also support `int8_weight_only()` and `int8_dynamic_activation_int8_weight()`, see</span>
<span># https://github.com/pytorch/ao/tree/main/torchao/quantization#other-available-quantization-techniques</span>
<span># for a full list of techniques that we support</span>
<span>from</span> <span>torchao.quantization.quant_api</span> <span>import</span> <span>quantize_</span><span>,</span> <span>int4_weight_only</span>
<span>quantize_</span><span>(</span><span>model</span><span>,</span> <span>int4_weight_only</span><span>())</span>

After quantization, we rely on torch.compile for speedups. For more details, please see this example usage.

torchao also provides this table listing performance and accuracy results for llama2 and llama3.

For Llama models, you can run generation directly in torchao on the quantized model using their generate.py script as discussed in this readme. This way you can compare your own results to those in the previously-linked table.

Using torchtune checkpoints with other libraries

As we mentioned above, one of the benefits of handling of the checkpoint conversion is that you can directly work with standard formats. This helps with interoperability with other libraries since torchtune doesn’t add yet another format to the mix.

Let’s take a look at an example of how this would work with a popular codebase used for running performant inference with LLMs - gpt-fast. This section assumes that you’ve cloned that repository on your machine.

gpt-fast makes some assumptions about the checkpoint and the availability of the key-to-file mapping i.e. a file mapping parameter names to the files containing them. Let’s satisfy these assumptions, by creating this mapping file. Let’s assume we’ll be using <new_dir>/Llama-2-7B-hf as the directory for this. gpt-fast assumes that the directory with checkpoints has the same format at the HF repo-id.

<span></span><span>import</span> <span>json</span>
<span>import</span> <span>torch</span>

<span># create the output dictionary</span>
<span>output_dict</span> <span>=</span> <span>{</span><span>"weight_map"</span><span>:</span> <span>{}}</span>

<span># Load the checkpoints</span>
<span>sd_1</span> <span>=</span> <span>torch</span><span>.</span><span>load</span><span>(</span><span>'&lt;checkpoint_dir&gt;/hf_model_0001_0.pt'</span><span>,</span> <span>mmap</span><span>=</span><span>True</span><span>,</span> <span>map_location</span><span>=</span><span>'cpu'</span><span>)</span>
<span>sd_2</span> <span>=</span> <span>torch</span><span>.</span><span>load</span><span>(</span><span>'&lt;checkpoint_dir&gt;/hf_model_0002_0.pt'</span><span>,</span> <span>mmap</span><span>=</span><span>True</span><span>,</span> <span>map_location</span><span>=</span><span>'cpu'</span><span>)</span>

<span># create the weight map</span>
<span>for</span> <span>key</span> <span>in</span> <span>sd_1</span><span>.</span><span>keys</span><span>():</span>
    <span>output_dict</span><span>[</span><span>'weight_map'</span><span>][</span><span>key</span><span>]</span> <span>=</span>  <span>"hf_model_0001_0.pt"</span>
<span>for</span> <span>key</span> <span>in</span> <span>sd_2</span><span>.</span><span>keys</span><span>():</span>
    <span>output_dict</span><span>[</span><span>'weight_map'</span><span>][</span><span>key</span><span>]</span> <span>=</span>  <span>"hf_model_0002_0.pt"</span>

<span>with</span> <span>open</span><span>(</span><span>'&lt;new_dir&gt;/Llama-2-7B-hf/pytorch_model.bin.index.json'</span><span>,</span> <span>'w'</span><span>)</span> <span>as</span> <span>f</span><span>:</span>
    <span>json</span><span>.</span><span>dump</span><span>(</span><span>output_dict</span><span>,</span> <span>f</span><span>)</span>

Now that we’ve created the weight_map, let’s copy over our checkpoints.

<span></span>cp<span>  </span>&lt;checkpoint_dir&gt;/hf_model_0001_0.pt<span>  </span>&lt;new_dir&gt;/Llama-2-7B-hf/
cp<span>  </span>&lt;checkpoint_dir&gt;/hf_model_0002_0.pt<span>  </span>&lt;new_dir&gt;/Llama-2-7B-hf/
cp<span>  </span>&lt;checkpoint_dir&gt;/tokenizer.model<span>     </span>&lt;new_dir&gt;/Llama-2-7B-hf/

Once the directory structure is setup, let’s convert the checkpoints and run inference!

<span></span><span>cd</span><span> </span>gpt-fast/

<span># convert the checkpoints into a format readable by gpt-fast</span>
python<span> </span>scripts/convert_hf_checkpoint.py<span> </span><span>\</span>
--checkpoint_dir<span> </span>&lt;new_dir&gt;/Llama-2-7B-hf/<span> </span><span>\</span>
--model<span> </span>7B

<span># run inference using the converted model</span>
python<span> </span>generate.py<span> </span><span>\</span>
--compile<span> </span><span>\</span>
--checkpoint_path<span> </span>&lt;new_dir&gt;/Llama-2-7B-hf/model.pth<span> </span><span>\</span>
--device<span> </span>cuda

The output should look something like this:

<span></span>Hello,<span> </span>my<span> </span>name<span> </span>is<span> </span>Justin.<span> </span>I<span> </span>am<span> </span>a<span> </span>middle<span> </span>school<span> </span>math<span> </span>teacher
at<span> </span>WS<span> </span>Middle<span> </span>School<span> </span>...

Time<span> </span><span>for</span><span> </span>inference<span> </span><span>5</span>:<span> </span><span>1</span>.94<span> </span>sec<span> </span>total,<span> </span><span>103</span>.28<span> </span>tokens/sec
Bandwidth<span> </span>achieved:<span> </span><span>1391</span>.84<span> </span>GB/sec

And thats it! Try your own prompt!

Uploading your model to the Hugging Face Hub

Your new model is working great and you want to share it with the world. The easiest way to do this is utilizing the huggingface-cli command, which works seamlessly with torchtune. Simply point the CLI to your finetuned model directory like so:

<span></span>huggingface-cli<span> </span>upload<span> </span>&lt;hf-repo-id&gt;<span> </span>&lt;checkpoint-dir&gt;

The command should output a link to your repository on the Hub. If the repository doesn’t exist yet, it will be created automatically:

<span></span>https://huggingface.co/&lt;hf-repo-id&gt;/tree/main/.

For more details on the huggingface-cli upload feature check out the Hugging Face docs.

Hopefully this tutorial gave you some insights into how you can use torchtune for your own workflows. Happy Tuning!