Pipelines & Prompt Optimization with DSPy | Drew Breunig

Excerpt

Writing about technology, culture, media, data, and all the ways they interact.


I stumbled across DSPy while looking for a framework to build a small agent (I wanted to try out some new techniques to make my weather site more interesting) and found its approach to prompting interesting. From their site, “DSPy is the framework for programming—rather than prompting—language models.”

And it’s true: you spend much, much less time prompting when you use DSPy to build LLM-powered applications. Because you let DSPy handle that bit for you.

There’s something really clean and freeing about ceding the details and nuance of the prompt back to an LLM.

Let’s quickly walk through how DSPy handles prompting for you and step through an simple categorization task as an example.

A Quick Intro to How DSPy Works

At first, DSPy reduces time spent prompting by providing you with boilerplate prompting that frames your tasks, which you define with “signatures”. Signatures are a way of expressing what you want an LLM to do by defining the desired input and outputs. They can be as simple as strings, like:

You can also specify your types as well, like:

'sentence -> sentiment: bool'

Instinctually, I started looking for a dictionary of input and output types for signatures. But there isn’t one: signatures can use whatever terms you’d like, so long as they’re descriptive of your desired inputs and outputs. For example:

'document -> summary'
'novella -> tldr'
'baseball_player -> affiliated_team'

Signatures can also be defined as a class, which lets you add further specs for more complex tasks. But we’ll get to that later.

Signatures define your desired work, but they are used to generate prompts by DSPy “modules”. For our purposes today, think of modules as runners which apply a specific set of prompt techniques to generate a prompt and run it against an LLM. The foundational module is Predict, which doesn’t do much out of the box besides frame your signature with some boilerplate instructions.

For example, given the signature, question -> answer and the input question, “What is the captital of France?” the Predict model will call an LLM with the following system prompt:

Your input fields are:
1. `question` (str)

Your output fields are:
1. `answer` (str)

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## question ## ]]
{question}

[[ ## answer ## ]]
{answer}

[[ ## completed ## ]]
In adhering to this structure, your objective is:
    Given the fields `question`, produce the fields `answer`.

And an accompanying user prompt:

[[ ## question ## ]]
What is the capital of France?

Respond with the corresponding output fields, starting with the field `[[ ## answer ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`

(I’ve replaced the \n characters with newlines above, for legibility)

The module is performing some basic string formatting to contextualize your provided signature. If we were to use a different module –– DSPy provides ChainOfThought, ProgramOfThought, ReAct, and MultiChainComparison –– different prompt technoques would be used to contextualize and reformat your signature.

Off the bat, this is helpful for quick LLM tasks, especially if you’re a beginner with prompts. But where DSPy really shines is when you ask it to optimize your prompts based on a provided training set.

Using DSPy to Categorize Historic Events

To illustrate how we can optimize prompts with DSPy, we’re going to use a simple toy problem: categorizing descriptions of historic events. While we do yield some gains with the following, this is designed to be a demonstration rather than a real-world approach (for a few reasons).

We’ve gathered the event descriptions by scraping Wikipedia’s date pages, obtaining a whole mess of descriptions like, “Battle of Nineveh: A Byzantine army under Emperor Heraclius defeats Emperor Khosrau II’s Persian forces, commanded by General Rhahzadh.”

First, let’s set up DSPy by running pip install -U dpsy and the following lines:

import dspy
lm = dspy.LM('ollama_chat/llama3.2:1b', api_base='http://localhost:11434')
dspy.configure(lm=lm)

We’re using a Llama 3.2 1b, running locally via Ollama (though you could use any of numerous adaptors). I like to start with small models when getting set up, as they help you iterate faster. DSPy and Ollama makes it easy when we want to step up to a larger model, after we’ve got what we want running bug free.

We’re going to use a class-based signature because it lets us explicitly specify the categories we want our events categorized with:

# 2. Set up the categorizer module
from typing import Literal
class Categorize(dspy.Signature):
"""Classify historic events."""
event: str = dspy.InputField()
category: Literal[
"Wars and Conflicts",
"Politics and Governance",
"Science and Innovation",
"Cultural and Artistic Movements",
"Exploration and Discovery",
"Economic Events",
"Social Movements",
"Man-Made Disasters and Accidents",
"Natural Disasters and Climate",
"Sports and Entertainment",
"Famous Personalities and Achievements"
] = dspy.OutputField()
confidence: float = dspy.OutputField()
classify = dspy.Predict(Categorize)
# Here is how we call this module
classification = classify(event="[YOUR HISTORIC EVENT")

Let’s quickly look at what prompt the Predict module generates for us based off this definition when we pass in the event, ““Second Boer War: In the Battle of Magersfontein the Boers commanded by general Piet Cronjé inflict a defeat on the forces of the British Empire commanded by Lord Methuen trying to relieve the Siege of Kimberley.”

Here’s the system prompt:

Your input fields are:
`event` (str)

Your output fields are:
1. `category` (Literal[Wars and Conflicts, Politics and Governance, Science and Innovation, Cultural and Artistic Movements, Exploration and Discovery, Economic Events, Social Movements, Man-Made Disasters and Accidents, Natural Disasters and Climate, Sports and Entertainment, Famous Personalities and Achievements])
2. `confidence` (float)

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## event ## ]]
{event}

[[ ## category ## ]]
{category}        # note: the value you produce must be one of: Wars and Conflicts; Politics and Governance; Science and Innovation; Cultural and Artistic Movements; Exploration and Discovery; Economic Events; Social Movements; Man-Made Disasters and Accidents; Natural Disasters and Climate; Sports and Entertainment; Famous Personalities and Achievements

[[ ## confidence ## ]]
{confidence}        # note: the value you produce must be a single float value

[[ ## completed ## ]]

In adhering to this structure, your objective is: 
    Classify historic events.

You can clearly see where the bits of my signature class are being dropped into this prompt. Here’s the user prompt:

[[ ## event ## ]]
Second Boer War: In the Battle of Magersfontein the Boers commanded by general Piet Cronjé inflict a defeat on the forces of the British Empire commanded by Lord Methuen trying to relieve the Siege of Kimberley.

Respond with the corresponding output fields, starting with the field `[[ ## category ## ]]` (must be formatted as a valid Python Literal[Wars and Conflicts, Politics and Governance, Science and Innovation, Cultural and Artistic Movements, Exploration and Discovery, Economic Events, Social Movements, Man-Made Disasters and Accidents, Natural Disasters and Climate, Sports and Entertainment, Famous Personalities and Achievements]), then `[[ ## confidence ## ]]` (must be formatted as a valid Python float), and then ending with the marker for `[[ ## completed ## ]]`.

To which the LLM responds with:

I can help you classify historic events.

[[ ## event ## ]]
The Second Boer War: In the Battle of Magersfontein the Boers commanded by general Piet Cronjé inflict a defeat on the forces of the British Empire commanded by Lord Methuen trying to relieve the Siege of Kimberley.

[ ## category ## ]]
Wars and Conflicts

[[ ## confidence ## ]]
0.75

But when I instigate this interaciton by calling the Predict module, this is what we get back:

Prediction(
    category='Wars and Conflicts',
    confidence=0.75
)

Already, this is a huge win. We’ve spec-ed out our categorization problem in a few lines, in a way that it will be much easier to edit our potential categories, and got back structured results without having to get our hands dirty with prompting boilerplate or manipulating the LLM response.

But the actual answers? They’re okay…not great. Lots of war events are categorized as poltical events (which…fair, I guess) and other times a tricky keyword will throw the results. We could go through and hand sort the results, but let’s take advantage of DSPy’s ease of model switching to compare Llama 3.2 1b to the new, excellent Llama 3.3 70b.

Here’s how:

with open("0101_events.json", 'r') as file:
data = json.load(file)
events = pd.DataFrame(data['events'])
with dspy.context(lm=dspy.LM('ollama_chat/llama3.2:1b', api_base='http://localhost:11434')):
events['category_32_1b'], events['confidence_32_1b'] = zip(*events['description'].apply(classify_event))
with dspy.context(lm=dspy.LM('ollama_chat/llama3.3', api_base='http://localhost:11434')):
events['category_33'], events['confidence_33'] = zip(*events['description'].apply(classify_event))
events.to_csv("model_compare.csv", index=False)

28 out of 59 times, the models disagree, with Llama 3.3 in the right. But, this comes at a cost: Llama 3.3 ran ~10x slower.

Llama 3.3’s size provides much more context to situate these events, many of which couldn’t be categorized without knowledge of their subjects. In these instances, there’s not much we can do to help Llama 3.1. Prompt engineering or fine-tuning isn’t going to add the needed diverse base knowledge needed for these calls.

But there’s enough near misses that I think some improved prompting can eek out some gains from the 1b model.

Optimizing Our Prompts With DSPy

An aspect about DSPy modules we haven’t yet discussed is that we can optimize them. To do this, we need to defined a metric and prepare some training data.

In DPSy, metrics are functions that take examples with ideal output and compare them to the output of our system. Here’s the one we’re going to use today:

def validate_category(example, prediction, trace=None):
return prediction.category == example.category

As simple as it gets. If our example doesn’t match the output, it fails. (Checkout DSPy’s docs for details on the example object here)

Next, we’ll generate a training set of 300 categorized events using Llama 3.3:

# Generating example predictions from Llama 3.3
import os
import json
import pandas as pd
# Define a function to classify the event description
def classify_event(description):
try:
prediction = classify(event=description)
return prediction.category, prediction.confidence
except Exception as e:
return 0, 0
with dspy.context(lm=dspy.LM('ollama_chat/llama3.3', api_base='http://localhost:11434')):
# Directory containing the JSON files
events_dir = 'events'
# Iterate over all files in the directory
for filename in os.listdir(events_dir):
if filename.endswith('.json'):
filepath = os.path.join(events_dir, filename)
with open(filepath, 'r') as file:
data = json.load(file)
events = pd.DataFrame(data['events'])
with dspy.context(lm=dspy.LM('ollama_chat/llama3.3', api_base='http://localhost:11434')):
events['category'], events['confidence'] = zip(*events['description'].progress_apply(classify_event))
# Append the results to a global dataframe
if 'lmma_events' not in globals():
lmma_events = events
else:
lmma_events = pd.concat([lmma_events, events], ignore_index=True)
# Break if the dataframe has more than 300 rows
if len(lmma_events) > 300:
print('Breaking...')
break
# Save the results to a CSV file
lmma_events.to_csv('llama_3_3_trainset.csv', index=False)

These answers are great, but generating them took awhile. A good reminder why we should try to eek as much value out of smaller models for these exercises.

To see how our tiny model –– Llama 3.2 1b –– fairs, we can use DSPy’s evaluator functions:

import csv
import dspy
from dspy.evaluate import Evaluate
# Load the trainset
trainset = []
with open('llama_3_3_trainset.csv', 'r') as file:
reader = csv.DictReader(file)
for row in reader:
example = dspy.Example(event=row['description'], category=row['category']).with_inputs("event")
trainset.append(example)
# Evaluate our existing function
evaluator = Evaluate(devset=trainset, num_threads=1, display_progress=True, display_table=5)
evaluator(classify, metric=validate_category)

51.9% of the time Llama 3.2 1b gets it right, about in line with our previous comparison. Nice to know this scales.

To improve our system, we specify an optimizer and ask DSPy to run it on our function with using our training data:

from dspy.teleprompt import *
# Load our model
lm = dspy.LM('ollama_chat/llama3.2:1b', api_base='http://localhost:11434')
dspy.configure(lm=lm)
# Optimize
tp = dspy.MIPROv2(metric=validate_category, auto="light")
optimized_classify = tp.compile(classify, trainset=trainset, max_labeled_demos=0, max_bootstrapped_demos=0)

Getting into the depths of DSPy optimizers is beyond the scope of this post, but we’re choosing MIPROv2 because we only want to optimize the prompt the module and signature are using. We aren’t fine-tuning any weights, just trying to find a way of prompting our LLM so we get results more in line with our desired output.

DSPy will use the LLM to generate other ways of prompting our model –– trying rephrases, using examples from our training set, and more –– to find a prompt which outperforms the boilerplate it generated above. As you stack modules and signatures, forming more complex prompting chains, this can get much more complex and obtain much better gains. Here we’re keeping it simple, using only one module and signature and asking that the optimizer not try few-shot prompts (aka prompts that involve a round or two of back and forth with the LLM).

And wouldn’t you know it? It works. Our optimizer raised our evaluation from 51.9% to 63.0%.

It did this by making one slight change to our prompt. Where it previous read:

...

In adhering to this structure, your objective is: 
Classify historic events.

It now generates:

...

In adhering to this structure, your objective is:
Classify historic events. Consider using synonyms for "landed", such as "arrived" or "descended". Also, try to include more context about Charles II\"s actions and their potential political consequences.

That second part is some very-specific over-fitting! Though the instructions to mind your synonyms seems benficial and more generic. And the results look…pretty good! Running the new signature on a wider batch of data and eyeballing the results appears promising.

But we can do better. DSPy has a really neat feature that lets us specify the model we want to use for the task itself and another model for generating prompts. This is perfect for us, as it lets us leverage the much better Llama 3.3 to come up with prompting strategies while evaluating them against the tiny 3.1 model.

Here’s how:

from dspy.teleprompt import *
# Load our model
lm = dspy.LM('ollama_chat/llama3.2:1b', api_base='http://localhost:11434')
prompt_gen_lm = dspy.LM('ollama_chat/llama3.3', api_base='http://localhost:11434')
dspy.configure(lm=lm)
# Optimize
tp = dspy.MIPROv2(metric=validate_category, auto="light", prompt_model=prompt_gen_lm, task_model=lm)
optimized_classify = tp.compile(classify, trainset=trainset, max_labeled_demos=0, max_bootstrapped_demos=0)

On first blush, this yields worse results: 62% vs our previous 63%. But the output looks much better on initial review. It’s easy to see how using a big LLM helped us avoid over-fitting and obtain better instructions.

Here’s the new modification:

...

In adhering to this structure, your objective is: 
Analyze the given historical event descriptions, which may pertain to various domains such as politics, science, conflicts, or cultural movements, and categorize each event into its most suitable category (e.g., Science and Innovation, Politics and Governance, Wars and Conflicts). Provide a confidence score for each categorization, indicating the level of certainty in assigning the event to its respective category. Ensure that your analysis is based on the content and context of the event description, utilizing natural language processing techniques to accurately determine the category and confidence score.

We use it like so:

classification = optimized_classify(event="Second Boer War: In the Battle of Magersfontein the Boers commanded by general Piet Cronjé inflict a defeat on the forces of the British Empire commanded by Lord Methuen trying to relieve the Siege of Kimberley.")
print(classification)
# We can save our optimization with:
optimized_classify.save("optimized_event_classifier.json")

Saving allows us to reload the optimized system during a different session.

DSPy is super useful, especially as your pipeline grows from a single, 0-shot call to a multistep, tool-using agent. The pattern of abstracting prompt generation away and leaving it to the models to figure out based on defined metrics is quite powerful.