Structured Output

A very useful feature of OpenAI’s API is the ability to return structured data. This is useful for a variety of reasons, but one of the most common is to return a JSON object. Here is the official OpenAI documentation for structured output.

OpenAI’s API can return responses in structured formats like JSON, making it easier to:

Parse and process responses programmatically
Ensure consistent output formats
Integrate with existing systems and databases

When using structured output, you can:

Define specific JSON schemas for your responses
Get predictable data structures instead of free-form text
Reduce the need for additional parsing/processing

Common use cases include:

Extracting specific fields from text
Converting unstructured data into structured formats
Creating standardized API responses
Building data pipelines with LLM outputs

Put very simply, the difference between structured and unstructured output is illustrated by the following example: Imagine you want to know the current weather in a city.

Unstructured output: The response is a free-form text response.

“The current weather in Bern is 8 degrees Celsius with partly cloudy skies.”

The weather in Bern is 10° with rain.

Structured output: The response is a JSON object with the weather information.

{"city": "Bern", 
"temperature": 8, 
"scale": "Celsius",
"condition": "partly cloudy"}

The benefit of structured output is that it is easier to parse and process programmatically. A further advantage is that we can use a data validation library like Pydantic to ensure that the response is in the expected format.

To use this feature, we first need to install the pydantic package.

pip install pydantic

Then we can define a Pydantic model to describe the expected structure of the response.

from pydantic import BaseModel

class Weather(BaseModel):
    city: str
    temperature: float
    scale: str
    condition: str

We can use this object as the response_format parameter in the openai.ChatCompletion.create method.

Extracting facts from text

Here is an example of how to use structured output. Since a pre-trained model is not actually able to provide weather information without calling a weather API, we will use a prompt that asks the model to give us some facts contained in a text about a composer. For example, we want to extract the composer’s name, the year of birth and death, and the country of origin, the genre of music they worked in, and some key works.

from dotenv import load_dotenv
from openai import OpenAI

load_dotenv()

client = OpenAI()

Next we define a Pydantic model to describe the expected structure of the response. The fields of the model correspond to the facts we want to extract.

In this case, we want to extract the following facts (if available):

The composer’s name
The year of birth
The year of death
The country of origin
The genre of music they worked in
Some key works

from pydantic import BaseModel
from typing import List, Optional

class ComposerFactSheet(BaseModel):
    name: str
    birth_year: int
    death_year: Optional[int] = None  # Optional for living composers
    country: str
    genre: str
    key_works: List[str]

This is a Pydantic model that defines a structured data format for storing information about composers:

class ComposerFactSheet(BaseModel): Creates a new class that inherits from Pydantic’s BaseModel, giving it data validation capabilities.
name: str: A required field for the composer’s name.
birth_year: int: A required field for the year of birth.
death_year: Optional[int] = None: An optional field for the year of death.
country: str: A required field for the country of origin.
genre: str: A required field for the genre of music.
key_works: List[str]: A required field for a list of key works.

When used, this model will:

Validate that all required fields are present
Convert input data to the correct types when possible
Raise validation errors if data doesn’t match the schema

Example output:

composer = ComposerFactSheet(
    name="Johann Sebastian Bach",
    birth_year=1685,
    death_year=1750,
    country="Germany",
    genre="Baroque",
    key_works=["Mass in B minor", "The Well-Tempered Clavier"]
)

Let’s try this with a suitable system prompt and a short paragraph about Eric Satie. We will use the GPT-4o model for this.

text = """
Éric Alfred Leslie Satie (1866–1925) was a French composer and pianist known for his eccentric personality and groundbreaking contributions to music. Often associated with the Parisian avant-garde, Satie coined the term “furniture music” (musique d’ameublement) to describe background music intended to blend into the environment, an early precursor to ambient music. He is perhaps best known for his piano compositions, particularly the Gymnopédies and Gnossiennes, which are characterized by their simplicity, haunting melodies, and innovative use of harmony. Satie’s collaborations with artists like Claude Debussy, Pablo Picasso, and Jean Cocteau established him as a central figure in early 20th-century modernism. Despite his whimsical demeanor, he significantly influenced composers such as John Cage and minimalists of the mid-20th century.
"""

system_prompt = """
You are an expert at extracting structured data from unstructured text.
"""

user_message = f"""
Please extract the following information from the text: {text}
"""

The f-string (formatted string literal)is used to embed the text variable into the user_message string. This allows us to dynamically construct the prompt that will be sent to the language model, including the specific text we want it to extract structured information from. Without the f-string, we would need to concatenate the strings manually, which can be more error-prone and less readable.


completion = client.beta.chat.completions.parse(
    model="gpt-4o-2024-08-06",
    messages=[
        {"role": "system", 
        "content": system_prompt},
        {"role": "user", 
        "content": user_message}
    ],
    response_format=ComposerFactSheet
)


factsheet = completion.choices[0].message.parsed
print(factsheet)

name='Éric Alfred Leslie Satie' birth_year=1866 death_year=1925 country='France' genre='Classical, Avant-Garde' key_works=['Gymnopédies', 'Gnossiennes']

We can now access the fields of the factsheet object.

factsheet.name

'Éric Alfred Leslie Satie'

factsheet.key_works

['Gymnopédies', 'Gnossiennes']

Let’s try another example. This time we will attempt to extract information from a paragraph in which some of the information is missing.

text_2 = """
Frédéric Chopin (1810) was a composer and virtuoso pianist, renowned for his deeply expressive and technically innovative piano works. Often called the “Poet of the Piano,” Chopin’s music, including his nocturnes, mazurkas, and polonaises, is celebrated for blending Polish folk elements with Romantic lyricism. Born near Warsaw, he spent much of his career in Paris, influencing generations of musicians and cementing his place as one of the greatest composers of all time.
"""

user_message = f"""
Please extract the following information from the text: {text_2}
"""


completion_2 = client.beta.chat.completions.parse(
    model="gpt-4o-2024-08-06",
    messages=[
        {"role": "system", 
        "content": system_prompt},
        {"role": "user", 
        "content": user_message}
    ],
    response_format=ComposerFactSheet
)

completion_2.choices[0].message.parsed

ComposerFactSheet(name='Frédéric Chopin', birth_year=1810, death_year=None, country='Poland', genre='Romantic', key_works=['nocturnes', 'mazurkas', 'polonaises'])

An obvious next step would be to improve our prompting strategy, so that the model indicates which fields it is able to fill in, and which fields are associated with uncertain or missing information.

Creating a reusable function

However, we will focus on making our code more resuable by creating a function that can be called with different texts.

def extract_composer_facts(text: str) -> ComposerFactSheet:
    system_prompt = """
    You are an expert at extracting structured data from unstructured text.
    """

    user_message = f"""
    Please extract the following information from the text: {text}
    """
    completion = client.beta.chat.completions.parse(
        model="gpt-4o-2024-08-06",
        messages=[
            {"role": "system", 
            "content": system_prompt},
            {"role": "user", 
            "content": user_message}
        ],
        response_format=ComposerFactSheet
    )
    return completion.choices[0].message.parsed

bach_text = """
Johann Sebastian Bach (1685–1750) was a German composer and musician of the Baroque era, widely regarded as one of the greatest composers in Western music history. His masterful works, including the Brandenburg Concertos, The Well-Tempered Clavier, and the Mass in B Minor, showcase unparalleled contrapuntal skill and emotional depth. Bach’s music has influenced countless composers and remains a cornerstone of classical music education and performance worldwide.
"""


extract_composer_facts(bach_text)

ComposerFactSheet(name='Johann Sebastian Bach', birth_year=1685, death_year=1750, country='Germany', genre='Baroque', key_works=['Brandenburg Concertos', 'The Well-Tempered Clavier', 'Mass in B Minor'])