Build: Interactive Exploration Labs

Duration: 90 minutes

Lab 1: Prompt Design (15 min)
Lab 2: Personalization A/B Test (15 min)
Lab 3: Data Model Designer (20 min)
Lab 4: Parameter Playground (10 min)
Lab 5: CLT Analyzer (20 min)
Final Reflection (10 min)

Learning Objectives

By the end of this section, you will:

Experience how prompt design shapes learning outcomes
Feel the cognitive load difference between generic and personalized examples
Design the data structure for pedagogically sound worked examples
Understand how model parameters affect educational quality
Evaluate generated examples using CLT criteria

A Different Kind of “Building”

This isn’t about writing code. You won’t become a Python developer in 90 minutes.

This is about understanding design principles. You’ll explore the key decisions that make AI educational tools effective or ineffective.

Why this matters: When you understand the principles, you can:

Critically evaluate existing AI tools
Request informed modifications from developers
Design tools grounded in learning science
Adapt templates to your teaching context

The Interactive Exploration App

Below is a live marimo notebook with 5 hands-on labs. You’ll experiment with the design decisions that shape AI educational tools.

How to use it:

Scroll through the labs in order
Fill in text fields with YOUR information
Click buttons to generate examples
Compare results and reflect on what you notice

Open in new window

Lab 1: Prompt Design Laboratory (15 minutes)

What You’ll Explore

Learning Question: How does prompt engineering affect the quality of worked examples?

Instructions

Read both prompts in the app:
- Basic Prompt (no pedagogical grounding)
- CLT-Grounded Prompt (reduces cognitive load)
Click “Generate Both Examples”
Compare the results:
- Which problem is clearer and more specific?
- Which solution breaks down steps better?
- Which explanation helps you understand WHY, not just WHAT?

Key Insight

The prompt IS your pedagogical design encoded in language.

Every word in your prompt shapes the language model’s output. Generic prompts produce generic examples. Pedagogically grounded prompts produce learning-focused examples.

Reflection Questions

Think about:

What specific phrases in the CLT-grounded prompt improved the output?
How could you apply this to prompts in YOUR teaching domain?
What CLT principles are explicitly mentioned in the grounded prompt?

Discussion: Share one phrase from the CLT prompt that you found particularly effective.

Lab 2: Personalization A/B Test (15 minutes)

What You’ll Explore

Learning Question: Can you FEEL the difference in cognitive load?

Instructions

Enter YOUR context:
- Your hobby or interest (e.g., photography, cooking, gaming)
- What you want to achieve (e.g., build a recipe app, automate photo editing)
Click “Generate A/B Comparison”
Read both examples:
- Generic (standard textbook style)
- Personalized (using your context)
Notice how each one FEELS:
- Which is more engaging to read?
- Which feels easier to process mentally?
- Can you visualize the personalized example more easily?

Key Insight

This is the personalization effect in action!

Familiar contexts require less cognitive effort to process. When you don’t have to decode an unfamiliar scenario, more working memory is available for learning the target concept.

Try This

Experiment further:

Generate examples for 2-3 different hobbies
Notice how the SAME concept (Python dictionaries) gets explained differently
Which personalized context resonated most with you?

The takeaway: Personalization isn’t just “nice to have”—it’s a cognitive load reduction strategy.

Reflection Questions

Think about:

How did the personalized example reduce extraneous cognitive load?
Could you use personalization in YOUR teaching context?
What student interests/contexts could you leverage?

Discussion: Share your most effective personalized example with a neighbor.

Lab 3: Data Model Designer (20 minutes)

What You’ll Explore

Learning Question: What makes a worked example “worked”?

Instructions

Read the current data model shown in the app
Select fields you think support learning:
- problem: str (The problem statement)
- solution_steps: list[str] (Steps as a list for chunking!)
- solution: str (Solution as one big block)
- final_answer: str (Explicit conclusion)
- key_insight: str (Why this approach works)
- code_with_comments: str (Annotated code)
- common_mistakes: str (What to avoid)
- connection_to_real_world: str (Practical relevance)
See the pedagogical analysis:
- What CLT principles do your choices implement?
- What’s your design score?
- What feedback does the analyzer provide?

Key Insight

The data structure IS the pedagogy.

When you design a Pydantic model (the structure that controls generated outputs), you’re making pedagogical choices. Each field implements (or undermines) a CLT principle.

Critical Design Choice: Chunking

Notice the difference between:

solution_steps: list[str] (Forces GPT-5.1 to break the solution into chunks)
solution: str (Allows GPT-5.1 to generate everything in one block)

For novices: Chunked steps reduce intrinsic cognitive load (Sweller, 1988)

For experts: One block may be fine (expertise reversal effect)

Your choice of data type (list vs. str) controls this!

Reflection Questions

Think about:

Why is solution_steps: list[str] better than solution: str for novices?
What field would you ADD for your teaching domain?
How does structure guide (or constrain) what the model generates?

Discussion: Design the ideal data model for worked examples in YOUR subject area. What fields would you include?

Lab 4: Parameter Playground (10 minutes)

What You’ll Explore

Learning Question: How do model parameters affect pedagogical quality?

Instructions

Adjust the parameters:
- Reasoning Effort (none, low, medium, high)
- Verbosity (low, medium, high)
Read the guidance:
- For novices: Low reasoning (fast), medium-high verbosity (detailed)
- For experts: Higher reasoning (better solutions), lower verbosity (concise)
Consider the tradeoffs:
- More reasoning = better quality but slower and more expensive
- Higher verbosity = clearer explanations but longer to read

Key Insight

The “best” parameters depend on your learners!

There’s no universal setting. You must match technical parameters to pedagogical needs:

Novice learners: Need detailed, step-by-step explanations (high verbosity)
Expert learners: Want concise, sophisticated solutions (low verbosity, high reasoning)
Budget constraints: Lower reasoning is faster and cheaper
Quality requirements: Higher reasoning produces better examples

Cost vs. Quality Tradeoffs

Real-world considerations:

Setting	Speed	Cost	Quality	Best For
Low reasoning, low verbosity	Fast	Low	Basic	Quick practice problems
Low reasoning, high verbosity	Fast	Medium	Good	Novice learners at scale
High reasoning, high verbosity	Slow	High	Excellent	Premium personalized tutoring

Your choice depends on: budget, learner needs, time constraints, and learning objectives.

Reflection Questions

Think about:

What parameters would you use for YOUR learners?
How would you balance cost and quality?
When might you use different settings for different students?

Lab 5: CLT Analyzer (20 minutes)

What You’ll Explore

Learning Question: Can you evaluate examples using CLT principles?

Instructions

Click “Generate Random Example”
Read the example carefully
Evaluate it using the checklist:
- ✅ Reduces extraneous cognitive load (no unnecessary complexity)
- ✅ Manages intrinsic load (breaks problem into chunks)
- ✅ Optimizes germane load (helps build schemas/patterns)
- ✅ Is a WORKED example (shows complete solution, not a puzzle)
- ✅ Has clear step-by-step progression
- ✅ Explains WHY, not just WHAT
See your score:
- 5-6: Excellent pedagogical design
- 3-4: Good, but room for improvement
- 1-2: Needs significant pedagogical revision

Key Insight

You’re developing a CLT-grounded critical lens for evaluating AI tools!

This skill is more valuable than coding. When you can evaluate generated outputs using learning science principles, you can:

Spot pedagogically weak examples
Request specific improvements
Compare competing language models and tools
Design better prompts and data models

Try This

Generate 3-4 examples and evaluate each:

Do you see patterns in what GPT-5.1 generates well?
What does it consistently miss?
How would you revise the prompt to improve low-scoring areas?

The goal: Develop your critical evaluation instinct.

Reflection Questions

Think about:

Which CLT criteria are hardest for language models to meet?
What prompt changes would improve low-scoring examples?
How would you use this checklist when evaluating tools you already use?

Discussion: Share one example you evaluated. What was its score? What would you improve?

Final Reflection (10 minutes)

What You’ve Learned

Through these 5 labs, you explored:

✅ Prompts encode pedagogy (Design drives outputs)
✅ Personalization reduces load (Context matters)
✅ Structure shapes learning (Data models are pedagogical choices)
✅ Parameters affect quality (Settings have learning implications)
✅ Critical evaluation is a skill (You can assess AI tools with CLT)

Integration Questions

Consider:

What surprised you most?
- Which lab challenged your assumptions?
- What principle seemed most powerful?
What will you change?
- How will you modify prompts you write for language models?
- What will you look for when evaluating educational technology tools?
What will you build?
- Could you adapt this pattern to your teaching domain?
- What concepts would you include in your worked example generator?
What questions remain?
- What do you still want to understand?
- What would you need to deploy this in your context?

Checkpoint: Can You Answer These?

Pedagogical Understanding:

Why does chunking (solution_steps: list[str]) reduce cognitive load?
How does personalization reduce extraneous cognitive load?
What makes a worked example “worked” versus a problem to solve?

Practical Skills:

Can you write a CLT-grounded prompt for your subject?
Can you evaluate a generated example using CLT criteria?
Can you design a data model for worked examples in your domain?

Critical Thinking:

What’s the difference between technically impressive and pedagogically sound?
When should you prioritize speed/cost vs. quality?
What ethical considerations arise from personalized learning tools powered by language models?

What’s Next?

You’ve explored the design principles. Now you’re ready to:

Option 1: Use the Complete Tool

Try the full Worked Example Weaver

See all 3 domains (Programming, Health Sciences, Agronomy)
16 concepts to explore
Generate personalized examples for real learners

Option 2: Design Your Extensions

Move to the Extend section where you’ll:

Plan how to adapt this to YOUR teaching context
Sketch your personalized worked example tool

Option 3: Dive Into the Code

If you want to understand the technical implementation:

View the complete app.py on HuggingFace
Explore the Marimo and Pydantic documentation

Next: Extend Section (Design your own extensions)

Reuse

CC BY 4.0