Lightning-Fast Data Analysis: Polars and plotnine for Modern Python

Python

Polars

plotnine

data manipulation

visualization

performance

tutorial

Author

Affiliation

Andrew Ellis

Virtual Academy, Bern University of Applied Sciences

Published

06 Jun, 2025

This tutorial explores the powerful combination of Polars and plotnine for high-performance data analysis in Python. Polars brings lightning-fast data manipulation with lazy evaluation, while plotnine provides ggplot2’s elegant grammar of graphics for Python, creating a modern alternative to the pandas + matplotlib/seaborn ecosystem.

Why Polars + plotnine?

The Performance Revolution: Polars

Polars is a blazingly fast DataFrame library that leverages:

Rust backend: Memory-efficient and CPU-optimized operations
Lazy evaluation: Query optimization before execution
Columnar processing: Apache Arrow format for speed
Parallel execution: Automatic multi-threading
Expressive API: Clean, readable data manipulation syntax

The Grammar Advantage: plotnine

plotnine brings ggplot2’s grammar of graphics to Python:

Declarative syntax: Describe what you want, not how to draw it
Layered approach: Build complex plots incrementally
Consistent aesthetics: Systematic approach to visual mapping
Extensible: Easy customization and theming
Educational: Matches R’s ggplot2 for cross-language consistency

Setup and Data Preparation

Show the code

import polars as pl
import numpy as np
from plotnine import (
    ggplot, aes, geom_point, geom_smooth, geom_violin, geom_boxplot, geom_col, 
    geom_tile, geom_text, stat_summary, facet_wrap,
    scale_color_brewer, scale_fill_brewer, scale_color_manual, scale_fill_manual,
    scale_color_gradient, scale_fill_gradient, scale_color_gradient2, 
    scale_fill_gradient2, scale_size_continuous, scale_alpha_continuous,
    labs, theme_minimal, theme, element_text, element_rect, element_line,
    element_blank, guide_legend, coord_flip, xlim, ylim, position_dodge
)
import plotnine.options
from datetime import datetime, timedelta
import warnings

# Configure plotnine for better output
plotnine.options.figure_size = (10, 6)
plotnine.options.dpi = 100
warnings.filterwarnings('ignore')

# Ensure proper display in Quarto
from IPython.display import display
import matplotlib
matplotlib.use('Agg')  # Use non-interactive backend

# Display Polars version and configuration
print(f"Polars version: {pl.__version__}")
print(f"Available threads: {pl.thread_pool_size()}")

Polars version: 1.30.0
Available threads: 14

Show the code

# Create a comprehensive educational dataset using Polars
# This demonstrates Polars' syntax while creating realistic data

np.random.seed(42)

# Generate base student data
n_students = 5000
n_courses = 8
n_semesters = 6

# Create students DataFrame
students_df = pl.DataFrame({
    "student_id": range(1, n_students + 1),
    "age": np.random.normal(22, 2, n_students).round().astype(int),
    "program": np.random.choice(["Computer Science", "Mathematics", "Physics", "Statistics"], n_students),
    "entry_year": np.random.choice([2020, 2021, 2022, 2023], n_students),
    "study_mode": np.random.choice(["Full-time", "Part-time"], n_students, p=[0.8, 0.2])
}).with_columns([
    # Add realistic constraints using Polars expressions
    pl.col("age").clip(18, 28).alias("age"),
    # Generate GPA with program-based bias
    pl.when(pl.col("program") == "Computer Science")
      .then(np.random.normal(3.2, 0.5, n_students))
      .when(pl.col("program") == "Mathematics") 
      .then(np.random.normal(3.4, 0.4, n_students))
      .when(pl.col("program") == "Physics")
      .then(np.random.normal(3.1, 0.6, n_students))
      .otherwise(np.random.normal(3.3, 0.5, n_students))
      .clip(1.0, 4.0)
      .round(2)
      .alias("gpa")
])

print("Students DataFrame shape:", students_df.shape)
students_df.head()

Students DataFrame shape: (5000, 6)

shape: (5, 6)

student_id	age	program	entry_year	study_mode	gpa
i64	i64	str	i64	str	f64
1	23	"Statistics"	2022	"Full-time"	2.43
2	22	"Computer Science"	2020	"Full-time"	2.54
3	23	"Statistics"	2020	"Part-time"	2.88
4	25	"Mathematics"	2023	"Full-time"	3.58
5	22	"Statistics"	2021	"Full-time"	3.94

Show the code

# Create course performance data using a simple approach
courses = ["Calculus", "Linear Algebra", "Statistics", "Programming", 
          "Data Structures", "Machine Learning", "Research Methods", "Thesis"]

# Create performance data manually to avoid cross join issues
np.random.seed(42)
performance_records = []

# Get student data as list for iteration
student_records = students_df.to_dicts()

for student in student_records:
    for i, course in enumerate(courses):
        # Course difficulty multipliers
        if course in ["Machine Learning", "Thesis"]:
            base_multiplier = 20
            noise_factor = 1.6
        elif course in ["Calculus", "Linear Algebra"]:
            base_multiplier = 22
            noise_factor = 2.0
        else:
            base_multiplier = 21
            noise_factor = 1.2
        
        # Generate pseudo-random values based on student_id and course
        seed_val = (student["student_id"] * 7 + i * 13) % 1000
        
        # Calculate score
        base_score = student["gpa"] * base_multiplier
        score_variation = (seed_val / 100.0 - 5.0) * noise_factor
        score = max(0, min(100, round(base_score + score_variation, 1)))
        
        # Study hours based on course type
        if course in ["Programming", "Data Structures"]:
            study_hours = 8 + (seed_val % 100) / 10.0
        elif course == "Thesis":
            study_hours = 15 + (seed_val % 150) / 10.0
        else:
            study_hours = 5 + (seed_val % 80) / 10.0
        
        # Attendance
        attendance = max(50, min(100, round(85 + (seed_val % 50) / 5.0 - 5.0, 1)))
        
        performance_records.append({
            "student_id": student["student_id"],
            "program": student["program"],
            "gpa": student["gpa"],
            "course": course,
            "semester": i + 1,
            "score": score,
            "study_hours": round(study_hours, 1),
            "attendance": attendance
        })

# Create Polars DataFrame from the records
performance_df = pl.DataFrame(performance_records)

print("Performance DataFrame shape:", performance_df.shape)
performance_df.head()

Performance DataFrame shape: (40000, 8)

shape: (5, 8)

student_id	program	gpa	course	semester	score	study_hours	attendance
i64	str	f64	str	i64	f64	f64	f64
1	"Statistics"	2.43	"Calculus"	1	43.6	5.7	81.4
1	"Statistics"	2.43	"Linear Algebra"	2	43.9	7.0	84.0
1	"Statistics"	2.43	"Statistics"	3	45.4	8.3	86.6
1	"Statistics"	2.43	"Programming"	4	45.6	12.6	89.2
1	"Statistics"	2.43	"Data Structures"	5	45.7	13.9	81.8

Polars Data Manipulation Mastery

1. Basic Operations and Lazy Evaluation

Show the code

# Demonstrate Polars' lazy evaluation
lazy_query = (
    performance_df
    .lazy()  # Switch to lazy mode
    .filter(pl.col("score") >= 70)
    .group_by(["program", "course"])
    .agg([
        pl.col("score").mean().alias("avg_score"),
        pl.col("study_hours").mean().alias("avg_study_hours"),
        pl.col("attendance").mean().alias("avg_attendance"),
        pl.count().alias("n_students")
    ])
    .sort("avg_score", descending=True)
)

# Execute the lazy query
program_performance = lazy_query.collect()
print("Top performing program-course combinations:")
program_performance.head(10)

Top performing program-course combinations:

shape: (10, 6)

program	course	avg_score	avg_study_hours	avg_attendance	n_students
str	str	f64	f64	f64	u32
"Mathematics"	"Calculus"	80.371966	8.828641	84.946117	824
"Mathematics"	"Linear Algebra"	80.352906	8.866828	84.859564	826
"Statistics"	"Calculus"	80.228754	8.865864	84.898867	706
"Statistics"	"Linear Algebra"	80.203841	8.831721	84.912376	703
"Physics"	"Calculus"	79.930411	9.084794	85.010376	559
"Physics"	"Linear Algebra"	79.864298	8.949378	84.705151	563
"Computer Science"	"Linear Algebra"	79.621408	8.850733	84.927273	682
"Computer Science"	"Calculus"	79.557143	8.905102	85.037609	686
"Physics"	"Programming"	77.858796	13.072222	85.144444	432
"Physics"	"Statistics"	77.850229	8.884404	85.030275	436

Show the code

# Advanced Polars expressions and window functions
student_rankings = (
    performance_df
    .with_columns([
        # Calculate percentile rank within each course
        pl.col("score").rank(method="average").over("course").alias("course_rank"),
        
        # Calculate student average score
        pl.col("score").mean().over("student_id").alias("student_avg"),
        
        # Flag high performers (top 10% in course) - simplified calculation
        (pl.col("score").rank(method="average", descending=True).over("course") <= 
         (pl.col("score").count().over("course") * 0.1).cast(pl.Int64)).alias("top_performer")
    ])
    .filter(pl.col("semester") >= 4)  # Focus on advanced courses
)

print("Student rankings with advanced metrics:")
student_rankings.head()

Student rankings with advanced metrics:

shape: (5, 11)

student_id	program	gpa	course	semester	score	study_hours	attendance	course_rank	student_avg	top_performer
i64	str	f64	str	i64	f64	f64	f64	f64	f64	bool
1	"Statistics"	2.43	"Programming"	4	45.6	12.6	89.2	137.0	44.275	false
1	"Statistics"	2.43	"Data Structures"	5	45.7	13.9	81.8	136.5	44.275	false
1	"Statistics"	2.43	"Machine Learning"	6	41.8	12.2	84.4	115.0	44.275	false
1	"Statistics"	2.43	"Research Methods"	7	46.0	5.5	87.0	143.5	44.275	false
1	"Statistics"	2.43	"Thesis"	8	42.2	24.8	89.6	124.0	44.275	false

2. Complex Aggregations and Transformations

Show the code

# Multi-level aggregations using Polars
program_analysis = (
    student_rankings
    .group_by("program")
    .agg([
        # Basic statistics
        pl.col("score").mean().alias("avg_score"),
        pl.col("score").std().alias("std_score"),
        pl.col("score").quantile(0.5).alias("median_score"),
        
        # Advanced metrics
        pl.col("top_performer").sum().alias("top_performers_count"),
        pl.col("top_performer").mean().alias("top_performer_rate"),
        
        # Study behavior
        pl.col("study_hours").mean().alias("avg_study_hours"),
        pl.col("attendance").mean().alias("avg_attendance"),
        
        # Count and range
        pl.count().alias("total_records"),
        (pl.col("score").max() - pl.col("score").min()).alias("score_range")
    ])
    .sort("avg_score", descending=True)
)

print("Comprehensive program analysis:")
program_analysis

Comprehensive program analysis:

shape: (4, 10)

program	avg_score	std_score	median_score	top_performers_count	top_performer_rate	avg_study_hours	avg_attendance	total_records	score_range
str	f64	f64	f64	u32	f64	f64	f64	u32	f64
"Mathematics"	70.218913	8.735432	70.4	737	0.123244	13.215217	84.922074	5980	50.5
"Statistics"	67.450602	10.359923	67.7	700	0.11245	13.102892	84.893012	6225	62.2
"Computer Science"	65.771991	10.61858	66.0	512	0.078108	13.151869	84.879786	6555	62.8
"Physics"	63.37617	12.308192	63.5	536	0.085897	13.204647	84.907051	6240	74.3

Show the code

# For correlation analysis, we'll use a simpler approach
# Calculate correlations using pandas (since plotnine uses pandas anyway)
import pandas as pd

correlation_results = []
for program in performance_df["program"].unique():
    program_data = performance_df.filter(pl.col("program") == program).to_pandas()
    
    score_study_corr = program_data["score"].corr(program_data["study_hours"])
    score_attendance_corr = program_data["score"].corr(program_data["attendance"])
    
    correlation_results.append({
        "program": program,
        "score_study_correlation": round(score_study_corr, 3),
        "score_attendance_correlation": round(score_attendance_corr, 3)
    })

correlation_df = pl.DataFrame(correlation_results)

# Combine with program analysis
final_program_analysis = program_analysis.join(correlation_df, on="program")
print("\nProgram analysis with correlations:")
final_program_analysis


Program analysis with correlations:

shape: (4, 12)

program	avg_score	std_score	median_score	top_performers_count	top_performer_rate	avg_study_hours	avg_attendance	total_records	score_range	score_study_correlation	score_attendance_correlation
str	f64	f64	f64	u32	f64	f64	f64	u32	f64	f64	f64
"Mathematics"	70.218913	8.735432	70.4	737	0.123244	13.215217	84.922074	5980	50.5	-0.113	0.019
"Computer Science"	65.771991	10.61858	66.0	512	0.078108	13.151869	84.879786	6555	62.8	-0.08	0.025
"Physics"	63.37617	12.308192	63.5	536	0.085897	13.204647	84.907051	6240	74.3	-0.063	0.014
"Statistics"	67.450602	10.359923	67.7	700	0.11245	13.102892	84.893012	6225	62.2	-0.094	0.021

Declarative Visualization with plotnine

3. Grammar of Graphics Implementation

Show the code

# Convert Polars to pandas for plotnine (plotnine expects pandas)
performance_pd = performance_df.to_pandas()

# Configure plotnine for this specific plot
import plotnine.options
plotnine.options.figure_size = (12, 8)

# Create a sophisticated multi-faceted visualization
p1 = (
    ggplot(performance_pd, aes(x="study_hours", y="score", color="program")) +
    geom_point(alpha=0.6, size=1.5) +
    geom_smooth(method="lm", se=True, size=1.2) +
    facet_wrap("course", ncol=4, scales="free") +
    scale_color_brewer(type="qual", palette="Set2") +
    labs(
        title="Relationship Between Study Hours and Academic Performance",
        subtitle="Linear trends with 95% confidence intervals across courses and programs",
        x="Weekly Study Hours",
        y="Course Score (%)",
        color="Academic Program",
        caption="Data: Simulated student performance (n=5,000 students, 8 courses)"
    ) +
    theme_minimal() +
    theme(
        plot_title=element_text(size=14, weight="bold"),
        plot_subtitle=element_text(size=11, color="#666666"),
        strip_text=element_text(size=10, weight="bold"),
        legend_position="bottom"
    )
)


# Display the plot
p1

4. Advanced Layered Visualizations

Show the code

# Configure plotnine for this plot
plotnine.options.figure_size = (10, 6)

# Aggregate data for program comparison
program_summary = program_analysis.to_pandas()

# Create a sophisticated comparison plot
p2 = (
    ggplot(program_summary, aes(x="avg_study_hours", y="avg_score")) +
    
    # Add confidence ellipses based on standard deviation
    geom_point(aes(size="total_records", color="top_performer_rate"), alpha=0.8) +
    
    # Add program labels
    geom_text(aes(label="program"), nudge_y=1.5, size=9, fontweight="bold") +
    
    # Add trend line
    geom_smooth(method="lm", color="darkred", linetype="dashed", se=False) +
    
    # Customize scales
    scale_size_continuous(
        name="Total Records",
        range=(8, 15),
        guide=guide_legend(override_aes={"alpha": 1})
    ) +
    scale_color_gradient2(
        name="Top Performer\nRate",
        low="blue", mid="white", high="red",
        midpoint=0.1,
        labels=lambda breaks: [f"{x:.1%}" for x in breaks]
    ) +
    
    # Elegant theming
    labs(
        title="Academic Program Performance Analysis",
        subtitle="Bubble size represents sample size, color indicates top performer rate",
        x="Average Study Hours per Week",
        y="Average Score (%)",
        caption="Programs with higher study hours don't always yield higher scores"
    ) +
    theme_minimal() +
    theme(
        plot_title=element_text(size=14, weight="bold"),
        plot_subtitle=element_text(size=11, color="#666666"),
        legend_position="right",
        panel_grid_minor=element_blank()
    )
)

# Display the plot
p2

5. Distribution Analysis with Multiple Geometries

Show the code

# Focus on advanced courses for distribution analysis
advanced_courses = performance_pd[
    performance_pd["course"].isin(["Machine Learning", "Data Structures", "Research Methods", "Thesis"])
]

# Configure plotnine for this plot
plotnine.options.figure_size = (12, 8)

# Create comprehensive distribution plot
p3 = (
    ggplot(advanced_courses, aes(x="program", y="score", fill="program")) +
    
    # Violin plots for distribution shape
    geom_violin(alpha=0.7, trim=False) +
    
    # Box plots for summary statistics
    geom_boxplot(width=0.3, alpha=0.8, outlier_alpha=0.6) +
    
    # Add mean points
    stat_summary(fun_y=np.mean, geom="point", size=3, color="white", shape="D") +
    
    # Facet by course
    facet_wrap("course", ncol=2) +
    
    # Color scheme
    scale_fill_brewer(type="qual", palette="Dark2") +
    
    # Coordinate system
    coord_flip() +
    
    # Labels and theme
    labs(
        title="Score Distribution Analysis for Advanced Courses",
        subtitle="Violin plots show full distribution, box plots highlight quartiles, diamonds mark means",
        x="Academic Program",
        y="Course Score (%)",
        fill="Program",
        caption="Advanced courses: Machine Learning, Data Structures, Research Methods, Thesis"
    ) +
    theme_minimal() +
    theme(
        plot_title=element_text(size=14, weight="bold"),
        plot_subtitle=element_text(size=11, color="#666666"),
        strip_text=element_text(size=11, weight="bold"),
        legend_position="none",  # Remove legend since x-axis shows programs
        axis_text_x=element_text(angle=45, hjust=1)
    )
)

# Display the plot

p3

Performance Comparison: Polars vs Pandas

6. Speed Benchmarking

Show the code

import time
import pandas as pd

# Create larger dataset for meaningful comparison
large_n = 50000
large_students = pl.DataFrame({
    "student_id": range(1, large_n + 1),
    "program": np.random.choice(["CS", "Math", "Physics", "Stats"], large_n),
    "score": np.random.normal(75, 15, large_n),
    "study_hours": np.random.gamma(3, 2, large_n),
    "semester": np.random.choice(range(1, 9), large_n)
})

# Convert to pandas for comparison
large_students_pd = large_students.to_pandas()

print(f"Dataset size: {large_students.shape[0]:,} rows")

Dataset size: 50,000 rows

Show the code

# Benchmark complex aggregation operations

def benchmark_polars():
    start_time = time.time()
    result = (
        large_students
        .group_by(["program", "semester"])
        .agg([
            pl.col("score").mean().alias("avg_score"),
            pl.col("score").std().alias("std_score"),
            pl.col("study_hours").mean().alias("avg_hours"),
            pl.col("score").quantile(0.9).alias("score_90th"),
            pl.count().alias("count")
        ])
        .filter(pl.col("count") >= 100)
        .sort(["program", "semester"])
    )
    end_time = time.time()
    return end_time - start_time, result.shape[0]

def benchmark_pandas():
    start_time = time.time()
    result = (
        large_students_pd
        .groupby(["program", "semester"])
        .agg({
            "score": ["mean", "std", lambda x: x.quantile(0.9)],
            "study_hours": "mean",
            "student_id": "count"
        })
        .reset_index()
    )
    # Flatten column names
    result.columns = ["_".join(col).strip() if col[1] else col[0] for col in result.columns]
    result = result[result.iloc[:, -1] >= 100]  # Filter by count
    end_time = time.time()
    return end_time - start_time, result.shape[0]

# Run benchmarks
polars_time, polars_rows = benchmark_polars()
pandas_time, pandas_rows = benchmark_pandas()

print(f"Polars: {polars_time:.4f} seconds ({polars_rows} result rows)")
print(f"Pandas: {pandas_time:.4f} seconds ({pandas_rows} result rows)")
print(f"Speedup: {pandas_time/polars_time:.2f}x faster with Polars")

Polars: 0.0016 seconds (32 result rows)
Pandas: 0.0073 seconds (32 result rows)
Speedup: 4.69x faster with Polars

7. Memory Usage Analysis

Show the code

# Memory usage comparison
print("Memory usage comparison:")
print(f"Polars DataFrame: {large_students.estimated_size('mb'):.2f} MB")
print(f"Pandas DataFrame: {large_students_pd.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

# Show data types efficiency
print("\nData types:")
print("Polars dtypes:")
for col, dtype in zip(large_students.columns, large_students.dtypes):
    print(f"  {col}: {dtype}")
    
print("\nPandas dtypes:")
for col, dtype in large_students_pd.dtypes.items():
    print(f"  {col}: {dtype}")

Memory usage comparison:
Polars DataFrame: 1.74 MB
Pandas DataFrame: 4.08 MB

Data types:
Polars dtypes:
  student_id: Int64
  program: String
  score: Float64
  study_hours: Float64
  semester: Int64

Pandas dtypes:
  student_id: int64
  program: object
  score: float64
  study_hours: float64
  semester: int64

Advanced plotnine Techniques

8. Custom Themes and Statistical Layers

Show the code

# Configure plotnine for this plot
plotnine.options.figure_size = (10, 6)

# Create a custom theme for academic publications
academic_theme = theme_minimal() + theme(
    plot_title=element_text(size=14, weight="bold", margin={"b": 20}),
    plot_subtitle=element_text(size=11, color="#4d4d4d", margin={"b": 15}),
    axis_title=element_text(size=12, weight="bold"),
    axis_text=element_text(size=10),
    legend_title=element_text(size=11, weight="bold"),
    legend_text=element_text(size=10),
    strip_text=element_text(size=11, weight="bold", margin={"b": 10}),
    panel_grid_major=element_line(color="#e6e6e6", size=0.5),
    panel_grid_minor=element_blank(),
    plot_background=element_rect(fill="white"),
    panel_background=element_rect(fill="white")
)

# Advanced statistical visualization
study_performance = (
    performance_df
    .filter(pl.col("course").is_in(["Programming", "Machine Learning", "Statistics"]))
    .to_pandas()
)

p4 = (
    ggplot(study_performance, aes(x="attendance", y="score")) +
    
    # Add points with transparency to show density
    geom_point(aes(color="program"), alpha=0.3, size=0.8) +
    
    # Add smooth trend lines
    geom_smooth(aes(color="program"), method="loess", se=True) +
    
    # Facet by course
    facet_wrap("course", ncol=3) +
    
    # Custom color palette
    scale_color_manual(
        values=["#2E4057", "#5D737E", "#8FA68E", "#C7D59F"],
        name="Program"
    ) +
    
    # Coordinate limits
    xlim(50, 100) +
    ylim(0, 100) +
    
    # Labels
    labs(
        title="Attendance vs Performance Analysis by Course",
        subtitle="Point density shows distribution, smooth curves indicate trends",
        x="Attendance Rate (%)",
        y="Course Score (%)",
        caption="Statistical analysis of 40,000 course enrollments"
    ) +
    
    # Apply custom theme
    academic_theme +
    theme(legend_position="bottom")
)

# Display the plot
display(p4)
p4

Polars-plotnine Integration Best Practices

9. Efficient Data Pipeline

Show the code

# Demonstrate efficient Polars → plotnine workflow
def create_analysis_pipeline(data: pl.DataFrame, analysis_type: str):
    """
    Efficient pipeline that processes data in Polars and visualizes with plotnine
    """
    
    if analysis_type == "performance_trends":
        # Complex Polars aggregation
        processed = (
            data
            .with_columns([
                pl.when(pl.col("score") >= 90).then(pl.lit("A"))
                  .when(pl.col("score") >= 80).then(pl.lit("B")) 
                  .when(pl.col("score") >= 70).then(pl.lit("C"))
                  .when(pl.col("score") >= 60).then(pl.lit("D"))
                  .otherwise(pl.lit("F")).alias("grade")
            ])
            .group_by(["course", "program", "grade"])
            .agg([
                pl.count().alias("student_count"),
                pl.col("study_hours").mean().alias("avg_study_hours")
            ])
            .with_columns([
                pl.col("student_count").sum().over(["course", "program"]).alias("total_students")
            ])
            .with_columns([
                (pl.col("student_count") / pl.col("total_students") * 100).alias("percentage")
            ])
            .filter(pl.col("total_students") >= 50)  # Sufficient sample size
        )
        
        # Convert to pandas only for plotting
        plot_data = processed.to_pandas()
        
        # Configure plotnine for this plot
        plotnine.options.figure_size = (10, 6)
        
        # Create visualization
        p = (
            ggplot(plot_data, aes(x="grade", y="percentage", fill="program")) +
            geom_col(position="dodge", alpha=0.8) +
            facet_wrap("course", ncol=4) +
            scale_fill_brewer(type="qual", palette="Set3") +
            labs(
                title="Grade Distribution by Program and Course",
                x="Grade", y="Percentage of Students (%)",
                fill="Program"
            ) +
            academic_theme +
            theme(
                axis_text_x=element_text(size=12, weight="bold"),
                legend_position="bottom"
            )
        )
        
        return processed, p
    
    else:
        raise ValueError("Unknown analysis type")

# Execute pipeline
grade_analysis, grade_plot = create_analysis_pipeline(performance_df, "performance_trends")

print("Processed data shape:", grade_analysis.shape)
# Display the plot
display(grade_plot)
grade_plot

Processed data shape: (139, 7)

Real-World Applications

10. Educational Data Science Workflow

Show the code

# Simulate a complete educational analytics workflow

# 1. Data Quality Assessment with Polars
# Create quality report with separate operations to avoid mixing agg types
null_counts = performance_df.null_count()
stats_summary = performance_df.select([
    pl.col("score").min().alias("score_min"),
    pl.col("score").max().alias("score_max"),
    pl.col("score").mean().alias("score_mean"),
])
quality_flags = performance_df.select([
    (pl.col("score") < 0).sum().alias("negative_scores"),
    (pl.col("score") > 100).sum().alias("invalid_scores"),
    (pl.col("study_hours") < 0).sum().alias("negative_hours"),
])

print("Data Quality Report:")
print("Null counts:")
print(null_counts)
print("\nStatistical summary:")
print(stats_summary)
print("\nQuality flags:")
print(quality_flags)

Data Quality Report:
Null counts:
shape: (1, 8)
┌────────────┬─────────┬─────┬────────┬──────────┬───────┬─────────────┬────────────┐
│ student_id ┆ program ┆ gpa ┆ course ┆ semester ┆ score ┆ study_hours ┆ attendance │
│ ---        ┆ ---     ┆ --- ┆ ---    ┆ ---      ┆ ---   ┆ ---         ┆ ---        │
│ u32        ┆ u32     ┆ u32 ┆ u32    ┆ u32      ┆ u32   ┆ u32         ┆ u32        │
╞════════════╪═════════╪═════╪════════╪══════════╪═══════╪═════════════╪════════════╡
│ 0          ┆ 0       ┆ 0   ┆ 0      ┆ 0        ┆ 0     ┆ 0           ┆ 0          │
└────────────┴─────────┴─────┴────────┴──────────┴───────┴─────────────┴────────────┘

Statistical summary:
shape: (1, 3)
┌───────────┬───────────┬────────────┐
│ score_min ┆ score_max ┆ score_mean │
│ ---       ┆ ---       ┆ ---        │
│ f64       ┆ f64       ┆ f64        │
╞═══════════╪═══════════╪════════════╡
│ 15.3      ┆ 98.0      ┆ 67.94946   │
└───────────┴───────────┴────────────┘

Quality flags:
shape: (1, 3)
┌─────────────────┬────────────────┬────────────────┐
│ negative_scores ┆ invalid_scores ┆ negative_hours │
│ ---             ┆ ---            ┆ ---            │
│ u32             ┆ u32            ┆ u32            │
╞═════════════════╪════════════════╪════════════════╡
│ 0               ┆ 0              ┆ 0              │
└─────────────────┴────────────────┴────────────────┘

Show the code

# 2. Predictive modeling preparation
# Check what columns we have available
print("Performance DataFrame columns:", performance_df.columns)

# Create modeling features directly from performance_df (which already includes key student data)
modeling_data = (
    performance_df
    .with_columns([
        # Feature engineering - simplified approach
        pl.col("score").shift(1, fill_value=0).over("student_id").alias("previous_score"),
        pl.col("study_hours").mean().over("student_id").alias("avg_study_hours_student"),
        (pl.col("attendance") >= 85).alias("high_attendance"),
        
        # Target encoding - course difficulty (average score for each course)
        pl.col("score").mean().over("course").alias("course_difficulty"),
        
        # Interaction features
        (pl.col("study_hours") * pl.col("attendance") / 100.0).alias("effective_study_time"),
        
        # Course progress indicator
        pl.col("semester").rank().over("student_id").alias("course_sequence")
    ])
    .filter(pl.col("score").is_not_null())  # Remove missing values for modeling
)

print("Modeling dataset shape:", modeling_data.shape)
print("Features available for modeling:")
print(modeling_data.columns)

Performance DataFrame columns: ['student_id', 'program', 'gpa', 'course', 'semester', 'score', 'study_hours', 'attendance']
Modeling dataset shape: (40000, 14)
Features available for modeling:
['student_id', 'program', 'gpa', 'course', 'semester', 'score', 'study_hours', 'attendance', 'previous_score', 'avg_study_hours_student', 'high_attendance', 'course_difficulty', 'effective_study_time', 'course_sequence']

Show the code

# 3. Final comprehensive visualization
final_plot_data = modeling_data.to_pandas()

# Configure plotnine for this plot
plotnine.options.figure_size = (12, 6)

p_final = (
    ggplot(final_plot_data.sample(2000), aes(x="effective_study_time", y="score")) +
    
    # Use points with alpha for density visualization
    geom_point(aes(color="program"), alpha=0.4, size=1.5) +
    
    # Overlay trend line
    geom_smooth(color="red", method="loess") +
    
    # Facet by program
    facet_wrap("program", ncol=2) +
    
    # Color scale for points
    scale_color_brewer(type="qual", palette="Set2", name="Program") +
    
    # Labels
    labs(
        title="Effective Study Time vs Academic Performance",
        subtitle="Point density shows student distribution, red line indicates trend",
        x="Effective Study Time (hours × attendance rate)",
        y="Course Score (%)",
        caption="Sample of 2,000 students from modeling dataset"
    ) +
    
    # Professional theme
    academic_theme +
    theme(
        strip_text=element_text(size=12, weight="bold"),
        legend_position="right"
    )
)

# Display the plot
display(p_final)
p_final

Key Takeaways and Best Practices

Performance Benefits

Polars advantages: 2-10x faster than pandas for most operations
Memory efficiency: Lower memory footprint with optimized data types
Lazy evaluation: Query optimization before execution
Parallel processing: Automatic multi-threading

Visualization Excellence

Grammar of graphics: Systematic approach to building complex visualizations
Layer composition: Build plots incrementally for clarity
Consistent aesthetics: Professional appearance with minimal code
Cross-platform: Same syntax as R’s ggplot2

Integration Strategy

Data processing in Polars: Leverage speed for heavy computations
Visualization in plotnine: Convert to pandas only when plotting
Memory management: Process in chunks for very large datasets
Type consistency: Ensure proper data types throughout pipeline

Educational Applications

Performance analytics: Fast processing of large student datasets
Interactive exploration: Quick iteration during analysis
Publication-ready plots: Professional visualizations for research
Reproducible workflows: Clear, readable data science pipelines

The combination of Polars and plotnine represents the future of Python data science: blazing-fast processing with elegant, declarative visualization. This powerful duo enables researchers and educators to handle larger datasets while creating more sophisticated analyses and beautiful visualizations.

Conclusion

Polars and plotnine together offer a compelling alternative to the traditional pandas + matplotlib ecosystem:

Polars delivers exceptional performance for data manipulation with an intuitive API
plotnine provides the grammar of graphics for systematic visualization
Together they enable fast, elegant, and reproducible data science workflows

For educational data analysis, this combination is particularly powerful, allowing researchers to: - Process large institutional datasets efficiently - Create publication-quality visualizations - Build reproducible analytical pipelines
- Scale analyses as data grows

The investment in learning these tools pays dividends in both performance and code clarity, making them excellent choices for modern Python data science.

Reuse

CC BY 4.0