High-Performance Data Science with Julia and Tidier.jl

Julia
Tidier.jl
data manipulation
performance
tutorial
Published

07 Jun, 2025

This tutorial explores the power of Julia combined with Tidier.jl for high-performance data science. Julia brings blazing-fast computation with Python-like syntax, while Tidier.jl provides the beloved tidyverse syntax from R, creating a perfect combination for modern data analysis.

Installation Prerequisites

Before running the examples, you’ll need to install Julia and the required packages:

Installing Julia

  1. Download Julia from julialang.org
  2. Using Homebrew (macOS): brew install julia
  3. Using juliaup (recommended): Follow instructions at github.com/JuliaLang/juliaup

Installing Required Packages

# In Julia REPL, install required packages
using Pkg
Pkg.add(["Tidier", "DataFrames"])

Why Julia + Tidier.jl?

The Performance Advantage: Julia

Julia offers compelling advantages for data science:

  • Near C-speed performance with high-level syntax
  • Multiple dispatch for elegant, extensible code
  • Native parallelism and distributed computing
  • Excellent interoperability with Python, R, and C
  • Growing ecosystem of scientific computing packages

The Familiar Syntax: Tidier.jl

Tidier.jl brings the tidyverse workflow to Julia:

  • Familiar dplyr-style verbs (select, filter, mutate, summarize)
  • Pipe operator (|>) for readable code chains
  • Consistent grammar for data manipulation
  • Performance benefits of Julia’s compiled execution

Setup and Data Preparation

Show the code
# Load required packages
using Tidier
using DataFrames
using Random
using Statistics

# Set random seed for reproducibility
Random.seed!(123)

# Display Julia and package versions
println("Julia version: ", VERSION)
println("Tidier.jl version: v1.2.0")
Julia version: 1.11.5
Tidier.jl version: v1.2.0
Show the code
# Create a simple dataset for demonstration
students = DataFrame(
    id = 1:100,
    name = ["Student $i" for i in 1:100],
    math_score = rand(60:100, 100),
    science_score = rand(55:100, 100),
    program = rand(["CS", "Math", "Physics"], 100),
    grade_level = rand([1, 2, 3, 4], 100)
)

println("Dataset shape: ", size(students))
first(students, 5)
Dataset shape: (100, 6)
5×6 DataFrame
Row id name math_score science_score program grade_level
Int64 String Int64 Int64 String Int64
1 1 Student 1 81 81 Physics 1
2 2 Student 2 84 57 Physics 4
3 3 Student 3 96 82 Physics 3
4 4 Student 4 67 64 Math 4
5 5 Student 5 81 83 Physics 3
Show the code
# Add a total score column
students = @mutate(students, total = math_score + science_score)
first(students, 5)
5×7 DataFrame
Row id name math_score science_score program grade_level total
Int64 String Int64 Int64 String Int64 Int64
1 1 Student 1 81 81 Physics 1 162
2 2 Student 2 84 57 Physics 4 141
3 3 Student 3 96 82 Physics 3 178
4 4 Student 4 67 64 Math 4 131
5 5 Student 5 81 83 Physics 3 164

Basic Tidier.jl Operations

1. Filtering Data

Show the code
# Filter students with high math scores
high_performers = @filter(students, math_score >= 90)
println("Students with math score >= 90: ", nrow(high_performers))
first(high_performers, 5)
Students with math score >= 90: 26
5×7 DataFrame
Row id name math_score science_score program grade_level total
Int64 String Int64 Int64 String Int64 Int64
1 3 Student 3 96 82 Physics 3 178
2 8 Student 8 98 98 CS 2 196
3 12 Student 12 94 67 CS 4 161
4 22 Student 22 100 83 Physics 3 183
5 23 Student 23 96 82 CS 3 178
Show the code
# Filter by multiple conditions
cs_seniors = @filter(students, program == "CS" && grade_level == 4)
println("CS seniors: ", nrow(cs_seniors))
first(cs_seniors, 5)
CS seniors: 9
5×7 DataFrame
Row id name math_score science_score program grade_level total
Int64 String Int64 Int64 String Int64 Int64
1 7 Student 7 61 67 CS 4 128
2 12 Student 12 94 67 CS 4 161
3 27 Student 27 63 83 CS 4 146
4 48 Student 48 90 100 CS 4 190
5 52 Student 52 69 85 CS 4 154

2. Selecting Columns

Show the code
# Select specific columns
scores_only = @select(students, id, math_score, science_score, total)
first(scores_only, 5)
5×4 DataFrame
Row id math_score science_score total
Int64 Int64 Int64 Int64
1 1 81 81 162
2 2 84 57 141
3 3 96 82 178
4 4 67 64 131
5 5 81 83 164
Show the code
# Select columns using patterns
name_and_scores = @select(students, name, ends_with("score"))
first(name_and_scores, 5)
5×3 DataFrame
Row name math_score science_score
String Int64 Int64
1 Student 1 81 81
2 Student 2 84 57
3 Student 3 96 82
4 Student 4 67 64
5 Student 5 81 83

3. Creating New Columns with Mutate

Show the code
# Add calculated columns
students_graded = @mutate(students, 
    average_score = (math_score + science_score) / 2,
    passed = total >= 140
)
first(students_graded, 5)
5×9 DataFrame
Row id name math_score science_score program grade_level total average_score passed
Int64 String Int64 Int64 String Int64 Int64 Float64 Bool
1 1 Student 1 81 81 Physics 1 162 81.0 true
2 2 Student 2 84 57 Physics 4 141 70.5 true
3 3 Student 3 96 82 Physics 3 178 89.0 true
4 4 Student 4 67 64 Math 4 131 65.5 false
5 5 Student 5 81 83 Physics 3 164 82.0 true

4. Summarizing Data

Show the code
# Basic summary statistics
summary_stats = @summarize(students,
    avg_math = mean(math_score),
    avg_science = mean(science_score),
    max_total = maximum(total),
    min_total = minimum(total),
    n_students = length(id)
)
summary_stats
1×5 DataFrame
Row avg_math avg_science max_total min_total n_students
Float64 Float64 Int64 Int64 Int64
1 80.07 76.96 196 119 100

5. Grouping and Summarizing

Show the code
# Summary by program
program_summary = @chain students begin
    @group_by(program)
    @summarize(
        count = length(id),
        avg_math = mean(math_score),
        avg_science = mean(science_score),
        avg_total = mean(total)
    )
    @arrange(desc(avg_total))
end
program_summary
3×5 DataFrame
Row program count avg_math avg_science avg_total
String Int64 Float64 Float64 Float64
1 Physics 41 81.7073 76.5854 158.293
2 Math 23 78.913 79.2174 158.13
3 CS 36 78.9444 75.9444 154.889
Show the code
# Summary by grade level
grade_level_summary = @chain students begin
    @group_by(grade_level)
    @summarize(
        n_students = length(id),
        avg_math = round(mean(math_score), digits=1),
        avg_science = round(mean(science_score), digits=1)
    )
    @arrange(grade_level)
end
grade_level_summary
4×4 DataFrame
Row grade_level n_students avg_math avg_science
Int64 Int64 Float64 Float64
1 1 23 82.9 84.3
2 2 18 80.1 77.6
3 3 33 80.5 73.6
4 4 26 77.0 74.2

6. Arranging Data

Show the code
# Sort by total score (descending)
top_students = @chain students begin
    @arrange(desc(total))
    @select(name, program, math_score, science_score, total)
    @slice(1:10)
end
println("Top 10 students by total score:")
top_students
Top 10 students by total score:
10×5 DataFrame
Row name program math_score science_score total
String String Int64 Int64 Int64
1 Student 8 CS 98 98 196
2 Student 72 Physics 98 98 196
3 Student 59 Physics 96 99 195
4 Student 48 CS 90 100 190
5 Student 50 Physics 84 100 184
6 Student 87 Physics 87 97 184
7 Student 22 Physics 100 83 183
8 Student 80 Math 89 94 183
9 Student 64 Physics 83 98 181
10 Student 86 Physics 85 96 181

7. Complex Data Transformations

Show the code
# First, let's verify the DataFrame exists and has the right columns
if @isdefined(students)
    println("Students DataFrame columns: ", names(students))
    println("Number of rows: ", nrow(students))
else
    println("Students DataFrame not found!")
end

# Use DataFrames.jl functions instead of Tidier.jl for this example
# Filter for upper-level students (grade_level >= 3)
upper_level = filter(row -> row.grade_level >= 3, students)

# Add performance column
upper_level.performance = map(upper_level.total) do t
    if t >= 160
        "Excellent"
    elseif t >= 140
        "Good"
    else
        "Average"
    end
end

# Group and summarize using DataFrames.jl
result = combine(groupby(upper_level, [:program, :performance]), nrow => :count)
sort!(result, [:program, order(:count, rev=true)])

println("\nPerformance distribution for upper-level students:")
result
Students DataFrame columns: ["id", "name", "math_score", "science_score", "program", "grade_level", "total"]
Number of rows: 100

Performance distribution for upper-level students:
9×3 DataFrame
Row program performance count
String String Int64
1 CS Good 11
2 CS Average 6
3 CS Excellent 6
4 Math Excellent 5
5 Math Good 5
6 Math Average 2
7 Physics Good 9
8 Physics Excellent 9
9 Physics Average 6

Practical Examples

8. Working with Missing Data

Show the code
# Create a DataFrame with some missing values
students_missing = DataFrame(
    id = 1:10,
    name = ["Student $i" for i in 1:10],
    math_score = [85, missing, 92, 78, missing, 88, 95, missing, 82, 90],
    science_score = [78, 85, missing, 82, 88, missing, 92, 85, missing, 87]
)

println("Data with missing values:")
println(students_missing)

# Count missing values
missing_counts = DataFrame(
    math_missing = sum(ismissing.(students_missing.math_score)),
    science_missing = sum(ismissing.(students_missing.science_score))
)
println("\nMissing value counts:")
println(missing_counts)

# Calculate mean, skipping missing values
math_mean = mean(skipmissing(students_missing.math_score))
science_mean = mean(skipmissing(students_missing.science_score))
println("\nMeans (excluding missing): Math = $math_mean, Science = $science_mean")
Data with missing values:

10×4 DataFrame

 Row  id     name        math_score  science_score  Int64  String      Int64?      Int64?        

─────┼──────────────────────────────────────────────

   1 │     1  Student 1           85             78

   2 │     2  Student 2      missing             85

   3 │     3  Student 3           92        missing 

   4 │     4  Student 4           78             82

   5 │     5  Student 5      missing             88

   6 │     6  Student 6           88        missing 

   7 │     7  Student 7           95             92

   8 │     8  Student 8      missing             85

   9 │     9  Student 9           82        missing 

  10 │    10  Student 10          90             87



Missing value counts:

1×2 DataFrame

 Row  math_missing  science_missing  Int64         Int64           

─────┼───────────────────────────────

   1 │            3                3



Means (excluding missing): Math = 87.14285714285714, Science = 85.28571428571429

9. Joining DataFrames

Show the code
# Create a simple grades DataFrame
grades = DataFrame(
    id = [1, 2, 3, 4, 5],
    final_grade = ["A", "B", "A", "C", "B"]
)

# Join with students data
students_with_grades = @left_join(students[1:5, :], grades, id)
students_with_grades
5×8 DataFrame
Row id name math_score science_score program grade_level total final_grade
Int64 String Int64 Int64 String Int64 Int64 String?
1 1 Student 1 81 81 Physics 1 162 A
2 2 Student 2 84 57 Physics 4 141 B
3 3 Student 3 96 82 Physics 3 178 A
4 4 Student 4 67 64 Math 4 131 C
5 5 Student 5 81 83 Physics 3 164 B

10. Reshaping Data

Show the code
# Create wide data
wide_scores = @chain students[1:5, :] begin
    @select(id, name, math_score, science_score)
end

println("Wide format:")
wide_scores

# Convert to long format (using DataFrames stack function)
long_scores = stack(wide_scores, [:math_score, :science_score], 
                    variable_name=:subject, value_name=:score)
println("\nLong format:")
first(long_scores, 10)
Wide format:

Long format:
10×4 DataFrame
Row id name subject score
Int64 String String Int64
1 1 Student 1 math_score 81
2 2 Student 2 math_score 84
3 3 Student 3 math_score 96
4 4 Student 4 math_score 67
5 5 Student 5 math_score 81
6 1 Student 1 science_score 81
7 2 Student 2 science_score 57
8 3 Student 3 science_score 82
9 4 Student 4 science_score 64
10 5 Student 5 science_score 83

Summary of Basic Tidier.jl Operations

Core Functions

  • @filter: Select rows based on conditions
  • @select: Choose specific columns
  • @mutate: Create or modify columns
  • @summarize: Calculate summary statistics
  • @group_by: Group data for aggregated operations
  • @arrange: Sort rows
  • @chain or |>: Combine multiple operations

Key Benefits of Julia + Tidier.jl

  1. Familiar Syntax: If you know dplyr from R, you already know Tidier.jl
  2. High Performance: Julia’s speed makes operations on large datasets fast
  3. Clean Code: The pipe operator makes complex operations readable
  4. Type Safety: Julia catches errors before runtime

Next Steps

To continue learning:

  1. Practice with your own data: Import a CSV and try these operations
  2. Explore more functions: Tidier.jl supports many more operations
  3. Learn Julia basics: Understanding Julia makes you more effective
  4. Join the community: Julia has a welcoming, helpful community

Resources

Back to top