High-Performance Data Science with Julia and Tidier.jl

Julia

Tidier.jl

data manipulation

performance

tutorial

Author

Affiliation

Andrew Ellis

Virtual Academy, Bern University of Applied Sciences

Published

07 Jun, 2025

This tutorial explores the power of Julia combined with Tidier.jl for high-performance data science. Julia brings blazing-fast computation with Python-like syntax, while Tidier.jl provides the beloved tidyverse syntax from R, creating a perfect combination for modern data analysis.

Installation Prerequisites

Before running the examples, you’ll need to install Julia and the required packages:

Installing Julia

Download Julia from julialang.org
Using Homebrew (macOS): brew install julia
Using juliaup (recommended): Follow instructions at github.com/JuliaLang/juliaup

Installing Required Packages

# In Julia REPL, install required packages
using Pkg
Pkg.add(["Tidier", "DataFrames"])

Why Julia + Tidier.jl?

The Performance Advantage: Julia

Julia offers compelling advantages for data science:

Near C-speed performance with high-level syntax
Multiple dispatch for elegant, extensible code
Native parallelism and distributed computing
Excellent interoperability with Python, R, and C
Growing ecosystem of scientific computing packages

The Familiar Syntax: Tidier.jl

Tidier.jl brings the tidyverse workflow to Julia:

Familiar dplyr-style verbs (select, filter, mutate, summarize)
Pipe operator (|>) for readable code chains
Consistent grammar for data manipulation
Performance benefits of Julia’s compiled execution

Setup and Data Preparation

Show the code

# Load required packages
using Tidier
using DataFrames
using Random
using Statistics

# Set random seed for reproducibility
Random.seed!(123)

# Display Julia and package versions
println("Julia version: ", VERSION)
println("Tidier.jl version: v1.2.0")

Julia version: 1.11.5
Tidier.jl version: v1.2.0

Show the code

# Create a simple dataset for demonstration
students = DataFrame(
    id = 1:100,
    name = ["Student $i" for i in 1:100],
    math_score = rand(60:100, 100),
    science_score = rand(55:100, 100),
    program = rand(["CS", "Math", "Physics"], 100),
    grade_level = rand([1, 2, 3, 4], 100)
)

println("Dataset shape: ", size(students))
first(students, 5)

Dataset shape: (100, 6)

5×6 DataFrame

Row	id	name	math_score	science_score	program	grade_level
	Int64	String	Int64	Int64	String	Int64
1	1	Student 1	81	81	Physics	1
2	2	Student 2	84	57	Physics	4
3	3	Student 3	96	82	Physics	3
4	4	Student 4	67	64	Math	4
5	5	Student 5	81	83	Physics	3

Show the code

# Add a total score column
students = @mutate(students, total = math_score + science_score)
first(students, 5)

5×7 DataFrame

Row	id	name	math_score	science_score	program	grade_level	total
	Int64	String	Int64	Int64	String	Int64	Int64
1	1	Student 1	81	81	Physics	1	162
2	2	Student 2	84	57	Physics	4	141
3	3	Student 3	96	82	Physics	3	178
4	4	Student 4	67	64	Math	4	131
5	5	Student 5	81	83	Physics	3	164

Basic Tidier.jl Operations

1. Filtering Data

Show the code

# Filter students with high math scores
high_performers = @filter(students, math_score >= 90)
println("Students with math score >= 90: ", nrow(high_performers))
first(high_performers, 5)

Students with math score >= 90: 26

5×7 DataFrame

Row	id	name	math_score	science_score	program	grade_level	total
	Int64	String	Int64	Int64	String	Int64	Int64
1	3	Student 3	96	82	Physics	3	178
2	8	Student 8	98	98	CS	2	196
3	12	Student 12	94	67	CS	4	161
4	22	Student 22	100	83	Physics	3	183
5	23	Student 23	96	82	CS	3	178

Show the code

# Filter by multiple conditions
cs_seniors = @filter(students, program == "CS" && grade_level == 4)
println("CS seniors: ", nrow(cs_seniors))
first(cs_seniors, 5)

CS seniors: 9

5×7 DataFrame

Row	id	name	math_score	science_score	program	grade_level	total
	Int64	String	Int64	Int64	String	Int64	Int64
1	7	Student 7	61	67	CS	4	128
2	12	Student 12	94	67	CS	4	161
3	27	Student 27	63	83	CS	4	146
4	48	Student 48	90	100	CS	4	190
5	52	Student 52	69	85	CS	4	154

2. Selecting Columns

Show the code

# Select specific columns
scores_only = @select(students, id, math_score, science_score, total)
first(scores_only, 5)

5×4 DataFrame

Row	id	math_score	science_score	total
	Int64	Int64	Int64	Int64
1	1	81	81	162
2	2	84	57	141
3	3	96	82	178
4	4	67	64	131
5	5	81	83	164

Show the code

# Select columns using patterns
name_and_scores = @select(students, name, ends_with("score"))
first(name_and_scores, 5)

5×3 DataFrame

Row	name	math_score	science_score
	String	Int64	Int64
1	Student 1	81	81
2	Student 2	84	57
3	Student 3	96	82
4	Student 4	67	64
5	Student 5	81	83

3. Creating New Columns with Mutate

Show the code

# Add calculated columns
students_graded = @mutate(students, 
    average_score = (math_score + science_score) / 2,
    passed = total >= 140
)
first(students_graded, 5)

5×9 DataFrame

Row	id	name	math_score	science_score	program	grade_level	total	average_score	passed
	Int64	String	Int64	Int64	String	Int64	Int64	Float64	Bool
1	1	Student 1	81	81	Physics	1	162	81.0	true
2	2	Student 2	84	57	Physics	4	141	70.5	true
3	3	Student 3	96	82	Physics	3	178	89.0	true
4	4	Student 4	67	64	Math	4	131	65.5	false
5	5	Student 5	81	83	Physics	3	164	82.0	true

4. Summarizing Data

Show the code

# Basic summary statistics
summary_stats = @summarize(students,
    avg_math = mean(math_score),
    avg_science = mean(science_score),
    max_total = maximum(total),
    min_total = minimum(total),
    n_students = length(id)
)
summary_stats

1×5 DataFrame

Row	avg_math	avg_science	max_total	min_total	n_students
	Float64	Float64	Int64	Int64	Int64
1	80.07	76.96	196	119	100

5. Grouping and Summarizing

Show the code

# Summary by program
program_summary = @chain students begin
    @group_by(program)
    @summarize(
        count = length(id),
        avg_math = mean(math_score),
        avg_science = mean(science_score),
        avg_total = mean(total)
    )
    @arrange(desc(avg_total))
end
program_summary

3×5 DataFrame

Row	program	count	avg_math	avg_science	avg_total
	String	Int64	Float64	Float64	Float64
1	Physics	41	81.7073	76.5854	158.293
2	Math	23	78.913	79.2174	158.13
3	CS	36	78.9444	75.9444	154.889

Show the code

# Summary by grade level
grade_level_summary = @chain students begin
    @group_by(grade_level)
    @summarize(
        n_students = length(id),
        avg_math = round(mean(math_score), digits=1),
        avg_science = round(mean(science_score), digits=1)
    )
    @arrange(grade_level)
end
grade_level_summary

4×4 DataFrame

Row	grade_level	n_students	avg_math	avg_science
	Int64	Int64	Float64	Float64
1	1	23	82.9	84.3
2	2	18	80.1	77.6
3	3	33	80.5	73.6
4	4	26	77.0	74.2

6. Arranging Data

Show the code

# Sort by total score (descending)
top_students = @chain students begin
    @arrange(desc(total))
    @select(name, program, math_score, science_score, total)
    @slice(1:10)
end
println("Top 10 students by total score:")
top_students

Top 10 students by total score:

10×5 DataFrame

Row	name	program	math_score	science_score	total
	String	String	Int64	Int64	Int64
1	Student 8	CS	98	98	196
2	Student 72	Physics	98	98	196
3	Student 59	Physics	96	99	195
4	Student 48	CS	90	100	190
5	Student 50	Physics	84	100	184
6	Student 87	Physics	87	97	184
7	Student 22	Physics	100	83	183
8	Student 80	Math	89	94	183
9	Student 64	Physics	83	98	181
10	Student 86	Physics	85	96	181

7. Complex Data Transformations

Show the code

# First, let's verify the DataFrame exists and has the right columns
if @isdefined(students)
    println("Students DataFrame columns: ", names(students))
    println("Number of rows: ", nrow(students))
else
    println("Students DataFrame not found!")
end

# Use DataFrames.jl functions instead of Tidier.jl for this example
# Filter for upper-level students (grade_level >= 3)
upper_level = filter(row -> row.grade_level >= 3, students)

# Add performance column
upper_level.performance = map(upper_level.total) do t
    if t >= 160
        "Excellent"
    elseif t >= 140
        "Good"
    else
        "Average"
    end
end

# Group and summarize using DataFrames.jl
result = combine(groupby(upper_level, [:program, :performance]), nrow => :count)
sort!(result, [:program, order(:count, rev=true)])

println("\nPerformance distribution for upper-level students:")
result

Students DataFrame columns: ["id", "name", "math_score", "science_score", "program", "grade_level", "total"]
Number of rows: 100

Performance distribution for upper-level students:

9×3 DataFrame

Row	program	performance	count
	String	String	Int64
1	CS	Good	11
2	CS	Average	6
3	CS	Excellent	6
4	Math	Excellent	5
5	Math	Good	5
6	Math	Average	2
7	Physics	Good	9
8	Physics	Excellent	9
9	Physics	Average	6

Practical Examples

8. Working with Missing Data

Show the code

# Create a DataFrame with some missing values
students_missing = DataFrame(
    id = 1:10,
    name = ["Student $i" for i in 1:10],
    math_score = [85, missing, 92, 78, missing, 88, 95, missing, 82, 90],
    science_score = [78, 85, missing, 82, 88, missing, 92, 85, missing, 87]
)

println("Data with missing values:")
println(students_missing)

# Count missing values
missing_counts = DataFrame(
    math_missing = sum(ismissing.(students_missing.math_score)),
    science_missing = sum(ismissing.(students_missing.science_score))
)
println("\nMissing value counts:")
println(missing_counts)

# Calculate mean, skipping missing values
math_mean = mean(skipmissing(students_missing.math_score))
science_mean = mean(skipmissing(students_missing.science_score))
println("\nMeans (excluding missing): Math = $math_mean, Science = $science_mean")

Data with missing values:

10×4 DataFrame

 Row │ id     name        math_score  science_score 

     │ Int64  String      Int64?      Int64?        

─────┼──────────────────────────────────────────────

   1 │     1  Student 1           85             78

   2 │     2  Student 2      missing             85

   3 │     3  Student 3           92        missing 

   4 │     4  Student 4           78             82

   5 │     5  Student 5      missing             88

   6 │     6  Student 6           88        missing 

   7 │     7  Student 7           95             92

   8 │     8  Student 8      missing             85

   9 │     9  Student 9           82        missing 

  10 │    10  Student 10          90             87



Missing value counts:

1×2 DataFrame

 Row │ math_missing  science_missing 

     │ Int64         Int64           

─────┼───────────────────────────────

   1 │            3                3



Means (excluding missing): Math = 87.14285714285714, Science = 85.28571428571429

9. Joining DataFrames

Show the code

# Create a simple grades DataFrame
grades = DataFrame(
    id = [1, 2, 3, 4, 5],
    final_grade = ["A", "B", "A", "C", "B"]
)

# Join with students data
students_with_grades = @left_join(students[1:5, :], grades, id)
students_with_grades

5×8 DataFrame

Row	id	name	math_score	science_score	program	grade_level	total	final_grade
	Int64	String	Int64	Int64	String	Int64	Int64	String?
1	1	Student 1	81	81	Physics	1	162	A
2	2	Student 2	84	57	Physics	4	141	B
3	3	Student 3	96	82	Physics	3	178	A
4	4	Student 4	67	64	Math	4	131	C
5	5	Student 5	81	83	Physics	3	164	B

10. Reshaping Data

Show the code

# Create wide data
wide_scores = @chain students[1:5, :] begin
    @select(id, name, math_score, science_score)
end

println("Wide format:")
wide_scores

# Convert to long format (using DataFrames stack function)
long_scores = stack(wide_scores, [:math_score, :science_score], 
                    variable_name=:subject, value_name=:score)
println("\nLong format:")
first(long_scores, 10)

Wide format:

Long format:

10×4 DataFrame

Row	id	name	subject	score
	Int64	String	String	Int64
1	1	Student 1	math_score	81
2	2	Student 2	math_score	84
3	3	Student 3	math_score	96
4	4	Student 4	math_score	67
5	5	Student 5	math_score	81
6	1	Student 1	science_score	81
7	2	Student 2	science_score	57
8	3	Student 3	science_score	82
9	4	Student 4	science_score	64
10	5	Student 5	science_score	83

Summary of Basic Tidier.jl Operations

Core Functions

@filter: Select rows based on conditions
@select: Choose specific columns
@mutate: Create or modify columns
@summarize: Calculate summary statistics
@group_by: Group data for aggregated operations
@arrange: Sort rows
@chain or |>: Combine multiple operations

Key Benefits of Julia + Tidier.jl

Familiar Syntax: If you know dplyr from R, you already know Tidier.jl
High Performance: Julia’s speed makes operations on large datasets fast
Clean Code: The pipe operator makes complex operations readable
Type Safety: Julia catches errors before runtime

Next Steps

To continue learning:

Practice with your own data: Import a CSV and try these operations
Explore more functions: Tidier.jl supports many more operations
Learn Julia basics: Understanding Julia makes you more effective
Join the community: Julia has a welcoming, helpful community

Resources

Reuse

CC BY 4.0