This tutorial explores the power of Julia combined with Tidier.jl for high-performance data science. Julia brings blazing-fast computation with Python-like syntax, while Tidier.jl provides the beloved tidyverse syntax from R, creating a perfect combination for modern data analysis.
Installation Prerequisites
Before running the examples, you’ll need to install Julia and the required packages:
Performance benefits of Julia’s compiled execution
Setup and Data Preparation
Show the code
# Load required packagesusingTidierusingDataFramesusingRandomusingStatistics# Set random seed for reproducibilityRandom.seed!(123)# Display Julia and package versionsprintln("Julia version: ", VERSION)println("Tidier.jl version: v1.2.0")
Julia version: 1.11.5
Tidier.jl version: v1.2.0
Show the code
# Create a simple dataset for demonstrationstudents =DataFrame( id =1:100, name = ["Student $i" for i in1:100], math_score =rand(60:100, 100), science_score =rand(55:100, 100), program =rand(["CS", "Math", "Physics"], 100), grade_level =rand([1, 2, 3, 4], 100))println("Dataset shape: ", size(students))first(students, 5)
Dataset shape: (100, 6)
5×6 DataFrame
Row
id
name
math_score
science_score
program
grade_level
Int64
String
Int64
Int64
String
Int64
1
1
Student 1
81
81
Physics
1
2
2
Student 2
84
57
Physics
4
3
3
Student 3
96
82
Physics
3
4
4
Student 4
67
64
Math
4
5
5
Student 5
81
83
Physics
3
Show the code
# Add a total score columnstudents =@mutate(students, total = math_score + science_score)first(students, 5)
5×7 DataFrame
Row
id
name
math_score
science_score
program
grade_level
total
Int64
String
Int64
Int64
String
Int64
Int64
1
1
Student 1
81
81
Physics
1
162
2
2
Student 2
84
57
Physics
4
141
3
3
Student 3
96
82
Physics
3
178
4
4
Student 4
67
64
Math
4
131
5
5
Student 5
81
83
Physics
3
164
Basic Tidier.jl Operations
1. Filtering Data
Show the code
# Filter students with high math scoreshigh_performers =@filter(students, math_score >=90)println("Students with math score >= 90: ", nrow(high_performers))first(high_performers, 5)
Students with math score >= 90: 26
5×7 DataFrame
Row
id
name
math_score
science_score
program
grade_level
total
Int64
String
Int64
Int64
String
Int64
Int64
1
3
Student 3
96
82
Physics
3
178
2
8
Student 8
98
98
CS
2
196
3
12
Student 12
94
67
CS
4
161
4
22
Student 22
100
83
Physics
3
183
5
23
Student 23
96
82
CS
3
178
Show the code
# Filter by multiple conditionscs_seniors =@filter(students, program =="CS"&& grade_level ==4)println("CS seniors: ", nrow(cs_seniors))first(cs_seniors, 5)
CS seniors: 9
5×7 DataFrame
Row
id
name
math_score
science_score
program
grade_level
total
Int64
String
Int64
Int64
String
Int64
Int64
1
7
Student 7
61
67
CS
4
128
2
12
Student 12
94
67
CS
4
161
3
27
Student 27
63
83
CS
4
146
4
48
Student 48
90
100
CS
4
190
5
52
Student 52
69
85
CS
4
154
2. Selecting Columns
Show the code
# Select specific columnsscores_only =@select(students, id, math_score, science_score, total)first(scores_only, 5)
5×4 DataFrame
Row
id
math_score
science_score
total
Int64
Int64
Int64
Int64
1
1
81
81
162
2
2
84
57
141
3
3
96
82
178
4
4
67
64
131
5
5
81
83
164
Show the code
# Select columns using patternsname_and_scores =@select(students, name, ends_with("score"))first(name_and_scores, 5)
# Summary by programprogram_summary =@chain students begin@group_by(program)@summarize( count =length(id), avg_math =mean(math_score), avg_science =mean(science_score), avg_total =mean(total) )@arrange(desc(avg_total))endprogram_summary
3×5 DataFrame
Row
program
count
avg_math
avg_science
avg_total
String
Int64
Float64
Float64
Float64
1
Physics
41
81.7073
76.5854
158.293
2
Math
23
78.913
79.2174
158.13
3
CS
36
78.9444
75.9444
154.889
Show the code
# Summary by grade levelgrade_level_summary =@chain students begin@group_by(grade_level)@summarize( n_students =length(id), avg_math =round(mean(math_score), digits=1), avg_science =round(mean(science_score), digits=1) )@arrange(grade_level)endgrade_level_summary
4×4 DataFrame
Row
grade_level
n_students
avg_math
avg_science
Int64
Int64
Float64
Float64
1
1
23
82.9
84.3
2
2
18
80.1
77.6
3
3
33
80.5
73.6
4
4
26
77.0
74.2
6. Arranging Data
Show the code
# Sort by total score (descending)top_students =@chain students begin@arrange(desc(total))@select(name, program, math_score, science_score, total)@slice(1:10)endprintln("Top 10 students by total score:")top_students
Top 10 students by total score:
10×5 DataFrame
Row
name
program
math_score
science_score
total
String
String
Int64
Int64
Int64
1
Student 8
CS
98
98
196
2
Student 72
Physics
98
98
196
3
Student 59
Physics
96
99
195
4
Student 48
CS
90
100
190
5
Student 50
Physics
84
100
184
6
Student 87
Physics
87
97
184
7
Student 22
Physics
100
83
183
8
Student 80
Math
89
94
183
9
Student 64
Physics
83
98
181
10
Student 86
Physics
85
96
181
7. Complex Data Transformations
Show the code
# First, let's verify the DataFrame exists and has the right columnsif@isdefined(students)println("Students DataFrame columns: ", names(students))println("Number of rows: ", nrow(students))elseprintln("Students DataFrame not found!")end# Use DataFrames.jl functions instead of Tidier.jl for this example# Filter for upper-level students (grade_level >= 3)upper_level =filter(row -> row.grade_level >=3, students)# Add performance columnupper_level.performance =map(upper_level.total) do tif t >=160"Excellent"elseif t >=140"Good"else"Average"endend# Group and summarize using DataFrames.jlresult =combine(groupby(upper_level, [:program, :performance]), nrow =>:count)sort!(result, [:program, order(:count, rev=true)])println("\nPerformance distribution for upper-level students:")result
Students DataFrame columns: ["id", "name", "math_score", "science_score", "program", "grade_level", "total"]
Number of rows: 100
Performance distribution for upper-level students:
9×3 DataFrame
Row
program
performance
count
String
String
Int64
1
CS
Good
11
2
CS
Average
6
3
CS
Excellent
6
4
Math
Excellent
5
5
Math
Good
5
6
Math
Average
2
7
Physics
Good
9
8
Physics
Excellent
9
9
Physics
Average
6
Practical Examples
8. Working with Missing Data
Show the code
# Create a DataFrame with some missing valuesstudents_missing =DataFrame( id =1:10, name = ["Student $i" for i in1:10], math_score = [85, missing, 92, 78, missing, 88, 95, missing, 82, 90], science_score = [78, 85, missing, 82, 88, missing, 92, 85, missing, 87])println("Data with missing values:")println(students_missing)# Count missing valuesmissing_counts =DataFrame( math_missing =sum(ismissing.(students_missing.math_score)), science_missing =sum(ismissing.(students_missing.science_score)))println("\nMissing value counts:")println(missing_counts)# Calculate mean, skipping missing valuesmath_mean =mean(skipmissing(students_missing.math_score))science_mean =mean(skipmissing(students_missing.science_score))println("\nMeans (excluding missing): Math = $math_mean, Science = $science_mean")
---title: "High-Performance Data Science with Julia and Tidier.jl"date: 7 June, 2025date-format: "DD MMM, YYYY"author: - name: Andrew Ellis url: https://github.com/awellis affiliation: Virtual Academy, Bern University of Applied Sciences affiliation-url: https://virtuelleakademie.ch orcid: 0000-0002-2788-936Xcategories: [Julia, Tidier.jl, data manipulation, performance, tutorial]format: html: code-fold: true code-tools: true code-summary: "Show the code" toc: true---This tutorial explores the power of **Julia** combined with **Tidier.jl** for high-performance data science. Julia brings blazing-fast computation with Python-like syntax, while Tidier.jl provides the beloved tidyverse syntax from R, creating a perfect combination for modern data analysis.## Installation PrerequisitesBefore running the examples, you'll need to install Julia and the required packages:### Installing Julia1. **Download Julia** from [julialang.org](https://julialang.org/downloads/)2. **Using Homebrew** (macOS): `brew install julia`3. **Using juliaup** (recommended): Follow instructions at [github.com/JuliaLang/juliaup](https://github.com/JuliaLang/juliaup)### Installing Required Packages```julia# In Julia REPL, install required packagesusingPkgPkg.add(["Tidier", "DataFrames"])```## Why Julia + Tidier.jl?### The Performance Advantage: JuliaJulia offers compelling advantages for data science:- **Near C-speed performance** with high-level syntax- **Multiple dispatch** for elegant, extensible code- **Native parallelism** and distributed computing- **Excellent interoperability** with Python, R, and C- **Growing ecosystem** of scientific computing packages### The Familiar Syntax: Tidier.jlTidier.jl brings the tidyverse workflow to Julia:- **Familiar dplyr-style verbs** (`select`, `filter`, `mutate`, `summarize`)- **Pipe operator** (`|>`) for readable code chains- **Consistent grammar** for data manipulation- **Performance benefits** of Julia's compiled execution## Setup and Data Preparation```{julia}# Load required packagesusingTidierusingDataFramesusingRandomusingStatistics# Set random seed for reproducibilityRandom.seed!(123)# Display Julia and package versionsprintln("Julia version: ", VERSION)println("Tidier.jl version: v1.2.0")``````{julia}# Create a simple dataset for demonstrationstudents =DataFrame( id =1:100, name = ["Student $i" for i in1:100], math_score =rand(60:100, 100), science_score =rand(55:100, 100), program =rand(["CS", "Math", "Physics"], 100), grade_level =rand([1, 2, 3, 4], 100))println("Dataset shape: ", size(students))first(students, 5)``````{julia}# Add a total score columnstudents =@mutate(students, total = math_score + science_score)first(students, 5)```## Basic Tidier.jl Operations### 1. Filtering Data```{julia}# Filter students with high math scoreshigh_performers =@filter(students, math_score >=90)println("Students with math score >= 90: ", nrow(high_performers))first(high_performers, 5)``````{julia}# Filter by multiple conditionscs_seniors =@filter(students, program =="CS"&& grade_level ==4)println("CS seniors: ", nrow(cs_seniors))first(cs_seniors, 5)```### 2. Selecting Columns```{julia}# Select specific columnsscores_only =@select(students, id, math_score, science_score, total)first(scores_only, 5)``````{julia}# Select columns using patternsname_and_scores =@select(students, name, ends_with("score"))first(name_and_scores, 5)```### 3. Creating New Columns with Mutate```{julia}# Add calculated columnsstudents_graded =@mutate(students, average_score = (math_score + science_score) /2, passed = total >=140)first(students_graded, 5)```### 4. Summarizing Data```{julia}# Basic summary statisticssummary_stats =@summarize(students, avg_math =mean(math_score), avg_science =mean(science_score), max_total =maximum(total), min_total =minimum(total), n_students =length(id))summary_stats```### 5. Grouping and Summarizing```{julia}# Summary by programprogram_summary =@chain students begin@group_by(program)@summarize( count =length(id), avg_math =mean(math_score), avg_science =mean(science_score), avg_total =mean(total) )@arrange(desc(avg_total))endprogram_summary``````{julia}# Summary by grade levelgrade_level_summary =@chain students begin@group_by(grade_level)@summarize( n_students =length(id), avg_math =round(mean(math_score), digits=1), avg_science =round(mean(science_score), digits=1) )@arrange(grade_level)endgrade_level_summary```### 6. Arranging Data```{julia}# Sort by total score (descending)top_students =@chain students begin@arrange(desc(total))@select(name, program, math_score, science_score, total)@slice(1:10)endprintln("Top 10 students by total score:")top_students```### 7. Complex Data Transformations```{julia}# First, let's verify the DataFrame exists and has the right columnsif@isdefined(students)println("Students DataFrame columns: ", names(students))println("Number of rows: ", nrow(students))elseprintln("Students DataFrame not found!")end# Use DataFrames.jl functions instead of Tidier.jl for this example# Filter for upper-level students (grade_level >= 3)upper_level =filter(row -> row.grade_level >=3, students)# Add performance columnupper_level.performance =map(upper_level.total) do tif t >=160"Excellent"elseif t >=140"Good"else"Average"endend# Group and summarize using DataFrames.jlresult =combine(groupby(upper_level, [:program, :performance]), nrow =>:count)sort!(result, [:program, order(:count, rev=true)])println("\nPerformance distribution for upper-level students:")result```## Practical Examples### 8. Working with Missing Data```{julia}# Create a DataFrame with some missing valuesstudents_missing =DataFrame( id =1:10, name = ["Student $i" for i in1:10], math_score = [85, missing, 92, 78, missing, 88, 95, missing, 82, 90], science_score = [78, 85, missing, 82, 88, missing, 92, 85, missing, 87])println("Data with missing values:")println(students_missing)# Count missing valuesmissing_counts =DataFrame( math_missing =sum(ismissing.(students_missing.math_score)), science_missing =sum(ismissing.(students_missing.science_score)))println("\nMissing value counts:")println(missing_counts)# Calculate mean, skipping missing valuesmath_mean =mean(skipmissing(students_missing.math_score))science_mean =mean(skipmissing(students_missing.science_score))println("\nMeans (excluding missing): Math = $math_mean, Science = $science_mean")```### 9. Joining DataFrames```{julia}# Create a simple grades DataFramegrades =DataFrame( id = [1, 2, 3, 4, 5], final_grade = ["A", "B", "A", "C", "B"])# Join with students datastudents_with_grades =@left_join(students[1:5, :], grades, id)students_with_grades```### 10. Reshaping Data```{julia}# Create wide datawide_scores =@chain students[1:5, :] begin@select(id, name, math_score, science_score)endprintln("Wide format:")wide_scores# Convert to long format (using DataFrames stack function)long_scores =stack(wide_scores, [:math_score, :science_score], variable_name=:subject, value_name=:score)println("\nLong format:")first(long_scores, 10)```## Summary of Basic Tidier.jl Operations### Core Functions- **`@filter`**: Select rows based on conditions- **`@select`**: Choose specific columns- **`@mutate`**: Create or modify columns- **`@summarize`**: Calculate summary statistics- **`@group_by`**: Group data for aggregated operations- **`@arrange`**: Sort rows- **`@chain`** or **`|>`**: Combine multiple operations### Key Benefits of Julia + Tidier.jl1. **Familiar Syntax**: If you know dplyr from R, you already know Tidier.jl2. **High Performance**: Julia's speed makes operations on large datasets fast3. **Clean Code**: The pipe operator makes complex operations readable4. **Type Safety**: Julia catches errors before runtime## Next StepsTo continue learning:1. **Practice with your own data**: Import a CSV and try these operations2. **Explore more functions**: Tidier.jl supports many more operations3. **Learn Julia basics**: Understanding Julia makes you more effective4. **Join the community**: Julia has a welcoming, helpful community## Resources- [Tidier.jl Documentation](https://tidierorg.github.io/Tidier.jl/stable/)- [Julia Documentation](https://docs.julialang.org)- [DataFrames.jl Tutorial](https://dataframes.juliadata.org/stable/)- [Julia for Data Science](https://juliadatascience.io/)