πŸ“Š CS5200 Data Warehousing Project

SportsTV Streaming Data Warehouse

A star schema analytics datamart processing 1.18M streaming transactions with ETL pipelines, partitioned fact tables, and business intelligence reporting for SportsTV Germany.

πŸ“Š
1.18M
Source Transactions
βœ…
97.1%
Data Retention
⚑
~51s
ETL Runtime
πŸš€
23K/s
Processing Speed
πŸ”„
161K
Records Recovered

Key Highlights

What makes this project stand out

⭐
Star Schema Design
Dimensional model with partitioned fact table and denormalized attributes for optimal query performance.
πŸ”
Sport Inference Algorithm
Pattern-matching recovers 161,588 orphaned records by inferring sport from asset ID prefixes (DEL-, AHL-, SKJ-).
πŸ“¦
Table Partitioning
Yearly range partitions on fact table enable efficient date-range query pruning and improved performance.
⚑
Optimized ETL
Vectorized operations, hashmap lookups, and bulk inserts achieve 23K records/second throughput.
☁️
Cloud Deployment
Production-ready MySQL on Aiven Cloud with SSL connections and secure credential management.
πŸ“ˆ
Pre-computed Metrics
Daily aggregations with pre-calculated averages enable sub-100ms analytical query response times.

Architecture

End-to-end data flow from sources to analytics

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        DATA SOURCES (EXTRACT)                           β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”              β”‚
β”‚  β”‚   SQLite Database    β”‚         β”‚      CSV Export      β”‚              β”‚
β”‚  β”‚   subscribersDB      β”‚         β”‚  new-streaming-txns  β”‚              β”‚
β”‚  β”‚   1,083,131 records  β”‚         β”‚    98,732 records    β”‚              β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              β”‚                                β”‚
              β–Ό                                β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      ETL PIPELINE (TRANSFORM)                           β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  β€’ Batch processing (50K records/batch)                                 β”‚
β”‚  β€’ Sport inference for orphaned assets (161,588 recovered)              β”‚
β”‚  β€’ User β†’ Country mapping via postal codes                              β”‚
β”‚  β€’ Vectorized operations + hashmap lookups                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              β”‚
              β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    STAR SCHEMA DATAMART (LOAD)                          β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                         β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚
β”‚   β”‚  dim_date   │───▢│  fact_streaming_summary │◀───│ dim_country β”‚     β”‚
β”‚   β”‚ 1,752 days  β”‚    β”‚     (PARTITIONED)       β”‚    β”‚ 4 countries β”‚     β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚
β”‚                                                                         β”‚
β”‚                    MySQL 8.0 on Aiven Cloud                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Data Pipeline

Three-stage ETL process with data recovery

πŸ“₯

Extract

Pull streaming transactions from SQLite operational database and CSV exports from web/app clients.

1,181,863 source records
βš™οΈ

Transform

Map users to countries, infer sports from asset prefixes, aggregate daily metrics, handle orphaned records.

161,588 records recovered
πŸ“€

Load

Bulk insert into partitioned fact tables with ON DUPLICATE KEY UPDATE for idempotent loads.

~51 seconds total

Year-over-Year Growth

Streaming transaction volume by year (2021-2025)

Annual Streaming Transactions
156K
2021
198K
2022
268K
2023
312K
2024
213K*
2025
1.15M
Total Streams
+100%
Growth 2021-2024
312K
Peak Year (2024)
4
Years of Data

Sample Analytics

Key insights from the data warehouse

Streaming by Sport

Sport Streams Share
Ice Hockey 687,234 59.9%
Ski Jumping 298,451 26.0%
Inline Hockey 161,994 14.1%

Top Markets

Country Streams Share
Deutschland 687,234 59.9%
Osterreich 245,891 21.4%
Schweiz 156,432 13.6%
Liechtenstein 58,122 5.1%

Star Schema Design

Dimensional model optimized for analytics queries

fact_streaming_summary FACT
PK date_id INT
PK country_id INT
PK sport_name VARCHAR
transaction_count INT
unique_user_count INT
total_minutes INT
Partitioned by Year
dim_date DIM
PK date_id INT
full_date DATE
year, quarter, month INT
week, day_of_week INT
1,752 rows
dim_country DIM
PK country_id INT
country_name VARCHAR
4 rows (DE, AT, CH, LI)
dim_sport DIM
PK sport_id INT
sport_name VARCHAR
3 rows (Hockey, Ski, Inline)

ETL Pipeline

Data flow from operational sources to analytics datamart

πŸ“₯
Extract
SQLite + CSV sources
1.18M rows
β†’
βš™οΈ
Transform
Infer sports, map countries
161K recovered
β†’
πŸ“€
Load
Bulk insert to MySQL
50K/batch
β†’
πŸ“Š
Analyze
BI Reports
R Markdown

Key Design Decisions

Architectural choices and their rationale

πŸ“…
Daily Granularity
Problem: What level of aggregation balances storage vs. flexibility?
Solution: 99% of queries operate at daily+ level. Daily aggregation reduces storage while supporting most analytics.
GROUP BY date_id, country_id, sport_name
🏷️
Denormalized sport_name
Problem: Should sport be a FK or denormalized in fact table?
Solution: 90% of queries need sport name. Denormalization avoids JOINs with acceptable redundancy.
sport_name VARCHAR(50) -- not sport_id FK
πŸ“¦
Range Partitioning
Problem: How to optimize time-based queries on 1M+ rows?
Solution: Yearly range partitions enable partition pruning. Queries for 2024 only scan 2024 partition.
PARTITION BY RANGE (date_id)
πŸ”
Conservative Data Exclusion
Problem: 17% of records have unrecognizable asset prefixes.
Solution: Exclude rather than guess. 97.1% retention is acceptable; guessing would compromise data quality.
-- Excluded: OXXX-, MSL- (unknown sports)

Tech Stack

MySQL 8.0
SQLite
R 4.3+
RMySQL
RSQLite
DBI
R Markdown
kableExtra
Aiven Cloud
Table Partitioning