Modern Data Engineering: Designing a Real-time Architecture
TL;DR: I built a data pipeline that processes Formula 1 racing data using modern data engineering tools: Python for data collection, PostgreSQL for storage, Kafka and Debezium for streaming, and DuckDB with dbt for analytics. While the source data is batch, I implemented real-time architecture for learning purposes.
The Journey Begins: Building a Data Pipeline
Have you ever wondered how Formula 1 teams handle their massive data streams? While building my data engineering portfolio, I decided to tackle this challenge by creating a data pipeline that processes F1 racing data. In this series, I’ll share my journey of building a modern data stack that handles real-time data processing, even though I’m running it in batch mode for learning purposes.
Why Formula 1 Data?
When learning data engineering, finding the right dataset is crucial. Many tutorials use simple e-commerce or social media datasets, but I wanted something more challenging and engaging. Formula 1 racing data offers unique characteristics that make it perfect for learning:
- Complex Relationships: The intricate connections between drivers, races, lap times, and weather conditions mirror real-world data complexity
- Time-Series Nature: Race data provides natural time-series patterns, perfect for learning temporal data handling
- Multiple Data Sources: Combining API data with web scraping teaches diverse data collection methods
- Rich Historical Data: Years of historical race data allow for interesting analytical possibilities
The Data Challenge
Every F1 race weekend is a data goldmine, generating various types of information:
- Real-time lap times and sector splits
- Driver positions and pit stop strategies
- Dynamic weather conditions affecting performance
- Race control messages and incidents
- Qualifying and race results with detailed timings
Traditional data processing might struggle with this complexity, so I needed a robust system that could:
- Reliably collect data from multiple sources
- Store it efficiently while maintaining relationships
- Process it in near real-time
- Make it easily accessible for analysis
Modern Data Stack: A Learning Approach
Rather than using a simple ETL pipeline, I chose to build a more sophisticated architecture that reflects modern industry practices. Here’s my tech stack and why I chose each component:
Data Collection Layer
- Python with Async Programming: For efficient API calls
- Beautiful Soup: For structured web scraping
- Poetry: For dependency management
- Supervisord: For reliable process management
Storage & Streaming Layer
- PostgreSQL: Primary OLTP database for operational data
- Debezium: Change Data Capture for streaming updates
- Apache Kafka: Enterprise-grade message streaming platform
Analytics Layer
- DuckDB: High-performance analytical database
- dbt: Data transformation and modeling
- Medallion Architecture: For data quality management
Key Technology Decisions Explained
Analytics Engine: DuckDB vs Cloud Data Warehouses
When choosing an analytics engine, cost and development experience were my main concerns. While I had experience with AWS Redshift in my previous projects, I found that cloud data warehouses can quickly become expensive, especially during the development phase where you’re constantly iterating and testing.
For instance, with Redshift:
- Basic clusters start at $0.25 per hour
- Query costs add up during development
- Dev/test environments double the costs
Instead, I chose DuckDB because:
- Zero cost for development and testing
- Local development means faster iteration
- Great performance for my data volume
in a larger-scale production setting, a cloud data warehouse like Redshift, Snowflake or BigQuery might be more suitable, but DuckDB is great for local and small-scale use.
Real-time Architecture for Batch Processing
An interesting aspect of my project is using real-time technologies (Kafka, Debezium) even though my source data is batch.
Why this approach? Because:
- Learning Value: Understand streaming concepts, practice with industry tools and learn event-driven patterns
- Future Readiness: Ready for real-time sources, scalable architecture, and flexible processing
OLTP vs OLAP: Understanding the Difference
OLTP (Online Transaction Processing) systems are optimized for handling real-time transactions and operational data. They are designed to efficiently record and update data as events happen. On the other hand, OLAP (Online Analytical Processing) systems are optimized for complex querying and data analysis, allowing users to gain insights from large datasets and historical trends.
OLTP Operations:
- Recording lap times as they happen
- Updating driver positions
- Logging pit stops
OLAP Analysis:
- Analyzing tire strategy impact
- Comparing historical performance
- Finding weather impact patterns
By using both OLTP and OLAP systems, we achieve the best of both worlds. OLTP allows us to handle real-time operations with ease, while OLAP enables us to run analyses that would be difficult in a transactional system. This dual approach helps in gaining actionable insights from the data, leading to more actionable decisions in real-time and over the long term.
CDC and SCD: Tracking Changes in Racing Data
Change Data Capture (CDC)
CDC ensures that real-time updates from our F1 pipeline are captured and transferred for analysis. When changes occur in PostgreSQL — like a position change (e.g., P1 — Hamilton → P1 — Verstappen) or weather updates (track temperature, rainfall, wind speed) — CDC detects and records these events.
- PostgreSQL Changes: When an event happens, PostgreSQL detects the change.
- Debezium: The change is sent to Debezium, which streams the data changes.
- Kafka Topics: CDC events are published to Kafka topics for consumption.
- DuckDB: The data is then processed and transferred to DuckDB for analysis.
This flow allows us to capture real-time updates. Once we have a real-time stream of events from PostgreSQL, we use CDC to keep the data synchronized across the pipeline, ensuring that updates flow seamlessly into the analytical layer.
Slowly Changing Dimensions (SCD)
How we handle changes:
- Type 1: Overwrite (current weather)
- Type 2: Keep history (driver careers)
- Type 3: Hybrid (race results with corrections)
I’ve implemented SCD Type 2 using DBT Jinja. Think of SCD Type 2 as a time machine for our data.
Let’s see how it works with three special time tracking fields:
valid_from: When the record starts being true
valid_to: When the record stops being true
is_current: Is this the current version? (true/false)
Example for Driver team changes:
In my F1 pipeline, SCD Type 2 captures how Ricciardo’s team technically changed from AlphaTauri to RB in 2024, even though it’s essentially the same team rebranded, demonstrating how this method tracks both driver moves and team identity changes over time.
{
"driver_id": "RIC",
"driver_number": 3,
"team": "AlphaTauri",
"valid_from": "2023-07-01",
"valid_to": "2023-12-31",
"is_current": false
},
{
"driver_id": "RIC",
"driver_number": 3,
"team": "RB",
"valid_from": "2024-01-01",
"valid_to": null,
"is_current": true
}
These fields work together to tell the complete story of our data.
The valid_from
field acts as our historical starting point, marking the exact moment when each change occurred in a driver's career. When combined with valid_to
, we create a precise timeline of changes, where a null valid_to
date tells us "this is still true today."
The is_current
flag makes it incredibly efficient to find the latest information, which is particularly useful for dashboard displays and current status reports. Together, these fields give us the power to time-travel through our data, analyzing any point in history or tracking changes over time. Think of it as having a complete historical record that's also easy to query for current information.
Looking Forward
Building this F1 data pipeline has been both challenging and rewarding. Throughout this journey, I’ve learned that while modern data architecture can seem daunting at first, each component plays a vital role in handling real-world data challenges. From choosing development-friendly tools like DuckDB to implementing patterns with Kafka and Debezium, every decision was made with both learning and scalability in mind.
The key insights from this architecture:
- Starting with the right tools can significantly impact your learning curve
- Understanding batch and real-time processing provides valuable architectural insights
- Implementing proper historical tracking (SCD) is crucial for data integrity
- Building with scalability in mind prepares you for real-world challenges
Next part, we’ll explore the data collection layer in detail. I’ll share practical implementations of Python async programming, efficient web scraping techniques, and robust error handling strategies. We’ll examine real code examples and discuss solutions to common data collection challenges.
Share your thoughts on modern data architecture in the comments!