Modern Data Engineering: Designing a Real-time Architecture

Mostafa Nabil
6 min readFeb 17, 2025

--

TL;DR: I built a data pipeline that processes Formula 1 racing data using modern data engineering tools: Python for data collection, PostgreSQL for storage, Kafka and Debezium for streaming, and DuckDB with dbt for analytics. While the source data is batch, I implemented real-time architecture for learning purposes.

The Journey Begins: Building a Data Pipeline

Have you ever wondered how Formula 1 teams handle their massive data streams? While building my data engineering portfolio, I decided to tackle this challenge by creating a data pipeline that processes F1 racing data. In this series, I’ll share my journey of building a modern data stack that handles real-time data processing, even though I’m running it in batch mode for learning purposes.

Why Formula 1 Data?

When learning data engineering, finding the right dataset is crucial. Many tutorials use simple e-commerce or social media datasets, but I wanted something more challenging and engaging. Formula 1 racing data offers unique characteristics that make it perfect for learning:

  • Complex Relationships: The intricate connections between drivers, races, lap times, and weather conditions mirror real-world data complexity
  • Time-Series Nature: Race data provides natural time-series patterns, perfect for learning temporal data handling
  • Multiple Data Sources: Combining API data with web scraping teaches diverse data collection methods
  • Rich Historical Data: Years of historical race data allow for interesting analytical possibilities

The Data Challenge

Every F1 race weekend is a data goldmine, generating various types of information:

  • Real-time lap times and sector splits
  • Driver positions and pit stop strategies
  • Dynamic weather conditions affecting performance
  • Race control messages and incidents
  • Qualifying and race results with detailed timings

Traditional data processing might struggle with this complexity, so I needed a robust system that could:

  1. Reliably collect data from multiple sources
  2. Store it efficiently while maintaining relationships
  3. Process it in near real-time
  4. Make it easily accessible for analysis

Modern Data Stack: A Learning Approach

Rather than using a simple ETL pipeline, I chose to build a more sophisticated architecture that reflects modern industry practices. Here’s my tech stack and why I chose each component:

Data Collection Layer

  • Python with Async Programming: For efficient API calls
  • Beautiful Soup: For structured web scraping
  • Poetry: For dependency management
  • Supervisord: For reliable process management

Storage & Streaming Layer

  • PostgreSQL: Primary OLTP database for operational data
  • Debezium: Change Data Capture for streaming updates
  • Apache Kafka: Enterprise-grade message streaming platform

Analytics Layer

  • DuckDB: High-performance analytical database
  • dbt: Data transformation and modeling
  • Medallion Architecture: For data quality management

Key Technology Decisions Explained

Analytics Engine: DuckDB vs Cloud Data Warehouses

When choosing an analytics engine, cost and development experience were my main concerns. While I had experience with AWS Redshift in my previous projects, I found that cloud data warehouses can quickly become expensive, especially during the development phase where you’re constantly iterating and testing.

For instance, with Redshift:

  • Basic clusters start at $0.25 per hour
  • Query costs add up during development
  • Dev/test environments double the costs

Instead, I chose DuckDB because:

  • Zero cost for development and testing
  • Local development means faster iteration
  • Great performance for my data volume

in a larger-scale production setting, a cloud data warehouse like Redshift, Snowflake or BigQuery might be more suitable, but DuckDB is great for local and small-scale use.

Real-time Architecture for Batch Processing

An interesting aspect of my project is using real-time technologies (Kafka, Debezium) even though my source data is batch.

Image sourced from Spot Intelligence

Why this approach? Because:

  1. Learning Value: Understand streaming concepts, practice with industry tools and learn event-driven patterns
  2. Future Readiness: Ready for real-time sources, scalable architecture, and flexible processing

OLTP vs OLAP: Understanding the Difference

OLTP (Online Transaction Processing) systems are optimized for handling real-time transactions and operational data. They are designed to efficiently record and update data as events happen. On the other hand, OLAP (Online Analytical Processing) systems are optimized for complex querying and data analysis, allowing users to gain insights from large datasets and historical trends.

Image sourced from Imaginary Cloud.

OLTP Operations:

  • Recording lap times as they happen
  • Updating driver positions
  • Logging pit stops

OLAP Analysis:

  • Analyzing tire strategy impact
  • Comparing historical performance
  • Finding weather impact patterns

By using both OLTP and OLAP systems, we achieve the best of both worlds. OLTP allows us to handle real-time operations with ease, while OLAP enables us to run analyses that would be difficult in a transactional system. This dual approach helps in gaining actionable insights from the data, leading to more actionable decisions in real-time and over the long term.

CDC and SCD: Tracking Changes in Racing Data

Change Data Capture (CDC)

CDC ensures that real-time updates from our F1 pipeline are captured and transferred for analysis. When changes occur in PostgreSQL — like a position change (e.g., P1 — Hamilton → P1 — Verstappen) or weather updates (track temperature, rainfall, wind speed) — CDC detects and records these events.

  1. PostgreSQL Changes: When an event happens, PostgreSQL detects the change.
  2. Debezium: The change is sent to Debezium, which streams the data changes.
  3. Kafka Topics: CDC events are published to Kafka topics for consumption.
  4. DuckDB: The data is then processed and transferred to DuckDB for analysis.

This flow allows us to capture real-time updates. Once we have a real-time stream of events from PostgreSQL, we use CDC to keep the data synchronized across the pipeline, ensuring that updates flow seamlessly into the analytical layer.

Slowly Changing Dimensions (SCD)

Image sourced from GeeksforGeeks

How we handle changes:

  • Type 1: Overwrite (current weather)
  • Type 2: Keep history (driver careers)
  • Type 3: Hybrid (race results with corrections)

I’ve implemented SCD Type 2 using DBT Jinja. Think of SCD Type 2 as a time machine for our data.

Let’s see how it works with three special time tracking fields:

valid_from: When the record starts being true
valid_to: When the record stops being true
is_current: Is this the current version? (true/false)

Example for Driver team changes:

In my F1 pipeline, SCD Type 2 captures how Ricciardo’s team technically changed from AlphaTauri to RB in 2024, even though it’s essentially the same team rebranded, demonstrating how this method tracks both driver moves and team identity changes over time.

{
"driver_id": "RIC",
"driver_number": 3,
"team": "AlphaTauri",
"valid_from": "2023-07-01",
"valid_to": "2023-12-31",
"is_current": false
},
{
"driver_id": "RIC",
"driver_number": 3,
"team": "RB",
"valid_from": "2024-01-01",
"valid_to": null,
"is_current": true
}

These fields work together to tell the complete story of our data.

The valid_from field acts as our historical starting point, marking the exact moment when each change occurred in a driver's career. When combined with valid_to, we create a precise timeline of changes, where a null valid_to date tells us "this is still true today."

The is_current flag makes it incredibly efficient to find the latest information, which is particularly useful for dashboard displays and current status reports. Together, these fields give us the power to time-travel through our data, analyzing any point in history or tracking changes over time. Think of it as having a complete historical record that's also easy to query for current information.

Looking Forward

Building this F1 data pipeline has been both challenging and rewarding. Throughout this journey, I’ve learned that while modern data architecture can seem daunting at first, each component plays a vital role in handling real-world data challenges. From choosing development-friendly tools like DuckDB to implementing patterns with Kafka and Debezium, every decision was made with both learning and scalability in mind.

The key insights from this architecture:

  • Starting with the right tools can significantly impact your learning curve
  • Understanding batch and real-time processing provides valuable architectural insights
  • Implementing proper historical tracking (SCD) is crucial for data integrity
  • Building with scalability in mind prepares you for real-world challenges

Next part, we’ll explore the data collection layer in detail. I’ll share practical implementations of Python async programming, efficient web scraping techniques, and robust error handling strategies. We’ll examine real code examples and discuss solutions to common data collection challenges.

Share your thoughts on modern data architecture in the comments!

--

--

No responses yet