3. Architectures of Data Systems

Modern data architectures have evolved from simple ETL pipelines into sophisticated, multi-layered platforms that support batch, streaming, and machine learning workloads simultaneously.
This section outlines the main paradigms and architectural models used in contemporary data systems.

⚙️ ETL vs ELT

🧩 1. Extract

Goal: collect data from multiple heterogeneous sources — APIs, databases, files, logs, sensors, etc.

How it works:

Connectors or agents pull data from various systems (e.g., PostgreSQL, Salesforce, S3).
Data is extracted in its raw form, without transformations.
Often performed incrementally (only new or changed records are fetched).

Typical tools:

Fivetran, Airbyte, Apache NiFi, AWS Glue, Kafka Connect
SQL queries, REST APIs, CDC (Change Data Capture)

💡 Examples:

SELECT * FROM sales WHERE updated_at > last_sync_time;

→ This pulls the latest records from sales into a staging layer.

⚙️ 2. Load

Goal: store the extracted data in a target system — usually a Data Lake, Data Warehouse, or staging area.

How it works:

Data is loaded into the raw or staging zone without modifications.
Common formats: Parquet, ORC, Avro, JSON.
In streaming pipelines (e.g., Kafka → Delta Lake), loading happens continuously.

Typical tools:

Amazon S3, Azure Data Lake Storage, Google BigQuery, Snowflake, Delta Lake
dbt, Airflow, Databricks Autoloader

💡 Examples:

Raw data → s3://data-lake/raw/sales/2025-10-06/

🔄 3. Transform

Goal: clean, normalize, aggregate, and prepare data for analytics or machine learning.

How it works:

Deduplication, type casting, filling nulls, standardizing formats.
Building business models such as facts, dimensions, and data marts.
Executed via SQL transformations or distributed compute engines like Spark/Flink.

Typical tools:

dbt, Apache Spark, Databricks, Flink, Snowpark, SQL

💡 Examples:

CREATE TABLE IF NOT EXISTS clean.sales AS
SELECT customer_id,
       SUM(amount) AS total_sales,
       DATE_TRUNC('month', sale_date) AS month
FROM raw.sales
GROUP BY customer_id, month;

🧠 Modern evolution

The newer paradigm ELT (Extract → Load → Transform) reverses the last two steps. Data is first loaded as-is into the warehouse or data lake and then transformed in-place using its computational power (e.g., in Snowflake, BigQuery, or Databricks).

This approach improves scalability and reduces load on source systems.

Aspect	ETL (Extract → Transform → Load)	ELT (Extract → Load → Transform)
Process flow	Data is extracted, transformed on an external engine, and then loaded into the data warehouse.	Raw data is first loaded into a data lake or warehouse, then transformed in place.
Best for	Traditional data warehouses with limited compute resources.	Modern cloud-based architectures (Snowflake, Databricks, BigQuery).
Tools	Informatica, Talend, SSIS.	dbt, Spark SQL, Databricks workflows.
Advantages	Early data validation, strict schema enforcement.	Scales better, cheaper storage, supports semi/unstructured data.
Drawbacks	Difficult to scale, limited flexibility.	May require stronger data governance to control raw data chaos.

💡 Example:

In a dbt + Snowflake pipeline, you use ELT: raw data lands in a staging schema (raw tables), and dbt transforms it into clean analytical models inside the warehouse.

🏢 Data Warehouse vs Data Lake vs Lakehouse

Data Warehouse

A Data Warehouse is a centralized system designed to store and manage structured, historical data from multiple sources for analytics and reporting.

It integrates data through ETL or ELT processes, organizes it into subject-oriented schemas (like sales, customers, finance), and allows fast querying using SQL.

💡 Typically, a data warehouse supports Business Intelligence (BI), dashboards, and analytical workloads rather than operational transactions.

💡 Examples:

Snowflake, Amazon Redshift, Google BigQuery, Azure Synapse Analytics.

Data Lake

A Data Lake is a centralized repository that stores raw, unstructured, semi-structured, and structured data at any scale.

Unlike a data warehouse, it doesn’t require predefined schemas — data is stored as-is (“schema-on-read”) and transformed only when needed for analysis. This flexibility makes it ideal for big data analytics, machine learning, and data exploration.

💡 Data Lakes typically use low-cost, scalable object storage such as Amazon S3, Azure Data Lake Storage (ADLS), or Google Cloud Storage.

Examples of technologies:

Delta Lake, Apache Iceberg, Apache Hudi — these add ACID transactions, versioning, and schema management to modern data lakes.

Lakehouse

A Data Lakehouse is a modern data architecture that combines the flexibility of a Data Lake with the management and performance features of a Data Warehouse.

It allows you to store all types of data — structured, semi-structured, and unstructured — in a single repository, while still supporting ACID transactions, governance, schema enforcement, and fast SQL queries.

💡 The Lakehouse eliminates the traditional split between “data for analysts” (in warehouses) and “data for data scientists” (in lakes), enabling both analytics and AI workloads on the same data.

Typical technologies:

Databricks Delta Lake, Apache Iceberg, Apache Hudi, Snowflake Arctic

Comparison

Feature	Data Warehouse	Data Lake	Lakehouse
Data type	Structured (SQL tables).	All types (structured, semi-structured, unstructured).	Unified — handles all formats.
Schema	Schema-on-write (defined upfront).	Schema-on-read (flexible).	Hybrid: schema enforcement + flexibility.
Storage	Proprietary, high-performance.	Cheap object storage (S3, ADLS, GCS).	Object storage with ACID transactions.
Processing	Batch SQL queries.	Batch + Streaming.	Batch + Streaming + ML.
Use cases	Business intelligence, reporting.	Data exploration, ML training.	Unified analytics and machine learning.
Examples	Snowflake, Redshift, BigQuery.	Hadoop, S3, ADLS.	Databricks Lakehouse, Delta Lake, Apache Iceberg.

💡 In practice:

Modern companies integrate the three — collecting raw data in a lake, curating and modeling it into warehouse-style tables, and running analytics and ML on top of a lakehouse architecture.

⚡ Lambda vs Kappa Architectures

These are two models for handling real-time and batch data processing.

Lambda Architecture

Combines batch processing for accuracy and stream processing for real-time results.

flowchart TD
    classDef source fill:#cce5ff,stroke:#003366,stroke-width:1px,color:#003366,font-weight:bold
    classDef batch fill:#e6ffe6,stroke:#006600,stroke-width:1px,color:#003300,font-weight:bold
    classDef speed fill:#fff0b3,stroke:#b38f00,stroke-width:1px,color:#664d00,font-weight:bold
    classDef serving fill:#ffe6e6,stroke:#990000,stroke-width:1px,color:#660000,font-weight:bold

    A(["🌐 Data Stream"]):::source
    B(["🗂️ Batch Layer<br/>(HDFS, Spark, Hive, dbt)"]):::batch
    C(["⚡ Speed Layer<br/>(Kafka, Flink, Spark Streaming)"]):::speed
    D(["📊 Serving Layer"]):::serving

    A --> B
    A --> C
    B --> D
    C --> D

Pros: Accurate + near real-time data; fault tolerance.
Cons: Duplicate logic in batch and speed layers (hard to maintain).
Used by: Legacy systems (Netflix, LinkedIn early architectures).

Kappa Architecture

Simplifies the Lambda model by removing the batch layer.
All processing — real-time and historical — is done through the streaming pipeline.

flowchart TD
    classDef source fill:#cce5ff,stroke:#004080,stroke-width:1px,color:#00264d,font-weight:bold
    classDef process fill:#e6ffe6,stroke:#007a00,stroke-width:1px,color:#003300,font-weight:bold
    classDef storage fill:#fff0b3,stroke:#b38f00,stroke-width:1px,color:#664d00,font-weight:bold

    A(["🌐 Data Stream<br/>(Kafka, Kinesis)"]):::source
    B(["⚙️ Stream Processor<br/>(Flink, Spark, Beam)"]):::process
    C(["💾 Serving / Storage<br/>(Delta, Cassandra)"]):::storage

    A --> B --> C

Pros: Simpler, consistent logic, true real-time architecture.
Cons: Complex reprocessing of historical data if schema changes.
Used by: Modern streaming-first companies (Uber, Twitter, Confluent).

💡 In short:

Lambda = batch + streaming.
Kappa = streaming-only, reprocesses all events.

🧩 Components of a Modern Data System

A modern data platform is built from modular components that handle each stage of the data lifecycle:

Layer	Main Responsibility	Common Tools
Data Ingestion	Collect and move data from sources.	Kafka, Kinesis, Airbyte, Fivetran.
Storage	Persist raw and curated data.	S3, ADLS, Delta Lake, Iceberg.
Processing	Transform, clean, aggregate.	Spark, Flink, dbt, Beam.
Metadata & Governance	Manage schema, lineage, catalog.	Hive Metastore, Unity Catalog, Amundsen, DataHub.
Orchestration	Coordinate and schedule pipelines.	Airflow, Dagster, Prefect.
Serving & Analytics	Query, visualize, and build ML models.	Trino, Databricks SQL, Power BI, MLflow.

This modularity allows teams to evolve their platforms incrementally — swapping tools without rewriting the entire stack.

🥇 Medallion Architecture (Bronze–Silver–Gold)

The Medallion Architecture (developed by Databricks) organizes data into logical layers that improve data quality, traceability, and governance.

flowchart TD
    classDef bronze fill:#fce5cd,stroke:#b45f06,stroke-width:1px,color:#783f04,font-weight:bold
    classDef silver fill:#d9d9d9,stroke:#7f7f7f,stroke-width:1px,color:#404040,font-weight:bold
    classDef gold fill:#fff2cc,stroke:#b38f00,stroke-width:1px,color:#665c00,font-weight:bold

    C(["🥉 BRONZE Layer<br/>(Raw, ingested data)"]):::bronze
    B(["🥈 SILVER Layer<br/>(Cleaned, conformed data)"]):::silver
    A(["🥇 GOLD Layer<br/>(Business-ready data, BI)"]):::gold

    C --> B --> A

Layer	Purpose	Example
Bronze	Raw ingestion from sources (no transformations).	Kafka → Delta Lake raw tables.
Silver	Data cleaned, deduplicated, joined with reference data.	Remove nulls, enrich user IDs with profile data.
Gold	Aggregated and business-ready datasets.	KPI dashboards, machine learning features.

💡 Typical flow:

Raw IoT telemetry → Bronze (raw JSON) → Silver (cleaned sensor readings) → Gold (aggregated device performance metrics).

🌐 Summary

ETL transforms before loading — best for legacy systems; ELT transforms inside the target system — ideal for cloud.
Data Warehouse → Data Lake → Lakehouse represents the natural evolution of analytical architectures.
Lambda vs. Kappa: hybrid vs. unified streaming.

Modern data platforms are modular, cloud-native, and automation-driven.
Medallion architecture standardizes data curation and quality.

Big Data Workshop

Explorer

3. Architectures of Data Systems

⚙️ ETL vs ELT

🧩 1. Extract

⚙️ 2. Load

🔄 3. Transform

🧠 Modern evolution

🏢 Data Warehouse vs Data Lake vs Lakehouse

Data Warehouse

Data Lake

Lakehouse

Comparison

⚡ Lambda vs Kappa Architectures

Lambda Architecture

Kappa Architecture

🧩 Components of a Modern Data System

🥇 Medallion Architecture (Bronze–Silver–Gold)

🌐 Summary

Graph View

Table of Contents

Backlinks