Modern data architectures have evolved from simple ETL pipelines into sophisticated, multi-layered platforms that support batch, streaming, and machine learning workloads simultaneously.
This section outlines the main paradigms and architectural models used in contemporary data systems.

⚙️ ETL vs ELT

🧩 1. Extract

Goal: collect data from multiple heterogeneous sources — APIs, databases, files, logs, sensors, etc.

How it works:

  • Connectors or agents pull data from various systems (e.g., PostgreSQL, Salesforce, S3).

  • Data is extracted in its raw form, without transformations.

  • Often performed incrementally (only new or changed records are fetched).

Typical tools:

  • Fivetran, Airbyte, Apache NiFi, AWS Glue, Kafka Connect

  • SQL queries, REST APIs, CDC (Change Data Capture)

💡 Examples:

SELECT * FROM sales WHERE updated_at > last_sync_time;

→ This pulls the latest records from sales into a staging layer.

⚙️ 2. Load

Goal: store the extracted data in a target system — usually a Data Lake, Data Warehouse, or staging area.

How it works:

  • Data is loaded into the raw or staging zone without modifications.

  • Common formats: Parquet, ORC, Avro, JSON.

  • In streaming pipelines (e.g., Kafka → Delta Lake), loading happens continuously.

Typical tools:

  • Amazon S3, Azure Data Lake Storage, Google BigQuery, Snowflake, Delta Lake

  • dbt, Airflow, Databricks Autoloader

💡 Examples:

Raw data → s3://data-lake/raw/sales/2025-10-06/

🔄 3. Transform

Goal: clean, normalize, aggregate, and prepare data for analytics or machine learning.

How it works:

  • Deduplication, type casting, filling nulls, standardizing formats.

  • Building business models such as facts, dimensions, and data marts.

  • Executed via SQL transformations or distributed compute engines like Spark/Flink.

Typical tools:

  • dbt, Apache Spark, Databricks, Flink, Snowpark, SQL

💡 Examples:

CREATE TABLE IF NOT EXISTS clean.sales AS
SELECT customer_id,
       SUM(amount) AS total_sales,
       DATE_TRUNC('month', sale_date) AS month
FROM raw.sales
GROUP BY customer_id, month;

🧠 Modern evolution

The newer paradigm ELT (Extract → Load → Transform) reverses the last two steps. Data is first loaded as-is into the warehouse or data lake and then transformed in-place using its computational power (e.g., in Snowflake, BigQuery, or Databricks).

This approach improves scalability and reduces load on source systems.

AspectETL (Extract → Transform → Load)ELT (Extract → Load → Transform)
Process flowData is extracted, transformed on an external engine, and then loaded into the data warehouse.Raw data is first loaded into a data lake or warehouse, then transformed in place.
Best forTraditional data warehouses with limited compute resources.Modern cloud-based architectures (Snowflake, Databricks, BigQuery).
ToolsInformatica, Talend, SSIS.dbt, Spark SQL, Databricks workflows.
AdvantagesEarly data validation, strict schema enforcement.Scales better, cheaper storage, supports semi/unstructured data.
DrawbacksDifficult to scale, limited flexibility.May require stronger data governance to control raw data chaos.

💡 Example:

In a dbt + Snowflake pipeline, you use ELT: raw data lands in a staging schema (raw tables), and dbt transforms it into clean analytical models inside the warehouse.

🏢 Data Warehouse vs Data Lake vs Lakehouse

Data Warehouse

A Data Warehouse is a centralized system designed to store and manage structured, historical data from multiple sources for analytics and reporting.

It integrates data through ETL or ELT processes, organizes it into subject-oriented schemas (like sales, customers, finance), and allows fast querying using SQL.

💡 Typically, a data warehouse supports Business Intelligence (BI), dashboards, and analytical workloads rather than operational transactions.

💡 Examples:

  • Snowflake, Amazon Redshift, Google BigQuery, Azure Synapse Analytics.
drawing

Data Lake

A Data Lake is a centralized repository that stores raw, unstructured, semi-structured, and structured data at any scale.

Unlike a data warehouse, it doesn’t require predefined schemas — data is stored as-is (“schema-on-read”) and transformed only when needed for analysis. This flexibility makes it ideal for big data analytics, machine learning, and data exploration.

💡 Data Lakes typically use low-cost, scalable object storage such as Amazon S3, Azure Data Lake Storage (ADLS), or Google Cloud Storage.

Examples of technologies:

  • Delta Lake, Apache Iceberg, Apache Hudi — these add ACID transactions, versioning, and schema management to modern data lakes.
drawing

Lakehouse

A Data Lakehouse is a modern data architecture that combines the flexibility of a Data Lake with the management and performance features of a Data Warehouse.

It allows you to store all types of data — structured, semi-structured, and unstructured — in a single repository, while still supporting ACID transactions, governance, schema enforcement, and fast SQL queries.

💡 The Lakehouse eliminates the traditional split between “data for analysts” (in warehouses) and “data for data scientists” (in lakes), enabling both analytics and AI workloads on the same data.

Typical technologies:

  • Databricks Delta Lake, Apache Iceberg, Apache Hudi, Snowflake Arctic
drawing

Comparison

FeatureData WarehouseData LakeLakehouse
Data typeStructured (SQL tables).All types (structured, semi-structured, unstructured).Unified — handles all formats.
SchemaSchema-on-write (defined upfront).Schema-on-read (flexible).Hybrid: schema enforcement + flexibility.
StorageProprietary, high-performance.Cheap object storage (S3, ADLS, GCS).Object storage with ACID transactions.
ProcessingBatch SQL queries.Batch + Streaming.Batch + Streaming + ML.
Use casesBusiness intelligence, reporting.Data exploration, ML training.Unified analytics and machine learning.
ExamplesSnowflake, Redshift, BigQuery.Hadoop, S3, ADLS.Databricks Lakehouse, Delta Lake, Apache Iceberg.

💡 In practice:

Modern companies integrate the three — collecting raw data in a lake, curating and modeling it into warehouse-style tables, and running analytics and ML on top of a lakehouse architecture.

⚡ Lambda vs Kappa Architectures

These are two models for handling real-time and batch data processing.

Lambda Architecture

Combines batch processing for accuracy and stream processing for real-time results.

flowchart TD
    classDef source fill:#cce5ff,stroke:#003366,stroke-width:1px,color:#003366,font-weight:bold
    classDef batch fill:#e6ffe6,stroke:#006600,stroke-width:1px,color:#003300,font-weight:bold
    classDef speed fill:#fff0b3,stroke:#b38f00,stroke-width:1px,color:#664d00,font-weight:bold
    classDef serving fill:#ffe6e6,stroke:#990000,stroke-width:1px,color:#660000,font-weight:bold

    A(["🌐 Data Stream"]):::source
    B(["🗂️ Batch Layer<br/>(HDFS, Spark, Hive, dbt)"]):::batch
    C(["⚡ Speed Layer<br/>(Kafka, Flink, Spark Streaming)"]):::speed
    D(["📊 Serving Layer"]):::serving

    A --> B
    A --> C
    B --> D
    C --> D
  • Pros: Accurate + near real-time data; fault tolerance.

  • Cons: Duplicate logic in batch and speed layers (hard to maintain).

  • Used by: Legacy systems (Netflix, LinkedIn early architectures).

Kappa Architecture

Simplifies the Lambda model by removing the batch layer.
All processing — real-time and historical — is done through the streaming pipeline.

flowchart TD
    classDef source fill:#cce5ff,stroke:#004080,stroke-width:1px,color:#00264d,font-weight:bold
    classDef process fill:#e6ffe6,stroke:#007a00,stroke-width:1px,color:#003300,font-weight:bold
    classDef storage fill:#fff0b3,stroke:#b38f00,stroke-width:1px,color:#664d00,font-weight:bold

    A(["🌐 Data Stream<br/>(Kafka, Kinesis)"]):::source
    B(["⚙️ Stream Processor<br/>(Flink, Spark, Beam)"]):::process
    C(["💾 Serving / Storage<br/>(Delta, Cassandra)"]):::storage

    A --> B --> C
  • Pros: Simpler, consistent logic, true real-time architecture.

  • Cons: Complex reprocessing of historical data if schema changes.

  • Used by: Modern streaming-first companies (Uber, Twitter, Confluent).

💡 In short:

  • Lambda = batch + streaming.

  • Kappa = streaming-only, reprocesses all events.

🧩 Components of a Modern Data System

A modern data platform is built from modular components that handle each stage of the data lifecycle:

LayerMain ResponsibilityCommon Tools
Data IngestionCollect and move data from sources.Kafka, Kinesis, Airbyte, Fivetran.
StoragePersist raw and curated data.S3, ADLS, Delta Lake, Iceberg.
ProcessingTransform, clean, aggregate.Spark, Flink, dbt, Beam.
Metadata & GovernanceManage schema, lineage, catalog.Hive Metastore, Unity Catalog, Amundsen, DataHub.
OrchestrationCoordinate and schedule pipelines.Airflow, Dagster, Prefect.
Serving & AnalyticsQuery, visualize, and build ML models.Trino, Databricks SQL, Power BI, MLflow.

This modularity allows teams to evolve their platforms incrementally — swapping tools without rewriting the entire stack.

🥇 Medallion Architecture (Bronze–Silver–Gold)

The Medallion Architecture (developed by Databricks) organizes data into logical layers that improve data quality, traceability, and governance.

flowchart TD
    classDef bronze fill:#fce5cd,stroke:#b45f06,stroke-width:1px,color:#783f04,font-weight:bold
    classDef silver fill:#d9d9d9,stroke:#7f7f7f,stroke-width:1px,color:#404040,font-weight:bold
    classDef gold fill:#fff2cc,stroke:#b38f00,stroke-width:1px,color:#665c00,font-weight:bold

    C(["🥉 BRONZE Layer<br/>(Raw, ingested data)"]):::bronze
    B(["🥈 SILVER Layer<br/>(Cleaned, conformed data)"]):::silver
    A(["🥇 GOLD Layer<br/>(Business-ready data, BI)"]):::gold

    C --> B --> A
LayerPurposeExample
BronzeRaw ingestion from sources (no transformations).Kafka → Delta Lake raw tables.
SilverData cleaned, deduplicated, joined with reference data.Remove nulls, enrich user IDs with profile data.
GoldAggregated and business-ready datasets.KPI dashboards, machine learning features.
drawing

💡 Typical flow:

Raw IoT telemetry → Bronze (raw JSON) → Silver (cleaned sensor readings) → Gold (aggregated device performance metrics).

🌐 Summary

  • ETL transforms before loading — best for legacy systems; ELT transforms inside the target system — ideal for cloud.

  • Data Warehouse → Data Lake → Lakehouse represents the natural evolution of analytical architectures.

  • Lambda vs. Kappa: hybrid vs. unified streaming.

  • Modern data platforms are modular, cloud-native, and automation-driven.

  • Medallion architecture standardizes data curation and quality.