3. Integration of Big Data with BI and Analytics

Latency mismatch: BI tools expect batch-style queries, while streams are continuous.
Data freshness: dashboards must update in seconds without overwhelming the query engine.
Schema evolution: real-time data structures change often — BI layers must adapt dynamically.
Query cost: frequent updates can stress underlying data warehouses.

Tableau / Power BI / Looker — connect to Delta tables or Kafka sinks.
Databricks SQL Dashboards — native support for streaming queries.
Grafana + Prometheus / Kafka Connect Sink — for real-time operational metrics.

💡 Example:

Spark Structured Streaming → writes to Delta Lake → Databricks SQL Dashboard → refresh every 10 seconds.

Parquet – columnar file format optimized for analytical queries.
- Supports predicate pushdown and efficient compression.
Delta Lake – transactional layer on top of Parquet providing:
- ACID transactions.
- Time travel and schema evolution.
- Support for both batch and streaming writes.

(df.writeStream
    .format("delta")
    .outputMode("append")
    .option("checkpointLocation", "/tmp/checkpoints/")
    .start("/delta/events"))

Databricks – unified analytics platform for Spark and Delta Lake.
- Simplifies orchestration, visualization, and governance of Big Data pipelines.
- Native connectors for BI tools and SQL-based streaming queries.

Parquet is a columnar storage format optimized for analytical workloads, widely used in Big Data ecosystems (Spark, Hive, Presto, Trino, etc.).

Data is stored by columns, not by rows.
Advantage: Analytical queries like SELECT sum(sales) FROM table WHERE region='EU' read only the relevant columns instead of the entire table.
Result: significant reduction in disk I/O.

Parquet supports column-level filtering.
Example: WHERE date >= '2025-01-01' reads only the blocks of data where the date matches the condition.
Saves I/O and speeds up queries.

Columns of the same type compress well (e.g., numeric columns, repeated string values).
Supports various compression algorithms: Snappy, GZIP, Brotli, etc.
Columnar storage + compression greatly reduces file size.

Parquet stores schema metadata inside the file, allowing:
- Adding new columns without rewriting old data.
- Combining data with different schema versions.

Works with most popular analytics tools:
- Apache Spark, Hive, Impala, Trino, Presto
- Python: pandas, pyarrow
- Java/Scala via Apache Arrow / Parquet libraries

Feature	Parquet	CSV	Avro
Storage type	Columnar	Row-based	Row-based
Compression	Built-in, efficient	None (external only)	Built-in, efficient (Snappy, etc.)
Predicate pushdown	✅ Yes	❌ No	❌ Limited (depends on reader)
Schema	Embedded (self-describing)	None (schema must be external)	Embedded (self-describing)
File size	Small (compressed)	Large	Medium
Read performance (analytical)	Very fast (column pruning)	Slow (full scan)	Moderate
Write performance	Moderate	Fast	Fast
Schema evolution	✅ Supported	❌ Not supported	✅ Supported
Best use case	Big Data analytics, OLAP	Simple CSV exchange, logs	Event streaming, data interchange

Real-time data processing enables immediate insight and reaction to streaming events.
Apache Kafka handles ingestion; Spark and Flink perform computation; Delta Lake ensures reliable storage.
Integration with BI tools bridges the gap between operational streams and business insights, enabling modern data-driven decision-making.

Big Data Workshop