Glossary

Glossary of Key Terms

Term	Definition
ACID Transactions	An acronym for atomicity, consistency, isolation, and durability. A set of properties of database transactions intended to guarantee data validity despite errors, power failures, and other mishaps. The data lakehouse notably supports ACID transactions, a departure from the original data lake.
Batch Layer	A component of the Lambda architecture that stores the complete, immutable master dataset. It periodically recomputes aggregates and derived views (batch views) over the entire dataset, ensuring data correctness and completeness.
Batch Processing	A method of processing data where inputs are of a known, finite size. It encourages deterministic, pure functions and is well-suited for complex logic and reprocessing large amounts of historical data.
Cloud Data Warehouse	An evolution of the on-premises data warehouse architecture, characterized by a pay-as-you-go model and the separation of compute and storage. This makes them scalable and accessible to companies of all sizes for processing petabyte-scale data.
Data as a Product	A core principle of the Data Mesh architecture, where domains host and serve their datasets in an easily consumable way, treating the data itself as a primary product.
Data Consumption	The layer in a big data system where end-users, applications, or other systems utilize the processed data.
Data Flow Model	A model, implemented by frameworks like Apache Beam, that views all data as events and performs aggregation over various types of windows. In this model, batch data is simply a bounded event stream, treating batch as a special case of streaming to unify processing logic.
Data Lake	A storage system where data can be stored in its “as is” or natural form, both structured and unstructured. Its primary purpose is to break data silos and democratize data for exploration and discovery.
Data Lake Governance	A horizontal layer in a data lake architecture that encompasses all data management activities, including metadata management, data cataloging, data quality, data lineage, and data auditing.
Data Lakehouse	A modern data architecture that represents a convergence of data lakes and data warehouses. It incorporates the controls and data management features of a data warehouse while housing data in object storage and supporting various query engines, notably including ACID transactions.
Data Mart	A refined subset of a data warehouse designed to serve the analytics and reporting needs of a single department or line of business.
Data Mesh	A decentralized data architecture that applies concepts of domain-driven design to data. It inverts the centralized model by having domains host and serve their own datasets as products, supported by a self-serve data infrastructure platform.
Data Swamp	A derogatory term for a data lake that has become a disorganized dumping ground for data. This often occurs due to a lack of schema management, cataloging, and discovery tools, making the data difficult to use and manage.
Data Warehouse	A central data hub used for reporting and analysis, as defined by Bill Inmon. It is a subject-oriented, integrated, nonvolatile, and time-variant collection of highly formatted and structured data.
ELT (Extract, Load, Transform)	A data integration process where data is extracted from source systems and loaded directly into a staging area in the target system (like a data warehouse). Transformations are then handled within the target system, taking advantage of its computational power.
ETL (Extract, Transform, Load)	A data integration process where data is extracted from source systems, cleaned and standardized in a dedicated transformation phase, and then loaded into the target data warehouse.
Event Sourcing	A pattern where all changes to application state are stored as a sequence of immutable events. The Lambda architecture’s core idea is similar, recording incoming data by appending immutable events to an always-growing dataset.
IoT (Internet of Things)	A distributed collection of devices (computers, sensors, mobile devices) with an internet connection that generate data by collecting it periodically or continuously from the surrounding environment.
Kappa Architecture	An alternative to the Lambda architecture that eliminates the batch processing layer. It uses a single stream-processing platform as the backbone for all data handling, where the immutable event log (e.g., Kafka) is the source of truth, and reprocessing is done by re-playing the log.
Lambda Architecture	A big data architecture pattern that processes data using two parallel systems: a batch layer for accurate, comprehensive views on historical data, and a speed layer for low-latency updates on recent data. A serving layer combines outputs from both layers.
MPP (Massively Parallel Processing)	A technical architecture, common in data warehouses, optimized to scan massive amounts of data in parallel. MPP systems allow for high-performance aggregation and statistical calculations on large datasets.
Modern Data Stack	A trendy analytics architecture that uses cloud-based, plug-and-play, easy-to-use components to create a modular and cost-effective system. It emphasizes self-service, agile data management, and open-source or simple proprietary tools.
OLAP (Online Analytical Processing)	Refers to systems and processes designed for analytics, such as those performed in a data warehouse. This is typically separated from production database workloads.
OLTP (Online Transaction Processing)	Refers to systems that handle production database transactions, such as recording sales or updating user accounts. A key feature of data warehouse architecture is separating OLAP workloads from OLTP systems.
Raw Zone (Landing Zone)	The first storage layer in a data lake, used as a staging area for ingesting all data in its native format without transformation. It serves as a complete historical archive.
Refined Zone	A storage layer in a data lake where data is cleansed, enriched, and conformed to specific subject areas for dedicated line-of-business use cases. Business context is applied to the data in this zone.
Reprocessing	The process of running a batch job over large amounts of accumulated historical data to derive new views or restructure a dataset. It is a key mechanism for maintaining systems and evolving them to support new features.
Sandbox Zone	An exploration or experimentation area in a data lake where authorized users (data scientists, analysts, engineers) can explore data, discover insights, create prototypes, and innovate.
Serving Layer	The layer in the Lambda architecture responsible for merging the pre-computed views from the batch layer with the real-time updates from the speed layer to provide a comprehensive answer to queries.
Speed Layer	The layer in the Lambda architecture that processes incoming data streams in real-time. It provides “fast, but approximate” analytics to deliver results with very low latency.
Stream Layer	The primary processing layer in the Kappa architecture. It performs real-time processing on the event stream and outputs computed values to a dedicated serving layer.
Stream Processing	A method of processing data that operates on unbounded datasets (continuous streams of events). It allows operators to manage fault-tolerant state and enables changes in input to be reflected in outputs with low delay.
Trusted Zone	A storage layer in a data lake that holds structured data that has been modeled and standardized to serve as a single source of truth for enterprise-wide use cases.

Big Data Workshop

Explorer

Glossary

Glossary of Key Terms

Graph View

Backlinks