Lean Data-Vault for Databricks
Data Vault is a proven methodology for building scalable, auditable data platforms. I consider myself a fundamental advocate of Data Vault – its concepts of hubs, satellites, and links provide a solid foundation for traceability and parallelization.
However, in real-world projects, I often experienced that the full methodology was too heavy for organizations to adopt consistently. Out of this need, I developed a Lean Data Vault approach: rooted in the principles of Data Vault, but simplified to focus on what truly matters – data quality, flexibility, and speed of delivery.
For more details visit https://andyloewen.de/2025/09/17/lean-data-vault-a-pragmatic-approach-to-data-modeling-in-the-lakehouse/
This project provides a code generation framework for building a multi-layer data warehouse on Databricks Unity Catalog. Instead of writing SQL by hand, developers define data models in XML (using DnAML — Data Modeling Language), and XSLT templates automatically generate all the required SQL scripts for each warehouse layer.
Data flows through three layers:
– Bronze — raw ingestion from source systems (e.g., Wide World Importers). Stores data as-is with technical audit fields and provides Access views on top.
– Silver — historized, vault-style storage. Splits data into Key tables (surrogate + business key) and Hist tables (SCD-2 history), following Lean Data Vault principles.
– Gold — business-ready dimensional model. Combines Key and Hist patterns into clean dimension tables (e.g., DimSuppliers, DimStockItems) without historization fields, ready for consumption.
Key Concepts:
Two deployment strategies are supported and switch via a single configuration flag:
PerLayer — each layer gets its own Unity Catalog, sources live in schemas
Unified — a single catalog, layers are separated by schema naming
Configuration-driven design — all naming conventions (schema names, table prefixes, catalog references) are centralized in Configuration.xslt. There are no hardcoded values in the individual templates.
Helper templates like BuildSchemaName, BuildTableName, and BuildCrossLayerReference ensure consistent naming across all layers and both strategies.
How to Work With It:
– Define your entity in a DnAML XML file
– Run the appropriate XSLT template against it to generate SQL
– Execute the generated SQL on your Databricks environment
Befor you start read the 00ReadmeFirst.md File in the “Common” Folder
