Data Mesh is an emerging concept gaining momentum across organizations. It is often described as a sociotechnical paradigm because a paradigm shift towards Data Mesh does not simply involve technological changes but also has sociological implications. As such, discussing a technical framework like Smart Data Lake Builder and analyzing how well it fits the Data Mesh paradigm can inherently only be part of the whole picture. Nevertheless, we want to do the exercise here to see how a technological framework can support the adoption.
In this article, we'll explore how the Smart Data Lake Builder aligns with the four principles as outlined by Zhamak Dehghani and assess which key concepts it supports.
Please note that we expect some familiarity with the Data Mesh principles to follow along. There are plenty of resources (books from O'Reilly and Manning, articles and dedicated websites) if you want to dive deeper into the topic.
Individual domains or business units are responsible for the data they generate and maintain, fostering a sense of ownership and accountability.
Scale Out Complexity
With a Data Mesh, your data organization can scale out as the centralized data team gets reduced. With increased usage, you will have more data sources, more consumers, more interfaces and with that, a large variety of systems and technologies.
SDLB comes with a large set of connectors out-of-the-box with the addition of Airbyte source connectors. Anything missing can be easily implemented as the whole ecosystem is open and builds on open standards.
Data Mesh embraces continuous change in your Data Products. Changing source systems or additional requirements all require you to adapt (quickly).
SDLB is built for changes. With built-in schema evolution, changing schemas can be handled automatically. Thanks to the declarative approach, adding additional data objects can be done in minutes and the dependencies don't need to be defined by hand.
Data as a Product
Data is treated as a product, with clear documentation, standardized interfaces, and well-defined quality measures to ensure its usability and value.
It's important to note that in the end, data products can be implemented in different technologies as long as they adhere to the agreed upon interfaces.
Again, there are different aspects where SDLB can help.
A Data Product has well-defined quality measures and should define Service-Level Objectives (SLOs).
SDLB also gathers metrics with every run. With the metrics gathered, you can detect anomalies like number of records or bytes written. These are often helpful in finding latent problems in your data pipeline.
As another aspect of quality, you expect your source system to adhere to the agreed interface.
SDLB supports schema validation for all its data objects. In your configuration, you can (optionally) define a schema and enforce it. This saves you from surprises if an upstream system suddenly changes the schema of a data object you use.
To provide transparency about the origin and transformations of data in your Data Product, a lineage diagram can greatly help.
SDLB is a metadata driven framework. You define Data Objects, you define Actions between them. That's it. SDLB will figure out the dependencies at runtime but they can also be easily displayed from static configurations. This is done i.e. in our sdl-visualization, see also the UI-Demo. SDL configuration files can be easily displayed and analyzed. This provides the needed transparency of a Data Product.
SDLB offers many standard Data Objects for all kinds of interfaces. With the declarative approach and schema enforcement, you can make sure that your data always adheres to the defined interface specification.
Often in Cloud Environments, we work with Delta or Iceberg Tables today. These open standards are a powerful tool to share your Data Product, even in multicloud deployments.
Self-Serve infrastructure platform
Data infrastructure is designed to be self-serve, enabling teams to access and manage their data autonomously without relying on central data teams.
Mobilize more developers
With increased usage of the Mesh, you need more developers in your domain teams building Data Products.
SDLB has an easy to learn, declarative approach that new users can learn quickly. It supports SQL, Scala and Python for the definition of transformations so your developers can still use the language they are most comfortable in. Complex logic and transformations like historization and deduplication are either built-in or can be extended in a generic way so your developers can leverage them easily.
Support your domain teams
The self-serve infrastructure should support your domain team to quickly build new Data Products.
SDLB is ideal to integrate into DevOps pipelines as all configurations are written in textfile based HOCON format. Even if you have custom code, it will be written in SQL, Scala or Python which also integrates well in any existing DevOps environments. It is designed to support code reviews, automated testing and deployment. With the right templates, your domain teams can set up new Data Products quickly.
Governance of data and its quality is distributed across domains, with a focus on collaboration, standardized practices, and federated decision-making processes to maintain data integrity and trust.
While many of the concepts for a federated governance are on an organizational level and not on a technical level, SDLB can again help on various points.
When it comes to Data Catalogs, there is no open, agreed upon standard (yet).
SDLB's configuration files are open and can be visualized as mentioned before. The concepts of Data Objects and Actions are very similar to many Data Catalog notations and with that, they also integrate nicely in existing solutions. There is i.e. an exporter to Apache Atlas which also builds the basis of Azure Purview.
One aspect with increasing importance is privacy and compliance. For these topics, it helps to have a unified, generic implementation to prevent your domain teams to each start their own implementation.
SDLB now supports encrypted columns that help i.e. with PII data (Personally Identifiable Information).
There are a lot of aspects to consider when adopting a Data Mesh, many of which are on a technical level. While you want to give your domain teams the independence to choose the technologies they know best, you still want a framework that is easy to learn, quick to adapt and open to work with any other Data Product. This article hopefully gave you a basic overview of how Smart Data Lake Builder can help you.