Skip to main content

· 6 min read

Data Mesh is an emerging concept gaining momentum across organizations. It is often described as a sociotechnical paradigm because a paradigm shift towards Data Mesh does not simply involve technological changes but also has sociological implications. As such, discussing a technical framework like Smart Data Lake Builder and analyzing how well it fits the Data Mesh paradigm can inherently only be part of the whole picture. Nevertheless, we want to do the exercise here to see how a technological framework can support the adoption.

In this article, we'll explore how the Smart Data Lake Builder aligns with the four principles as outlined by Zhamak Dehghani and assess which key concepts it supports.

Please note that we expect some familiarity with the Data Mesh principles to follow along. There are plenty of resources (books from O'Reilly and Manning, articles and dedicated websites) if you want to dive deeper into the topic.

Domain Ownership

Individual domains or business units are responsible for the data they generate and maintain, fostering a sense of ownership and accountability.

Scale Out Complexity

With a Data Mesh, your data organization can scale out as the centralized data team gets reduced. With increased usage, you will have more data sources, more consumers, more interfaces and with that, a large variety of systems and technologies.

SDLB comes with a large set of connectors out-of-the-box with the addition of Airbyte source connectors. Anything missing can be easily implemented as the whole ecosystem is open and builds on open standards.

Continuous change

Data Mesh embraces continuous change in your Data Products. Changing source systems or additional requirements all require you to adapt (quickly).

SDLB is built for changes. With built-in schema evolution, changing schemas can be handled automatically. Thanks to the declarative approach, adding additional data objects can be done in minutes and the dependencies don't need to be defined by hand.

Data as a Product

Data is treated as a product, with clear documentation, standardized interfaces, and well-defined quality measures to ensure its usability and value.

It's important to note that in the end, data products can be implemented in different technologies as long as they adhere to the agreed upon interfaces.

Again, there are different aspects where SDLB can help.

Data Quality

A Data Product has well-defined quality measures and should define Service-Level Objectives (SLOs).

SDLB builds data quality into your pipelines. It supports constraints and expectations that are evaluated each time a pipeline is executed instead of having downstream checks after the execution.

SDLB also gathers metrics with every run. With the metrics gathered, you can detect anomalies like number of records or bytes written. These are often helpful in finding latent problems in your data pipeline.

Schema enforcement

As another aspect of quality, you expect your source system to adhere to the agreed interface.

SDLB supports schema validation for all its data objects. In your configuration, you can (optionally) define a schema and enforce it. This saves you from surprises if an upstream system suddenly changes the schema of a data object you use.

Lineage

To provide transparency about the origin and transformations of data in your Data Product, a lineage diagram can greatly help.

SDLB is a metadata driven framework. You define Data Objects, you define Actions between them. That's it. SDLB will figure out the dependencies at runtime but they can also be easily displayed from static configurations. This is done i.e. in our sdl-visualization, see also the UI-Demo. SDL configuration files can be easily displayed and analyzed. This provides the needed transparency of a Data Product.

Interfaces

SDLB offers many standard Data Objects for all kinds of interfaces. With the declarative approach and schema enforcement, you can make sure that your data always adheres to the defined interface specification.

Often in Cloud Environments, we work with Delta or Iceberg Tables today. These open standards are a powerful tool to share your Data Product, even in multicloud deployments.

Self-Serve infrastructure platform

Data infrastructure is designed to be self-serve, enabling teams to access and manage their data autonomously without relying on central data teams.

Mobilize more developers

With increased usage of the Mesh, you need more developers in your domain teams building Data Products.

SDLB has an easy to learn, declarative approach that new users can learn quickly. It supports SQL, Scala and Python for the definition of transformations so your developers can still use the language they are most comfortable in. Complex logic and transformations like historization and deduplication are either built-in or can be extended in a generic way so your developers can leverage them easily.

Support your domain teams

The self-serve infrastructure should support your domain team to quickly build new Data Products.

SDLB is ideal to integrate into DevOps pipelines as all configurations are written in textfile based HOCON format. Even if you have custom code, it will be written in SQL, Scala or Python which also integrates well in any existing DevOps environments. It is designed to support code reviews, automated testing and deployment. With the right templates, your domain teams can set up new Data Products quickly.

Federated governance

Governance of data and its quality is distributed across domains, with a focus on collaboration, standardized practices, and federated decision-making processes to maintain data integrity and trust.

While many of the concepts for a federated governance are on an organizational level and not on a technical level, SDLB can again help on various points.

Data Catalogs

When it comes to Data Catalogs, there is no open, agreed upon standard (yet).

SDLB's configuration files are open and can be visualized as mentioned before. The concepts of Data Objects and Actions are very similar to many Data Catalog notations and with that, they also integrate nicely in existing solutions. There is i.e. an exporter to Apache Atlas which also builds the basis of Azure Purview.

Encryption

One aspect with increasing importance is privacy and compliance. For these topics, it helps to have a unified, generic implementation to prevent your domain teams to each start their own implementation.

SDLB now supports encrypted columns that help i.e. with PII data (Personally Identifiable Information).

Summary

There are a lot of aspects to consider when adopting a Data Mesh, many of which are on a technical level. While you want to give your domain teams the independence to choose the technologies they know best, you still want a framework that is easy to learn, quick to adapt and open to work with any other Data Product. This article hopefully gave you a basic overview of how Smart Data Lake Builder can help you.

· 6 min read

In this article, we're taking a look on how we use SDLB's housekeeping features to keep our pipelines running efficiently.

Some DataObject contain housekeeping features of their own. Make sure you use them! For example, Delta Tables support commands like optimize and vacuum to optimize storage and delete no longer needed files.

But usually, those commands do not re-organize your partitions. This is where SDLBs housekeeping mode comes in.

The example is taken from a real world project we've implemented.

Context

In this particular project we are collecting data from various reporting units and process it in batches. The reporting units use an Azure Function to upload JSON files to an Azure Data Lake Storage. From there, we pick them up for validation and processing. Reporting units can upload data anytime, but it is only processed a few times a day.

Once validated, we use Delta Lake tables in Databricks to process data through the layers of the Lakehouse.

Partitioning

The Azure Function puts uploaded JSON files in a subfolder for each reporting unit. As such, JSON files are already neatly partitioned by reporting_unit:

uploadFolder/
reporting_unit=rp01
file1.json
file2.json
file3.json
reporting_unit=rp02
file1.json
reporting_unit=rp03
fileX.json

To read these JSON files, we can therefore use the following DataObject definition:

import_json {
type = JsonFileDataObject
path = uploadFolder/
partitions = [reporting_unit]
}

These files are then processed with a FileTransferAction into an output DataObject stage_json:

stage_json {
type = FileTransferAction
inputId = import_json
outputId = stage_json
executionMode = { type = FileIncrementalMoveMode }
metadata.feed = stage_json
}

Each time we start to process uploaded data, we use the run_id to keep track of all batch jobs and version of files delivered. If you use a state path (see commandLine), your runs automatically generate a run_id to identify the run, and you can use it by extending your DataObject:

stage_json {
type = JsonFileDataObject
path = processedFolder
partitions = [run_id,reporting_unit]
schema = """reporting_unit string, run_id string, ...."""
}

Note how we just use run_id as part of the schema without any further declaration. Since we use the state path, SDLB uses a run_id internally, and if it's referenced as partition column in a DataObject, processed data get automatically assigned the id of the current run.

Drawback

Let's take a look at the resulting partition layout of stage_json:

processedFolder/
run_id=1/
reporting_unit=rp01/
file1.json
file2.json
file3.json
reporting_unit=rp02/
file1.json
reporting_unit=rp03/
fileX.json

This partition layout has many advantages in our case as we know exactly during which run a particular file was processed and which reporting unit uploaded it. In further stages we can clearly work with files that were processed in the current run and not touch any old run_ids.

For this use case, a few things are important to note:

  • Some reporting units don't upload data for days. You end up with only a few reporting_unit partitions per run_id.
  • File sizes are rather small (< 1 MiB), partition sizes end up very small too.
  • If you use hourly runs and run 24/7, you end up with 168 partitions per week, plus sub-partitions for reporting units.
  • Once files are correctly processed, we don't read the uploaded files anymore. We still keep them as raw files should we ever need to re-process them.

The drawback becomes apparent when you have actions working with all partitions, they will become very slow. Spark doesn't like a lot of small partitions.

To mitigate that, we use SDLB's Housekeeping Feature.

HousekeepingMode

If you take a look at DataObject's parameters, you will see a housekeepingMode. There are two modes available:

  • PartitionArchiveCompactionMode: to compact / archive partitions
  • PartitionRetentionMode: to delete certain partitions completely

PartitionArchiveCompactionMode

In this mode, you solve two tasks at once:

  • You define how many smaller partitions are aggregated into one larger partition (archive)
  • Rewrite all files in a partition to combine many small files into larger files (compact)

Archive

In our example above, we stated that we don't want to alter any input files, so we won't use compaction. We want to keep them as is (raw data). But we do want to get rid of all the small partitions after a certain amount of time. For that, we extend stage_json to include the housekeepingMode with a archivePartitionExpression:

stage_json {
type = JsonFileDataObject
path = processedFolder
partitions = [run_id,reporting_unit]
schema = """reporting_unit string, run_id string, ...."""
housekeepingMode = {
type = PartitionArchiveCompactionMode
archivePartitionExpression = "if( elements.run_id < (runId - 500), map('run_id', (cast(elements.run_id as integer) div 500) * 500, 'reporting_unit', elements.reporting_unit), elements)"
}
}

This expression probably needs some explanation:
The Spark SQL expression works with attributes of PartitionExpressionData. In this case we use runId (the current runId) and elements (all partition values as map(string,string)). It needs to return a map(string,string) to define new partition values. In our case, it needs to define run_id and reporting_unit because these are the partitions defined in stage_json.

Let's take the expression apart:
if(elements.run_id < (runId - 500), ...
Only archive the partition if it's runId is older than 500 run_ids ago.

map('run_id', (cast(elements.run_id as integer) div 500) * 500, 'reporting_unit', elements.reporting_unit)
Creates the map with the new values for the partitions. The run_id is floored to the next 500 value, so as example, the new value of run_id 1984 will be 1500 (because integer 1984/500=3, 3*500=1500).
Remember that we need to return all partition values in the map, also the ones we don't want to alter. For reporting_unit we simply return the existing value elements.reporting_unit.

..., elements)
This is the else condition and simply returns the existing partition values if there is nothing to archive.

info

The housekeeping mode is applied after writing a DataObject. Keep in mind, that it is executed with every run.

Compaction

We don't want to compact files in our case. But from the documentation you can see that compaction works very similarly:
You also work with attributes from PartitionExpressionData but instead of new partition values, you return a boolean to indicate for each partition if it should be compacted or not.

PartitionRetentionMode

Again, not used in our example as we never delete old files. But if you need to, you define a Spark SQL expression returning a boolean indicating if a partition should be retained or deleted.

stage_json {
type = JsonFileDataObject
path = processedFolder
partitions = [run_id,reporting_unit]
schema = """reporting_unit string, run_id string, ...."""
housekeepingMode = {
type = PartitionRetentionMode
retentionCondition = "elements.run_id > (runId - 500)"
}
}

Result

In our example, we had performance gradually decreasing because Spark had to read more than 10'000 partitions and subpartitions. Just listing all available partitions, even if you only worked with the most recent one, took a few minutes and these operations added up.

With the housekeeping mode enabled, older partitions continuously get merged into larger partitions containing up to 500 runs. This brought the duration of list operations back to a few seconds.

The operations are fully automated, no manual intervention is required.

· 13 min read

In many cases datasets have no constant live. New data points are created, values changed and data expires. We are interested in keeping track of all these changes. This article first presents collecting data utilizing JDBC and deduplication on the fly. Then, a Change Data Capture (CDC) enabled (MS)SQL table will be transferred and historized in the data lake using the Airbyte MS SQL connector supporting CDC. Methods for reducing the computational and storage efforts are mentioned.

· 6 min read

Many analytics applications are ported to the cloud, Data Lakes and Lakehouses in the cloud becoming more and more popular. The Databricks platform provides an easy accessible and easy configurable way to implement a modern analytics platform. Smart Data Lake Builder on the other hand provides an open source, portable automation tool to load and transform the data.

In this article the deployment of Smart Data Lake Builder (SDLB) on Databricks is described.