Features

Smart Data Lake Builder is still under heavy development so new features are added all the time. The following list will give you a rough overview of current and planned features. More details on the roadmap will follow shortly.

Declarative approach, file based metadata

Easy to version with a VCS for DevOps
Flexible structure by splitting over multiple files and subdirectories
Easy to generate from third party metadata (e.g. source system table catalog) to automate transformation of large number of DataObjects
Support to handle multiple environments

Support for complex workflows & streaming

Fork, join, parallel execution, multiple start- & end-nodes possible
Recovery of failed runs
Switch a workflow between batch or streaming execution by using just a command line switch

Multi-Engine

Spark (DataFrames)
Snowflake (DataFrames)
File (Input&OutputStream)
Future: SQL, Kafka Streams, Flink, …

Connectivity

Spark: diverse connectors (HadoopFS, Hive, DeltaLake, JDBC, Kafka, Splunk, Webservice, JMS) and formats (CSV, JSON, XML, Avro, Parquet, Excel, Access …)
File: SFTP, Local, Webservice
Easy to extend by implementing predefined scala traits
Support for getting secrets from different secret providers
Support for SQL update & merge (Jdbc, DeltaLake)
Support for integration of Airbyte sources

Generic Transformations

Spark based: Copy, Historization, Deduplication (incl. incremental update/merge mode for streaming)
File based: FileTransfer
Easy to extend by implementing predefined scala traits
Future: applying MLFlow machine learning models

Customizable Transformations

Spark Transformations:
- Chain predefined standard transformations (e.g. filter, row level data validation and more) and custom transformations within the same action
- Custom Transformation Languages: SQL, Scala (Class, compile from config), Python
- Many input DataFrames to many outputs DataFrames (but only one output recommended normally, in order to define dependencies as detailed as possible for the lineage)
- Add metadata to each transformation to explain your data pipeline.
File Transformations:
- Language: Scala
- Only one to one (one InputStream to one OutputStream)

Early Validation

Execution in 3 phases before execution

Load Config: validate configuration
Prepare: validate connections
Init: validate Spark DataFrame Lineage (missing columns in transformations of later actions will stop the execution)

see execution phases for details

Execution Modes

Select data to process, e.g.

Process all data
Partition parameters: give partition values to process for start nodes as parameter
Partition Diff: search missing partitions and use as parameter
Incremental: use stateful input DataObject, or compare sortable column between source and target and load the difference
Spark Streaming: asynchronous incremental processing by using Spark Structured Streaming
Spark Streaming Once: synchronous incremental processing by using Spark Structured Streaming with Trigger=Once mode

Schema Evolution

Automatic evolution of data schemas (new column, removed column, changed datatype)
Support for changes in complex datatypes (e.g. new column in array of struct)
Automatic adaption of DataObjects with fixed schema (Jdbc, DeltaLake)

Metrics

Number of rows read/written per DataObject
Execution duration per Action
Arbitrary custom metrics defined by aggregation expressions
Predefined metric for transfer rate, completness and ensuring unique constraints.
StateListener interface to get notified about progress & metrics

Data Catalog

Report all DataObjects attributes (incl. foreign keys if defined) for visualisation of data catalog in BI tool
Metadata support for categorizing Actions and DataObjects
Custom metadata attributes

Lineage

Browse lineage of DataObjects and Actions in the UI

Data Quality

Metadata support for primary & foreign keys
Check & report primary key violations by executing primary key checker action
Define and validate row-level Constraints before writing DataObject
Define and evaluate Expectations when writing DataObject, trigger warning or error, collect result as custom metric
Future: Report data quality (foreign key matching & expectations) by executing data quality reporter action

Testing

Support for CI
- Config validation
- Custom transformation unit tests
- Spark data pipeline simulation (acceptance tests)
Support for Deployment
- Dry-run

Spark Performance

Execute multiple Spark jobs in parallel within the same Spark Session to save resources
Automatically cache and release intermediate results (DataFrames)

Housekeeping

Delete, or archive & compact partitions according to configurable expressions
Extend with custom housekeeping logic

User Interface

Configuration viewer with catalog and lineage view
Comprehensive workflow visualization
Documentation from metadata and code approach - all configuration elements can be described in the metadata, and are enriched with documentation from code where possible.

see also UI Demo visualizing Getting Started data pipeline.

Declarative approach, file based metadata​

Support for complex workflows & streaming​

Multi-Engine​

Connectivity​

Generic Transformations​

Customizable Transformations​

Early Validation​

Execution Modes​

Schema Evolution​

Metrics​

Data Catalog​

Lineage​

Data Quality​

Testing​

Spark Performance​

Housekeeping​

User Interface​