Features
Smart Data Lake Builder is still under heavy development so new features are added all the time. The following list will give you a rough overview of current and planned features. More details on the roadmap will follow shortly.
Declarative approach, file based metadata
- Easy to version with a VCS for DevOps
- Flexible structure by splitting over multiple files and subdirectories
- Easy to generate from third party metadata (e.g. source system table catalog) to automate transformation of large number of DataObjects
- Support to handle multiple environments
Support for complex workflows & streaming
- Fork, join, parallel execution, multiple start- & end-nodes possible
- Recovery of failed runs
- Switch a workflow between batch or streaming execution by using just a command line switch
Multi-Engine
- Spark (DataFrames)
- Snowflake (DataFrames)
- File (Input&OutputStream)
- Future: SQL, Kafka Streams, Flink, …
Connectivity
- Spark: diverse connectors (HadoopFS, Hive, DeltaLake, JDBC, Kafka, Splunk, Webservice, JMS) and formats (CSV, JSON, XML, Avro, Parquet, Excel, Access …)
- File: SFTP, Local, Webservice
- Easy to extend by implementing predefined scala traits
- Support for getting secrets from different secret providers
- Support for SQL update & merge (Jdbc, DeltaLake)
- Support for integration of Airbyte sources
Generic Transformations
- Spark based: Copy, Historization, Deduplication (incl. incremental update/merge mode for streaming)
- File based: FileTransfer
- Easy to extend by implementing predefined scala traits
- Future: applying MLFlow machine learning models
Customizable Transformations
- Spark Transformations:
- Chain predefined standard transformations (e.g. filter, row level data validation and more) and custom transformations within the same action
- Custom Transformation Languages: SQL, Scala (Class, compile from config), Python
- Many input DataFrames to many outputs DataFrames (but only one output recommended normally, in order to define dependencies as detailed as possible for the lineage)
- Add metadata to each transformation to explain your data pipeline.
- File Transformations:
- Language: Scala
- Only one to one (one InputStream to one OutputStream)
Early Validation
Execution in 3 phases before execution
- Load Config: validate configuration
- Prepare: validate connections
- Init: validate Spark DataFrame Lineage (missing columns in transformations of later actions will stop the execution)
see execution phases for details
Execution Modes
Select data to process, e.g.
- Process all data
- Partition parameters: give partition values to process for start nodes as parameter
- Partition Diff: search missing partitions and use as parameter
- Incremental: use stateful input DataObject, or compare sortable column between source and target and load the difference
- Spark Streaming: asynchronous incremental processing by using Spark Structured Streaming
- Spark Streaming Once: synchronous incremental processing by using Spark Structured Streaming with Trigger=Once mode
Schema Evolution
- Automatic evolution of data schemas (new column, removed column, changed datatype)
- Support for changes in complex datatypes (e.g. new column in array of struct)
- Automatic adaption of DataObjects with fixed schema (Jdbc, DeltaLake)
Metrics
- Number of rows read/written per DataObject
- Execution duration per Action
- Arbitrary custom metrics defined by aggregation expressions
- Predefined metric for transfer rate, completness and ensuring unique constraints.
- StateListener interface to get notified about progress & metrics
Data Catalog
- Report all DataObjects attributes (incl. foreign keys if defined) for visualisation of data catalog in BI tool
- Metadata support for categorizing Actions and DataObjects
- Custom metadata attributes
Lineage
- Browse lineage of DataObjects and Actions in the UI
Data Quality
- Metadata support for primary & foreign keys
- Check & report primary key violations by executing primary key checker action
- Define and validate row-level Constraints before writing DataObject
- Define and evaluate Expectations when writing DataObject, trigger warning or error, collect result as custom metric
- Future: Report data quality (foreign key matching & expectations) by executing data quality reporter action
Testing
- Support for CI
- Config validation
- Custom transformation unit tests
- Spark data pipeline simulation (acceptance tests)
- Support for Deployment
- Dry-run
Spark Performance
- Execute multiple Spark jobs in parallel within the same Spark Session to save resources
- Automatically cache and release intermediate results (DataFrames)
Housekeeping
- Delete, or archive & compact partitions according to configurable expressions
- Extend with custom housekeeping logic
User Interface
- Configuration viewer with catalog and lineage view
- Comprehensive workflow visualization
- Documentation from metadata and code approach - all configuration elements can be described in the metadata, and are enriched with documentation from code where possible.
see also UI Demo visualizing Getting Started data pipeline.