Features
Smart Data Lake Builder is still under heavy development so new features are added all the time. The following list will give you a rough overview of current and planned features. More details on the roadmap will follow shortly.
Filebased metadata​
- Easy to version with a VCS for DevOps
- Flexible structure by splitting over multiple files and subdirectories
- Easy to generate from third party metadata (e.g. source system table catalog) to automate transformation of large number of DataObjects
Support for complex workflows & streaming​
- Fork, join, parallel execution, multiple start- & end-nodes possible
- Recovery of failed runs
- Switch a workflow between batch or streaming execution by using just a command line switch
Execution Engines​
- Spark (DataFrames)
- File (Input&OutputStream)
- Future: Kafka Streams, Flink, …
Connectivity​
- Spark: diverse connectors (HadoopFS, Hive, DeltaLake, JDBC, Kafka, Splunk, Webservice, JMS) and formats (CSV, JSON, XML, Avro, Parquet, Excel, Access …)
- File: SFTP, Local, Webservice
- Easy to extend by implementing predefined scala traits
- Support for getting secrets from different secret providers
- Support for SQL update & merge (Jdbc, DeltaLake)
Generic Transformations​
- Spark based: Copy, Historization, Deduplication (incl. incremental update/merge mode for streaming)
- File based: FileTransfer
- Easy to extend by implementing predefined scala traits
- Future: applying MLFlow machine learning models
Customizable Transformations​
- Spark Transformations:
- Chain predefined standard transformations (e.g. filter, row level data validation and more) and custom transformations within the same action
- Custom Transformation Languages: SQL, Scala (Class, compile from config), Python
- Many input DataFrames to many outputs DataFrames (but only one output recommended normally, in order to define dependencies as detailed as possible for the lineage)
- Add metadata to each transformation to explain your data pipeline.
- File Transformations:
- Language: Scala
- Only one to one (one InputStream to one OutputStream)
Early Validation​
(see execution phases for details)
- Execution in 3 phases before execution
- Load Config: validate configuration
- Prepare: validate connections
- Init: validate Spark DataFrame Lineage (missing columns in transformations of later actions will stop the execution)
Execution Modes​
(see execution Modes for details)
- Process all data
- Partition parameters: give partition values to process for start nodes as parameter
- Partition Diff: search missing partitions and use as parameter
- Spark Incremental: compare sortable column between source and target, load the difference
- Spark Streaming: asynchronous incremental processing by using Spark Structured Streaming
- Spark Streaming Once: synchronous incremental processing by using Spark Structured Streaming with Trigger=Once mode
Schema Evolution​
- Automatic evolution of data schemas (new column, removed column, changed datatype)
- Support for changes in complex datatypes (e.g. new column in array of struct)
- Automatic adaption of DataObjects with fixed schema (Jdbc, DeltaLake)
Metrics​
- Number of rows written per DataObject
- Execution duration per Action
- StateListener interface to get notified about progress & metrics
Data Catalog​
- Report all DataObjects attributes (incl. foreign keys if defined) for visualisation of data catalog in BI tool
- Metadata support for categorizing Actions and DataObjects
- Custom metadata attributes
Lineage​
- Report all dependencies between DataObjects for visualisation of lineage in BI tool
Data Quality​
- Metadata support for primary & foreign keys
- Check & report primary key violations by executing primary key checker action
- Future: Metadata support for arbitrary data quality checks
- Future: Report data quality (foreign key matching & arbitrary data quality checks) by executing data quality reporter action
Testing​
- Support for CI
- Config validation
- Custom transformation unit tests
- Spark data pipeline simulation (acceptance tests)
- Support for Deployment
- Dry-run
Spark Performance​
- Execute multiple Spark jobs in parallel within the same Spark Session to save resources
- Automatically cache and release intermediate results (DataFrames)
Housekeeping​
- Delete, or archive & compact partitions according to configurable expressions