Skip to main content

Architecture

Smart Data Lake Builder (SDLB) is basically a Java application which is started on the command line. It can run in many environments and platforms like a Databricks cluster, Azure Synapse, Google Dataproc, and also on your local machine, see Getting Started.

Find below an overview of requirements, versions and supported configurations.

Basic Requirements

  • Needs Java 8+ to run
  • Uses Hadoop Java library to read local and remote files (S3, ADLS, HDFS, ...)
  • Is programmed in Scala
  • Uses Maven 3+ as build system

Versions and supported configuration

SDLB currently maintains the following two major versions, which are published as Maven artifacts on maven central:

SDL VersionJava/Scala/Hadoop VersionFile EngineSpark EngineSnowflake/Snowpark EngineComments
1.x, branch master/develop-spark2Java 8, Scala 2.11 & 2.12Hadoop 2.7.xSpark 2.4.xnot supportedDelta lake has limited functionality in Spark 2.x
2.x, branch master/develop-spark3Java 8+, Scala 2.12Hadoop 3.3.x (2.7.x)Spark 3.2.x (3.1.x)Snowpark 1.2.xDelta lake, spark-snowflake and spark-extensions need specific library versions matching the corresponding spark minor version

Configurations using alternative versions mentioned in parentheses can be build manually by setting corresponding maven profiles.

It's possible to customize dependencies and make Smart Data Lake Builder work with other version combinations, but this needs manual tuning of dependencies in your own maven project.

In general, Java library versions are held as close as possible to the ones used in the corresponding Spark version.

Release Notes

See SDBL Release Notes including breaking changes on Github

Logging

By default, SDLB uses the logging libraries included in the corresponding Spark version. This is Log4j 1.2.x for Spark 2.4.x up to Spark 3.2.x. Starting from Spark 3.3.x it will use Log4j 2.x, see SPARK-6305.

You can customize logging dependencies manually by creating your own maven project.