Skip to main content

Architecture

Smart Data Lake Builder (SDLB) is basically a Java application which is started on the command line. It can run in many environments and platforms like a Databricks cluster, Azure Synapse, Google Dataproc, and also on your local machine, see Getting Started.

Find below an overview of requirements, versions and supported configurations.

Basic Requirements

  • Needs Java 8+ to run
  • Uses Hadoop Java library to read local and remote files (S3, ADLS, HDFS, ...)
  • Is programmed in Scala
  • Uses Maven 3+ as build system

Versions and supported configuration

SDLB is published as Maven artifacts on Maven Central. The SDLB versions are build with specific versions of Apache Spark and other libraries. This page gives an overview of the respective versions.

SDLB Version 1.X

SDLB version 1.X used Apache Spark 2.X. This branch of SDLB is no longer maintained. To profit from the latest development, please upgrade to a more recent version of SDLB.

SDLB Version 2.X

The following table gives an overview of dependency versions that are delivered with each major branch of SDLB.

SDL VersionJava/Scala/Hadoop VersionHadoop VersionSpark EngineLog4jSnowflake/Snowpark EngineDelta LakeIceberg
2.8.XJava 8+, Scala 2.12/2.133.3.63.5.32.20.03.1.1 / 1.15.0 (*)3.2.01.6.1
2.7.XJava 8+, Scala 2.12/2.133.3.63.5.22.20.03.0.0 / 1.13.2 (*)3.2.01.6.1
2.6.XJava 8+, Scala 2.12/2.133.3.63.4.32.20.02.12.0 / 1.9.0 (*)2.4.01.3.1
2.5.XJava 8+, Scala 2.123.3.23.3.22.17.22.11.0 / 1.6.22.2.01.1.0
2.4.XJava 8+, Scala 2.123.3.13.2.21.2.172.10.0 / 1.2.02.0.0-
2.3.XJava 8+, Scala 2.123.3.13.2.21.2.172.10.0 / 1.2.02.0.0-
2.2.XJava 8+, Scala 2.123.3.13.2.11.2.172.9.2 / 0.11.01.1.0-
2.1.XJava 8+, Scala 2.122.7.43.1.11.2.172.8.41.0.0-

(*) Snowpark is not supported for Scala 2.13, see also this note.

It's possible to customize dependencies and make Smart Data Lake Builder work with other version combinations, but this needs manual tuning of dependencies in your own maven project.

In general, Java library versions are held as close as possible to the ones used in the corresponding Spark version.

Release Notes

See SDBL Release Notes including breaking changes on Github

Context

Legend:

Components of an SDLB Job

Legend:

Cross cutting concerns

Logging

By default, SDLB uses the logging libraries included in the corresponding Spark version. This is Log4j 1.2.x for Spark 2.4.x up to Spark 3.2.x. Starting from Spark 3.3.x it will use Log4j 2.x, see SPARK-6305.

You can customize logging dependencies manually by creating your own maven project.