Smart Data Lake Builder can be executed in multiple ways on Microsoft Azure:
- on Databricks
- as containers orchestrated with Kubernetes
- as virtual machine
SDLB on Databricks
Databricks has the advantage of pre-configurated features like ready-to-use Spark clusters, metastore, notebook support and integrated SQL endpoints.
At the time of this writing, a few extra steps are needed to overwrite specific libraries. When running a job in Databricks, a few dependencies are given and can not be simply overwritten with your own as described in the Azure documentation. Since we use a newer version of typesafe config, we need to force the overwrite of this dependency. We will create a cluster init script that downloads the library and saves it on the cluster, then use Sparks ChildFirstURLClassLoader to explicitly load our library first. This can hopefully be simplified in the future.
In your Azure portal, create a Databricks Workspace and launch it
Create a cluster that fits your needs. For a first test you can use the miminal configuration of 1 Worker and 1 Driver node. This example was tested on Databricks Runtime Version 6.2.
Open the Advanced Options, Init Scripts and configure the path:
On your local machine, create a simple script called config-install.sh with the following content
wget -O /databricks/jars/-----config-1.3.4.jar https://repo1.maven.org/maven2/com/typesafe/config/1.3.4/config-1.3.4.jar
To copy this local file to your Databricks filesystem, use the Databricks CLI:
databricks fs mkdirs dbfs:/databricks/scripts
databricks fs cp \<path-to/config-install.sh\> dbfs:/databricks/scripts/
Now this script gets executed every time the cluster starts. It will download the config library and put it in a place where the classloader can find it.
Start your cluster, check the event log to see if it's up. If something is wrong with the init script, the cluster will not start.
On your local machine, create a second the SDLB configuration file(s) e.g. called application.conf. For more details of the configuration file(s) see hocon overview.
Upload the file(s) to a conf folder in dbfs:
databricks fs mkdirs dbfs:/conf
databricks fs cp path-to/application.conf dbfs:/conf/
Now create a Job with the following details: If you don't have the JAR file yet, see build fat jar on how to build it (using the Maven profile fat-jar).
Task: Upload JAR - Choose the smartdatalake-<version>-jar-with-dependencies.jar
Main Class: io.smartdatalake.app.LocalSmartDataLakeBuilder Arguments:
["-c", "file:///dbfs/conf/", "--feed-sel", "download"]
The option --override-jars is set automatically to the correct value for DatabricksConfigurableApp. If you want to override any additional libraries, you can provide a list with this option.
Finally the job can be started and the result checked.
For a detailed example see Deployment on Databricks blog post.