Many analytics applications are ported to the cloud, Data Lakes and Lakehouses in the cloud becoming more and more popular. The Databricks platform provides an easy accessible and easy configurable way to implement a modern analytics platform. Smart Data Lake Builder on the other hand provides an open source, portable automation tool to load and transform the data.
In this article the deployment of Smart Data Lake Builder (SDLB) on Databricks is described.
Before jumping in, it should be mentioned, that there are also many other methods to deploy SDLB in the cloud, e.g. using containers on Azure, Azure Kubernetes Service, Azure Synapse Clusters, Google Dataproc... The present method provides the advantage of having many aspects taken care of by Databricks like Cluster management, Job scheduling and integrated data science notebooks. Further, the presented SDLB pipeline is just a simple example, focusing on the integration into Databricks environment. SDLB provides a wide range of features and its full power is not revealed here.
Let's get started:
-
Databricks accounts can be created as Free Trial or as Community Account
- Account and Workspace creation are described in detail here, there are few hints and modifications presented below.
- I selected AWS backend, but there are conceptually no differences to the other providers. If you already have an Azure, AWS or Google Cloud account/subscription this can be used, otherwise you can register a trial subscription there.
-
Workspace stack is created using the Quickstart as described in the documentation. When finished launch the Workspace.
-
Databricks CLI: for file transfer of configuration files, scripts and data, the Databricks CLI is installed locally. Configure the CLI, using the Workspace URL and in the Workspace "Settings" -> "User Settings" -> "Access tokens" create a new token.
-
Cluster creation, in the Workspace open the Cluster Creation form.
-
Spark version: When selecting the Databricks version pay attention to the related Spark version. This needs to match the Spark version we build SDLB with later. Here,
10.4 LTS
is selected withSpark 3.2.1
andScala 2.12
. Alternatively, SDLB can be build with a different Spark version, see also Architecture for supported versions. -
typesafe library version correction script: the workspace currently includes version 1.2.1 from com.typesafe:config java library. SDLB relies on functions of a newer version (>1.3.0) of this library. Thus, we provide a newer version of the com.typesafe:config java library in an initialization script: Advanced options -> Init Scripts specify
dbfs:/databricks/scripts/config-install.sh
- Further, the script needs to be created and uploaded. You can use the following script in a local terminal:
cat << EOF >> ./config-install.sh
#!/bin/bash
wget -O /databricks/jars/-----config-1.4.1.jar https://repo1.maven.org/maven2/com/typesafe/config/1.4.1/config-1.4.1.jar
EOF
databricks fs mkdirs dbfs:/databricks/scripts
databricks fs cp ./config-install.sh dbfs:/databricks/scripts/Alternatively, you can also use a Databricks notebook for the script upload by executing the following cell:
%sh
cat << EOF >> ./config-install.sh
#!/bin/bash
wget -O /databricks/jars/-----config-1.4.1.jar https://repo1.maven.org/maven2/com/typesafe/config/1.4.1/config-1.4.1.jar
EOF
mkdir /dbfs/databricks/scripts
cp ./config-install.sh /dbfs/databricks/scripts/Note: to double-check the library version I ran
grep typesafe pom.xml
in the SmartDataLake sourceNote: the added
-----
will ensure that this.jar
is preferred before the default Workspace Spark version (which starts with----
). If you are curious you could double-check e.g. with a Workspace Shell Notebook runningls /databricks/jars/*config*
-
-
fat-jar: We need to provide the SDLB sources and all required libraries. Therefore, we compile and pack the Scala code into a Jar including the dependencies. We use the getting-started as dummy project, which itself pulls the SDLB sources.
- download the getting-started source and build it with the
-P fat-jar
profile
podman run -v ${PWD}:/mnt/project -v ${PWD}/.mvnrepo:/mnt/.mvnrepo maven:3.6.0-jdk-11-slim -- mvn -DskipTests -P fat-jar -f /mnt/project/pom.xml "-Dmaven.repo.local=/mnt/.mvnrepo" package
General build instructions can be found in the getting-started documentation. Therewith, the file
target/getting-started-1.0-jar-with-dependencies.jar
is created. The fat-jar profile will include all required dependencies. The profile is defined in the smart-data-lake pom.xml. - download the getting-started source and build it with the
-
upload files
- JAR: in the "Workspace" -> your user -> create a directory
jars
and "import" the library using the link in "(To import a library, such as a jar or egg, click here)" and select the above created fat-jar to upload. As a result the jar will be listed in the Workspace directory. - SDLB application: As an example a dataset from Airbnb NYC will be downloaded from Github, first written into a CSV file and later partially ported into a table. Therefore, the pipeline is defined first locally in a new file
application.conf
:
dataObjects {
ext-ab-csv-web {
type = WebserviceFileDataObject
url = "https://raw.githubusercontent.com/adishourya/Airbnb/master/new-york-city-airbnb-open-data/AB_NYC_2019.csv"
followRedirects = true
readTimeoutMs=200000
}
stg-ab {
type = CsvFileDataObject
schema = """id integer, name string, host_id integer, host_name string, neighbourhood_group string, neighbourhood string, latitude double, longitude double, room_type string, price integer, minimum_nights integer, number_of_reviews integer, last_review timestamp, reviews_per_month double, calculated_host_listings_count integer, availability_365 integer"""
path = "file:///dbfs/data/~{id}"
}
int-ab {
type = DeltaLakeTableDataObject
path = "~{id}"
table {
db = "default"
name = "int_ab"
primaryKey = [id]
}
}
}
actions {
loadWeb2Csv {
type = FileTransferAction
inputId = ext-ab-csv-web
outputId = stg-ab
metadata {
feed = download
}
}
loadCsvLoc2Db {
type = CopyAction
inputId = stg-ab
outputId = int-ab
transformers = [{
type = SQLDfTransformer
code = "select id, name, host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude from stg_ab"
}]
metadata {
feed = copy
}
}
}- upload using Databricks CLI
databricks fs mkdirs dbfs:/conf/
databricks fs cp application.conf dbfs:/conf/application.conf - JAR: in the "Workspace" -> your user -> create a directory
-
Job creation: Here, the Databricks job gets defined, specifying the SDL library and, the entry point and the arguments. Here we specify only the download feed. Therefore, open in the sidebar Jobs -> Create Job:
- Type:
JAR
- Main Class:
io.smartdatalake.app.LocalSmartDataLakeBuilder
- add Dependent Libraries: "Workspace" -> select the file previously uploaded "getting-started..." file in the "jars" directory
- Cluster select the cluster created above with the corrected typesafe library
- Parameters:
["-c", "file:///dbfs/conf/", "--feed-sel", "download"]
, which specifies the location of the SDLB configuration and selects the feed "download"
- Type:
-
Launch the job: Launch the job. When finished in the "Runs" section of that job we can verify the successful run status
-
Results After running the SDLB pipeline the data should be downloaded into the staging file
stg_ab/result.csv
and selected parts into the tableint_ab
- csv file: in the first step we downloaded the CSV file. This can be verified, e.g. by inspecting the data directory in the Databricks CLI using
databricks fs ls dbfs:/data/stg-ab
or running in a Workspace shell notebookls /dbfs/data/stg-ab
- database: in the second phase specific columns are put into the database. This can be verified in the Workspace -> Data -> default -> int_ab
infoNote that our final table was defined as
DeltaLakeTableDataObject
. With that, Smart Data Lake Builder automatically generates a Delta Lake Table in your Databricks workspace. - csv file: in the first step we downloaded the CSV file. This can be verified, e.g. by inspecting the data directory in the Databricks CLI using
Lessons Learned
There are a few steps necessary, including building and uploading SDLB. Further, we need to be careful with the used versions of the underlying libraries. With these few steps we can reveal the power of SDLB and Databricks, creating a portable and reproducible pipeline into a Databricks Lakehouse.