Technical Setup

Trouble Shooting

In case you encounter an issue during this tutorial, feel free to consult the trouble shooting section of the Getting Started guide or the one of the Smart Data Lake Reference. Another good source is the issue tracker on Github, either of the Getting Started guide or on the main repository.

Requirements

To run this tutorial you just need two things:

Podman, a free Docker alternative on Linux. On Windows you might use it through WSL2, see also Podman as an alternative to docker.
The source code of the example.

caution

Note for Windows Users (this includes WSL!). Deactivate the autocrlf function of git before cloning, otherwise it will break some podman scripts. To do this, run git config --global core.autocrlf false in your terminal before cloning the repository.

Build Spark docker image

Download the source code of the example either via git or by downloading the zip and extracting it.
Open up a terminal and change to the folder with the source, you should see a file called Dockerfile in the Spark subfolder.
Run the following command from the root directory (note: this might take some time, but it's only needed once):

./buildSpark.sh

note

This creates a docker image including Spark, Python and SDLB libraries according to the SDLB version configured in pom.xml as parent.version.

Compile Scala Classes

Utilizing a Maven container, the getting-started project with the required SDLB Scala sources and all required libraries are compiled and packed using the following command:

./buildJob.sh

note

This might take some time, but it's only needed at the beginning or if Scala code has changed.

caution

In case you get an error stating that the dev.conf file is not existing please do the following:

Navigate into the folder envConfig
Copy the file dev.conf.part-2-solution as dev.conf (cp dev.conf.part-2-solution dev.conf)
Rerun ./buildJob.sh

Run SDLB with Spark docker image

Now let's see Smart Data Lake in action!

pushd config && cp departures.conf.part-1-solution departures.conf && cp airports.conf.part-1-solution airports.conf && cp btl.conf.part-1-solution btl.conf && popd
./startJob.sh --config /mnt/config,/mnt/envConfig/dev.conf --feed-sel download

This executes a simple data pipeline that downloads two files from two different websites into the data folder.

When the execution is complete, you should see the two new directories in the data folder. Wonder what happened ? You will create the data pipeline that does just this in the first steps of this guide.

If you wish, you can start with part 1 right away. For part 2 and part 3, it is recommended to set up a Development Environment.

Development Environment

For some parts of this tutorial it is beneficial to have a working development environment ready. In the following we will mainly explain how one can configure a working environment for Windows or Linux. We will focus on the community version of IntelliJ. Please download the version that suits your operating system.

Hadoop Setup (Needed for Windows only)

Windows Users need to follow the steps below to have a working Hadoop Installation :

First download the Windows binaries for Hadoop here
Extract the wished version to a folder (e.g. <prefix>\hadoop-<version>\bin ). For this tutorial we use the version 3.2.2.
Configure the HADOOP_HOME environment variable to point to the folder <prefix>\hadoop-<version>
Add the %HADOOP_HOME%\bin to the PATH environment variable

Run SDLB in IntelliJ

We will focus on the community version of IntelliJ. Please download the version that suits your operating system. This needs an Intellij and Java SDK installation. Please make sure you have:

Java Java 17 SDK or Java 11 JDK
Scala Version 2.12.
- Install the Scala-Plugin (File -> Settings -> Plugins)
- Install Scala version 2.12 and DO NOT UPGRADE to Scala 3. For the complete list of versions at play in SDLB, you can consult the Reference.
  - Existing Project: File -> Project Structure -> Global Libraries -> Add (Select correct version)
  - New Project: Select Scala under New Project and choose the correct version

Then do the following to load the project successfully:

Load the project as a maven project: Right-click on pom.xml file -> add as Maven Project
Ensure all correct dependencies are loaded: Right-click on pom.xml file, Maven -> Sync Project
Configure and run the following run configuration in IntelliJ IDEA (optional, as the .idea folder already contains this setup):
- Main class: io.smartdatalake.app.LocalSmartDataLakeBuilder
- Program arguments: --feed-sel <regex-feedname-selector> --config $ProjectFileDir$/config
- Working directory: /path/to/sdl-examples/target or just target
- VM Options: --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/jdk.internal.ref=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED

Congratulations! You're now all setup! Head over to the next step to analyse these files...

Requirements​

Build Spark docker image​

Compile Scala Classes​

Run SDLB with Spark docker image​

Development Environment​

Hadoop Setup (Needed for Windows only)​

Run SDLB in IntelliJ​