Skip to main content

Configure for different environments

Goal

Industrializing data pipelines also means running them in different environments, e.g. DEV (development), INTE (Integration) and PROD (Production). Different environments normally need parametrized settings for connections and secrets. In this step we will introduce SDLB approach for environment configuration, create a configuration for the DEV environment, and discuss how it can be adapted for PROD.

SDLB approach for environment configuration

As SDLB configuration files use HOCON format, environment configuration can be implemented using HOCON substitution.

SDLB conventions suggest to use a separate configuration file per environment located in a special folder envConfig. Let's call these files "environment configuration files". The envConfig folder is a separate folder from the config folder.

When starting an SDLB Job, one environment configuration file is selected, and all configuration files from the config folder. Using HOCON substitution the definitions of the selected environment configuration file can be used in the normal configuration files.

Selecting additional configurations files on the SDLB command line is easy, as you can give a list of configuration file locations. To start our SDLB job with for a specific environment, we can just add the corresponding environment file as follows:

./startJob.sh -c /mnt/config,/mnt/envConfig/dev.conf --feed-sel compute

Creating DEV configuration file

The command above do anything new yet, as we first need to create the envConfig/dev.conf file.

As part of this tutorial, let's make the following configurations customizable per environment:

  • database: The database name to be used in DeltaLakeTableDataObjects
  • basePath: The root path where data files are stored For this create an environment file envConfig/dev.conf with the following content, if it doesn't yet exist:
env {
database = default
basePath = "./"
}

Then lets replace all table.database and path configuration entries with a HOCON substitution as follows:

  int-departures {
type = DeltaLakeTableDataObject
path = ${env.basePath}"~{id}"
table = {
db = ${env.database}
name = int_departures
primaryKey = [icao24, estdepartureairport, dt]
}
allowSchemaEvolution = true
}

Note that HOCON substitution syntax needs to be placed outside of string double quotes. Now you can test the configuration without running any feed. This can be done by using command line parameter --test config:

./startJob.sh -c /mnt/config,/mnt/envConfig/dev.conf --feed-sel compute --test config
Automated Testing

Testing the configuration is a very good starting point for automated integration tests. It is the easiest CI pipeline and recommended for every project. See also Testing.

Configuring other environments

To configure other environments like PROD (Production), a envConfig/prd.conf file is created and the relevant configurations adapted. Then dev.conf in startJob.sh command is replaced with prd.conf.

A special case is managing secrets for different environments, e.g. passwords. SDLB supports various Secret Providers, which can be configured differently per environment.

Summary

You have now seen different parts of industrializing a data pipeline like robust data formats, caring about historical data and configuring different environments. Further, you have explored data interactively with spark-shell.

The final solution for departures/airports/btl.conf should look like the files ending with part-2-solution in this directory.

In part 3 we will see how to incrementally load fresh flight data. See you!