Launch Java application using Spark-Submit
SmartDataLakeBuilder is a java application. To run on a cluster with spark-submit, use DefaultSmartDataLakeBuilder application. It can be started with the following command line options (for details, see YARN).
spark-submit --master yarn --deploy-mode client --class target/smartdatalake_2.12-2.5.1-jar-with-dependencies.jar [arguments]
and takes the following arguments:
Usage: DefaultSmartDataLakeBuilder [options]
-f, --feed-sel <operation?><;prefix:?><regex>[,<operation?><prefix:?><regex>...]
Select actions to execute by one or multiple expressions separated by comma (,). Results from multiple expressions are combined from left to right.
- pipe symbol (|): the two sets are combined by union operation (default)
- ampersand symbol (&): the two sets are combined by intersection operation
- minus symbol (-): the second set is subtracted from the first set
- 'feeds': select actions where metadata.feed is matched by regex pattern (default)
- 'names': select actions where is matched by regex pattern
- 'ids': select actions where id is matched by regex pattern
- 'layers': select actions where metadata.layer of all output DataObjects is matched by regex pattern
- 'startFromActionIds': select actions which with id is matched by regex pattern and any dependent action (=successors)
- 'endWithActionIds': select actions which with id is matched by regex pattern and their predecessors
- 'startFromDataObjectIds': select actions which have an input DataObject with id is matched by regex pattern and any dependent action (=successors)
- 'endWithDataObjectIds': select actions which have an output DataObject with id is matched by regex pattern and their predecessors
All matching is done case-insensitive.
Example: to filter action 'A' and its successors but only in layer L1 and L2, use the following pattern: "startFromActionIds:a,&layers:(l1|l2)"
-n, --name <value>
Optional name of the application. If not specified feed-sel is used.
-c, --config <file1>[,<file2>...]
One or multiple configuration files or directories containing configuration files, separated by comma.
Entries must be valid Hadoop URIs or a special URI with scheme "cp" which is treated as classpath entry.
--partition-values <partitionColName>=<partitionValue>[,<partitionValue>,...]
Partition values to process for one single partition column.
--multi-partition-values <partitionColName1>=<partitionValue>,<partitionColName2>=<partitionValue>[;<partitionColName1>=<partitionValue>,<partitionColName2>=<partitionValue>;...]
Partition values to process for multiple partition columns.
-s, --streaming
Enable streaming mode for continuous processing.
--parallelism <int>
Parallelism for DAG run.
--state-path <path>
Path to save run state files. Must be set to enable recovery in case of failures.
--override-jars <jar1>[,<jar2>...]
Comma separated list of jar filenames for child-first class loader. The jars must be present in classpath.
--test <config|dry-run>
Run in test mode: config -> validate configuration, dry-run -> execute prepare- and init-phase only to check environment and spark lineage
Display the help text.
Display version information.
The DefaultSmartDataLakeBuilder class should be fine in most situations. There are two additional, adapted application versions you can use:
- LocalSmartDataLakeBuilder: default for Spark master is
in this case, and it has additional properties to configure Kerberos authentication. Use can use this application to run in a local environment (e.g. IntelliJ) without cluster deployment. - DatabricksSmartDataLakeBuilder: see Microsoft Azure, special class when running a Databricks Cluster.
Launching SDL container
Depending on the container definition, especially the entrypoint the arguments may vary. Furthermore, we distinguish starting the container using docker or podman.
In general a container launch would look like:
- Docker
- Podman
docker run [docker-args] sdl-spark --config [config-file] --feed-sel [feed] [further-SDL-args]
podman run [docker-args] sdl-spark --config [config-file] --feed-sel [feed] [further-SDL-args]
These could also include mounted directories for configurations, additional Scala Classes, data directories, etc.
- Docker
- Podman
docker run --rm -v ${PWD}/data:/mnt/data -v ${PWD}/target:/mnt/lib -v ${PWD}/config:/mnt/config sdl-spark:latest --config /mnt/config --feed-sel download
podman run --rm -v ${PWD}/data:/mnt/data -v ${PWD}/target:/mnt/lib -v ${PWD}/config:/mnt/config sdl-spark:latest --config /mnt/config --feed-sel download
Pods with Podman
When interacting between multiple containers, e.g. SDL container and a metastore container, pods are utilized to manage the container and especially the network. A set of containers is launched using
Assuming an existing pod mypod
is running, another container can be started within this pod using the additional podman arguments --pod mypod --hostname myhost --add-host myhost:
The hostname specification fixes an issue in resolving the own localhost.