Deploy on YARN
This page is under review and currently not visible in the menu.
Smart Data Lake can be easily executed on a YARN cluster by spark-submit. The following steps will show you how to set everything up and start a first data load. See Running Spark on YARN for detailed Spark configuration options.
-
Make sure you have a Spark 2.x Scala 2.11 release installed (spark-submit command needed)
-
Build the project (with activated profile fat-jar) if you haven't done that already:
mvn package -DskipTests -Pscala-2.11 -Pfat-jar -
Copy test data file to hdfs home directory:
hdfs dfs -put src/test/resources/AB_NYC_2019.csv -
Create an application.conf file:
dataObjects {ab-csv1 {type = CsvFileDataObjectpath = "AB_NYC_2019.csv"}ab-csv2 {type = CsvFileDataObjectpath = "AB_NYC_copy.csv"}}actions {loadCsv2Csv {type = CopyActioninputId = ab-csv1outputId = ab-csv2metadata {feed = ab-csv}}} -
Submit application to YARN cluster with spark-submit. Don't forget to replace the SmartDataLake version (2x). On windows you also need to manually local directory of application.conf file in the following command.
spark-submit --master yarn --deploy-mode client --jars target/smartdatalake_2.11-1.0.3-jar-with-dependencies.jar --class io.smartdatalake.app.DefaultSmartDataLakeBuilder target/smartdatalake_2.11-1.0.3-jar-with-dependencies.jar --feed-sel ab-csv -c file://`pwd`/application.conf