Deploy on YARN
This page is under review and currently not visible in the menu.
Smart Data Lake can be easily executed on a YARN cluster by spark-submit. The following steps will show you how to set everything up and start a first data load. See Running Spark on YARN for detailed Spark configuration options.
-
Make sure you have a Spark 2.x Scala 2.11 release installed (spark-submit command needed)
-
Build the project (with activated profile fat-jar) if you haven't done that already:
mvn package -DskipTests -Pscala-2.11 -Pfat-jar
-
Copy test data file to hdfs home directory:
hdfs dfs -put src/test/resources/AB_NYC_2019.csv
-
Create an application.conf file:
dataObjects {
ab-csv1 {
type = CsvFileDataObject
path = "AB_NYC_2019.csv"
}
ab-csv2 {
type = CsvFileDataObject
path = "AB_NYC_copy.csv"
}
}
actions {
loadCsv2Csv {
type = CopyAction
inputId = ab-csv1
outputId = ab-csv2
metadata {
feed = ab-csv
}
}
} -
Submit application to YARN cluster with spark-submit. Don't forget to replace the SmartDataLake version (2x). On windows you also need to manually local directory of application.conf file in the following command.
spark-submit --master yarn --deploy-mode client --jars target/smartdatalake_2.11-1.0.3-jar-with-dependencies.jar --class io.smartdatalake.app.DefaultSmartDataLakeBuilder target/smartdatalake_2.11-1.0.3-jar-with-dependencies.jar --feed-sel ab-csv -c file://`pwd`/application.conf