Deploy on YARN

warning

This page is under review and currently not visible in the menu.

Smart Data Lake can be easily executed on a YARN cluster by spark-submit. The following steps will show you how to set everything up and start a first data load. See Running Spark on YARN for detailed Spark configuration options.

Make sure you have a Spark 2.x Scala 2.11 release installed (spark-submit command needed)
Build the project (with activated profile fat-jar) if you haven't done that already:
```
mvn package -DskipTests -Pscala-2.11 -Pfat-jar
```

Copy test data file to hdfs home directory:

hdfs dfs -put src/test/resources/AB_NYC_2019.csv

Create an application.conf file:

dataObjects {
  ab-csv1 {
    type = CsvFileDataObject
    path = "AB_NYC_2019.csv"
  }
  ab-csv2 {
    type = CsvFileDataObject
    path = "AB_NYC_copy.csv"
  }
}

actions {
  loadCsv2Csv {
    type = CopyAction
    inputId = ab-csv1
    outputId = ab-csv2
    metadata {
      feed = ab-csv
    }
  }
}

Submit application to YARN cluster with spark-submit. Don't forget to replace the SmartDataLake version (2x). On windows you also need to manually local directory of application.conf file in the following command.

spark-submit --master yarn --deploy-mode client --jars target/smartdatalake_2.11-1.0.3-jar-with-dependencies.jar --class io.smartdatalake.app.DefaultSmartDataLakeBuilder target/smartdatalake_2.11-1.0.3-jar-with-dependencies.jar --feed-sel ab-csv -c file://`pwd`/application.conf