Skip to main content

Deploy on YARN

danger

This page is under review and currently not visible in the menu.

Smart Data Lake can be easily executed on a YARN cluster by spark-submit. The following steps will show you how to set everything up and start a first data load. See Running Spark on YARN for detailed Spark configuration options.

  1. Make sure you have a Spark 2.x Scala 2.11 release installed (spark-submit command needed)

  2. Build the project (with activated profile fat-jar) if you haven't done that already:

    mvn package -DskipTests -Pscala-2.11 -Pfat-jar
  3. Copy test data file to hdfs home directory:

    hdfs dfs -put src/test/resources/AB_NYC_2019.csv
  4. Create an application.conf file:

    dataObjects {
    ab-csv1 {
    type = CsvFileDataObject
    path = "AB_NYC_2019.csv"
    }
    ab-csv2 {
    type = CsvFileDataObject
    path = "AB_NYC_copy.csv"
    }
    }

    actions {
    loadCsv2Csv {
    type = CopyAction
    inputId = ab-csv1
    outputId = ab-csv2
    metadata {
    feed = ab-csv
    }
    }
    }
  5. Submit application to YARN cluster with spark-submit. Don't forget to replace the SmartDataLake version (2x). On windows you also need to manually local directory of application.conf file in the following command.

    spark-submit --master yarn --deploy-mode client --jars target/smartdatalake_2.11-1.0.3-jar-with-dependencies.jar --class io.smartdatalake.app.DefaultSmartDataLakeBuilder target/smartdatalake_2.11-1.0.3-jar-with-dependencies.jar --feed-sel ab-csv -c file://`pwd`/application.conf