View on GitHub

smart-data-lake

Framework to quickly build and maintain Smart Data Lakes

Run on YARN

Smart Data Lake can be easily executed on a YARN cluster by spark-submit. The following steps will show you how to set everything up and start a first data load. See Running Spark on YARN for detailed Spark configuration options.

  1. Make sure you have a Spark 2.x Scala 2.11 release installed (spark-submit command needed)

  2. Build the project (with activated profile fat-jar) if you haven’t done that already:
    mvn package -DskipTests -Pscala-2.11 -Pfat-jar)
    
  3. Copy test data file to hdfs home directory:
    hdfs dfs -put src/test/resources/AB_NYC_2019.csv
    
  4. Create an application.conf file:
    dataObjects {
      ab-csv1 {
        type = CsvFileDataObject
        path = "AB_NYC_2019.csv"
      }
      ab-csv2 {
        type = CsvFileDataObject
        path = "AB_NYC_copy.csv"
      }
    }
        
    actions {
      loadCsv2Csv {
        type = CopyAction
        inputId = ab-csv1
        outputId = ab-csv2
        metadata {
          feed = ab-csv
        }
      }
    }
    
  5. Submit application to YARN cluster with spark-submit. Don’t forget to replace the SmartDataLake version (2x). On windows you also need to manually local directory of application.conf file in the following command.
    spark-submit --master yarn --deploy-mode client --jars target/smartdatalake_2.11-1.0.3-jar-with-dependencies.jar --class io.smartdatalake.app.DefaultSmartDataLakeBuilder target/smartdatalake_2.11-1.0.3-jar-with-dependencies.jar --feed-sel ab-csv -c file://`pwd`/application.conf