Package

io.smartdatalake.workflow

dataobject

Permalink

package dataobject

Visibility
  1. Public
  2. All

Type Members

  1. case class AccessTableDataObject(id: DataObjectId, path: String, preSql: Option[String] = None, postSql: Option[String] = None, schemaMin: Option[StructType] = None, table: Table, metadata: Option[DataObjectMetadata] = None)(implicit instanceRegistry: InstanceRegistry) extends TableDataObject with Product with Serializable

    Permalink

    DataObject of type JDBC / Access.

    DataObject of type JDBC / Access. Provides access to a Access DB to an Action. The functionality is handled seperately from JdbcTableDataObject to avoid problems with net.ucanaccess.jdbc.UcanaccessDriver

  2. case class ActionsExporterDataObject(id: DataObjectId, config: Option[String] = None, metadata: Option[DataObjectMetadata] = None)(implicit instanceRegistry: InstanceRegistry) extends DataObject with CanCreateDataFrame with ParsableFromConfig[ActionsExporterDataObject] with Product with Serializable

    Permalink

    Exports a util DataFrame that contains properties and metadata extracted from all io.smartdatalake.workflow.action.Actions that are registered in the current InstanceRegistry.

    Exports a util DataFrame that contains properties and metadata extracted from all io.smartdatalake.workflow.action.Actions that are registered in the current InstanceRegistry.

    Alternatively, it can export the properties and metadata of all io.smartdatalake.workflow.action.Actions defined in config files. For this, the configuration "config" has to be set to the location of the config.

    Example:

    dataObjects = {
     ...
     actions-exporter {
       type = ActionsExporterDataObject
       config = path/to/myconfiguration.conf
     }
     ...
    }

    The config value can point to a configuration file or a directory containing configuration files.

    See also

    Refer to ConfigLoader.loadConfigFromFilesystem() for details about the configuration loading.

  3. case class AvroFileDataObject(id: DataObjectId, path: String, partitions: Seq[String] = Seq(), schema: Option[StructType] = None, schemaMin: Option[StructType] = None, saveMode: SaveMode = SaveMode.Overwrite, sparkRepartition: Option[SparkRepartitionDef] = None, acl: Option[AclDef] = None, connectionId: Option[ConnectionId] = None, filenameColumn: Option[String] = None, metadata: Option[DataObjectMetadata] = None)(implicit instanceRegistry: InstanceRegistry) extends SparkFileDataObjectWithEmbeddedSchema with CanCreateDataFrame with CanWriteDataFrame with Product with Serializable

    Permalink

    A io.smartdatalake.workflow.dataobject.DataObject backed by an Avro data source.

    A io.smartdatalake.workflow.dataobject.DataObject backed by an Avro data source.

    It manages read and write access and configurations required for io.smartdatalake.workflow.action.Actions to work on Avro formatted files.

    Reading and writing details are delegated to Apache Spark org.apache.spark.sql.DataFrameReader and org.apache.spark.sql.DataFrameWriter respectively. The reader and writer implementations are provided by the databricks spark-avro project.

    schema

    An optional schema for the spark data frame used when writing new Avro files. Note: Existing Avro files contain a source schema. Therefore, this schema is ignored when reading from existing Avro files.

    sparkRepartition

    Optional definition of repartition operation before writing DataFrame with Spark to Hadoop.

    See also

    org.apache.spark.sql.DataFrameWriter

    org.apache.spark.sql.DataFrameReader

  4. case class ConnectionTestException(msg: String, ex: Throwable) extends RuntimeException with Product with Serializable

    Permalink
  5. case class CsvFileDataObject(id: DataObjectId, path: String, csvOptions: Map[String, String] = Map(), partitions: Seq[String] = Seq(), schema: Option[StructType] = None, schemaMin: Option[StructType] = None, dateColumnType: DateColumnType = DateColumnType.Date, saveMode: SaveMode = SaveMode.Overwrite, sparkRepartition: Option[SparkRepartitionDef] = None, acl: Option[AclDef] = None, connectionId: Option[ConnectionId] = None, filenameColumn: Option[String] = None, metadata: Option[DataObjectMetadata] = None)(implicit instanceRegistry: InstanceRegistry) extends SparkFileDataObject with CanCreateDataFrame with CanWriteDataFrame with Product with Serializable

    Permalink

    A DataObject backed by a comma-separated value (CSV) data source.

    A DataObject backed by a comma-separated value (CSV) data source.

    It manages read and write access and configurations required for io.smartdatalake.workflow.action.Actions to work on CSV formatted files.

    CSV reading and writing details are delegated to Apache Spark org.apache.spark.sql.DataFrameReader and org.apache.spark.sql.DataFrameWriter respectively.

    Read Schema specifications:

    If a data object schema is not defined via the schema attribute (default) and inferSchema option is disabled (default) in csvOptions, then all column types are set to String and the first row of the CSV file is read to determine the column names and the number of fields.

    If the header option is disabled (default) in csvOptions, then the header is defined as "_c#" for each column where "#" is the column index. Otherwise the first row of the CSV file is not included in the DataFrame content and its entries are used as the column names for the schema.

    If a data object schema is not defined via the schema attribute and inferSchema is enabled in csvOptions, then the samplingRatio (default: 1.0) option in csvOptions is used to extract a sample from the CSV file in order to determine the input schema automatically.

    csvOptions

    Settings for the underlying org.apache.spark.sql.DataFrameReader and org.apache.spark.sql.DataFrameWriter.

    schema

    An optional data object schema. If defined, any automatic schema inference is avoided.

    dateColumnType

    Specifies the string format used for writing date typed data.

    sparkRepartition

    Optional definition of repartition operation before writing DataFrame with Spark to Hadoop.

    Note

    This data object sets the following default values for csvOptions: delimiter = "|", quote = null, header = false, and inferSchema = false. All other csvOption default to the values defined by Apache Spark.

    See also

    org.apache.spark.sql.DataFrameWriter

    org.apache.spark.sql.DataFrameReader

  6. case class CustomDfDataObject(id: DataObjectId, creator: CustomDfCreatorConfig, schemaMin: Option[StructType] = None, metadata: Option[DataObjectMetadata] = None)(implicit instanceRegistry: InstanceRegistry) extends DataObject with CanCreateDataFrame with SchemaValidation with Product with Serializable

    Permalink

    Generic DataObject containing a config object.

    Generic DataObject containing a config object. E.g. used to implement a CustomAction that reads a Webservice.

  7. case class DataObjectMetadata(name: Option[String] = None, description: Option[String] = None, layer: Option[String] = None, subjectArea: Option[String] = None, tags: Seq[String] = Seq()) extends Product with Serializable

    Permalink

    Additional metadata for a DataObject

    Additional metadata for a DataObject

    name

    Readable name of the DataObject

    description

    Description of the content of the DataObject

    layer

    Name of the layer this DataObject belongs to

    subjectArea

    Name of the subject area this DataObject belongs to

    tags

    Optional custom tags for this object

  8. case class DataObjectsExporterDataObject(id: DataObjectId, config: Option[String] = None, metadata: Option[DataObjectMetadata] = None)(implicit instanceRegistry: InstanceRegistry) extends DataObject with CanCreateDataFrame with ParsableFromConfig[DataObjectsExporterDataObject] with Product with Serializable

    Permalink

    Exports a util DataFrame that contains properties and metadata extracted from all DataObjects that are registered in the current InstanceRegistry.

    Exports a util DataFrame that contains properties and metadata extracted from all DataObjects that are registered in the current InstanceRegistry.

    Alternatively, it can export the properties and metadata of all DataObjects defined in config files. For this, the configuration "config" has to be set to the location of the config.

    Example:

    ```dataObjects = {
     ...
     dataobject-exporter {
       type = DataObjectsExporterDataObject
       config = path/to/myconfiguration.conf
     }
     ...
    }

    The config value can point to a configuration file or a directory containing configuration files.

    See also

    Refer to ConfigLoader.loadConfigFromFilesystem() for details about the configuration loading.

  9. case class DatePartitionColumnDef(colName: String, timeFormat: String = "yyyyMMdd", timeUnit: String = "days", timeZone: Option[String] = None) extends Product with Serializable

    Permalink

    Definition of date partition column to extract formatted timestamp into column.

    Definition of date partition column to extract formatted timestamp into column.

    colName

    date partition column name to extract timestamp into column on batch read

    timeFormat

    time format for timestamp in date partition column, definition according to java DateTimeFormatter. Default is "yyyyMMdd".

    timeUnit

    time unit for timestamp in date partition column, definition according to java ChronoUnit. Default is "days".

    timeZone

    time zone used for date logic. If not specified, java system default is used.

  10. case class DeltaLakeTableDataObject(id: DataObjectId, path: String, partitions: Seq[String] = Seq(), dateColumnType: DateColumnType = DateColumnType.Date, schemaMin: Option[StructType] = None, table: Table, numInitialHdfsPartitions: Int = 16, saveMode: SaveMode = SaveMode.Overwrite, retentionPeriod: Int = 7*24, acl: Option[AclDef] = None, connectionId: Option[ConnectionId] = None, metadata: Option[DataObjectMetadata] = None)(implicit instanceRegistry: InstanceRegistry) extends TransactionalSparkTableDataObject with CanHandlePartitions with Product with Serializable

    Permalink

    DataObject of type DeltaLakeTableDataObject.

    DataObject of type DeltaLakeTableDataObject. Provides details to access Hive tables to an Action

    id

    unique name of this data object

    path

    hadoop directory for this table. If it doesn't contain scheme and authority, the connections pathPrefix is applied. If pathPrefix is not defined or doesn't define scheme and authority, default schema and authority is applied.

    partitions

    partition columns for this data object

    dateColumnType

    type of date column

    schemaMin

    An optional, minimal schema that this DataObject must have to pass schema validation on reading and writing.

    table

    DeltaLake table to be written by this output

    numInitialHdfsPartitions

    number of files created when writing into an empty table (otherwise the number will be derived from the existing data)

    saveMode

    spark SaveMode to use when writing files, default is "overwrite"

    retentionPeriod

    DeltaLake table retention period of old transactions for time travel feature in hours

    acl

    override connections permissions for files created tables hadoop directory with this connection

    connectionId

    optional id of io.smartdatalake.workflow.connection.HiveTableConnection

    metadata

    meta data

  11. case class ExcelFileDataObject(id: DataObjectId, path: String, excelOptions: ExcelOptions, partitions: Seq[String] = Seq(), schema: Option[StructType] = None, schemaMin: Option[StructType] = None, saveMode: SaveMode = SaveMode.Overwrite, sparkRepartition: Option[SparkRepartitionDef] = ..., acl: Option[AclDef] = None, connectionId: Option[ConnectionId] = None, filenameColumn: Option[String] = None, metadata: Option[DataObjectMetadata] = None)(implicit instanceRegistry: InstanceRegistry) extends SparkFileDataObject with CanCreateDataFrame with CanWriteDataFrame with Product with Serializable

    Permalink

    A DataObject backed by an Microsoft Excel data source.

    A DataObject backed by an Microsoft Excel data source.

    It manages read and write access and configurations required for io.smartdatalake.workflow.action.Actions to work on Microsoft Excel (.xslx) formatted files.

    Reading and writing details are delegated to Apache Spark org.apache.spark.sql.DataFrameReader and org.apache.spark.sql.DataFrameWriter respectively. The reader and writer implementation is provided by the Crealytics spark-excel project.

    Read Schema:

    When useHeader is set to true (default), the reader will use the first row of the Excel sheet as column names for the schema and not include the first row as data values. Otherwise the column names are taken from the schema. If the schema is not provided or inferred, then each column name is defined as "_c#" where "#" is the column index.

    When a data object schema is provided, it is used as the schema for the DataFrame. Otherwise if inferSchema is enabled (default), then the data types of the columns are inferred based on the first excerptSize rows (excluding the first). When no schema is provided and inferSchema is disabled, all columns are assumed to be of string type.

    excelOptions

    Settings for the underlying org.apache.spark.sql.DataFrameReader and org.apache.spark.sql.DataFrameWriter.

    schema

    An optional data object schema. If defined, any automatic schema inference is avoided.

    sparkRepartition

    Optional definition of repartition operation before writing DataFrame with Spark to Hadoop. Default is numberOfTasksPerPartition = 1.

  12. case class ExcelOptions(sheetName: String, numLinesToSkip: Option[Int] = None, startColumn: Option[Int] = None, endColumn: Option[Int] = None, rowLimit: Option[Int] = None, useHeader: Boolean = true, treatEmptyValuesAsNulls: Option[Boolean] = Some(true), inferSchema: Option[Boolean] = Some(true), timestampFormat: Option[String] = Some("dd-MM-yyyy HH:mm:ss"), dateFormat: Option[String] = None, maxRowsInMemory: Option[Int] = None, excerptSize: Option[Int] = None) extends Product with Serializable

    Permalink

    Options passed to org.apache.spark.sql.DataFrameReader and org.apache.spark.sql.DataFrameWriter for reading and writing Microsoft Excel files.

    Options passed to org.apache.spark.sql.DataFrameReader and org.apache.spark.sql.DataFrameWriter for reading and writing Microsoft Excel files. Excel support is provided by the spark-excel project (see link below).

    sheetName

    the name of the Excel Sheet to read from/write to. This option is required.

    numLinesToSkip

    the number of rows in the excel spreadsheet to skip before any data is read. This option must not be set for writing.

    startColumn

    the first column in the specified Excel Sheet to read from (1-based indexing). This option must not be set for writing.

    endColumn

    TODO: this is not used anymore as far as I can tell --> crealytics now uses dataAddress.

    rowLimit

    Limit the number of rows being returned on read to the first rowLimit rows. This is applied after numLinesToSkip.

    useHeader

    If true, the first row of the excel sheet specifies the column names. This option is required (default: true).

    treatEmptyValuesAsNulls

    Empty cells are parsed as null values (default: true).

    inferSchema

    Infer the schema of the excel sheet automatically (default: true).

    timestampFormat

    A format string specifying the format to use when writing timestamps (default: dd-MM-yyyy HH:mm:ss).

    dateFormat

    A format string specifying the format to use when writing dates.

    maxRowsInMemory

    The number of rows that are stored in memory. If set, a streaming reader is used which can help with big files.

    excerptSize

    Sample size for schema inference.

    See also

    https://github.com/crealytics/spark-excel

  13. case class ForeignKey(db: Option[String], table: String, columns: Map[String, String], name: Option[String]) extends Product with Serializable

    Permalink

    Foreign key definition

    Foreign key definition

    db

    target database, if not defined it is assumed to be the same as the table owning the foreign key

    table

    referenced target table name

    columns

    mapping of source column(s) to referenced target table column(s)

    name

    optional name for foreign key, e.g to depict it's role

  14. case class HiveTableDataObject(id: DataObjectId, path: Option[String] = None, partitions: Seq[String] = Seq(), analyzeTableAfterWrite: Boolean = false, dateColumnType: DateColumnType = DateColumnType.Date, schemaMin: Option[StructType] = None, table: Table, numInitialHdfsPartitions: Int = 16, saveMode: SaveMode = SaveMode.Overwrite, acl: Option[AclDef] = None, connectionId: Option[ConnectionId] = None, metadata: Option[DataObjectMetadata] = None)(implicit instanceRegistry: InstanceRegistry) extends TableDataObject with CanWriteDataFrame with CanHandlePartitions with SmartDataLakeLogger with Product with Serializable

    Permalink

    DataObject of type Hive.

    DataObject of type Hive. Provides details to access Hive tables to an Action

    id

    unique name of this data object

    path

    hadoop directory for this table. If it doesn't contain scheme and authority, the connections pathPrefix is applied. If pathPrefix is not defined or doesn't define scheme and authority, default schema and authority is applied. If DataObject is only used for reading or if the HiveTable already exist, the path can be omitted. If the HiveTable already exists but with a different path, a warning is issued

    partitions

    partition columns for this data object

    analyzeTableAfterWrite

    enable compute statistics after writing data (default=false)

    dateColumnType

    type of date column

    schemaMin

    An optional, minimal schema that this DataObject must have to pass schema validation on reading and writing.

    table

    hive table to be written by this output

    numInitialHdfsPartitions

    number of files created when writing into an empty table (otherwise the number will be derived from the existing data)

    saveMode

    spark SaveMode to use when writing files, default is "overwrite"

    acl

    override connections permissions for files created tables hadoop directory with this connection

    connectionId

    optional id of io.smartdatalake.workflow.connection.HiveTableConnection

    metadata

    meta data

  15. case class JdbcTableDataObject(id: DataObjectId, createSql: Option[String] = None, preSql: Option[String] = None, postSql: Option[String] = None, schemaMin: Option[StructType] = None, table: Table, jdbcFetchSize: Int = 1000, connectionId: ConnectionId, jdbcOptions: Map[String, String] = Map(), metadata: Option[DataObjectMetadata] = None)(implicit instanceRegistry: InstanceRegistry) extends TransactionalSparkTableDataObject with Product with Serializable

    Permalink

    DataObject of type JDBC.

    DataObject of type JDBC. Provides details for an action to access tables in a database through JDBC.

    id

    unique name of this data object

    createSql

    DDL-statement to be executed in prepare phase

    preSql

    SQL-statement to be executed before writing to table

    postSql

    SQL-statement to be executed after writing to table

    schemaMin

    An optional, minimal schema that this DataObject must have to pass schema validation on reading and writing.

    table

    The jdbc table to be read

    jdbcFetchSize

    Number of rows to be fetched together by the Jdbc driver

    connectionId

    Id of JdbcConnection configuration

    jdbcOptions

    Any jdbc options according to https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html. Note that some options above set and override some of this options explicitly.

  16. case class JmsDataObject(id: DataObjectId, jndiContextFactory: String, jndiProviderUrl: String, schemaMin: Option[StructType], authMode: AuthMode, batchSize: Int, maxWaitSec: Int, maxBatchAgeSec: Int, txBatchSize: Int, connectionFactory: String, queue: String, metadata: Option[DataObjectMetadata] = None)(implicit instanceRegistry: InstanceRegistry) extends DataObject with CanCreateDataFrame with SchemaValidation with Product with Serializable

    Permalink

    DataObject of type JMS queue.

    DataObject of type JMS queue. Provides details to an Action to access JMS queues.

    jndiContextFactory

    JNDI Context Factory

    jndiProviderUrl

    JNDI Provider URL

    authMode

    authentication information: for now BasicAuthMode is supported.

    batchSize

    JMS batch size

    connectionFactory

    JMS Connection Factory

    queue

    Name of MQ Queue

  17. case class JsonFileDataObject(id: DataObjectId, path: String, jsonOptions: Option[Map[String, String]] = None, partitions: Seq[String] = Seq(), schema: Option[StructType] = None, schemaMin: Option[StructType] = None, saveMode: SaveMode = SaveMode.Overwrite, sparkRepartition: Option[SparkRepartitionDef] = None, stringify: Boolean = false, acl: Option[AclDef] = None, connectionId: Option[ConnectionId] = None, filenameColumn: Option[String] = None, metadata: Option[DataObjectMetadata] = None)(implicit instanceRegistry: InstanceRegistry) extends SparkFileDataObject with CanCreateDataFrame with CanWriteDataFrame with Product with Serializable

    Permalink

    A io.smartdatalake.workflow.dataobject.DataObject backed by a JSON data source.

    A io.smartdatalake.workflow.dataobject.DataObject backed by a JSON data source.

    It manages read and write access and configurations required for io.smartdatalake.workflow.action.Actions to work on JSON formatted files.

    Reading and writing details are delegated to Apache Spark org.apache.spark.sql.DataFrameReader and org.apache.spark.sql.DataFrameWriter respectively.

    jsonOptions

    Settings for the underlying org.apache.spark.sql.DataFrameReader and org.apache.spark.sql.DataFrameWriter.

    sparkRepartition

    Optional definition of repartition operation before writing DataFrame with Spark to Hadoop.

    stringify

    Set the data type for all values to string.

    Note

    By default, the JSON option multiline is enabled.

    See also

    org.apache.spark.sql.DataFrameWriter

    org.apache.spark.sql.DataFrameReader

  18. case class KafkaTopicDataObject(id: DataObjectId, topicName: String, connectionId: ConnectionId, keyType: KafkaColumnType = KafkaColumnType.String, valueType: KafkaColumnType = KafkaColumnType.String, schemaMin: Option[StructType] = None, selectCols: Seq[String] = Seq("key", "value"), datePartitionCol: Option[DatePartitionColumnDef] = None, batchReadConsecutivePartitionsAsRanges: Boolean = false, batchReadMaxOffsetsPerTask: Option[Int] = None, dataSourceOptions: Map[String, String] = Map(), metadata: Option[DataObjectMetadata] = None)(implicit instanceRegistry: InstanceRegistry) extends DataObject with CanCreateDataFrame with CanCreateStreamingDataFrame with CanWriteDataFrame with CanHandlePartitions with SchemaValidation with Product with Serializable

    Permalink

    DataObject of type KafkaTopic.

    DataObject of type KafkaTopic. Provides details to an action to read from Kafka Topics using either org.apache.spark.sql.DataFrameReader or org.apache.spark.sql.streaming.DataStreamReader

    topicName

    The name of the topic to read

    keyType

    Optional type the key column should be converted to. If none is given it will remain a bytearray / binary.

    valueType

    Optional type the value column should be converted to. If none is given it will remain a bytearray / binary.

    schemaMin

    An optional, minimal schema that this DataObject must have to pass schema validation on reading and writing.

    selectCols

    Columns to be selected when reading the DataFrame. Available columns are key, value, topic, partition, offset, timestamp, timestampType. If key/valueType is AvroSchemaRegistry the key/value column are convert to a complex type according to the avro schema. To expand it select "value.*". Default is to select key and value.

    datePartitionCol

    definition of date partition column to extract formatted timestamp into column. This is used to list existing partition and is added as additional column on batch read.

    batchReadConsecutivePartitionsAsRanges

    Set to true if consecutive partitions should be combined as one range of offsets when batch reading from topic. This results in less tasks but can be a performance problem when reading many partitions. (default=false)

    batchReadMaxOffsetsPerTask

    Set number of offsets per Spark task when batch reading from topic.

    dataSourceOptions

    Options for the Kafka stream reader (see https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html). These options override connection.kafkaOptions.

  19. case class PKViolatorsDataObject(id: DataObjectId, config: Option[String] = None, flattenOutput: Boolean = false, metadata: Option[DataObjectMetadata] = None)(implicit instanceRegistry: InstanceRegistry) extends DataObject with CanCreateDataFrame with ParsableFromConfig[PKViolatorsDataObject] with Product with Serializable

    Permalink

    Checks for Primary Key violations for all DataObjects with Primary Keys defined that are registered in the current InstanceRegistry.

    Checks for Primary Key violations for all DataObjects with Primary Keys defined that are registered in the current InstanceRegistry. Returns the list of Primary Key violations as a DataFrame.

    Alternatively, it can check for Primary Key violations of all DataObjects defined in config files. For this, the configuration "config" has to be set to the location of the config.

    Example:

    ```dataObjects = {
     ...
     primarykey-violations {
       type = PKViolatorsDataObject
       config = path/to/myconfiguration.conf
     }
     ...
    }
    See also

    Refer to ConfigLoader.loadConfigFromFilesystem() for details about the configuration loading.

  20. case class ParquetFileDataObject(id: DataObjectId, path: String, partitions: Seq[String] = Seq(), schema: Option[StructType] = None, schemaMin: Option[StructType] = None, saveMode: SaveMode = SaveMode.Overwrite, sparkRepartition: Option[SparkRepartitionDef] = None, acl: Option[AclDef] = None, connectionId: Option[ConnectionId] = None, filenameColumn: Option[String] = None, metadata: Option[DataObjectMetadata] = None)(implicit instanceRegistry: InstanceRegistry) extends SparkFileDataObjectWithEmbeddedSchema with CanCreateDataFrame with CanWriteDataFrame with Product with Serializable

    Permalink

    A io.smartdatalake.workflow.dataobject.DataObject backed by an Apache Hive data source.

    A io.smartdatalake.workflow.dataobject.DataObject backed by an Apache Hive data source.

    It manages read and write access and configurations required for io.smartdatalake.workflow.action.Actions to work on Parquet formatted files.

    Reading and writing details are delegated to Apache Spark org.apache.spark.sql.DataFrameReader and org.apache.spark.sql.DataFrameWriter respectively.

    id

    unique name of this data object

    path

    Hadoop directory where this data object reads/writes it's files. If it doesn't contain scheme and authority, the connections pathPrefix is applied. If pathPrefix is not defined or doesn't define scheme and authority, default schema and authority is applied. Optionally defined partitions are appended with hadoop standard partition layout to this path. Only files ending with *.parquet* are considered as data for this DataObject.

    partitions

    partition columns for this data object

    saveMode

    spark SaveMode to use when writing files, default is "overwrite"

    sparkRepartition

    Optional definition of repartition operation before writing DataFrame with Spark to Hadoop.

    acl

    override connections permissions for files created with this connection

    connectionId

    optional id of io.smartdatalake.workflow.connection.HadoopFileConnection

    metadata

    Metadata describing this data object.

    See also

    org.apache.spark.sql.DataFrameWriter

    org.apache.spark.sql.DataFrameReader

  21. case class QueryTimeInterval(from: LocalDateTime, to: LocalDateTime) extends Product with Serializable

    Permalink
  22. case class RawFileDataObject(id: DataObjectId, path: String, partitions: Seq[String] = Seq(), saveMode: SaveMode = SaveMode.Overwrite, acl: Option[AclDef] = None, connectionId: Option[ConnectionId] = None, metadata: Option[DataObjectMetadata] = None)(implicit instanceRegistry: InstanceRegistry) extends HadoopFileDataObject with Product with Serializable

    Permalink

    DataObject of type raw for files with unknown content.

    DataObject of type raw for files with unknown content. Provides details to an Action to access raw files.

    saveMode

    Overwrite or Append new data.

  23. case class SFtpFileRefDataObject(id: DataObjectId, path: String, connectionId: ConnectionId, partitions: Seq[String] = Seq(), partitionLayout: Option[String] = None, saveMode: SaveMode = SaveMode.Overwrite, metadata: Option[DataObjectMetadata] = None)(implicit instanceRegistry: InstanceRegistry) extends FileRefDataObject with CanCreateInputStream with CanCreateOutputStream with SmartDataLakeLogger with Product with Serializable

    Permalink

    Connects to SFtp files Needs java library "com.hieronymus % sshj % 0.21.1" The following authentication mechanisms are supported -> public/private-key: private key must be saved in ~/.ssh, public key must be registered on server.

    Connects to SFtp files Needs java library "com.hieronymus % sshj % 0.21.1" The following authentication mechanisms are supported -> public/private-key: private key must be saved in ~/.ssh, public key must be registered on server. -> user/pwd authentication: user and password is taken from two variables set as parameters. These variables could come from clear text (CLEAR), a file (FILE) or an environment variable (ENV)

    partitionLayout

    partition layout defines how partition values can be extracted from the path. Use "%<colname>%" as token to extract the value for a partition column. With "%<colname:regex>%" a regex can be given to limit search. This is especially useful if there is no char to delimit the last token from the rest of the path or also between two tokens.

    saveMode

    Overwrite or Append new data.

  24. case class SplunkDataObject(id: DataObjectId, params: SplunkParams, connectionId: ConnectionId, metadata: Option[DataObjectMetadata] = None)(implicit instanceRegistry: InstanceRegistry) extends DataObject with CanCreateDataFrame with SplunkService with Product with Serializable

    Permalink

    DataObject of type Splunk.

    DataObject of type Splunk. Provides details to an action to access Splunk logs.

  25. case class SplunkParams(query: String, queryFrom: LocalDateTime, queryTo: LocalDateTime, queryTimeInterval: Duration = ofMinutes(10), columnNames: Seq[String] = Seq("_raw", "_time"), parallelRequests: Int = 2) extends Product with Serializable

    Permalink
  26. case class Table(db: Option[String], name: String, query: Option[String] = None, primaryKey: Option[Seq[String]] = None, foreignKeys: Option[Seq[ForeignKey]] = None, options: Option[Map[String, String]] = None) extends Product with Serializable

    Permalink

    Table attributes

    Table attributes

    db

    optional override of db defined by connection

    name

    table name

    query

    optional select query

    primaryKey

    optional sequence of primary key columns

    foreignKeys

    optional sequence of foreign key definitions. This is used as metadata for a data catalog.

  27. case class TickTockHiveTableDataObject(id: DataObjectId, path: Option[String] = None, partitions: Seq[String] = Seq(), analyzeTableAfterWrite: Boolean = false, dateColumnType: DateColumnType = DateColumnType.Date, schemaMin: Option[StructType] = None, table: Table, numInitialHdfsPartitions: Int = 16, saveMode: SaveMode = SaveMode.Overwrite, acl: Option[AclDef] = None, connectionId: Option[ConnectionId] = None, metadata: Option[DataObjectMetadata] = None)(implicit instanceRegistry: InstanceRegistry) extends TransactionalSparkTableDataObject with CanHandlePartitions with Product with Serializable

    Permalink
  28. case class WebserviceFileDataObject(id: DataObjectId, webserviceOptions: WebserviceOptions, partitionDefs: Seq[WebservicePartitionDefinition] = Seq(), partitionLayout: Option[String] = None, metadata: Option[DataObjectMetadata] = None)(implicit instanceRegistry: InstanceRegistry) extends FileRefDataObject with CanCreateInputStream with SmartDataLakeLogger with Product with Serializable

    Permalink

    DataObject to call webservice and return response as InputStream This is implemented as FileRefDataObject because the response is treated as some file content.

    DataObject to call webservice and return response as InputStream This is implemented as FileRefDataObject because the response is treated as some file content. FileRefDataObjects support partitioned data. For a WebserviceFileDataObject partitions are mapped as query parameters to create query string. All possible query parameter values must be given in configuration.

    partitionDefs

    list of partitions with list of possible values for every entry

    partitionLayout

    definition of partitions in query string. Use %<partitionColName>% as placeholder for partition column value in layout.

  29. case class WebserviceOptions(url: String, connectionTimeoutMs: Option[Int] = None, readTimeoutMs: Option[Int] = None, authHeader: Option[String] = None, clientIdVariable: Option[String] = None, clientSecretVariable: Option[String] = None, keycloakAuth: Option[KeycloakConfig] = None, userVariable: Option[String] = None, passwordVariable: Option[String] = None, token: Option[String] = None) extends Product with Serializable

    Permalink
  30. case class WebservicePartitionDefinition(name: String, values: Seq[String]) extends Product with Serializable

    Permalink
  31. case class XmlFileDataObject(id: DataObjectId, path: String, rowTag: Option[String] = None, xmlOptions: Option[Map[String, String]] = None, partitions: Seq[String] = Seq(), schema: Option[StructType] = None, schemaMin: Option[StructType] = None, saveMode: SaveMode = SaveMode.Overwrite, sparkRepartition: Option[SparkRepartitionDef] = None, flatten: Boolean = false, acl: Option[AclDef] = None, connectionId: Option[ConnectionId] = None, filenameColumn: Option[String] = None, metadata: Option[DataObjectMetadata] = None)(implicit instanceRegistry: InstanceRegistry) extends SparkFileDataObject with CanCreateDataFrame with CanWriteDataFrame with Product with Serializable

    Permalink

    A io.smartdatalake.workflow.dataobject.DataObject backed by an XML data source.

    A io.smartdatalake.workflow.dataobject.DataObject backed by an XML data source.

    It manages read and write access and configurations required for io.smartdatalake.workflow.action.Actions to work on XML formatted files.

    Reading and writing details are delegated to Apache Spark org.apache.spark.sql.DataFrameReader and org.apache.spark.sql.DataFrameWriter respectively. The reader and writer implementations are provided by the databricks spark-xml proect.

    xmlOptions

    Settings for the underlying org.apache.spark.sql.DataFrameReader and org.apache.spark.sql.DataFrameWriter.

    sparkRepartition

    Optional definition of repartition operation before writing DataFrame with Spark to Hadoop.

    See also

    org.apache.spark.sql.DataFrameWriter

    org.apache.spark.sql.DataFrameReader

Value Members

  1. object AccessTableDataObject extends FromConfigFactory[DataObject] with Serializable

    Permalink
  2. object ActionsExporterDataObject extends FromConfigFactory[ActionsExporterDataObject] with Serializable

    Permalink
  3. object AvroFileDataObject extends FromConfigFactory[DataObject] with Serializable

    Permalink
  4. object CsvFileDataObject extends FromConfigFactory[DataObject] with Serializable

    Permalink
  5. object CustomDfDataObject extends FromConfigFactory[DataObject] with Serializable

    Permalink
  6. object DataObjectsExporterDataObject extends FromConfigFactory[DataObjectsExporterDataObject] with Serializable

    Permalink
  7. object ExcelFileDataObject extends FromConfigFactory[DataObject] with Serializable

    Permalink
  8. object HiveTableDataObject extends FromConfigFactory[DataObject] with Serializable

    Permalink
  9. object JdbcTableDataObject extends FromConfigFactory[DataObject] with Serializable

    Permalink
  10. object JmsDataObject extends FromConfigFactory[DataObject] with Serializable

    Permalink
  11. object JsonFileDataObject extends FromConfigFactory[DataObject] with Serializable

    Permalink
  12. object KafkaColumnType extends Enumeration

    Permalink
  13. object KafkaTopicDataObject extends FromConfigFactory[DataObject] with Serializable

    Permalink
  14. object PKViolatorsDataObject extends FromConfigFactory[PKViolatorsDataObject] with Serializable

    Permalink
  15. object ParquetFileDataObject extends FromConfigFactory[DataObject] with Serializable

    Permalink
  16. object RawFileDataObject extends FromConfigFactory[DataObject] with Serializable

    Permalink
  17. object SFtpFileRefDataObject extends FromConfigFactory[DataObject] with Serializable

    Permalink
  18. object SplunkDataObject extends FromConfigFactory[DataObject] with Serializable

    Permalink
  19. object SplunkFormatter

    Permalink
  20. object SplunkParams extends Serializable

    Permalink
  21. object TickTockHiveTableDataObject extends FromConfigFactory[DataObject] with Serializable

    Permalink
  22. object WebserviceFileDataObject extends FromConfigFactory[DataObject] with SmartDataLakeLogger with Serializable

    Permalink
  23. object XmlFileDataObject extends FromConfigFactory[DataObject] with Serializable

    Permalink

Ungrouped