Skip to main content

Execution Engines

An execution engine is a technology/library used by SDLB to transform data. SDLB supports different execution engines and is able to combine different execution engines in the same data pipeline / job. The data structure used to transport data between DataObjects and Actions is called a SubFeed. Each Execution Engine has Subfeeds, Actions and Dataobjects associated with it.

Currently SDLB supports the following execution engines:

CategoryExecution EngineSubFeed NameDescriptionSupported ActionsSupported DataObjects
Java-Byte-StreamFile EngineFileSubFeedTransfer Byte-Streams without further knowledge about their contentFileTransferAction, CustomFileActionall HadoopFileDataObjects, WebserviceFileDataObject, SFtpFileDataObject
Generic DataFrame APISpark EngineSparkSubFeedTransform data with Spark DataFrame APICopyAction, CustomDataFrameAction, DeduplicateAction, HistorizeActionall Hadoop/SparkFileDataObject, AccessTableDataObject, AirbyteDataObject, CustomDfDataObject, DeltaLakeTableDataObject, HiveTableDataObject, JdbcTableDataObject, JmsDataObject, KafkaTopicDataObject, SnowflakeTableDataObject, SplunkDataObject, TickTockHiveTableDataObject
Generic DataFrame APISnowflake-Snowpark EngineSnowparkSubFeedTransform data within Snowflake with Snowpark DataFrame APICopyAction, CustomDataFrameAction, DeduplicateAction, (HistorizeAction)SnowflakeTableDataObject
ScriptScript EngineScriptSubFeedCoordinate script task execution and notify DataObjects about script resultsNo public implementation for nowall DataObjects

Connecting different execution engines

In order to build a data pipeline using different execution engines, you need a DataObject that supports both execution engines as interface, so that one execution engine can write the data in the DataObject and the other one can read from it.

  • from FileSubFeed to SparkSubFeed (and vice-versa): any Hadoop/SparkFileDataObject like ParquetFileDataObject
  • from SparkSubFeed to SnowparkSubFeed (and vice-versa): SnowflakeTableDataObject
  • from ScriptSubFeed to any (and vice-versa): every DataObject is suitable

Schema propagation

Note that a schema can only be propagated within a data pipeline for consecutive actions running with an execution engine of category "Generic DataFrame API". Whenever such an Action has an input from a different category, the schema is read again from the DataObject.

SDLB is able to convert schemas between different execution engines of category "Generic DataFrame API", e.g. Spark and Snowpark.

Determining execution engine to use in "Generic DataFrame API" Actions

A "Generic DataFrame API" Action can run with different execution engines like Spark or Snowpark. It determines the execution engine to use in Init-phase by checking the supported types of inputs, outputs and transformations. The first common type is chosen. If there is no common type an exception is thrown. To check which execution engine was chosen, look for logs like the following:

INFO CustomDataFrameAction - (Action~...) selected subFeedType SparkSubFeed

Execution Engines vs Execution Environments

As mentioned in Architecture, SDLB is first and foremost a Java (Scala) application. It can run in any Execution Environment where you can install a JVM, executing Actions with any of its Execution Engines. SDLB chooses the Execution Engines for your data pipeline independently from the Execution Environment that SDLB lives in. For example: Let's say you run SDLB in a distributed fashion on a Spark Cluster using spark-submit. If one of your Actions only has SnowflakeTableDataObjects as input and output, SDLB will run it using the Snowpark-Engine. In practice, this means that SDLB will connect to the Snowflake Environment from inside your Spark-Cluster and then execute your Action from there using Snowpark's Java/Scala Library.

Of course, the Execution Environment you have influences the DataObjects that you have at your disposal: for instance, if you want to connect to Snowflake, you need a Snowflake account and be able to connect to Snowflake. But the Execution Environment does not determine the Execution Engines SDLB will use - your DataObjects, Actions and Transformations do. img.png