Execution Engines

An execution engine is a technology/library used by SDLB to transform data. SDLB supports different execution engines and is able to combine different execution engines in the same data pipeline / job. The data structure used to transport data between DataObjects and Actions is called a SubFeed. Each Execution Engine has Subfeeds, Actions and Dataobjects associated with it.

Currently SDLB supports the following execution engines:

Category	Execution Engine	SubFeed Name	Description	Supported Actions	Supported DataObjects
Java-Byte-Stream	File Engine	FileSubFeed	Transfer Byte-Streams without further knowledge about their content	FileTransferAction, CustomFileAction	all HadoopFileDataObjects, WebserviceFileDataObject, SFtpFileDataObject
Generic DataFrame API	Spark Engine	SparkSubFeed	Transform data with Spark DataFrame API	CopyAction, CustomDataFrameAction, DeduplicateAction, HistorizeAction	all Hadoop/SparkFileDataObject, AccessTableDataObject, AirbyteDataObject, CustomDfDataObject, DeltaLakeTableDataObject, HiveTableDataObject, JdbcTableDataObject, JmsDataObject, KafkaTopicDataObject, SnowflakeTableDataObject, SplunkDataObject, TickTockHiveTableDataObject
Generic DataFrame API	Snowflake-Snowpark Engine	SnowparkSubFeed	Transform data within Snowflake with Snowpark DataFrame API	CopyAction, CustomDataFrameAction, DeduplicateAction, (HistorizeAction)	SnowflakeTableDataObject
Script	Script Engine	ScriptSubFeed	Coordinate script task execution and notify DataObjects about script results	No public implementation for now	all DataObjects

Connecting different execution engines

In order to build a data pipeline using different execution engines, you need a DataObject that supports both execution engines as interface, so that one execution engine can write the data in the DataObject and the other one can read from it.

from FileSubFeed to SparkSubFeed (and vice-versa): any Hadoop/SparkFileDataObject like ParquetFileDataObject
from SparkSubFeed to SnowparkSubFeed (and vice-versa): SnowflakeTableDataObject
from ScriptSubFeed to any (and vice-versa): every DataObject is suitable

Schema propagation

Note that a schema can only be propagated within a data pipeline for consecutive actions running with an execution engine of category "Generic DataFrame API". Whenever such an Action has an input from a different category, the schema is read again from the DataObject.

SDLB is able to convert schemas between different execution engines of category "Generic DataFrame API", e.g. Spark and Snowpark.

Determining execution engine to use in "Generic DataFrame API" Actions

A "Generic DataFrame API" Action can run with different execution engines like Spark or Snowpark. It determines the execution engine to use in Init-phase by checking the supported types of inputs, outputs and transformations. The first common type is chosen. If there is no common type an exception is thrown. To check which execution engine was chosen, look for logs like the following:

INFO CustomDataFrameAction - (Action~...) selected subFeedType SparkSubFeed

Execution Engines vs Execution Environments

As mentioned in Architecture, SDLB is first and foremost a Java (Scala) application. It can run in any Execution Environment where you can install a JVM, executing Actions with any of its Execution Engines. SDLB chooses the Execution Engines for your data pipeline independently from the Execution Environment that SDLB lives in. For example: Let's say you run SDLB in a distributed fashion on a Spark Cluster using spark-submit. If one of your Actions only has SnowflakeTableDataObjects as input and output, SDLB will run it using the Snowpark-Engine. In practice, this means that SDLB will connect to the Snowflake Environment from inside your Spark-Cluster and then execute your Action from there using Snowpark's Java/Scala Library.

Of course, the Execution Environment you have influences the DataObjects that you have at your disposal: for instance, if you want to connect to Snowflake, you need a Snowflake account and be able to connect to Snowflake. But the Execution Environment does not determine the Execution Engines SDLB will use - your DataObjects, Actions and Transformations do.

Connecting different execution engines​

Schema propagation​

Determining execution engine to use in "Generic DataFrame API" Actions​

Execution Engines vs Execution Environments​

Connecting different execution engines

Schema propagation

Determining execution engine to use in "Generic DataFrame API" Actions

Execution Engines vs Execution Environments