Execution Phases
Early validation
Execution of a SmartDataLakeBuilder run is designed with "early validation" in mind. This means it tries to fail as early as possible if something is wrong.
The following phases are involved during each execution:
- Parse configuration:
Parses and validates your configuration files. This step fails if there is anything wrong with your configuration, i.e. if a required attribute is missing or a whole block likeactions {}
is missing or misspelled. There's also a neat feature that will warn you of typos and will suggest spelling corrections if it can. - DAG prepare:
Preconditions are validated. This includes testing Connections and DataObject structures that must exists. - DAG init:
Creates and validates the whole lineage of Actions according to the DAG. For Spark Actions this involves the validation of the DataFrame lineage. A column which doesn't exist but is referenced in a later Action will fail the execution. - DAG exec:
Apply Execution Modes to select data and execute Actions. Data is effectively transferred in this phase (and only in this phase!).
Implications
Early validation
As mentioned, especially the init phase is very powerful as SDL will validate your whole lineage.
This even includes custom transformations you have in your pipeline.
So if you have a typo in a column name or reference a column that will not exist at that state in the pipeline,
SDL will report this error and fail within seconds.
This saves a lot of time as you don't have to wait for the whole pipeline to execute to catch these errors.
No data during Init Phase
At one point you will start implementing your own transformers.
When analyzing problems you will want to debug them and most likely set break points somewhere in the transform
method.
It's important to know, that execution will pass your break point twice:
Once during the init
phase, once during exec
phase.
In the init
phase, the whole execution is validated but without actually moving any data.
If you take a look at your DataFrames at this point, it will be empty.
We can guarantee that you will fall into this trap at least once. ;-)
If you debug your code and wonder why your DataFrame is completely empty, you probably stopped execution during init phase.
Continue execution and make sure you're in the exec phase before taking a look at data in your DataFrame.
How does the Init Phase work for Spark Actions?
During the Init Phase, the whole Spark DAG is evaluated by executing all the code of the SDLB Actions, but without executing any Spark Action such as show, count, write... See this spark tutorial for more on Spark Actions. This is how in the Init Phase SDLB is able to check your lineage for Spark Actions. Basically, it relies on Spark to check the execution DAG under the hood. If you add Spark-Actions in your Custom Transformers (which is considered bad practice in most cases), you basically break that mechanism.
Watch the log output
The stages are also clearly marked in the log output. Here is the sample output of part-3 of the gettings-started guide again with a few things removed:
Action~download-departures[CopyAction]: Prepare started
Action~download-departures[CopyAction]: Prepare succeeded
Action~download-departures[CopyAction]: Init started
Action~download-departures[CopyAction]: Init succeeded
Action~download-departures[CopyAction]: Exec started
...
If execution stops, always check during which phase that happens. If it happens while still in the init phase, it probably has nothing to do with the data itself but more with the structure of your DataObjects.