Data Quality
Data quality is an important topic in data governance. To monitoring and improvement data quality, building data pipelines with implementing data quality measures is an important measure.
SDLB provides the following features to improve data quality:
- Runtime metrics to monitor data pipeline output and track it over time
- Row-level constraints to stop before writing wrong data to an output
- Expectations on dataset level to stop or warn on unplausible data
Metrics
Every SDLB job collects metrics for each Action and DataObject written. They are logged with the following log statements:
2020-07-21 11:36:34 INFO CopyAction:105 - (Action~a) finished writing to DataObject~tgt1: job_duration=PT0.906S count=1 records_written=1 bytes_written=1142 num_tasks=1 stage=save
job_duration
is always recorded. For DataFrame based Actions, the number of records written is recorded as count
, and the number of records read as count#<dataObjectId>
. Further metrics are recorded depending on the DataObject type, e.g. rows_inserted/updated/deleted
for merge statements. And it's possible to record custom metrics, see chapter "Expectations" below.
Metrics are also stored in the state file, and if you want to sync them to monitoring system in real-time, the StateListener can be implemented. It gets notified about action new events and metrics as soon as they are available. To configure state listeners set config attribute global.stateListeners = [{className = ...}]
.