In many cases datasets have no constant live. New data points are created, values changed and data expires. We are interested in keeping track of all these changes. This article first presents collecting data utilizing JDBC and deduplication on the fly. Then, a Change Data Capture (CDC) enabled (MS)SQL table will be transferred and historized in the data lake using the Airbyte MS SQL connector supporting CDC. Methods for reducing the computational and storage efforts are mentioned.
Deployment on Databricks
Many analytics applications are ported to the cloud, Data Lakes and Lakehouses in the cloud becoming more and more popular. The Databricks platform provides an easy accessible and easy configurable way to implement a modern analytics platform. Smart Data Lake Builder on the other hand provides an open source, portable automation tool to load and transform the data.
In this article the deployment of Smart Data Lake Builder (SDLB) on Databricks is described.
Combine Spark and Snowpark to ingest and transform data in one pipeline
This article shows how to create one unified data pipeline that uses Spark to ingest data into Snowflake, and Snowpark to transform data inside Snowflake.
Using Airbyte connector to inspect github data
This article presents the deployment of an Airbyte Connector with Smart Data Lake Builder (SDLB). In particular the github connector is implemented using the python sources.