Proof that Delta Lake is Great for Data Engineering Pipelines

May 20, 2023

a river running through a lush green valley — Photo by Max Bürgi on Unsplash

This is a sponsored blog post by Delta Lake.

Data pipelines are often used by internet companies in various industries — from eCommerce to healthcare and financial services. Data pipelines can be used for either batch-based or streaming purposes. Traditional data lakes are used to store large repositories of raw data and only serve as a whiteboard for big data analysts to use as they test out different analytics methods. However, Delta Lake solves many of the limitations presented by traditional data lakes. We will be learning what pipelines and data lakes are and how the limitations of data lakes can be solved using Delta Lake.

Data Pipelines

Delta Lake is often used in data pipelines. Data pipelines are a series of processes that move and transform data from source to destination. They are used to extract data from various sources, transform it into a format that can be used by analytics or machine learning tools, and load it into a data warehouse or data lake.

When it comes to data integration, there are two widely used approaches: Extract, Load, Transform (ELT) and Extract, Transform, Load (ETL). I have found that most people will start out using ETL when data is either small or getting information out of raw data is not as important. Yet, when dealing with large amounts of data that are difficult to extract from sources or audits are needed, I will recommend using ELT.

Regardless of whether you choose to use ETL or ELT, it is important to ensure that each step of the process has access to sufficient data storage. This can take the form of intermediary storage during the process or storage of final data at the end. One excellent option for this is to use a data lake.

What is a data lake?

A data lake is a large, centralized repository that allows for the storage of all structured and unstructured data at any scale. Data lakes are used by organizations to store data in their native format, which makes it easier to analyze and extract insights from the data.

A great candidate for Data Lakes

During data processing or data pipelines, data lakes are an excellent solution, offering a myriad of benefits. Let's explore some of the advantages in greater detail:

Centralization of data: One of the key benefits of data lakes is that they provide a centralized location for storing data. Rather than having to pull in data from multiple servers, all of the data can be stored in a single bucket or blob, making it easier to process and analyze.
Scalability: Data lakes can store vast amounts of data, making them ideal for handling big data. This scalability ensures that companies are better equipped to deal with large volumes of data, which can be crucial in today's data-driven world.
Cost efficiency: Compared to relational databases like Redshift or SQL Server in AWS, data lakes are much more cost-effective. This is especially true when storing large amounts of data, as data lakes can store data at a much lower price point. Furthermore, many data lakes have implemented a tiering system that makes it even cheaper to store historical data.
Support for multiple file formats: Data lakes can store data in various formats, including structured, semi-structured, and unstructured data. This flexibility ensures that companies can work with the file types that are most convenient for them, making data processing and analysis more efficient and streamlined.

In summary, data lakes are a powerful tool for data processing and pipelines. They provide centralized storage, scalability, cost efficiency, and support for multiple file formats, making them an excellent choice for data processing.

Before you begin to celebrate your freedom from whatever tool you are currently using, take a minute to examine the downfalls.

Data Lakes’ Downfall

But it is these same strengths that also cause its downfall.

Let’s take a look at a few commonly seen scenarios.

To set the stage, we are looking at a team of 5 data engineers with 5 data analysts (I know! A dream come true literally) at a medium-sized E-commerce company that sells digital courses. The data sources that are often being used are from third parties (think Google Analytics for website analytics and Zendesk for customer services). They are doing extremely well and have over 500,000 users (or customers). Each of the API calls is made numerous times throughout the day so that will bring in a lot of frequent data. A great use case for the data lakes.

The data engineers just pull the data in from the APIs and dump the data into a data lake with a bucket for stg/zendesk and prod/zendesk**.** If let’s say the API changes and the schema is different now, the data lake will just passively absorb it into the rest of the data, corrupting the data when it is read.

Each data analyst also pays for how much data they are querying. The first set of data they had was only for the last 3 months of data, so the query was for 100 GB of data. In the next 2.5. years, the data has now grown to 5 TB. The query now has to query about 50x as much data, takes 50x as long to query, and charges 50x as much.

In the following weeks, there seems to have been a bug that popped up on one of the dashboards the data analysts found. After a week of data engineers searching for the bug, it turns out that one of the data sources was duplicating data in the data lake. They verified that it was not the sources themselves, but no one knows how it happened.

I wonder if there’s a tool that could alleviate some of those problems. Oh wait, there is! Delta Lake!

Delta Lake

If the previous company had chosen to use Delta Lake instead, most of their problems would have gone away. Before addressing those questions, let’s dive into what Delta Lake is.

Delta Lake is an open-source data storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. It does not replace the data lake, but instead builds on top of it.

Now, we can look at how Delta Lake will make the team’s problem disappear.

Data engineers often encounter merges that can occur without their knowledge. When merging, Delta Lake will automatically verify whether the schema of the data is compatible with the table's schema. If that is not the case, we can decide to evolve the schema and add the column. Delta Lake updates table schema automatically as part of a DML transaction (either appending or overwriting), making it compatible with the written data. This eliminates any unexpected issues.

Data analysts want to be great stewards of the data that is bestowed on them. This obviously includes costs and speed of the queries. Delta Lake can reduce the cost and speed of queries. It implements:

Optimize improves query speed by compacting small files, reducing the number of files and minimizing the overhead of opening excessive tiny files on subsequent reads.
Z-order colocates related information so that query engines can skip entire files based on metadata statistics.

Plus, they will be able to look at table details such as who created the table, when was the table created, the names of the partitioned files, the size of the table, other properties, and even custom metadata. Tables are no longer a mystery.

Data engineers were having trouble finding what happened to the underlying data. With every operation that modifies one of its tables, Delta Lake will version that table. You can use history information to audit operations or query a table at a specific point in time.

Though there are many more benefits to Delta Lake, these are some of the most influential when it comes to data pipelining.

Conclusion

Delta Lake is a powerful tool for data lakes that brings reliability to data pipelines. It provides ACID transactions, scalable metadata handling, and allows merging between different data processes. Delta Lake builds on top of data lakes and can reduce costs, improve query speed, and provide detailed information about tables. With Delta Lake, schema validation is included as part of the merging process, ensuring that any issues are eliminated, and tables are versioned with every operation that modifies them, making it easier to audit operations or query a table at a specific point in time.

This is a sponsored blog post by Delta Lake.

Dutch Engineer’s Newsletter

Discussion about this post