Get started with Delta Lake with Azure Data Factory (2023)

By:Ron El Esteve| Updated: 08-17-2020 |Comments (4)| Related: >Azure Data Factory


Problem

While working with Azure Data Lake Gen2 and Apache Spark, I began to learn about the limitations of Apache Spark along with the many data lake implementation challenges. I also learned that an ACID compliant set of features is essential on a lake and that Delta Lake offers many solutions to these existing problems. What is a Delta lake and why do we need an ACID compliant lake? What are the benefits of Delta Lake and what is a good way to get started with Delta Lake?

Solution

Delta Lake is an open source storage layer that ensures the atomicity, consistency, isolation, and durability of data in the lake. In short, a Delta Lake is ACID compliant. In addition to providing ACID transactions, scalable metadata management, and more, Delta Lake runs on an existing Data Lake and is compatible with Apache Spark APIs. There are a few methods to get started with Delta Lake. Databricks offers notebooks along with compatible Apache Spark APIs to create and manage Delta Lakes. Alternatively, Azure Data Factory mapping dataflows, using scalable Apache Spark clusters, can be used to perform ACID-compliant CRUD operations via GUI-designed ETL pipelines. This article will demonstrate how to get started with Delta Lake using the new Azure Data Factory Delta Lake connector through examples of how to create, insert, update, and delete in a Delta Lake.

Why an ACID Delta Lake?

The introduction of Delta Lake into a modern cloud data architecture offers many benefits. Traditionally, Data Lakes and Apache Spark are not ACID compliant. Delta Lake introduces this ACID compliance to address many of the following ACID compliance issues.

Atomicity:Write all data or nothing. . . . spark apachesave statesIt does not use any type of blocking and it is not atomic. With this, a failed job can leave an incomplete file and can corrupt data. Also, a failed job can delete the old file and corrupt the new file. While this seems worrisome, Spark has built-in dataframe writing APIs that are not atomic but behave that way for add operations. However, this comes with a performance overhead for use with cloud storage.

Consistency:Data is always in a valid state. If the Spark API writer deletes an old file and creates a new one, and the operation is not transactional, there will always be a period of time that the file does not exist between deleting the old file and creating the new one. In that scenario, if the overwrite operation fails, the previous file data will be lost. Also, the new file may not be created. This is a typical consistency related spark override issue.

Isolation:Multiple transactions are carried out independently without interference. This means that when you write to a data set, other concurrent reads or writes to the same data set should not be affected by the write operation. Typical transactional databases offer severalisolation levels. While Spark has commitments at the task and job level, since it lacks atomicity, it does not have isolation types.

Durability:Committed data is never lost.When Spark doesn't implement a commit correctly, it overwrites all the great durability features that cloud storage options offer and corrupts or loses data. This violates the durability of the data.

Now that we understand today's data lake and the challenges along with the benefits of an ACID compliant Delta Lake, let's get started with the demo.

previous requirements

For this demo, be sure to create the following prerequisites.

1)Create a data factory V2: Data Factory will be used to perform the ELT orchestrations. In addition, ADF's Mapping Data Flows Delta Lake connector will be used to create and manage Delta Lake. For more details on creating Data Factory V2, seeQuickstart: Create a data factory using the Azure Data Factory UI.

2)Crear un Data Lake Storage Gen2: ADLSgen2 will be the Data Lake repository in which Delta Lake will be created. For more details on creating ADLSgen2, see:Creating Your First ADLS Gen2 Data Lake.

3)Create Containers and Zones of Data Lake Storage Gen2: Once your Data Lake Gen2 is created, you must also create the corresponding containers and zones. For more information on ADLS Gen2 zone design, see:Create your data lake on Azure Data Lake Storage gen2. This demo will useraw areato save a sample source parquet file. other than thattest areait will be used for delta updates, inserts, deletes, and other transformations. While the selected zone will not be used in this demo, it is important to note that this zone may contain the final transformed E-T-L, advanced analytics, or data science models selected from the Test Zone.

Get started with Delta Lake with Azure Data Factory (1)

4)Subir datos a Raw Zone: Finally, you need some data for this demo. Searching for "sample parquet files" will give you access to various online GitHub repositories or downloadable sample data. The nextGitHub repositoryData was used for this demonstration.

Get started with Delta Lake with Azure Data Factory (2)

5)Create a Data Factory parquet dataset pointing to the raw zone: The final prerequisite would be to create a parquet format dataset in the newly created instance of ADF V2 that points to the sample parquet file stored in the raw zone.

Get started with Delta Lake with Azure Data Factory (3)

Create and paste in Delta Lake

Now that all the prerequisites are met, we're ready to create the initial delta tables and insert data from our raw zone into the delta tables.

Let's start by creating a new Data Factory pipeline and adding a new 'Mapping Data Flow' to it. Also remember to name the Pipeline and Data Flow sensible names, like in the example below.

Get started with Delta Lake with Azure Data Factory (4)

Inside the data flow, add a source and sink with the following settings. Scheduled operation can be activated as needed for the specific use case.

For more details on Schema Drift, seeSchema operation in data flow mapping.

Sampling provides a method of limiting the number of rows in the source, which is used primarily for testing and debugging purposes.

Get started with Delta Lake with Azure Data Factory (5)

Since Delta Lake takes advantage of Spark's distributed processing power, it can partition the data successfully, but in order to demonstrate the ability to manually configure partitioning, I have configured 20 Hash partitions on the ID column. clickare, read more about the different options for partition types.

Get started with Delta Lake with Azure Data Factory (6)

After adding the target activity, make sure the sink type is set to Delta. For more information on how to participate in ADF, seeFormato delta i Azure Data Factory. Note that Delta is available as a source and sink in Mapping Data Flows. You will also be prompted to select the associated service when selecting the wash type for Delta.

Get started with Delta Lake with Azure Data Factory (7)

On the Settings tab, make sure the Test folder is selected and select Deploy as the update method. Also select Truncate Table if the Delta table needs to be truncated before loading it.

For more information on the void command, see:Vacuum a Delta Table (Delta Lake on Databricks). Basically, Vacuum will delete files that are no longer referenced in the delta tables and that are over the retention threshold in hours. The default value is 30 days if the value is 0 or empty.

Get started with Delta Lake with Azure Data Factory (8)

Finally, on the Optimize tab, simply use the current partition from the source partition with downstream flow to the sink.

Get started with Delta Lake with Azure Data Factory (9)

As expected, when the pipeline is activated and finished running, we can see that 13 new columns have been created across 20 different partitions.

Get started with Delta Lake with Azure Data Factory (10)

As we look at the ADLS2 staging folder, we see that a delta_log folder has been created along with 20 quick zipped parquet files.

Get started with Delta Lake with Azure Data Factory (11)

Open the delta_log folder to see the two log files. For more information on understanding delta logs, please read:Diving into Delta Lake: Unpacking the transaction log.

Get started with Delta Lake with Azure Data Factory (12)

After checking the new data in Staging Delta Lake, we can see that new records have been inserted.

Get started with Delta Lake with Azure Data Factory (13)

Update Lake Delta

So far, we've covered Insert in Delta Lake. Next, let's take a look at how Data Factory can handle updates to our delta tables.

Like plugins, create a new ADF pipeline with a mapping dataflow for updates.

Get started with Delta Lake with Azure Data Factory (14)

For this update demo, we need to update the user's first and last name and convert it to lowercase. To do this, we added a Derived Columns and Change Rows transformation activity to the Update Mapping Dataflow canvas.

Get started with Delta Lake with Azure Data Factory (15)

The source data is still our Staging Delta Lake which was also set up for betting. For more details on time travel, see:Introducing Delta Time Travel for large-scale data lakes.

Get started with Delta Lake with Azure Data Factory (16)

Derived columns convert the first and last name to lowercase using the following expression. Mapping Data Flows is capable of handling extremely complex transformations at this stage.

Get started with Delta Lake with Azure Data Factory (17)

To change the configuration of the rows, we must specify an Update if condition for true() to update all rows that meet the criteria.

Get started with Delta Lake with Azure Data Factory (18)

Check sink settings.

Get started with Delta Lake with Azure Data Factory (19)

Make sure the sink is still pointing to the Staging Delta Lake data. Also select Allow update as the update method. To show that multiple key columns can be selected simultaneously, 3 columns have been selected.

Get started with Delta Lake with Azure Data Factory (20)

After saving and activating the pipeline, we can see that the results reflect the first and last names that have been updated to lowercase.

Get started with Delta Lake with Azure Data Factory (21)

Delete from Delta Lake

In summary, so far we have covered inserts and updates. Next, let's look at an example of how Mapping Data Flows handles deletions in Delta Lake.

Similar to inserts and updates, you must create a new data factory and mapping data flow.

Get started with Delta Lake with Azure Data Factory (22)

Configure the Delta source settings.

Get started with Delta Lake with Azure Data Factory (23)

Since we are still working with the same Delta Lake Staging, this source configuration will be configured in the same way as pushes and updates.

Get started with Delta Lake with Azure Data Factory (24)

For this example, let's remove all gender = male entries. To do this, we need to set the modified row conditions toRemove if gender == 'Male'.

Get started with Delta Lake with Azure Data Factory (25)

Finally, set the delta settings of the sump.

Get started with Delta Lake with Azure Data Factory (26)

Select Staging Delta Lake for the sink, select 'Allow delete' and select the desired key columns.

Get started with Delta Lake with Azure Data Factory (27)

After publishing and activating this pipeline, notice how all records where gender = Male have been removed.

Get started with Delta Lake with Azure Data Factory (28)

Explore delta logs

Finally, let's go ahead and take a look at Delta logs to briefly understand how the logs were created and populated. The main commit information files are generated and stored in Insert, Update, and Delete JSON commit files. Additionally, CRC files are created. CRC is a popular technique for verifying data integrity because it has excellent error detection capabilities, uses few resources, and is easy to use.

Get started with Delta Lake with Azure Data Factory (29)

Insert

As we can see by opening the Insert JSON commit file, it contains the commit information for the insert operations.

{"commitInfo":{"timestamp":1594782281467,"operation":"WRITE","operationParameters":{"mode":"Add","partitionBy":"[]"},"isolationLevel":"WriteSerializable" "esBlindAppend":verdadero}}

Get started with Delta Lake with Azure Data Factory (30)

Update

Similarly, when we open the Update JSON commit file, it contains commit information for update operations.

{"commitInfo":{"timestamp":1594782711552,"operation":"MERGE","operationParameters":{"predicate":"(((source.`id` = target.`id`) AND (source.` registration_dttm` = target.`registration_dttm`)) AND (source.`ip_address` = target.`ip_address`))","updatePredicate":"((NOT ((source.`ra2b434a305b34f2f96cd5b4b4b4149455e`0))) O (NO ( (fuente.`ra2b434a305b34f2f96cd5b4b4149455e` & 8) = 0)))","deletePredicate":"(NO ((fuente.`ra2b434a305b34f2f96cd5b49455e` & 8) = 0)))","deletePredicate":"(NO ((fuente .`ra2b434a305b34f2f96cd5b49455e` & 8) = 0))" nivel de aislamiento":"EscribirSerializable","isBlindAppend":falso}}

Get started with Delta Lake with Azure Data Factory (31)

Delete

Finally, when we open the Delete JSON commit file, it contains commit information for delete operations.

"commitInfo":{"timestamp":1594783812366,"operation":"MERGE","operationParameters":{"predicate":"(((source.`id` = target.`id`) AND (source.`registration_dttm ` = target.`registration_dttm`)) AND (source.`ip_address` = target.`ip_address`))","updatePredicate":"((NOT ((source.`ra079d97a688347b581710234d2cc4b63` & 2) = 0)) O NO ((source.`ra079d97a688347b581710234d2cc4b63` & 8) = 0)))","deletePredicate":"(NOT ((source.`ra079d97a688347b581710234d2cc)","isolation": 41" ":"WriteSerializable","isBlindAppend"); :FALSO}}

Get started with Delta Lake with Azure Data Factory (32)

Next step
  • For more details related to Delta Lake, please read the following Databricksdocumentation.
  • For more information on understanding Delta Lake logs, readDiving into Delta Lake: Unpacking the transaction log.
  • For more information about the Delta connector in Azure Data Factory, seeFormato delta i Azure Data Factory.
  • To exploredelta engineas an efficient way to process data in data lakes, including data stored in the open source Delta Lake.
About the Author

Ron L'Esteve is a trusted information technology thought leader and professional writer based in Illinois. He brings more than 20 years of IT experience and is known for his powerful books and article publications on AI and data architecture, cloud engineering and management. Ron completed his Master's in Business Administration and Finance from Loyola University Chicago. Ron brings deep tech

See all my tips

Last article update: 2020-08-17

Top Articles
Latest Posts
Article information

Author: Moshe Kshlerin

Last Updated: 06/07/2023

Views: 6229

Rating: 4.7 / 5 (57 voted)

Reviews: 80% of readers found this page helpful

Author information

Name: Moshe Kshlerin

Birthday: 1994-01-25

Address: Suite 609 315 Lupita Unions, Ronnieburgh, MI 62697

Phone: +2424755286529

Job: District Education Designer

Hobby: Yoga, Gunsmithing, Singing, 3D printing, Nordic skating, Soapmaking, Juggling

Introduction: My name is Moshe Kshlerin, I am a gleaming, attractive, outstanding, pleasant, delightful, outstanding, famous person who loves writing and wants to share my knowledge and understanding with you.