By:Ron El Esteve| Updated: 08-17-2020 |Comments (4)| Related: >Azure Data Factory
Problem
While working with Azure Data Lake Gen2 and Apache Spark, I began to learn about the limitations of Apache Spark along with the many data lake implementation challenges. I also learned that an ACID compliant set of features is essential on a lake and that Delta Lake offers many solutions to these existing problems. What is a Delta lake and why do we need an ACID compliant lake? What are the benefits of Delta Lake and what is a good way to get started with Delta Lake?
Solution
Delta Lake is an open source storage layer that ensures the atomicity, consistency, isolation, and durability of data in the lake. In short, a Delta Lake is ACID compliant. In addition to providing ACID transactions, scalable metadata management, and more, Delta Lake runs on an existing Data Lake and is compatible with Apache Spark APIs. There are a few methods to get started with Delta Lake. Databricks offers notebooks along with compatible Apache Spark APIs to create and manage Delta Lakes. Alternatively, Azure Data Factory mapping dataflows, using scalable Apache Spark clusters, can be used to perform ACID-compliant CRUD operations via GUI-designed ETL pipelines. This article will demonstrate how to get started with Delta Lake using the new Azure Data Factory Delta Lake connector through examples of how to create, insert, update, and delete in a Delta Lake.
Why an ACID Delta Lake?
The introduction of Delta Lake into a modern cloud data architecture offers many benefits. Traditionally, Data Lakes and Apache Spark are not ACID compliant. Delta Lake introduces this ACID compliance to address many of the following ACID compliance issues.
Atomicity:Write all data or nothing. . . . spark apachesave statesIt does not use any type of blocking and it is not atomic. With this, a failed job can leave an incomplete file and can corrupt data. Also, a failed job can delete the old file and corrupt the new file. While this seems worrisome, Spark has built-in dataframe writing APIs that are not atomic but behave that way for add operations. However, this comes with a performance overhead for use with cloud storage.
Consistency:Data is always in a valid state. If the Spark API writer deletes an old file and creates a new one, and the operation is not transactional, there will always be a period of time that the file does not exist between deleting the old file and creating the new one. In that scenario, if the overwrite operation fails, the previous file data will be lost. Also, the new file may not be created. This is a typical consistency related spark override issue.
Isolation:Multiple transactions are carried out independently without interference. This means that when you write to a data set, other concurrent reads or writes to the same data set should not be affected by the write operation. Typical transactional databases offer severalisolation levels. While Spark has commitments at the task and job level, since it lacks atomicity, it does not have isolation types.
Durability:Committed data is never lost.When Spark doesn't implement a commit correctly, it overwrites all the great durability features that cloud storage options offer and corrupts or loses data. This violates the durability of the data.
Now that we understand today's data lake and the challenges along with the benefits of an ACID compliant Delta Lake, let's get started with the demo.
previous requirements
For this demo, be sure to create the following prerequisites.
1)Create a data factory V2: Data Factory will be used to perform the ELT orchestrations. In addition, ADF's Mapping Data Flows Delta Lake connector will be used to create and manage Delta Lake. For more details on creating Data Factory V2, seeQuickstart: Create a data factory using the Azure Data Factory UI.
2)Crear un Data Lake Storage Gen2: ADLSgen2 will be the Data Lake repository in which Delta Lake will be created. For more details on creating ADLSgen2, see:Creating Your First ADLS Gen2 Data Lake.
3)Create Containers and Zones of Data Lake Storage Gen2: Once your Data Lake Gen2 is created, you must also create the corresponding containers and zones. For more information on ADLS Gen2 zone design, see:Create your data lake on Azure Data Lake Storage gen2. This demo will useraw areato save a sample source parquet file. other than thattest areait will be used for delta updates, inserts, deletes, and other transformations. While the selected zone will not be used in this demo, it is important to note that this zone may contain the final transformed E-T-L, advanced analytics, or data science models selected from the Test Zone.
4)Subir datos a Raw Zone: Finally, you need some data for this demo. Searching for "sample parquet files" will give you access to various online GitHub repositories or downloadable sample data. The nextGitHub repositoryData was used for this demonstration.
5)Create a Data Factory parquet dataset pointing to the raw zone: The final prerequisite would be to create a parquet format dataset in the newly created instance of ADF V2 that points to the sample parquet file stored in the raw zone.
Create and paste in Delta Lake
Now that all the prerequisites are met, we're ready to create the initial delta tables and insert data from our raw zone into the delta tables.
Let's start by creating a new Data Factory pipeline and adding a new 'Mapping Data Flow' to it. Also remember to name the Pipeline and Data Flow sensible names, like in the example below.
Inside the data flow, add a source and sink with the following settings. Scheduled operation can be activated as needed for the specific use case.
For more details on Schema Drift, seeSchema operation in data flow mapping.
Sampling provides a method of limiting the number of rows in the source, which is used primarily for testing and debugging purposes.
Since Delta Lake takes advantage of Spark's distributed processing power, it can partition the data successfully, but in order to demonstrate the ability to manually configure partitioning, I have configured 20 Hash partitions on the ID column. clickare, read more about the different options for partition types.
After adding the target activity, make sure the sink type is set to Delta. For more information on how to participate in ADF, seeFormato delta i Azure Data Factory. Note that Delta is available as a source and sink in Mapping Data Flows. You will also be prompted to select the associated service when selecting the wash type for Delta.
On the Settings tab, make sure the Test folder is selected and select Deploy as the update method. Also select Truncate Table if the Delta table needs to be truncated before loading it.
For more information on the void command, see:Vacuum a Delta Table (Delta Lake on Databricks). Basically, Vacuum will delete files that are no longer referenced in the delta tables and that are over the retention threshold in hours. The default value is 30 days if the value is 0 or empty.
Finally, on the Optimize tab, simply use the current partition from the source partition with downstream flow to the sink.
As expected, when the pipeline is activated and finished running, we can see that 13 new columns have been created across 20 different partitions.
As we look at the ADLS2 staging folder, we see that a delta_log folder has been created along with 20 quick zipped parquet files.
Open the delta_log folder to see the two log files. For more information on understanding delta logs, please read:Diving into Delta Lake: Unpacking the transaction log.
After checking the new data in Staging Delta Lake, we can see that new records have been inserted.
Update Lake Delta
So far, we've covered Insert in Delta Lake. Next, let's take a look at how Data Factory can handle updates to our delta tables.
Like plugins, create a new ADF pipeline with a mapping dataflow for updates.
For this update demo, we need to update the user's first and last name and convert it to lowercase. To do this, we added a Derived Columns and Change Rows transformation activity to the Update Mapping Dataflow canvas.
The source data is still our Staging Delta Lake which was also set up for betting. For more details on time travel, see:Introducing Delta Time Travel for large-scale data lakes.
Derived columns convert the first and last name to lowercase using the following expression. Mapping Data Flows is capable of handling extremely complex transformations at this stage.
To change the configuration of the rows, we must specify an Update if condition for true() to update all rows that meet the criteria.
Check sink settings.
Make sure the sink is still pointing to the Staging Delta Lake data. Also select Allow update as the update method. To show that multiple key columns can be selected simultaneously, 3 columns have been selected.
After saving and activating the pipeline, we can see that the results reflect the first and last names that have been updated to lowercase.
Delete from Delta Lake
In summary, so far we have covered inserts and updates. Next, let's look at an example of how Mapping Data Flows handles deletions in Delta Lake.
Similar to inserts and updates, you must create a new data factory and mapping data flow.
Configure the Delta source settings.
Since we are still working with the same Delta Lake Staging, this source configuration will be configured in the same way as pushes and updates.
For this example, let's remove all gender = male entries. To do this, we need to set the modified row conditions toRemove if gender == 'Male'.
Finally, set the delta settings of the sump.
Select Staging Delta Lake for the sink, select 'Allow delete' and select the desired key columns.
After publishing and activating this pipeline, notice how all records where gender = Male have been removed.
Explore delta logs
Finally, let's go ahead and take a look at Delta logs to briefly understand how the logs were created and populated. The main commit information files are generated and stored in Insert, Update, and Delete JSON commit files. Additionally, CRC files are created. CRC is a popular technique for verifying data integrity because it has excellent error detection capabilities, uses few resources, and is easy to use.
Insert
As we can see by opening the Insert JSON commit file, it contains the commit information for the insert operations.
{"commitInfo":{"timestamp":1594782281467,"operation":"WRITE","operationParameters":{"mode":"Add","partitionBy":"[]"},"isolationLevel":"WriteSerializable" "esBlindAppend":verdadero}}
Update
Similarly, when we open the Update JSON commit file, it contains commit information for update operations.
{"commitInfo":{"timestamp":1594782711552,"operation":"MERGE","operationParameters":{"predicate":"(((source.`id` = target.`id`) AND (source.` registration_dttm` = target.`registration_dttm`)) AND (source.`ip_address` = target.`ip_address`))","updatePredicate":"((NOT ((source.`ra2b434a305b34f2f96cd5b4b4b4149455e`0))) O (NO ( (fuente.`ra2b434a305b34f2f96cd5b4b4149455e` & 8) = 0)))","deletePredicate":"(NO ((fuente.`ra2b434a305b34f2f96cd5b49455e` & 8) = 0)))","deletePredicate":"(NO ((fuente .`ra2b434a305b34f2f96cd5b49455e` & 8) = 0))" nivel de aislamiento":"EscribirSerializable","isBlindAppend":falso}}
Delete
Finally, when we open the Delete JSON commit file, it contains commit information for delete operations.
"commitInfo":{"timestamp":1594783812366,"operation":"MERGE","operationParameters":{"predicate":"(((source.`id` = target.`id`) AND (source.`registration_dttm ` = target.`registration_dttm`)) AND (source.`ip_address` = target.`ip_address`))","updatePredicate":"((NOT ((source.`ra079d97a688347b581710234d2cc4b63` & 2) = 0)) O NO ((source.`ra079d97a688347b581710234d2cc4b63` & 8) = 0)))","deletePredicate":"(NOT ((source.`ra079d97a688347b581710234d2cc)","isolation": 41" ":"WriteSerializable","isBlindAppend"); :FALSO}}
Next step
- For more details related to Delta Lake, please read the following Databricksdocumentation.
- For more information on understanding Delta Lake logs, readDiving into Delta Lake: Unpacking the transaction log.
- For more information about the Delta connector in Azure Data Factory, seeFormato delta i Azure Data Factory.
- To exploredelta engineas an efficient way to process data in data lakes, including data stored in the open source Delta Lake.
About the Author
Ron L'Esteve is a trusted information technology thought leader and professional writer based in Illinois. He brings more than 20 years of IT experience and is known for his powerful books and article publications on AI and data architecture, cloud engineering and management. Ron completed his Master's in Business Administration and Finance from Loyola University Chicago. Ron brings deep tech
See all my tips