This is the third post in a 3-part blog series on Power BI with Azure Databricks SQL written byAndrey Mirskyydiego fanesi. Lea to Part 1:Power your BI with Microsoft Power BI and Azure Databricks Lakehouse: Part 1 - Fundamentalsand part 2:Power your BI with Microsoft Power BI and Lakehouse on Azure Databricks: Part 2 - Tuning Power BI
yo denfront partin this series we discuss some of the Power BI optimization techniques to get better performance from your reports and dashboards. In this part, we'll focus on tuning your Delta Lake and Azure Databricks SQL Warehouses for better performance.
We have previously discussed how to use Power BI on top of Databricks Lakehouse effectively. However, the well-designed and efficient Lakehouse is the foundation of the overall performance and good user experience. We will discuss recommendations for the physical design of Delta tables, data modeling, as well as recommendations for Databricks SQL Warehouse.
These tips and techniques have proven to be effective based on our field experience. We hope you find them relevant to your Lakehouse implementations as well.
While your data model (tables and columns) is driven by business requirements and data modeling approach (eg, dimensional model, Data Vault), you don't have many options for tuning the model itself. However, you can achieve higher performance by optimizing the physical layout of your tables.
First of all, we recommend using the Delta format for your tables in Lakehouse. Delta offers performance optimizations such asdata jump,dynamic file clipping, and many others.
With our high performanceEngine photoyou can achieve much better performance compared to parquet tables.
There is a common misconception that table partitioning helps performance. While this has been true for years or even decades in legacy on-premises data warehouses and even Parquet file-based cloud data lakes, this is not always the case with Delta tables. Delta Lake maintains table metadata, allowing fast query performance even without partitioning in most cases. Therefore, we do not recommend partitioning tables smaller than 1 TB.
When you use partitions for larger tables, you should keep the size of each partition to at least 10 GB or more. Partitions 1TB in size are still acceptable. This is to ensure that Delta Optimize and Z-Ordering can still optimize your data input: each partition must contain at least 10 active parquet files.
Columns generated automatically
When choosing the correct partitioning column, you may need to generate one with a value derived from an existing column. An example would be an event transaction table that has a timestamp column where you might only want to split the table at the year or year and month level. In this case, you would create a new calculated column with year and month and divide by that.
The problem you'll have in this case is if users only filter on the timestamp and don't include an explicit filter on the partition column, ending up with a browse table query. You can mitigate this by using auto-generated columns for Delta.They areis the documentation.
CABINET EDGEevents (
year AND T GENERATED ALWAYS AS(YEAR(Event time)),
mes AND T GENERATED ALWAYS AS(MES(Event time)),
day AND T GENERATED ALWAYS AS(DAY(Event time))
DIVIDED BY(event type,year,mes,day)
Note in the example above that the minimum table size recommendation for the partition still applies. We recommend estimating partition sizes when deciding on a partition strategy.
By doing this, you don't need to add the processing logic for the extra column in your code, and Spark will be able to derive the derived value for the partition column when only the timestamp column is used for filters.
As standard Delta engineautomaticsets the file size based on the size of the table. Delta will use the query history and table size to understand the best file size for you to use. But if you just created new tables for a proof of concept, Delta won't have enough data to optimize for file size right away. In that case, consider adjusting the file size manually.
In that case, there are some considerations to make. The first is that even if Delta doesn't have a query history to consider, you'll still be able to see the size of the table. So in that case you can use a simple property to suggest to Delta what kind of workload you are running on the table. This is the casedelta.tuneFileSizesForRewrites. When set to true, it will tell Delta to optimize for frequent updates and deletes, allowing it to choose the smallest file sizes.
AMEND EDGEmythicalTO PLACE TBL PROPERTIES(delta.tuneFileSizesForRewrites=GOOD);
Also, you can manually set it to a specific size. Below is an example of configuring adestination file sizemanually to 32 MB.
AMEND EDGEmythicalTO PLACE TBL PROPERTIES(delta.targetFileSize= 33554432);
Orden Zis another optimization technique similar to database indexes, but without creating additional files or data structures to parse and process. Instead, the data in the files is arranged to locate similar data, which boosts the data skip algorithm for faster query performance at run time. Below is an example of how to apply Z-Ordering to a table.
OPTIMIZEmythicalORDEN Z BYclave conjunta1, predicado2;
Tables must be sorted in Z using the columns that are most commonly used as WHERE or JOIN predicates. However, it is not recommended to use more than 5 columns for Z ordering. Data types can also affect join performance: joins on string keys are definitely less efficient than joins on integers, even when applied the Z order.
Adaptive Query Execution (AQE) uses table statistics to select the correct join type and other query optimizations. Therefore, it is important to have up-to-date table statistics. This can be achieved by runningANALYSIS TABLE.
ANALYZE EDGEmythicalCALCULATE STATISTICS A IN COLUMNS;
However, Delta will only calculate statistics for the first 32 columns of a table. This means that the order of the columns in a table can have some meaning. Make sure that all columns used where the clauses or combinations are in at least the first 32 columns. Also remember that the Z order must be applied to the columns between the first 32 columns of the table.
Potentially you could use a config key to extend the stat calculation beyond 32 columns. However, this configuration property should never be set to hundreds of columns, as this would make the Delta metadata significantly larger and take a long time to process, affecting all queries against the table. Below is an example of how you can set it to 40 columns.
AMEND EDGEmythicalTO PLACE TBL PROPERTIES(delta.dataSkippingNumIndexedCols= 40);
Power BI requires a date dimension table for date and time intelligence functions. Although Power BI offers more options forgenerate date tables, we recommend creating a persistent date table in Delta Lake. This approach allows Power BI to generate concise queries that are executed more efficiently by AzureDatabricks SQL.
cabinet board Standard.data_dimas
blow(order(till the date('2010-01-01'),current date(),interval 1 day))as given
list(date format(given,"aaaaMMdd")as And t)askey_date,
list(date format(given,"ooooo")as And t)as year,
list(date format(given,"ooooomm")as And t)asyear month
For the time dimension, we recommend using a separate Delta table with the necessary time granularity, e.g. hours only. Having separate date and time dimensions provides better data compression, query performance, and more flexibility for end users.
The common performance optimization technique is to pre-aggregate data and persist the results for faster query performance in BI tools like Power BI. In Azure Databricks, there are several options you can use to create aggregate tables.
First, you can use the familiar CREATE TABLE AS SELECT statement in your data preparation pipelines. Therefore, such tables will naturally belong to the gold layer in your Lakehouse.
CREATE OR REPLACE EDGEq13_addedAS
You can also take advantage of DLT -Delta Live should- to create and maintain aggregate tables. DLT provides a declarative framework for building reliable, maintainable, and testable data processing pipelines. With DLT, your materialized aggregate tables can be maintained automatically.
CABINET STRAIGHT EDGEq13AS
Aggregate tables are especially beneficial for large data sets and complex calculations. Organizations can optimize query execution and reduce processing times, resulting in faster data retrieval, more efficient reporting, and therefore a better end-user experience.
Primary and foreign keys
To improve the quality of Power BI models and developer productivity, we recommend defining primary and foreign keys on your tables in Lakehouse. Although primary and foreign keys are informational only (not enforced) in Azure Databricks SQL, Power BI can take advantage of this information to automatically create table relationships in models.
This feature is expected to be available to customers in the May 2023 Power BI update. Primary and foreign key creation can be done usinglimitationsto Delta tables.
Please note that withAssume referential integrityoption in table relationships Power BI uses INNER JOIN in SQL queries, which can lead to better query performance in Azure Databricks SQL. Therefore, correctly configuring table relationships in Power BI can improve report performance.
Pushdown calculations to Databricks SQL
As discussed in the previous part of this blog series, sending calculations to Azure Databricks SQL can sometimes improve overall performance by minimizing the number of SQL queries and simplifying the calculations on the BI tools side. This is especially true in the case of Power BI.
In simple cases, calculations can be included in a view. In more complex cases, we recommend keeping the data in tables in the gold layer or taking advantage of materialized views.
Azure Databricks SQL Store
To render reports, Power BI generates SQL queries and sends them to Azure Databricks SQL Warehouse through an ODBC connection. Therefore, it is important to know the options available to achieve great performance, hence the user experience.
Azure Databricks SQL Userdisk cacheto speed up data reads by copying data files to the local storage of the nodes. This happens automatically when executing SQL queries. However, to improve query performance in the future, this can also be done powerfully by runningCACHE SELECTIONstatement. It is worth mentioning that Azure Databricks automatically detects changes to the base data, so there is no need to refresh the cache after data is loaded.
Serverless Query Results Cache
In addition to Disk Cache, Azure Databricks has SQLQuery results cachewhich stores the results of SELECT queries and allows faster result retrieval for further execution. In Databrick's SQL Serverless SKU, this feature provides even better capabilities. Query result caching is available across all Azure Databricks SQL stores and clusters within those stores. This means that a result cached in one cluster is available in all clusters and even in other SQL stores. This dramatically improves performance and user experience for highly concurrent BI reports.
Intelligent workload management
Power BI can generate multiple SQL queries per report, at least 1 SQL query per visual. While some queries are quite complex when processing data from large fact tables, the other queries can be trivial for selecting data from smaller fact or dimension tables.
To accommodate this mix of queries, Azure Databricks SQL uses a dual queuing system that prioritizes small queries over large ones. In other words, small queries are not blocked by large ones. Therefore, the overall query performance and BI reporting performance is better.
Simultaneity and dimensioning
As mentioned above, Power BI can generate multiple SQL queries per single report. Therefore, even for a small number of reports, you may see tens or even hundreds of concurrent queries arriving at Azure Databricks SQL Warehouse. To achieve good performance for all users, your SQL Warehouse must be configured to be the appropriate size and scale.
The general guide is:
- Use multiple clusters to handle multiple users/concurrent queries.
- Use a higher cluster size for larger data sets.
Select the correct Azure Databricks SQL SKU.
Last but not least, Azure Databricks SQL is available at3 SKU- Classic, Pro and Serverless. Choosing the right SKU is important when planning your solution for future workloads.
- no server,generally availableAs of May 3, 2023, it includes all the additional features of the Pro SKU, it is generally the most efficient for BI use cases, regardless of the tool used to query Lakehouse. With compute capacity fully managed by Azure Databricks, customers get their SQL Warehouse clusters instantly without waiting for cluster provisioning. Features such as intelligent workload management and serverless query result caching enable great performance in highly concurrent BI workloads for 100 or even 1000 users.
- Proprovides several additional features on top of the Classic SKU that directly impact performance. With Predictive I/O, Photon I/O, Materialized Views, Python UDFs, you can achieve better reporting performance when querying data directly from Lakehouse without in-memory caching in the BI tool. Therefore, faster and better business decisions.
- Classicprovides all the standard features of SQL, including full ANSI SQL support, the Photon engine, and Unity Catalog integration. It can be a good choice for simple scenarios, such as scheduled refresh of Power BI datasets, where you don't need maximum performance and cluster startup time is not an issue.
In conclusion, optimizing the performance of your Power BI dashboards on top of your Databricks Lakehouse requires a combination of Power BI-specific tuning techniques and Databricks-specific techniques. We've discussed a number of optimization techniques that can help you improve the performance of your dashboards, including logical table partitioning, Cloud Fetch, support for Azure Databricks SQL Native Query, and sending complex formulas to Azure Databricks SQL. It's important to note that while these techniques can be effective, basic best practices apply to improve performance, such as data filtering, aggregation, and reducing the number of images on a page.
To get the best performance from your Lakehouse, it's critical to find the right balance between the complexity of your queries, the size of your data, and the complexity of your dashboard. By implementing a combination of techniques and following best practices, you can ensure that your Power BI dashboards on top of your Databricks Lakehouse deliver insights and value quickly and efficiently.
Finally, Azure Databricks SQL Pro and Serverless SKUs are currently on the wayan extended promotional offer, resulting in potential cost savings of up to 30% depending on the specific Azure region. So it's a great opportunity to try the latest and greatest Databricks SQL features at a discounted price and discover the full potential of your data.
And if you missed it, read Part 1:Power your BI with Microsoft Power BI and Azure Databricks Lakehouse: Part 1 - Fundamentalsand part 2:Power your BI with Microsoft Power BI and Lakehouse on Azure Databricks: Part 2 - Tuning Power BI