Aggregations became generally available in Power BI in July 2019. Aggregations are a modeling optimization technique that can improve user experience when dealing with large datasets. Combined with Power BI composite models, they make it possible to analyze big datasets efficiently by allowing high level analytical queries to be answered from memory while sending more detailed queries back to the source database. The trillion-row demo by Christian Wade is one example of this scenario that shows how aggregations can make it possible for big datasets to be analyzed in Power BI. Even though this demo is based on a trillion rows (quarter peta bytes) of data, any dataset that is expensive to cache in memory because of space, CPU, or time, can benefit from aggregations. Additionally, sometimes even Power BI reports built off of smaller datasets can benefit from faster load time if they can use aggregations.
The concept of aggregations
The idea of Aggregations is not a new concept in the world of data. In fact, you may have used them in the past without knowing it. Essentially an aggregation is a summary object created from a detail object and it can answer some summary reports in the place of the detail object.
Summary tables in SQL Server
For example, consider the following scenario. Let’s assume you have a Sales table containing sales amounts for 5000 products sold at 2000 stores for 10 years and that every product is sold at every store on every day. This results in a table with 3.65 billion rows. If you were to write a SQL query against this data that returned Sales for each Store for each year, the query would take a long time to run. Instead, you could create a smaller summary (aggregation) table that already stores the data at a higher level than the original table and query that instead. For example, assuming there are 200 Product Groups, a summary table by Product Group, Year, and Store would have 104,000 rows which is a small fraction of the main table. Querying this table is much faster than the main table.
Even though this design improves end query performance, it makes the report development more complex. The report developer should decide which table to use for each report. The high level queries can use the summary table whereas more detailed queries must go back to the detail table. The report developer must be aware of both tables and consciously use one vs. the other for different parts of the report.
In this design, the Aggregation table needs to be maintained as part of the overall database maintenance. When the main table is updated, the aggregation table needs to update as well.
Aggregations in SQL Server Analysis Services multidimensional (SSAS MD)
In SSAS MD, the predecessor to SSAS Tabular, aggregations are explicitly defined as actual objects that become part of the database. When designed properly, aggregations could significantly improve end user query performance. For example in larger cubes, you could see a difference between 1 minute vs. 5 seconds in query response times. Even though SSAS’s MD storage engine is disk based (vs. Memory base in Tabular), the size of aggregation files on disk is a fraction of the size of the measure group (fact table) files so they can be loaded into memory faster and searched faster.
Like the previous SQL server example, theses aggregations have a cost in terms of disk storage space and the amount of time it takes to process them to keep them updated. The more aggregations are created, the more queries they cover but they will take more space and need more time to process. There is also a point when adding more aggregations can have a negative impact on performance. This is like having too many indexes on a SQL table. I had a case where removing one aggregation that was automatically created by the wizard, improved performance because the queries could hit another manually designed aggregation.
Ultimately the SSAS developer must come up with the right balance of aggregations and make sure that they are being used. In SSAS, this is an advanced task that many developers did not want to get into. If you used the wizard, you had to have all attribute counts specified correctly, the correct relationships in place, … and even then, you may have not got what you wanted. There were many rules in place, that even if you requested an aggregation using hints in the wizard, the engine could have refused to create it. Usually there was a good reason for this decision as the engine weighted out the cost of using the potential aggregation vs. going directly to the fact objects; however, if this was not the case, you could manually create the aggregation. For complex cubes (lots of dimensions), sometimes it was better to skip the wizard entirely and manually create a custom set of aggregations that covered a certain set of queries.
Once aggregations are created, report developers or end users do not need to know about their existence. The engine automatically uses them to answer queries when possible.
At the time of writing this post, SSAS Tabular does not offer aggregations; however, since SSAS Tabular is the engine behind Power BI, it would make sense to expect aggregations to make their way into SSAS Tabular and Azure Analysis Services at some time in future.
Aggregations in Power BI
In Power BI, Aggregations start as tables just like any other table in a model. They can be based off a view or table in the source database, or created in Power BI with Power Query. They can be in Import or Direct Query storage mode.
Once in the model, these tables can be configured so that the engine can use them instead of a detail table to answer queries when possible. The process of creating and configuring aggregations in Power BI is significantly easier than the process of creating aggregations in SSAS multidimensional.
Once an aggregation table is configured, it becomes hidden from end users. Report developers and end users don’t know that it exists and don’t need to change anything in how they query the dataset.
At this time, aggregation tables can only be used with detail tables that are in Direct Query storage mode themselves. It is not possible to configure aggregation tables to work with details tables that are in Import storage mode themselves.
Like SQL Server Summary tables and aggregation objects in SSAS multidimensional, aggregations in Power BI have a maintenance cost. Once they are configured, it is important to make sure they are actually being used.
Comparing Aggregations in different products
The following table compares aggregations as described in previous examples.
|Power BI||SSAS MD||SQL Server|
|Summary object explicitly called an Aggregation||Yes||Yes||No|
|Reports must be aware of aggregations to use them.||No||No||Yes|
|Modeler has full control over their design.||Yes||No: if designed by using the Wizard. |
Yes: if designed manually.
|Modeler can control the order in which aggregations are considered by the engine to answer queries||Yes||No||N/A|
|Modeler can create a set of aggregations in one batch to cover a set of reports||No||Yes||N/A|
|Modeler can create a set of aggregations based on a set of input queries to be covered (Usage Bases Optimization)||No||Yes||N/A|
Setting up aggregations in Power BI
For the rest of this blog post, I use the AdventureWorks sample database to review a couple of examples that show how to configure aggregation tables in Power BI and I will use DAX Studio to see how they are being used.
To to this, I created a couple of views on top of the fact/dimension tables to simplify the model and take out the clutter. I also created two summary tables in the SQL database to be used as aggregation tables. You can download this version of AdventureWorks, the queries I use, and the pbix file from here.
This is how the model looks like:
The Internet Sales table has 60,398 rows, the Sales_Agg1_By_Product_Year has 288 rows, and the Sales_Agg2_By_Year has 5 rows.
Here is the SQL code I used to create the two aggregation tables:
CREATE TABLE [dbo].[Sales_Agg1_By_Product_Year]( [ProductKey] [int] NOT NULL, [Calendar Year] [smallint] NOT NULL, [Sales Amount] [money] NULL, [Unit Price Sum] [money] NULL, [Unit Price Count] [int] NULL, [InternetSalesRowCount] [int] NULL ) ON [PRIMARY] INSERT INTO dbo.Sales_Agg1_By_Product_Year SELECT fs.ProductKey, dd.[Calendar Year], SUM([Sales Amount]) AS [Sales Amount], SUM([Unit Price]) AS [Unit Price Sum], COUNT([Unit Price]) AS [Unit Price Count], COUNT(*) AS InternetSalesRowCount FROM [dbo].[vFactInternetSales] fs JOIN [dbo].[vDimDate] dd ON fs.DateKey = dd.DateKey GROUP By fs.ProductKey, dd.[Calendar Year]
CREATE TABLE [dbo].[Sales_Agg2_By_Year]( [Calendar Year] [smallint] NOT NULL, [Sales Amount] [money] NULL ) ON [PRIMARY] INSERT INTO dbo.Sales_Agg2_By_Year SELECT dd.[Calendar Year], SUM([Sales Amount]) AS [Sales Amount] FROM [dbo].[vFactInternetSales] fs JOIN [dbo].[vDimDate] dd ON fs.DateKey = dd.DateKey GROUP By dd.[Calendar Year]
For the rest of this blog post, I will start with all the tables in Direct Query storage mode and configure the two aggregation tables. Then I will use DAX studio to see if the queries I plan for the aggregations to cover, get answered by them. Then I will move the aggregation tables into memory by changing their storage mode to Import. You can read more about storage modes and composite models here. In short, Direct Query storage mode keeps the actual table data in the data source, Import storage mode loads the table data into Power BI’s in-memory engine called VertiPaq. Finally there is the Dual storage mode where a table can behave either as a Direct Query table or as an Import table based on a query. In Dual, a copy of the table data is loaded into VertiPaq but it is not always used.
Example 1: No Aggregations defined
To establish a baseline, let’s first take a look at what happens when a query runs against the model that is entirely in Direct Query and no Aggregation tables have been configured yet.
Here is a sample query that returns Sales Amount for each ProductKey.
EVALUATE SUMMARIZECOLUMNS( 'Product'[ProductKey], "Sales Amount", SUM ('Internet Sales'[Sales Amount]) )
If you run this query in DAX Studio (make sure the Server Timings option is selected in the top ribbon), you will see the following:
This query generates a SQL event which sends a SQL query back to SQL Server. This is because this model is in Direct Query. If you look at the details of the query, you can see that it is joining the views that Internet Sales and Product tables are based off of.
Example 2: Detail and Aggregation tables in Direct Query mode.
Even though the ultimate use case of an aggregation table is when it is in Import mode, as you will see, queries can benefit from an aggregation table even if it is in Direct Query itself.
You can configure an aggregation table by right clicking on any table in the model and choosing Manage Aggregations. In the pop up menu, you will see one row for each column of the aggregation table. Not all rows need to be filled out for an aggregation table to work.
Example of an Aggregation based on relationships
To cover the query mentioned before, the only row that needs to be filled out is the Sales Amount. Here you are telling Power BI desktop that if a query comes in asking for the Sales Amount column from the Detail table of Internet Sales, it can use the Sum of Sales Amount from the Aggregation table of Sales_Agg1_By_Product_Year.
Note: The reason this is all that is needed to cover the previous query, is that the aggregation table is connected via a relationship to the Product table. In the absence of relationships you have to use a different method that I will explain later in another example.
Now if you run the same query in DAX Studio, you will see that it runs in 27 ms vs. 49 ms. You can see that even in such a small dataset, being able to use a smaller table to answer the query made a difference in total run time. In this example, the detail table has 60,398 rows and the aggregation table has 288 rows. You can imagine what a difference in run time you would see if the detail table has millions of rows and the aggregation table has thousand of rows.
If I run the same query as before, this time a RewriteAttempt event is generated before the SQL event. In this case, the Rewrite is successful.
In the details of the Rewrite Attempt you can see how Power BI was able to make the match.
If you look at the SQL event now, you will see that this time it used the Sales_Agg1_By_Product_Year instead of the detail table. You can also see how it is able to join the aggregation table with vDimProduct based on productkey.
At this point with this simple configuration, the aggregation table is able to answer any query asking for Sales Amount by any column of the Product, Product Subcatergory and Product Category tables. This is possible because of the existing relationship between the aggregation table and Product.
Example of an Aggregation Miss
If you are wondering how an aggregation miss looks like, if you run the following query, it will result in a miss since Order Quantity is not included in the aggregation table. Note that I did this for showing how a miss looks like. In real life, you would typically include all quantity columns from the detail table in the aggregation table.
EVALUATE SUMMARIZECOLUMNS( 'Product'[ProductKey], "Order Quantity", SUM ('Internet Sales'[Order Quantity]) )
If you run this query in DAX Studio, you will see the following:
This time the RewriteAttempt failed as expected. If you look in the Details, you will see the following information that shows that the Power BI engine could not find any information about Order Quantity in the aggregation table:
Obviously, the SQL event reverts back to using the detail table in this case.
Example of covering an Average calculation with an aggregation
If you look at the dropdown choices for the aggregation summarization options, it does not include an option for Average. It only includes Sum, Min, Max, Count, and Count table rows. (Ignore GroupBy for now)
So how would you cover the following query?
SUMMARIZECOLUMNS( 'Product'[Product Name], "Avg Unit Price", AVERAGE('Internet Sales'[Unit Price]) )
Thankfully Power BI is able to use the basic summarization functions to cover a lot more queries than what you may think it can. If it can break a calculation into these basic calculations, then it can use an aggregation to answer more complex calculation. In the case of Average, it is able to break it down to Sum/Count. As long as you provide a Sum and Count entry in the aggregation table, it is able to use it.
Consider the following two queries:
EVALUATE SUMMARIZECOLUMNS( 'Product'[Product Name], "Unit Price Sum", SUM ('Internet Sales'[Unit Price]) )
EVALUATE SUMMARIZECOLUMNS( 'Product'[Product Name], "Unit Price Count", COUNT('Internet Sales'[Unit Price]) )
The aggregation table can be easily configured to cover each one of the above queries. All you have to do is to fill out the rows for Unit Price Sum and Unit Price Count in the Aggregation properties as shown below:
Once this is done, both of the queries are able to benefit from the aggregation as shown in the following pictures:
Now if you run the Average query, you will see how Power BI is able to break it down into Sum/Count.
If you look at the Details, you will see the following:
Finally note that you cannot use the “Count table rows” in place of defining a summarization for Unit Price Count. Even though the two are semantically the same in this example, Power BI looks for a count summarization to calculate Average.
Example of an aggregation not based on relationships
So far the queries we looked at have been using an attribute of the Product table. Now consider this query:
EVALUATE SUMMARIZECOLUMNS( 'Date'[Calendar Year], "Sales Amount", SUM ('Internet Sales'[Sales Amount]) )
If you run this query at this point, it will not use the Sales_Agg1_By_Product_Year since this aggregation table does not have a relationship with the Date table. The aggregation table is at Year level whereas the Date table is at Day level so it is not possible to connect the two. This is the design situation you would run into with some of the big data systems.
In order to define aggregations not based off of relationships, you must use the GroupBy summarization. Note that the Detail table here is set to the Date table vs. Internet Sales. The GroupBy summarization has another use case later that will use Internet Sales as the Detail table.
If you run the previous query after you configure the GroupBY for Calendar Year, you will see the following:
Example of covering a Distinct Count calculation with an aggregation
The Distinct Count summarization if not available as a built in summarization; however, you can use the Group By summarization to implement it. Remember that Group By in SQL is the same as Distinct.
Consider the following query:
EVALUATE SUMMARIZECOLUMNS( 'Product'[Color], "Distinct Products Sold", DISTINCTCOUNT( 'Internet Sales'[ProductKey]) )
To cover this query, you must add a Group By summarization for ProductKey to Sales_Agg1_By_Product_Year:
Note how the Group By for Calendar Year has the Date table as its Detail Table whereas the Group By for ProductKey has Internet Sales as its Detail Table. This is because the Group By is serving a different purpose in each scenario. In the latter case, its purpose is to provide a distinct count of ProoductKey.
If you run the previous query in DAX Studio, you will get a match. The following shows the data request:
If you look at the SQL query submitted, you will see that it uses the ProductKey column from Sales_Agg1_By_Product_Year
Performance Consideration: This method can be useful in scenarios where there are billions of rows in the detail table and two to five million distinct values in the column to be distinctly counted. If there are more than two to five million distinct values, this can affect query performance, meaning the aggregation will not help a query run faster.
Example 3: Adding a second Aggregation table, all tables still in Direct Query mode.
You can have more than one aggregation table in a model based on the same Detail table. For example, consider the second aggregation table of Sales_Agg2_By_Year. This table only has 5 rows compared to 288 rows of Sales_Agg1_By_Product_Year. For a query asking for Sale Amount by Year, it would be much faster if Power BI could use this smaller table. Keep in mind that this just an example and in real life you will not see a tangible difference between using a table with 5 rows versus using table with 288 rows. You would see a difference if the Sales_Agg1_By_Product_Year had 288,000 rows.
Before configuring Sales_Agg2_By_Year, if you run the following query you will see that it will be answered by Sales_Agg1_By_Product_Year,
EVALUATE SUMMARIZECOLUMNS( 'Date'[Calendar Year], "Sales Amount", SUM ('Internet Sales'[Sales Amount]) )
Now let’s configure Sales_Agg2_By_Year. This aggrgation table does not have relationship with the Date table in the model so the Calendar Year column must be configured using the GroupBY summarization as shown below:
Now if you run the same query as before, you will see that it will be answered from Sales_Agg2_By_Year.
You can control the order in which, aggregation tables are considered by Power BI to answer queries with a precedence property on each aggregation table.
For example, let’s say that for some reason you wanted the previous query to be answered from Sales_Agg1_By_Year_Product_Year. Note that in this scenario this does not makes sense since it is the larger table but I am only using this as an example.
To control the order in which aggregation tables are considered, you can set the precedence property on Sales_Agg1_By_Product_Year. The higher the number the earlier it will be considered by Power BI.
Now if you run the previous query, it will be answered from Sales_Agg1_By_Product_Year
Example 4: Detail table in Direct Query mode and Aggregation table in Import mode.
So far all tables have were in Direct Query mode. In most cases, you would get the best benefit from aggregation tables if you can load them into memory.
You can change the storage mode of a table by right clicking on the table in the Model view and getting the properties. You will see Storage Model under Advanced.
You can change the storage mode from Direct Query to Import. Note that this is not a reversible change.
Example of an Aggregation Hit
Let’s change the storage modes for both Aggregation tables to import. Power BI will give a set of warnings when you do this. It will warn you that this is not a reversible change and that it may take some time for the data to load. It will also suggest that you change the storage mode of the tables connected to Sales_Agg1_By_Product_Year to Dual. If they are not set to dual, the relationship between them and the aggregation table will be considered weak and the aggregation table will not be used to answer queries by the set of Product tables.
In order for an aggregation to be considered to answer a query based on relationships, the aggregation table and tables related to it that are involved in the query must have the same storage mode or one side has to be set to dual.
The following shows all the combinations in which a relationship is considered Strong and an aggregation table is considered:
|Many Side (Aggregation table)||One Side (Dimension table)|
|Import||Dual or Import|
|Direct Query *||Dual or Direct Query *|
* Both sides in Direct Query have to be from the same data source.
Finally note that Many-to-many relationships are always considered weak regardless of the storage mode of tables.
Once you change the storage mode of the Sales_Agg1_By_Product to Import, if you run the following query:
EVALUATE SUMMARIZECOLUMNS( 'Product'[Color], "Sales Amount", SUM ('Internet Sales'[Sales Amount]) )
You will see that this time, the RewriteAttempt succeeds similar to before as expected. Also note how this query ran in 5 ms vs.
This event is followed by a Scan event instead of a SQL event. This is beause the Aggregation table is in import mode. If you look at the Details you will see the following code which is the internal language Power BI uses to query its in memory data.
Example of an Aggregation Miss
To see what an Aggregation Miss looks like, run the following query:
EVALUATE SUMMARIZECOLUMNS( 'Product'[Color], "Order Qty", SUM ('Internet Sales'[Order Quantity]) )
Order Quantity does not exist in the Aggregation tables so this results in a RewriteAttempt failure.
Note how this query took 101 ms versus 5 ms for the previous query that was answered from the aggregation table in memory.
The RewriteAttempt is followed by a SQL event.
Aggregations are a great optimization technique that can improve end user experience significantly even for datasets that are way smaller than a trillion rows.
Currently there are some limitations. The main one is that they can only be used on top of detail tables that are in Direct Query. Hopefully this will be lifted in future.
Aggregations are not useful for all calculations. For example, they cannot answer queries with calculations that iterate the data row by row. In these cases, a potential workaround is to see if the calculation that is being done row by row can be done with a calculated column instead.
Aggregations have a cost no matter how/where they are created so their actual usage should be monitored. This can be done with DAX Studio for individual queries or with monitoring the ‘Query Processing\Aggregate Table Rewrite Query’ event with SQL Server Profiler/Extended Events for a larger set of queries over a period of time.
Finally, even though Aggregations are a great technique for improving query performance, following best practices for a good model design should be the first step in optimizing a model.
Here you can watch a video of a presentation I did for PASS BA Virtual Chapter on the same topic covered in this post.