Incremental Refresh for Power BI datasets based on a numeric column (Year): Working with Parquet/Delta files in Azure Synapse Analytics

The most common scenario for setting up the out of the box incremental refresh in Power BI is to base it off of a datetime column; however, there are cases when you may want to set up incremental refresh based off of a column with a data type other than datetime. Examples are when you are working with a smart date ID (01012023 for Jan 1,2023) column or when you are working with a source system that has partitioned data using a column such as Year that has a numeric data type.

A use case for the latter scenario is when you are working with Parquet/Delta files via Azure Synapse Analytics Serverless SQL Pool. When working with larger datasets, it is typical to see the Parquet/Delta files partitioned by date ranges. Depending on how much data there is, the partitioning may be at the Year level instead of Day.

From Power BI perspective, you always want your Power BI data refresh queries to include the partitioning key if applicable for faster refreshes. From Azure Synapse Analytics side, paying attention to the partitioning key on the Parquet/Delta files becomes extremely important. If you don’t benefit from partition pruning on the Synapse side, you may be reading the entire dataset for each partition. This is costly, inefficient, and can make your incremental refresh go even slower than a regular full refresh.

Azure Synapse Analytics Serverless SQL Pool calculates costs as $5 per TB of data processed. When queries include the partitioning key, Synapse is able to partition prune, meaning that it will only read the files that are needed to answer the query. This results in:

In Azure Synapse Analytics Serverless SQL PoolLess data processed = less cost
In Power BIFaster refreshes and less pressure on your capacity

In this post, I show you how to set up incremental refresh off a numeric column such as Year to cover the above use case. The same method can apply when using a smart date id column. You would just need a different Power Query function.

High Level Steps:

  • Create a function called YearValue in PowerQuery that takes a datetime value and returns its Year.
  • Create a RangeStart parameter.
  • Create a RangeEnd parameter
  • Filter the Year column using fixed values.
  • Go to advanced editor and replace the fixed values with YearValue(RangeStart) and YearValue(RangeEnd) in the function.
  • Close and Apply
  • Enable Incremental Refresh from the model view. Make sure you choose Year as the granularity for both the archive (full) load and incremental load. In this example, we are assuming data for the current year keeps changing and we want to reload it with every refresh.
  • Publish and refresh from Power BI Service.
  • Watch your refresh queries from Azure Synapse Studio to make sure you are generating correct queries and hitting the right data.

Detailed Steps:

In Power BI got to Get Data and select “Azure Synapse Analytics SQL”.

Type in your Azure Synapse Serverless SQL End Point and select Import. Reminder: Anytime we talk about incremental refresh, by default, we are talking about import. Direct Query does not load any data into Power BI hence there is no refresh.

Choose your source object. It is best practice to use a view or at least an entire table. This will ensure that query folding will happen later on. Click on Transform Data.

In Power Query Editor, add a new blank query.

Enter the following as the formula: = (x as datetime) => Date.Year(x)

Rename the query to YearValue.

Next you need to create two parameters that will be used by incremental refresh later on.

RangeStart = 1/1/2023

RangeEnd = 1/1/2024

Pay attention to case sensitivity. These are reserved words. The data type must be Date/Time.

Add a filter to the column that you want to base incremental refresh off of. In my case, it is puYear.

Choose Between.

Make sure the second dropdown is set to “is less than” not “is less than or equal to” and enter two year values that exist in your data.

Click on OK. Right click on your fact table and choose Advanced Editor.

As you can see, the query is using the fixed values that you provided for the Year filter.

Modify the query by replacing the fixed Year values with YearValue(RangeStart) and YearValue(RangeEnd) respectively.

This is how the end formula should look like.

each [puYear] >= YearValue(RangeStart) and [puYear] < YearValue(RangeEnd))

Click on Close and Apply. Go to the Model view, right click on the fact table and choose Incremental Refresh.

Define the Incremental Refresh Policy. In the following example, I have chosen to archive data for the last 5 Years and incrementally refresh the last Year. The first load (refresh) from Power BI service is going to load 6 years’ worth of data. The next refreshes will only reload the data for the current year. In my example, since the data is partitioned by Year, I expect the delta files for the current year to change with every new (ETL) update hence I need to read the data for the current year with every refresh.

Going forward, Power BI will roll the window automatically when needed. For more information on how Power BI maintains the partitions, see my previous blog post here.

Apply, save, and publish the Power BI report (dataset) to Power BI Service. You can monitor the refresh queries from Azure Synapse Analytics Studio Monitor hub. It is highly recommended that you do this or ask someone who has access to Azure Synapse Analytics Studio to help you look at them. If you have made any mistakes in setting up the partitions on the Power BI side, you can catch them here.

The following shows the refresh queries in my example. The first refresh generates 6 queries, one for each year and you can see the amount of data processed for each query. (1 TB of data processed = $5)

Here is the query for 2023.

select [_].[puYear],
from [dbo].[vNYCTaxiGreenDelta] as [_]
where [_].[puYear] >= 2023 and [_].[puYear] < 2024

The next refresh (if you did everything correctly 🙂 ) only generates one query which only processes 19 MB of data instead of 2.8 GB.

This results in much faster refreshes from Power BI side as well as less cost for each refresh going forward.

More on Incremental Refresh:

4 thoughts on “Incremental Refresh for Power BI datasets based on a numeric column (Year): Working with Parquet/Delta files in Azure Synapse Analytics

  1. @shabnam : Thanks for this detailed example. I have a question, i am using Synapse serverless pool (Cosmos Db as a source) and power BI but while incremental refreshing first load , sometimes second load is succesful then i am getting the duplicate id error. This id is not duplicate in the source but it is getting updated . I have a fact and dimension tables in my model (1 to many relationship) . when i changed the relationship to many to many data loads works fine .. but some of the query performance is very bad due to change in relationship. Can you please help me in this?

    1. Hi. I have not seen this problem myself, but I have heard from someone else that they ran into this. Is there any possibility that you have either blanks or Nulls on any of the key columns on either side of the relationship? Limit the amount of data you are working with to a few months at a time if you can (put a where clause in the main view for example) and see if you can spot the trouble rows.

Leave a Reply