How does the DirectQuery partition of a Hybrid table in Power BI get used in queries?

While I was tracing some Power BI datasets with Hybrid tables, I noticed that that the DirectQuery partition of a Hybrid table is used in all queries submitted to the Hybrid table, whether or not they are asking for data from that Direct Query partition. On first look, this does not look good because Power BI should be able to use partition pruning for a partitioned table, meaning it should only scan certain partitions. For example, if you ask for historical data from 5 years ago, the current day’s data is not needed to answer that query and hence, the Direct Query partition for today should not be used in the query.

Thankfully, upon further investigation, it turns out that while a query is submitted for the Direct Query partition to the backend for all queries against a Hybrid table, the query itself has certain criteria in its Joins and Where clause and it does not actually return any data unless the main query is actually asking for data from that partition explicitly.

While this is not ideal because the query does go to the backend and the backend has to figure out that the query is empty, this is much better than what at first, I had feared was happening. This is a much better scenario than the backend resolving the query for the Direct Query partition every time with all the data and sending all that data to Power BI to filter out. That is not what’s happening.

If you are interested to see more details, in the remainder of this post, I will go over a few queries submitted against a sample Hybrid table and look the SQL/Scan queries generated using Performance Analyzer in Power BI Desktop and DAX studio.

First a little background on Partitioned tables:

Partitioning Benefits /Partition Pruning

Any time a table is partitioned, regardless of the storage mode of its partitions, it should achieve at least one of the two following goals:

1) Faster processing (refresh/import) time from Power BI service, assuming the backend source can handle parallel queries.

2) Better query performance, at least for some of queries that explicitly target one or a couple of the partitions.

This latter is achieved by partition pruning, meaning Power BI should understand that a query is only asking for the data in one or more partitions and only query those partitions.

Partitions of a Hybrid table

In its simplest form, a hybrid table is a table that has a set of partitions in Import and one partition in DirectQuery storage mode. This scenario can easily be set up using the Incremental Refresh GUI from Power BI Desktop on top of a table that has Incremental Refresh enabled.

This is one of the use cases for Hybrid tables which allows you to load (import) historical data into Power BI’s memory for fast performance (and do it fully only once followed by only incremental updates as needed) and at the same time, present the data for the most recent time period to users in near real time (Direct Query). This satisfies a very specific use case. Obviously, if the data for the last time period does not change frequently, it itself should also be imported on a schedule for the best performance.

The following section shows the partitions of a Hybrid table that I set up for my other blog post where I documented how Power BI automatically maintains the partitions over time, e.g., creates new partitions/merges old ones/…

Sample Hybrid table

My test Hybrid table is set up with the following Incremental Refresh Policy:

Archive (import) last 5 years

Incrementally Refresh (import) Last one month

Get the data for the current month with a Direct Query partition (near real time)

At the time of writing this blog, November 6th, 2022, the Refresh Policy yields the following partitions:

The DirectQuery partition is the last one, 2022Q411-onward. The following shows the Patrion script:

{
  "createOrReplace": {
    "object": {
      "database": "Hybrid Tables Demo 8 - Last 5 Years Last 1 Month IRH Published Oct 30",
      "table": "vFactInternetSalesBig",
      "partition": "2022Q411-onward"
    },
    "partition": {
      "name": "2022Q411-onward",
      "mode": "directQuery",
      "source": {
        "type": "policyRange",
        "start": "2022-11-01T00:00:00",
        "end": "2028-01-01T00:00:00",
        "granularity": "month"
      }
    }
  }
}

Notice how the start and end Propeties of the partition are 2022-11-01 and 2028-11-01. We will see these dates in the SQL query submitted to the backend later on. (As a side note, if you have future dates facts, they will show up with the design. You should use other methods to throw them out.)

Test Report

To generate sample queries, I put together a simple report by live connecting to the dataset and I used the query for the Sales Amount card:

Test Queries

I chose certain values from the Year/Month filters and captured the query for the Sales Amount card using the Performance Analyzer feature in Power BI Desktop under the View tab. Then I took the query to DAX Studio and ran it while running a trace and saw how Power BI resolved it.

Query 1: Year = 2022

We expect all the partitions in 2022, including the Direct Query one for Nov, to be involved in answering this query.

We can capture the query by clicking on Copy query:

// DAX Query
DEFINE VAR __DS0FilterTable = 
  TREATAS({2022}, 'vDimDate'[Calendar Year])

EVALUATE
  SUMMARIZECOLUMNS(__DS0FilterTable, "Sales_Amount", IGNORE('vFactInternetSalesBig'[Sales Amount]))

Running the above query in DAX Studio with the trace (Server Timings) running, shows that as expected, two events are generated:

  1. SQL: To get the data for Nov 2022 from the SQL source
  2. Scan: To get the data from all imported partitions from Power BI’s memory (VertiPaq engine).

The SQL query looks like this:


SELECT SUM([t0].[SalesAmount])
 AS [a0]
FROM 
((
select [_].[ProductKey],
    [_].[CustomerKey],
    [_].[SalesOrderNumber],
    [_].[SalesOrderLineNumber],
    [_].[OrderQuantity],
    [_].[ExtendedAmount],
    [_].[SalesAmount],
    [_].[OrderDate],
    [_].[OrderDateKey],
    [_].[ShipDate],
    [_].[ShipDateKey],
    [_].[DueDate],
    [_].[DueDateKey]
from [dbo].[vFactInternetSalesBig] as [_]
where [_].[OrderDate] >= convert(datetime2, '2022-11-01 00:00:00') and [_].[OrderDate] < convert(datetime2, '2028-01-01 00:00:00')
) AS [t0]

 LEFT OUTER JOIN 

(
select [$Table].[DateKey] as [DateKey],
    [$Table].[Date] as [Date],
    [$Table].[Day] as [Day],
    [$Table].[Month] as [Month],
    [$Table].[Calnedar Quarter] as [Calnedar Quarter],
    [$Table].[Calendar Year] as [Calendar Year]
from [dbo].[vDimDate] as [$Table]
) AS [t1] on 
(
[t0].[OrderDateKey] = [t1].[DateKey]
)
)

The above query returns the Sales data for Nov from the fact table in the backend. This is expected and the right behavior.

The Scan query which gets the rest of the data from Power BI’s memory (import) looks like this:

SET DC_KIND="AUTO";
SELECT
SUM ( '<ccon>vFactInternetSalesBig ( 364 ) </ccon>'[<ccon>SalesAmount</ccon>] )
FROM '<ccon>vFactInternetSalesBig ( 364 ) </ccon>'
	LEFT OUTER JOIN '<ccon>vDimDate ( 15763 ) </ccon>' ON '<ccon>vFactInternetSalesBig ( 364 ) </ccon>'[<ccon>OrderDateKey</ccon>]='<ccon>vDimDate ( 15763 ) </ccon>'[<ccon>DateKey</ccon>]
WHERE
	'<ccon>vDimDate ( 15763 ) </ccon>'[<ccon>Calendar Year</ccon>] = <ccon>2022</ccon>;


'Estimated size ( volume, marshalling bytes ) : 1, 16'

Query 2: Year = 2021

This time we expect that only the imported year partition for 2021 be used; however, as I mentioned earlier in this post, we will still see a SQL query submitted to the backenda as well.

We can capture the query by clicking on Copy query:

// DAX Query
DEFINE VAR __DS0FilterTable = 
  TREATAS({2021}, 'vDimDate'[Calendar Year])

EVALUATE
  SUMMARIZECOLUMNS(__DS0FilterTable, "Sales_Amount", IGNORE('vFactInternetSalesBig'[Sales Amount]))

Here is the SQL query for a closer look:


SELECT SUM([t0].[SalesAmount])
 AS [a0]
FROM 
((
select [_].[ProductKey],
    [_].[CustomerKey],
    [_].[SalesOrderNumber],
    [_].[SalesOrderLineNumber],
    [_].[OrderQuantity],
    [_].[ExtendedAmount],
    [_].[SalesAmount],
    [_].[OrderDate],
    [_].[OrderDateKey],
    [_].[ShipDate],
    [_].[ShipDateKey],
    [_].[DueDate],
    [_].[DueDateKey]
from [dbo].[vFactInternetSalesBig] as [_]
where [_].[OrderDate] >= convert(datetime2, '2022-11-01 00:00:00') and [_].[OrderDate] < convert(datetime2, '2028-01-01 00:00:00')
) AS [t0]

 LEFT OUTER JOIN 

(
select [$Table].[DateKey] as [DateKey],
    [$Table].[Date] as [Date],
    [$Table].[Day] as [Day],
    [$Table].[Month] as [Month],
    [$Table].[Calnedar Quarter] as [Calnedar Quarter],
    [$Table].[Calendar Year] as [Calendar Year]
from [dbo].[vDimDate] as [$Table]
) AS [t1] on 
(
[t0].[OrderDateKey] = [t1].[DateKey]
)
)

WHERE 
(
[t1].[Calendar Year] = 2021
)
 

Thankfully, the above query does not return any rows because the date filters don’t overlap; however, this is still a query submitted to the backend.

Here is how the Scan query looks like.

SET DC_KIND="AUTO";
SELECT
SUM ( '<ccon>vFactInternetSalesBig ( 364 ) </ccon>'[<ccon>SalesAmount</ccon>] )
FROM '<ccon>vFactInternetSalesBig ( 364 ) </ccon>'
	LEFT OUTER JOIN '<ccon>vDimDate ( 15763 ) </ccon>' ON '<ccon>vFactInternetSalesBig ( 364 ) </ccon>'[<ccon>OrderDateKey</ccon>]='<ccon>vDimDate ( 15763 ) </ccon>'[<ccon>DateKey</ccon>]
WHERE
	'<ccon>vDimDate ( 15763 ) </ccon>'[<ccon>Calendar Year</ccon>] = <ccon>2021</ccon>;


'Estimated size ( volume, marshalling bytes ) : 1, 16'

Query 3: Year = 2022, Month = October

Here is the DAX code generated:

// DAX Query
DEFINE
  VAR __DS0FilterTable = 
    TREATAS({2022}, 'vDimDate'[Calendar Year])

  VAR __DS0FilterTable2 = 
    TREATAS({"October"}, 'vDimDate'[Month])

EVALUATE
  SUMMARIZECOLUMNS(
    __DS0FilterTable,
    __DS0FilterTable2,
    "Sales_Amount", IGNORE('vFactInternetSalesBig'[Sales Amount])
  )

Running this query in DAX Studio, we see both a SQL and Scan events generated:

Here is the SQL code generated which will not return any rows because of the date filters not overlapping:


SELECT SUM([t0].[SalesAmount])
 AS [a0]
FROM 
((
select [_].[ProductKey],
    [_].[CustomerKey],
    [_].[SalesOrderNumber],
    [_].[SalesOrderLineNumber],
    [_].[OrderQuantity],
    [_].[ExtendedAmount],
    [_].[SalesAmount],
    [_].[OrderDate],
    [_].[OrderDateKey],
    [_].[ShipDate],
    [_].[ShipDateKey],
    [_].[DueDate],
    [_].[DueDateKey]
from [dbo].[vFactInternetSalesBig] as [_]
where [_].[OrderDate] >= convert(datetime2, '2022-11-01 00:00:00') and [_].[OrderDate] < convert(datetime2, '2028-01-01 00:00:00')
) AS [t0]

 LEFT OUTER JOIN 

(
select [$Table].[DateKey] as [DateKey],
    [$Table].[Date] as [Date],
    [$Table].[Day] as [Day],
    [$Table].[Month] as [Month],
    [$Table].[Calnedar Quarter] as [Calnedar Quarter],
    [$Table].[Calendar Year] as [Calendar Year]
from [dbo].[vDimDate] as [$Table]
) AS [t1] on 
(
[t0].[OrderDateKey] = [t1].[DateKey]
)
)

WHERE 
(
(
[t1].[Month] = 'October'
)
 AND 
(
[t1].[Calendar Year] = 2022
)
) 

Here is the Scan code generated which will return the data for October from Power BI:

SET DC_KIND="AUTO";
SELECT
SUM ( '<ccon>vFactInternetSalesBig ( 364 ) </ccon>'[<ccon>SalesAmount</ccon>] )
FROM '<ccon>vFactInternetSalesBig ( 364 ) </ccon>'
	LEFT OUTER JOIN '<ccon>vDimDate ( 15763 ) </ccon>' ON '<ccon>vFactInternetSalesBig ( 364 ) </ccon>'[<ccon>OrderDateKey</ccon>]='<ccon>vDimDate ( 15763 ) </ccon>'[<ccon>DateKey</ccon>]
WHERE
	'<ccon>vDimDate ( 15763 ) </ccon>'[<ccon>Month</ccon>] = <ccon>'October'</ccon> VAND
	'<ccon>vDimDate ( 15763 ) </ccon>'[<ccon>Calendar Year</ccon>] = <ccon>2022</ccon>;


'Estimated size ( volume, marshalling bytes ) : 1, 16'

Query 4: Year = 2022, Month = November

For this one, we expect to only see a SQL query only; however, you will see that we will still have a Scan query as well!

Here is the DAX query generarted from Power BI Desktop:

// DAX Query
DEFINE
  VAR __DS0FilterTable = 
    TREATAS({2022}, 'vDimDate'[Calendar Year])

  VAR __DS0FilterTable2 = 
    TREATAS({"November"}, 'vDimDate'[Month])

EVALUATE
  SUMMARIZECOLUMNS(
    __DS0FilterTable,
    __DS0FilterTable2,
    "Sales_Amount", IGNORE('vFactInternetSalesBig'[Sales Amount])
  )

Here is the SQL query which will return the data for Nov:


SELECT SUM([t0].[SalesAmount])
 AS [a0]
FROM 
((
select [_].[ProductKey],
    [_].[CustomerKey],
    [_].[SalesOrderNumber],
    [_].[SalesOrderLineNumber],
    [_].[OrderQuantity],
    [_].[ExtendedAmount],
    [_].[SalesAmount],
    [_].[OrderDate],
    [_].[OrderDateKey],
    [_].[ShipDate],
    [_].[ShipDateKey],
    [_].[DueDate],
    [_].[DueDateKey]
from [dbo].[vFactInternetSalesBig] as [_]
where [_].[OrderDate] >= convert(datetime2, '2022-11-01 00:00:00') and [_].[OrderDate] < convert(datetime2, '2028-01-01 00:00:00')
) AS [t0]

 LEFT OUTER JOIN 

(
select [$Table].[DateKey] as [DateKey],
    [$Table].[Date] as [Date],
    [$Table].[Day] as [Day],
    [$Table].[Month] as [Month],
    [$Table].[Calnedar Quarter] as [Calnedar Quarter],
    [$Table].[Calendar Year] as [Calendar Year]
from [dbo].[vDimDate] as [$Table]
) AS [t1] on 
(
[t0].[OrderDateKey] = [t1].[DateKey]
)
)

WHERE 
(
(
[t1].[Month] = 'November'
)
 AND 
(
[t1].[Calendar Year] = 2022
)
)
 

Here is the Scan query which returns no data:

SET DC_KIND="AUTO";
SELECT
SUM ( '<ccon>vFactInternetSalesBig ( 364 ) </ccon>'[<ccon>SalesAmount</ccon>] )
FROM '<ccon>vFactInternetSalesBig ( 364 ) </ccon>'
	LEFT OUTER JOIN '<ccon>vDimDate ( 15763 ) </ccon>' ON '<ccon>vFactInternetSalesBig ( 364 ) </ccon>'[<ccon>OrderDateKey</ccon>]='<ccon>vDimDate ( 15763 ) </ccon>'[<ccon>DateKey</ccon>]
WHERE
	'<ccon>vDimDate ( 15763 ) </ccon>'[<ccon>Month</ccon>] = <ccon>'November'</ccon> VAND
	'<ccon>vDimDate ( 15763 ) </ccon>'[<ccon>Calendar Year</ccon>] = <ccon>2022</ccon>;


'Estimated size ( volume, marshalling bytes ) : 1, 16'

Summary

Regardless of which partition a query is targeting to get its data from, for a Hybrid table, you will always see a SQL event and a Scan event. Depending on the query, one of these queries may return no rows; however, they are always submitted.
It is possible that in future, Power BI changes so that the Direct Query partition is not included in answering all queries but for now, it is good to know that Power BI does send the right filters (Query Folding is super important here) to at least zero out the number of rows returned by such a query when appropriate.

5 thoughts on “How does the DirectQuery partition of a Hybrid table in Power BI get used in queries?

  1. Hi Shabnam,

    I had no idea about this feature of partition pruning in tabular models, which you refer to, as it happens in multidimensional models!?
    I was convinced that partition pruning didn’t exist in Vertipaq, or maybe you’re referring to a sort of segment “elimination” that someone (Marco Russo I think) has already mentioned that occurs during segment scanning.

    1. Hi Jorge,
      When you set up partitions either manually or automatically by using the Incremental Refresh feature, the Power BI engine is able to do partition pruning when answering queries. A segment is inside the partition. The post was at partition level. Hope this helps.

Leave a Reply