-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert surrogate keys from hash to int for Power Bi #33
Comments
I'm very interested in this issue. Thank you for sharing your thoughts! |
yay, finally someone who cares about this too! :) What are your thoughts @fritz3000g? |
It looks like you're basically maintaining a key map at the same grain as the fact table. So if you truncate the fact table you can keep the key map unless you're changing the grain. My basic question is performance, for example using a 1,000,000 row fact table and 100,000 row dimension with guid keys What would be your load time on:
What would be the size and performance of the data model in:
And all of this compared for the same workload using the guid keys directly vs. the integer keys. Maintaining a key map isn't that unusual of a thing to do in my mind, though it's only slightly less annoying than maintaining surrogate keys in the fact table. I'd be interested to see it at work though, and know how it performs. |
Hey @infused-kim did you continue testing more about this issue? I don't want to go the hash route because that will just increase the size of all the things I am importing to power bi but there does not seem to be a better alternative (I am using snowflake) 🤔 |
I have been using hash keys for the time being, but I think my solution above should work. |
Yeah it should work, we are doing our attempt at int surrogates following these steps:
We intentionally left out surrogate key generation outside of dbt to avoid issues with multiple environments in different databases and schemas having different surrogate keys. |
Hey Guys,
yes... ANOTHER surrogate key issue from me... But I think you guys are the right people to nerd around with for this type of thing :)
Problem Description
I am visualizing the models I create with DBT in Power BI and I imagine many of you are doing the same.
One of the best practices in Power BI is to keep columns small and reduce their uniqueness. Unfortunately hash surrogate keys don't fit that best practice at all.
You can't use the binary column type for relationships and instead have to use the string type. So all your keys are 32 character long and highly unique.
I analysed my model with Dax Studio and found that a very large portion of the model size is due to surrogate keys.
So one negative impact that is definitely happening is a bigger model size and RAM consumption of the Power BI datasets.
Additionally, there are some claims that it also affects the performance if you create relationships on large text columns.
Here is an article explaining the performance impact them and here is a reddit discussion.
To be honest, I am skeptical about that. Since relationships are pre-defined, it would be silly if Power BI didn't optimize away the strings... (but I don't know for sure).
Is this even a problem that needs solving?
So my first question is: How do you handle the problem? Or do you think it's not important to handle at all?
Does the simplicity of DBT's hash keys and the productivty increase of DBT outweight the downsides when using Power BI for you?
Do you have any idea on how to benchmark whether hash keys have an impact on query performance for imported Power BI datasets?
Possible solution
I have also come up with a prototype to solve this issue. I have not implemented it in production yet (and I am still undecided whether I actually should), but it seems to work well enough to get your opinion here.
Dimension table example
The model below has a
test_int_key
column that is set to null when the model is built.It also has a post_hook that updates that column with a int value corresponding to the hash key.
Macro that maintains a table with int keys for each hash key
This macro creates a table for the primary key of the dimension with an identity property for the
int_key
and another column for thehash_key
.When the post_hook runs this macro, it merges all new hash keys into the table, which generates new rows with a new unique
int_key
for them (due to the identity column property).Then the macro updates the
int_key
column of the dimension table with int keys that correspond to the hash keys.Fact Table View
I generate table models for my facts and dimensions. And then I create views with renamed columns that I actually import in Power BI.
The underlying dimension tables contain the int_key columns. The fact tables are not regenerated to include the int key columns.
Instead I join to the dimensions in the report views on the hash key, and then retrieve the int key from the dimension:
What do you think?
This increases the model generation and probably makes the Power BI refresh a bit longer due to the join in the report view.
But the upside is that the model size and ram consumption would be lower. It also has potentially to improve query performance.
Do you think this is an issue worth bothering with? Or is the upside too small?
And what do you think of my idea to solve it? Do you perhaps have improvement suggestions or other ideas of how this could be solved better?
And do you think there is a big enough need for this that we should integrate the solution into tsql_tools?
The text was updated successfully, but these errors were encountered: