Skip to content

Separate attrs into another table (reduces database size)#68224

Merged
bdraco merged 62 commits intohome-assistant:devfrom
bdraco:state_attr_table_poc
Mar 18, 2022
Merged

Separate attrs into another table (reduces database size)#68224
bdraco merged 62 commits intohome-assistant:devfrom
bdraco:state_attr_table_poc

Conversation

@bdraco
Copy link
Copy Markdown
Member

@bdraco bdraco commented Mar 16, 2022

Breaking change

Attributes are now stored in a state_attributes table, which stores the same set of attributes once (many to one relationship).

Attributes represent roughly 21% of the database size. (28% if you exclude statistics)
On a few of my production instances attributes ranged from 82-88% duplicates of another set of attributes.
I expect this will reduce the size of the database between 13-16% on average

My CPU history graphs which have a lot of duplicate attributes now load faster in unscientific testing. (134ms instead of 385ms). I expect this to speed up other queries as we now mix in the attributes after filtering most of the time so it saves a bit of I/O as well in some cases.

Type of change

  • Dependency upgrade
  • Bugfix (non-breaking change which fixes an issue)
  • New integration (thank you!)
  • New feature (which adds functionality to an existing integration)
  • Breaking change (fix/feature causing existing functionality to break)
  • Code quality improvements to existing code or addition of tests

Additional information

Checklist

  • The code change is tested and works locally.
  • Local tests pass. Your PR cannot be merged unless tests pass
  • There is no commented out code in this PR.
  • I have followed the development checklist
  • The code has been formatted using Black (black --fast homeassistant tests)
  • Tests have been added to verify that the new code works.

If user exposed functionality or configuration variables are added/changed:

If the code communicates with devices, web services, or third-party tools:

  • The manifest file has all fields filled out correctly.
    Updated and included derived files by running: python3 -m script.hassfest.
  • New or updated dependencies have been added to requirements_all.txt.
    Updated by running python3 -m script.gen_requirements_all.
  • For the updated dependencies - a link to the changelog, or at minimum a diff between library versions is added to the PR description.
  • Untested files have been added to .coveragerc.

The integration reached or maintains the following Integration Quality Scale:

  • No score or internal
  • 🥈 Silver
  • 🥇 Gold
  • 🏆 Platinum

To help with the load of incoming pull requests:

@probot-home-assistant
Copy link
Copy Markdown

Hey there @home-assistant/core, mind taking a look at this pull request as it has been labeled with an integration (recorder) you are listed as a code owner for? Thanks!
(message by CodeOwnersMention)

@bdraco
Copy link
Copy Markdown
Member Author

bdraco commented Mar 16, 2022

Need to handle dupe attributes that happen before commit

Also not write the attributes if it's a dupe

@frenck
Copy link
Copy Markdown
Member

frenck commented Mar 16, 2022

I think this is a great idea! Technically, we could optimize this further, if we store each attribute separately (instead of dumping the JSON that contains all every single time).

From that perspective, I also think we could allow marking attributes as "not recorded" or not relevant for recording or something. For example, it is really unnecessary to record the effect_list attribute of lights... (and if you use WLED, like I do, with hundreds of effects; that is quite a lot of wasted space).

@bdraco
Copy link
Copy Markdown
Member Author

bdraco commented Mar 16, 2022

I think this is a great idea! Technically, we could optimize this further, if we store each attribute separately (instead of dumping the JSON that contains all every single time).

That was the first approach with this, but I ran into a problem: each attribute value had to be json.dump since the data is opaque to us (An attribute value can be anything that is serializable since attributes are typed dict[str, Any]).

Additionally that meant another table to store state_id, attribute_id, ....

That created more overhead to load them in testing since json.loads had to be called on each one. I tried implementing a cache but even then it still wasn't great.

Finally there were a lot more selects on the database to find attributes, but that did mean no need for hashing. A cache might make this ok

I couldn't come up with a way to implement it that didn't trade performance in a way that seemed worth it.

Maybe there is still a way to make it work though. Probably will give it another go before going this direction.

@bdraco
Copy link
Copy Markdown
Member Author

bdraco commented Mar 16, 2022

I tried a few approaches with the single attribute split up and the cost of the additional indexes needed to make the purge run at a reasonable speed exceeded the savings from the deduplication. We have a lot of small attributes that make the overhead of the database storage quickly offset the savings

@bdraco
Copy link
Copy Markdown
Member Author

bdraco commented Mar 16, 2022

TODO:

  • purge shared attributes that are no longer linked to any state
  • Fix case where there a mixed old and new states in the db and the old states are unseen since the join is missing
  • update recorder tests since they are doing raw selects from States without attrs
  • tests for purge

@bdraco bdraco changed the title POC: Separate attributes into another table (reduces database size) POC: Separate attrs into another table (reduces database size) Mar 16, 2022
@bdraco bdraco force-pushed the state_attr_table_poc branch from d430820 to bd97dcc Compare March 16, 2022 22:51
@bdraco
Copy link
Copy Markdown
Member Author

bdraco commented Mar 17, 2022

Also need to disconnect the rows before deleting

@balloob
Copy link
Copy Markdown
Member

balloob commented Mar 18, 2022

One slightly off-topic note:

One of the ideas that Erik and I had is that we want all sensors with a state class in the frontend to use just the statistics table and not be stored in the states/events tables. In that case our statistics/event tables would be a lot smaller and might not require such aggressive optimizations.

@bdraco
Copy link
Copy Markdown
Member Author

bdraco commented Mar 18, 2022

One slightly off-topic note:

One of the ideas that Erik and I had is that we want all sensors with a state class in the frontend to use just the statistics table and not be stored in the states/events tables. In that case our statistics/event tables would be a lot smaller and might not require such aggressive optimizations.

That sounds like it would save a lot of space that those tend to be the chatty ones. We could come up with an api to abstract away the access to the data to replace the sql queries that the statistics, stats history, plant, etc integrations.

Copy link
Copy Markdown
Contributor

@emontnemery emontnemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, left a few questions though.

@bdraco
Copy link
Copy Markdown
Member Author

bdraco commented Mar 18, 2022

Retest looks good

@bdraco bdraco merged commit 9215702 into home-assistant:dev Mar 18, 2022
@bdraco bdraco deleted the state_attr_table_poc branch March 18, 2022 10:23
@github-actions github-actions bot locked and limited conversation to collaborators Mar 21, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants