Add Backblaze backup integration#145508
Conversation
|
Didn't realize there was already another older pr with similar features. But I'll have a look at it tomorrow to merge the best of both worlds (at least doing it from scratch taught me how HA integrations work) |
|
setting this back to draft while you figure that out |
I made some minor changes, but it should all be ready for review now. I tested with a slightly bigger backup (500MB) and I didn't notice any slowdowns or hanging in the rest of Home Assistant |
|
Just making sure, this does not load the entire backup file into memory before uploading? That was one of the issues @frenck had to fix in his implementations. But he does use a different way of uploading the data, he uses And is there a convenient way to test this on a production setting? I could try and extract this implementation and run it on my production home assistant install as a test. But not sure what the steps required for that are. (I did it once for the tado integration when the quite suddenly changed their login mechanism, breaking the production tado integrations. Ran the version in a pull request for a week or two that way.) |
|
Please take a look at the requested changes, and use the Ready for review button when you are done, thanks 👍 |
Yes, the
No idea. I'd be happy to test this on my own actual HA instance if you find a way |
|
I'm not exactly sure why there are tests failing (on only python 3.13). The assertion diff is not an object i'm creating directly |
| patch("b2sdk.v2.B2Api", return_value=sim) as mock_client, | ||
| patch("homeassistant.components.backblaze.B2Api", return_value=sim), |
There was a problem hiding this comment.
we should only patch libraries where we use them, so the first patch is incorrect, while the second patch is a correct one. I think you want to patch the first one to homeassistant.components.backblaze.config_flow.B2Api
| @pytest.fixture(autouse=True) | ||
| def b2_fixture(): | ||
| """Create account and application keys.""" | ||
| sim = RawSimulator() |
| application_key: str = key["applicationKey"] | ||
|
|
||
| bucket = sim.create_bucket( | ||
| api_url=api_url, | ||
| account_id=account_id, | ||
| account_auth_token=auth_token, | ||
| bucket_name=USER_INPUT[CONF_BUCKET], | ||
| bucket_type="allPrivate", | ||
| ) | ||
|
|
||
| # Create a test backup | ||
| test_backup_data = b"backup data" | ||
| upload_url = sim.get_upload_url(api_url, auth_token, bucket["bucketId"]) | ||
| stream = io.BytesIO(test_backup_data) | ||
| stream.seek(0) | ||
|
|
||
| filename = TEST_BACKUP.name | ||
| sha1 = hashlib.sha1(test_backup_data).hexdigest() | ||
| file = sim.upload_file( | ||
| upload_url["uploadUrl"], | ||
| upload_url["authorizationToken"], | ||
| filename, | ||
| len(test_backup_data), | ||
| "application/octet-stream", | ||
| sha1, | ||
| BACKUP_METADATA, | ||
| stream, | ||
| ) | ||
|
|
||
| def ls( | ||
| self, | ||
| prefix: str = "", | ||
| ) -> list[tuple[FileVersion, str]]: | ||
| """List files in the bucket.""" | ||
| return [ | ||
| ( | ||
| FileVersion( | ||
| sim, | ||
| file["fileId"], | ||
| file["fileName"], | ||
| file["contentLength"], | ||
| "application/octet-stream", | ||
| sha1, | ||
| BACKUP_METADATA, | ||
| file["uploadTimestamp"], | ||
| file["accountId"], | ||
| file["bucketId"], | ||
| "action", | ||
| None, | ||
| None, | ||
| ), | ||
| file["fileName"], | ||
| ) | ||
| ] | ||
|
|
||
| BucketSimulator.ls = ls | ||
|
|
||
| yield BackblazeFixture(application_key_id, application_key, bucket, sim, auth) |
There was a problem hiding this comment.
I have a little clue what happens here, but I am wondering why you keep using certain parts of the library. I personally always like to completely patch out the library and this way our code does not use library code (except for some dataclasses and mapping logic to create many objects).
There was a problem hiding this comment.
I think by using more mocks you can remove more inline patches, which is a big improvement imo
| async def test_form_invalid_auth(hass: HomeAssistant) -> None: | ||
| """Test config flow.""" | ||
| result = await _async_start_flow(hass, "invalid", "invalid") | ||
| assert result.get("type") is FlowResultType.FORM | ||
| assert result.get("errors") == {"base": "invalid_credentials"} | ||
|
|
||
|
|
||
| async def test_form_invalid_bucket_name( | ||
| hass: HomeAssistant, | ||
| b2_fixture: BackblazeFixture, | ||
| ) -> None: | ||
| """Test config flow.""" | ||
| result = await _async_start_flow( | ||
| hass, | ||
| b2_fixture.key_id, | ||
| b2_fixture.application_key, | ||
| { | ||
| **USER_INPUT, | ||
| "bucket": "invalid-bucket-name", | ||
| }, | ||
| ) | ||
| assert result.get("type") is FlowResultType.FORM | ||
| assert result.get("errors") == {"bucket": "invalid_bucket_name"} | ||
|
|
||
|
|
||
| async def test_form_cannot_connect( | ||
| hass: HomeAssistant, | ||
| b2_fixture: BackblazeFixture, | ||
| ) -> None: | ||
| """Test config flow.""" | ||
| with patch( | ||
| "b2sdk.v2.RawSimulator.authorize_account", | ||
| side_effect=exception.ConnectionReset("test"), | ||
| ): | ||
| result = await _async_start_flow( | ||
| hass, | ||
| b2_fixture.key_id, | ||
| b2_fixture.application_key, | ||
| USER_INPUT, | ||
| ) | ||
|
|
||
| assert result.get("type") is FlowResultType.FORM | ||
| assert result.get("errors") == {"base": "cannot_connect"} |
There was a problem hiding this comment.
Let's finish config flow tests so we know they are able to recover once they experienced an error
| ) -> None: | ||
| """Test config flow.""" | ||
| with patch( | ||
| "b2sdk.v2.RawSimulator.get_bucket_by_name", |
There was a problem hiding this comment.
you could use the mock for this I think
| downloaded_file = await self._hass.async_add_executor_job(file.download) | ||
| response = downloaded_file.response |
There was a problem hiding this comment.
correct me if I'm wrong, but this sounds like you now have the entire file in memory?
There was a problem hiding this comment.
No this is the http response object, not the body of the response, which we start streaming after this
| r_fd, w_fd = os.pipe() | ||
|
|
||
| async def writer() -> None: | ||
| """Write async stream to the pipe.""" | ||
| with os.fdopen(w_fd, "wb") as w: | ||
| async for chunk in stream: | ||
| w.write(chunk) | ||
| w.close() | ||
|
|
||
| # Schedule the writer coroutine | ||
| writer_task = self.async_create_task(self._hass, writer()) |
There was a problem hiding this comment.
this also sounds dangerous to me? Who guarantees we're not writing the entire file to the pipe, before the other task is able to upload (enough) chunks?
There was a problem hiding this comment.
Yeah I think you might be right. I'm not sure how the streams work internally, but if they are push-based that is possible (but also an issue with any other implementation?)
There was a problem hiding this comment.
iirc none of the other implementations are using a separate task for this, but do read and write for a single chunk together
| self._bucket.upload_unbound_stream( | ||
| r, | ||
| filename, | ||
| file_info=file_info, |
There was a problem hiding this comment.
what's the maximum metadata size that backblaze allows?
There was a problem hiding this comment.
From the docs:
Each key is a UTF-8 string up to 50 bytes. There is an overall 7000-byte limit on the headers that are needed for file name and file information, unless the file is uploaded with server-side encryption in which case the limit is 2048 bytes.
There was a problem hiding this comment.
But that'll be a problem then. Not 100% sure what a key is in that context, but aren't you dropping the entire serialized json into one key? In any case even 2kB are not safe enough, which means we'd need to use metadata files in those cases.
There was a problem hiding this comment.
I think the docs might be wrong (or I misunderstood). The backup I have been uploading has 370 bytes of metadata and that's been working fine. I don't know how big the metadata can get
There was a problem hiding this comment.
Potentially bigger than 2kB. I think with anything bigger than 4 we should be safe, but 2 is not enough.
There was a problem hiding this comment.
7kb is the limit for files when server-side encryption is off (including the filename). I can add a note to the docs that server-side encryption is not recommended for larger instances, because of this limit
There was a problem hiding this comment.
I'd rather switch to file based metadata (like AWS S3 does it as well), instead of discouraging users to opt out of a security feature.
|
FYI. I've been following this conversation, and submitted #149627 based on the feedback here. Thank you @hugo-vrijswijk. Happy to merge into you branch if that's the preference here, since I didn't want to duplicate your brand and documentation pull requests. |
|
I'm closing this since #149627 which completed the work started in this PR has been merged. |
Proposed change
Adds a new core integration for Backblaze B2. Can be used as a backup source for HA Backups.
Type of change
Additional information
Integration is mostly based on the aws_s3 source, with some inspiration from azure_storage. It has some prerequisites (existing bucket and key). Also allows using a prefix to store backups in a certain folder in Backblaze. The integration will also make some checks before setup to ensure the configuration works.
The library used (
b2sdk) is completely sync, so many calls are wrapped inasync_add_executor_job. b2sdk also uses therequestslibrary, instead of the async http libraries used in HA. There is an open issue for async support, but it is noted this would be essentially a new library. Tests use theRawSimulatorfrom b2sdk to test interactions with the SDK, rather than patching.This is my first time contributing to HA, so please verify I am following all conventions. I intend to make more improvements to raise the integration quality scale.
Also I named the integration
backblaze_b2, as that seemed in line withaws_s3. Not sure if there ever would be another backblaze integration for a different product. If needed i can change the name to justbackblazeThis PR fixes or closes issue: fixesLink to developer documentation pull request:Link to frontend pull request:Checklist
ruff format homeassistant tests)If user exposed functionality or configuration variables are added/changed:
If the code communicates with devices, web services, or third-party tools:
Updated and included derived files by running:
python3 -m script.hassfest.requirements_all.txt.Updated by running
python3 -m script.gen_requirements_all.To help with the load of incoming pull requests: