Skip to content

Add Backblaze backup integration#145508

Closed
hugo-vrijswijk wants to merge 19 commits into
home-assistant:devfrom
hugo-vrijswijk:integration-backblaze-b2
Closed

Add Backblaze backup integration#145508
hugo-vrijswijk wants to merge 19 commits into
home-assistant:devfrom
hugo-vrijswijk:integration-backblaze-b2

Conversation

@hugo-vrijswijk
Copy link
Copy Markdown
Contributor

@hugo-vrijswijk hugo-vrijswijk commented May 23, 2025

Proposed change

Adds a new core integration for Backblaze B2. Can be used as a backup source for HA Backups.

Type of change

  • Dependency upgrade
  • Bugfix (non-breaking change which fixes an issue)
  • New integration (thank you!)
  • New feature (which adds functionality to an existing integration)
  • Deprecation (breaking change to happen in the future)
  • Breaking change (fix/feature causing existing functionality to break)
  • Code quality improvements to existing code or addition of tests

Additional information

Integration is mostly based on the aws_s3 source, with some inspiration from azure_storage. It has some prerequisites (existing bucket and key). Also allows using a prefix to store backups in a certain folder in Backblaze. The integration will also make some checks before setup to ensure the configuration works.

The library used (b2sdk) is completely sync, so many calls are wrapped in async_add_executor_job. b2sdk also uses the requests library, instead of the async http libraries used in HA. There is an open issue for async support, but it is noted this would be essentially a new library. Tests use the RawSimulator from b2sdk to test interactions with the SDK, rather than patching.

This is my first time contributing to HA, so please verify I am following all conventions. I intend to make more improvements to raise the integration quality scale.

Also I named the integration backblaze_b2, as that seemed in line with aws_s3. Not sure if there ever would be another backblaze integration for a different product. If needed i can change the name to just backblaze

Checklist

  • The code change is tested and works locally.
  • Local tests pass. Your PR cannot be merged unless tests pass
  • There is no commented out code in this PR.
  • I have followed the development checklist
  • I have followed the perfect PR recommendations
  • The code has been formatted using Ruff (ruff format homeassistant tests)
  • Tests have been added to verify that the new code works.

If user exposed functionality or configuration variables are added/changed:

If the code communicates with devices, web services, or third-party tools:

  • The manifest file has all fields filled out correctly.
    Updated and included derived files by running: python3 -m script.hassfest.
  • New or updated dependencies have been added to requirements_all.txt.
    Updated by running python3 -m script.gen_requirements_all.
  • For the updated dependencies - a link to the changelog, or at minimum a diff between library versions is added to the PR description.

To help with the load of incoming pull requests:

@hugo-vrijswijk
Copy link
Copy Markdown
Contributor Author

Didn't realize there was already another older pr with similar features. But I'll have a look at it tomorrow to merge the best of both worlds (at least doing it from scratch taught me how HA integrations work)

@zweckj
Copy link
Copy Markdown
Member

zweckj commented May 24, 2025

setting this back to draft while you figure that out

@zweckj zweckj marked this pull request as draft May 24, 2025 19:54
@hugo-vrijswijk hugo-vrijswijk marked this pull request as ready for review May 26, 2025 11:32
@hugo-vrijswijk
Copy link
Copy Markdown
Contributor Author

setting this back to draft while you figure that out

I made some minor changes, but it should all be ready for review now. I tested with a slightly bigger backup (500MB) and I didn't notice any slowdowns or hanging in the rest of Home Assistant

@hugo-vrijswijk hugo-vrijswijk changed the title Add Backblaze B2 integration Add Backblaze integration May 26, 2025
@jeroenleenarts
Copy link
Copy Markdown

jeroenleenarts commented May 30, 2025

Just making sure, this does not load the entire backup file into memory before uploading?

That was one of the issues @frenck had to fix in his implementations. But he does use a different way of uploading the data, he uses upload_bytes I think.

And is there a convenient way to test this on a production setting? I could try and extract this implementation and run it on my production home assistant install as a test. But not sure what the steps required for that are.

(I did it once for the tado integration when the quite suddenly changed their login mechanism, breaking the production tado integrations. Ran the version in a pull request for a week or two that way.)

Comment thread homeassistant/components/backblaze/__init__.py Outdated
Comment thread homeassistant/components/backblaze/__init__.py Outdated
Comment thread homeassistant/components/backblaze/__init__.py Outdated
Comment thread homeassistant/components/backblaze/__init__.py Outdated
Comment thread homeassistant/components/backblaze/config_flow.py Outdated
@home-assistant home-assistant Bot marked this pull request as draft May 30, 2025 18:08
@home-assistant
Copy link
Copy Markdown
Contributor

Please take a look at the requested changes, and use the Ready for review button when you are done, thanks 👍

Learn more about our pull request process.

@hugo-vrijswijk
Copy link
Copy Markdown
Contributor Author

Just making sure, this does not load the entire backup file into memory before uploading?

That was one of the issues @frenck had to fix in his implementations. But he does use a different way of uploading the data, he uses upload_bytes I think.

Yes, the upload_bytes I think requires you to load the whole thing into memory. That was my initial implementation. But it should now stream without loading the whole backup into memory. The downside is there's no way of indicating how big the stream will be, and thus have some checks on the upload size from backblaze.

And is there a convenient way to test this on a production setting? I could try and extract this implementation and run it on my production home assistant install as a test. But not sure what the steps required for that are.

(I did it once for the tado integration when the quite suddenly changed their login mechanism, breaking the production tado integrations. Ran the version in a pull request for a week or two that way.)

No idea. I'd be happy to test this on my own actual HA instance if you find a way

@hugo-vrijswijk hugo-vrijswijk marked this pull request as ready for review June 2, 2025 11:18
@home-assistant home-assistant Bot requested a review from joostlek June 2, 2025 11:18
@hugo-vrijswijk
Copy link
Copy Markdown
Contributor Author

I'm not exactly sure why there are tests failing (on only python 3.13). The assertion diff is not an object i'm creating directly

Comment thread homeassistant/components/backblaze/__init__.py Outdated
Comment thread homeassistant/components/backblaze/config_flow.py Outdated
Comment on lines +39 to +40
patch("b2sdk.v2.B2Api", return_value=sim) as mock_client,
patch("homeassistant.components.backblaze.B2Api", return_value=sim),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should only patch libraries where we use them, so the first patch is incorrect, while the second patch is a correct one. I think you want to patch the first one to homeassistant.components.backblaze.config_flow.B2Api

@pytest.fixture(autouse=True)
def b2_fixture():
"""Create account and application keys."""
sim = RawSimulator()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this exactly?

Comment on lines +77 to +134
application_key: str = key["applicationKey"]

bucket = sim.create_bucket(
api_url=api_url,
account_id=account_id,
account_auth_token=auth_token,
bucket_name=USER_INPUT[CONF_BUCKET],
bucket_type="allPrivate",
)

# Create a test backup
test_backup_data = b"backup data"
upload_url = sim.get_upload_url(api_url, auth_token, bucket["bucketId"])
stream = io.BytesIO(test_backup_data)
stream.seek(0)

filename = TEST_BACKUP.name
sha1 = hashlib.sha1(test_backup_data).hexdigest()
file = sim.upload_file(
upload_url["uploadUrl"],
upload_url["authorizationToken"],
filename,
len(test_backup_data),
"application/octet-stream",
sha1,
BACKUP_METADATA,
stream,
)

def ls(
self,
prefix: str = "",
) -> list[tuple[FileVersion, str]]:
"""List files in the bucket."""
return [
(
FileVersion(
sim,
file["fileId"],
file["fileName"],
file["contentLength"],
"application/octet-stream",
sha1,
BACKUP_METADATA,
file["uploadTimestamp"],
file["accountId"],
file["bucketId"],
"action",
None,
None,
),
file["fileName"],
)
]

BucketSimulator.ls = ls

yield BackblazeFixture(application_key_id, application_key, bucket, sim, auth)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a little clue what happens here, but I am wondering why you keep using certain parts of the library. I personally always like to completely patch out the library and this way our code does not use library code (except for some dataclasses and mapping logic to create many objects).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think by using more mocks you can remove more inline patches, which is a big improvement imo

Comment thread tests/components/backblaze/conftest.py Outdated
Comment thread tests/components/backblaze/test_config_flow.py Outdated
Comment thread tests/components/backblaze/test_config_flow.py
Comment on lines +84 to +126
async def test_form_invalid_auth(hass: HomeAssistant) -> None:
"""Test config flow."""
result = await _async_start_flow(hass, "invalid", "invalid")
assert result.get("type") is FlowResultType.FORM
assert result.get("errors") == {"base": "invalid_credentials"}


async def test_form_invalid_bucket_name(
hass: HomeAssistant,
b2_fixture: BackblazeFixture,
) -> None:
"""Test config flow."""
result = await _async_start_flow(
hass,
b2_fixture.key_id,
b2_fixture.application_key,
{
**USER_INPUT,
"bucket": "invalid-bucket-name",
},
)
assert result.get("type") is FlowResultType.FORM
assert result.get("errors") == {"bucket": "invalid_bucket_name"}


async def test_form_cannot_connect(
hass: HomeAssistant,
b2_fixture: BackblazeFixture,
) -> None:
"""Test config flow."""
with patch(
"b2sdk.v2.RawSimulator.authorize_account",
side_effect=exception.ConnectionReset("test"),
):
result = await _async_start_flow(
hass,
b2_fixture.key_id,
b2_fixture.application_key,
USER_INPUT,
)

assert result.get("type") is FlowResultType.FORM
assert result.get("errors") == {"base": "cannot_connect"}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's finish config flow tests so we know they are able to recover once they experienced an error

) -> None:
"""Test config flow."""
with patch(
"b2sdk.v2.RawSimulator.get_bucket_by_name",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could use the mock for this I think

@home-assistant home-assistant Bot marked this pull request as draft June 12, 2025 14:34
Comment thread homeassistant/components/backblaze/strings.json Outdated
Comment on lines +107 to +108
downloaded_file = await self._hass.async_add_executor_job(file.download)
response = downloaded_file.response
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct me if I'm wrong, but this sounds like you now have the entire file in memory?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No this is the http response object, not the body of the response, which we start streaming after this

Comment on lines +145 to +155
r_fd, w_fd = os.pipe()

async def writer() -> None:
"""Write async stream to the pipe."""
with os.fdopen(w_fd, "wb") as w:
async for chunk in stream:
w.write(chunk)
w.close()

# Schedule the writer coroutine
writer_task = self.async_create_task(self._hass, writer())
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this also sounds dangerous to me? Who guarantees we're not writing the entire file to the pipe, before the other task is able to upload (enough) chunks?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think you might be right. I'm not sure how the streams work internally, but if they are push-based that is possible (but also an issue with any other implementation?)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iirc none of the other implementations are using a separate task for this, but do read and write for a single chunk together

self._bucket.upload_unbound_stream(
r,
filename,
file_info=file_info,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the maximum metadata size that backblaze allows?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the docs:

Each key is a UTF-8 string up to 50 bytes. There is an overall 7000-byte limit on the headers that are needed for file name and file information, unless the file is uploaded with server-side encryption in which case the limit is 2048 bytes.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that'll be a problem then. Not 100% sure what a key is in that context, but aren't you dropping the entire serialized json into one key? In any case even 2kB are not safe enough, which means we'd need to use metadata files in those cases.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the docs might be wrong (or I misunderstood). The backup I have been uploading has 370 bytes of metadata and that's been working fine. I don't know how big the metadata can get

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially bigger than 2kB. I think with anything bigger than 4 we should be safe, but 2 is not enough.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7kb is the limit for files when server-side encryption is off (including the filename). I can add a note to the docs that server-side encryption is not recommended for larger instances, because of this limit

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather switch to file based metadata (like AWS S3 does it as well), instead of discouraging users to opt out of a security feature.

@ElCruncharino
Copy link
Copy Markdown
Contributor

FYI. I've been following this conversation, and submitted #149627 based on the feedback here. Thank you @hugo-vrijswijk. Happy to merge into you branch if that's the preference here, since I didn't want to duplicate your brand and documentation pull requests.

@emontnemery emontnemery changed the title Add Backblaze integration Add Backblaze backup integration Sep 17, 2025
@emontnemery
Copy link
Copy Markdown
Contributor

I'm closing this since #149627 which completed the work started in this PR has been merged.
Thanks a lot @hugo-vrijswijk for initiating it 👍

@github-actions github-actions Bot locked and limited conversation to collaborators Oct 31, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants