Skip to content

Fix s3 backups of tables > 50gb#3844

Merged
sougou merged 1 commit intovitessio:masterfrom
HubSpot:backup_partsize
May 1, 2018
Merged

Fix s3 backups of tables > 50gb#3844
sougou merged 1 commit intovitessio:masterfrom
HubSpot:backup_partsize

Conversation

@hmcgonig
Copy link
Copy Markdown
Contributor

@hmcgonig hmcgonig commented Apr 18, 2018

By default the s3uploader has a max of 10000 parts with a default part size of 5mb, meaning the largest file it can upload successfully is 50gb.
Any table with data greater than 50gb will fail a backup with:

E0417 14:54:36.546921   50578 main.go:58] E0417 14:54:36.546467 vtctl.go:1042] E0417 14:54:36.545616 backup.go:243] backup is not usable, aborting it: cannot copy data: MultipartUpload: upload multipart failed
    upload id: crmzwJ5gN7OhgpKuwrre4lNl9QDpudJeaiOdSvvKj04NZho_NRogaIoOFE62iqpghEEpawgEkZmk2NYDifXUYoTCOApvV7nUn7trb3HE4KaWt_bAbeV3uAxdqnMlI_11
caused by: TotalPartsExceeded: exceeded total allowed configured MaxUploadParts (10000). Adjust PartSize to fit in this limit
E0417 14:54:37.454383   50578 main.go:61] Remote error: rpc error: code = Unknown desc = TabletManager.Backup on prod-0368789102 error: cannot copy data: MultipartUpload: upload multipart failed
    upload id: crmzwJ5gN7OhgpKuwrre4lNl9QDpudJeaiOdSvvKj04NZho_NRogaIoOFE62iqpghEEpawgEkZmk2NYDifXUYoTCOApvV7nUn7trb3HE4KaWt_bAbeV3uAxdqnMlI_11
caused by: TotalPartsExceeded: exceeded total allowed configured MaxUploadParts (10000). Adjust PartSize to fit in this limit

This PR dynamically adjusts the uploader partsize based on the filesize of the source file by doing filesize/max parts and then rounding that up to the nearest mb. I tested this successfully on a db with a 120gb table.

Signed-off-by: Harrison McGonigal <hmcgonigal@hubspot.com>
// Calculate s3 upload part size using the source filesize
partSizeMB := s3manager.DefaultUploadPartSize
if filesize > 0 {
minimumPartSize := float64(filesize) / float64(s3manager.MaxUploadParts)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about dynamically changing the part size without specifying a max limit. The problem with tuning the part size is now you have to be able to fit that in memory -- multiplied by the concurrency. So I think there should be a flag for the max part size and a flag for max upload parts. Using those 2 along with concurrency, someone can tune the memory footprint of their backups

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given a 120gb table, that would be 12mb parts. For that to have any kind of significant impact you would have to set concurrency pretty high.

I didn't think this change would affect anyone, given that tables under 50gb would just be using the same 5mb default part size as before and tables over 50gb currently don't even upload.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd definitely like to hear what others have to say about their existing concurrency settings etc.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to add, I was also cautious of making a blanket part size flag because then that would create large upload buffers for files that are very small. I really wanted to use a larger part size only for files that needed it and leave the rest to the default size of 5mb.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like MaxUploadParts is not changeable, so this is probably the only course of action. 🤞

@sougou
Copy link
Copy Markdown
Contributor

sougou commented Apr 21, 2018

@bbeaudreault is this ready to go?

@hmcgonig
Copy link
Copy Markdown
Contributor Author

This should be ready. We've been using this internally for about a week to allow backups on one of our larger dbs.

@bbeaudreault
Copy link
Copy Markdown
Contributor

travis is failing, but this lgtm otherwise

@hmcgonig
Copy link
Copy Markdown
Contributor Author

hmcgonig commented Apr 27, 2018

@sougou is there anything left on this one? Travis failures seem to be unrelated.

@sougou
Copy link
Copy Markdown
Contributor

sougou commented Apr 27, 2018

I've been backlogged. Hopefully will have time this weekend to catch up. I've restarted the failed test for now.

@hmcgonig
Copy link
Copy Markdown
Contributor Author

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants