vtbackup, mysqlctl: detailed backup and restore metrics#11979
vtbackup, mysqlctl: detailed backup and restore metrics#11979deepthi merged 23 commits intovitessio:mainfrom
Conversation
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
If a new flag is being introduced:
If a workflow is added or modified:
Bug fixes
Non-trivial changes
New/Existing features
Backward compatibility
|
e79b3f5 to
651a278
Compare
go/cmd/vtbackup/vtbackup.go
Outdated
There was a problem hiding this comment.
Because vtbackup also picks up stats from mysqlctl, there will be some overlap between this metric and those metrics. I think that's OK, personally, but if we want I can take more care to avoid any overlap.
go/vt/mysqlctl/backup.go
Outdated
There was a problem hiding this comment.
Not deleted, just moved to backupstats package.
go/vt/mysqlctl/backup.go
Outdated
There was a problem hiding this comment.
This is a bit awkward, and copying structs makes me uncomfortable. Is there a way to lint for exhaustive field copying?
There was a problem hiding this comment.
just thinking out loud here.. can we have tow stats , enginestats and storagestats in backupparam.. so you won't end up copy
There was a problem hiding this comment.
I think if we did that approach we might end up needing three stats, one for the engine, one for the storage, and one for the controlling function (mysqlctl.Backup).
Overall I think I am OK with the copy, and one thing I didn't know about Golang when I wrote the comment above is that this kind of struct creation is exhaustive:
a1 := A{
"hello",
true,
3,
}
It won't let you omit any struct fields. So I at least feel good about that. The only risk now is if someone swaps the order of two struct fields that have the same type 😬
There was a problem hiding this comment.
Not strictly necessary, but makes a bit easier to avoid nil dereferences.
651a278 to
57e3842
Compare
57e3842 to
9bf42a3
Compare
There was a problem hiding this comment.
If Stats is not set then mysqlctl will use backupstats.NopStats. This way any out-of-tree code can opt in to the new stats, or not (and not have to worry about nil deference errors).
Signed-off-by: Max Englander <max@planetscale.com>
Signed-off-by: Max Englander <max@planetscale.com>
go/ioutil/meter.go
Outdated
| duration time.Duration | ||
| } | ||
|
|
||
| // Bytes reports the total bytes read in calls to f so far. |
There was a problem hiding this comment.
Hm this refers to the argument f in func measure, but it's not very helpful here. Let me reword this.
maxenglander
left a comment
There was a problem hiding this comment.
Improve code comments in ioutil/meter
Signed-off-by: Max Englander <max.englander@gmail.com>
Signed-off-by: Max Englander <max.englander@gmail.com>
Signed-off-by: Max Englander <max.englander@gmail.com>
Signed-off-by: Max Englander <max@planetscale.com>
Signed-off-by: Max Englander <max@planetscale.com>
maxenglander
left a comment
There was a problem hiding this comment.
@rsajwani thanks for all the helpful suggestions. I think I addressed all of your feedback, ready for another look!
rsajwani
left a comment
There was a problem hiding this comment.
LGTM. Thanks Max. This is awesome work.
Description
Addresses #11977.
I would like to have better instrumentation on backups, in particular backups generated by
vtbackup. While backup stats are exposed viaservenvsince #11388 (which is great!) ideally I would like more fine-grained stats on:Design
There were a couple design goals that shaped the way the code is written.
No breaking interface changes
After discussion with Deepthi we decided to introduce a breaking change after all.
A quick Google/GitHub search didn't reveal any out-of-treebackupstorageplugins, but the way the backup/restore APIs are laid out makes it seem like they were designed to support out-of-tree plugins.I tried to keep that in mind when writing this PR, in particular by not making any changes that would require anyone using out-of-tree plugins to make code changes when they upgrade to a Vitess version with these changes.Separate policy and mechanism
It wouldn't be great if every out-of-tree
backupstoragegenerated metrics in ways that conflicted with each other or varied widely from way Vitess users are used to consuming in-tree metrics.In this PR, I tried to create a minimal stats mechanism that can be used by in-tree and out-of-tree code, but where the policies for stats (metric names and labels, stats sink, etc.) are kept in-tree and under the control of the Vitess user.
This approach seems similar in spirit to what is already being done with
BackupParams.LoggerandRestoreParams.Logger.Changes
This PR adds several new metrics:
vtbackup_duration_by_phase_secondswithphaselabel{vtbackup,vttablet}_backup_byteswithcomponent,implementation, andoperationlabels{vtbackup,vttablet}_backup_countwithcomponent,implementation, andoperationlabels{vtbackup,vttablet}_backup_duration_nanosecondswithcomponent,implementation, andoperationlabels{vtbackup,vttablet}_restore_byteswithcomponent,implementation, andoperationlabels{vtbackup,vttablet}_restore_countwithcomponent,implementation, andoperationlabels{vtbackup,vttablet}_restore_duration_nanosecondswithcomponent,implementation, andoperationlabelsIt also deprecates these older backup/restore metrics:
{vtbackup,vttablet}_backup_duration_seconds{vtbackup,vttablet}_restore_duration_secondsNotes
Changes to
vtbackup:duration_secondsmetric which reports durations of additional phases not covered bymysqlctl.initmysqld,initialbackup,restorelastbackup,catchupreplication, etc.Changes to
mysqlctl:backup_bytes,backup_count,backup_duration_nanoseconds,restore_bytes,restore_count,restore_duration_nanosecondsmetrics.component,implementation, andoperation.-= unscoped, top-level,backupstorage,backupengine, etc.) across different implementations (s3,file, etc.), and across different operations (backup,restore,read,compress,encrypt).Other notes:
nanosecondson those two new metrics? Because as we're reporting on read/write times for individual files, if all the files are small and take less than a second to process then they end up reporting as zero.Whybackupengine.(Parameterizable)? I wasn't sure how safe it would be to break any of the APIs likeBackupEngineandBackupStorage. Figured it was better to introduce changes this way until I get some guidance.Samples
Sample metrics generated by running
vtbackupfrom this branch against the local example cluster. Processed withjqand sorted for readability.Performance
At @deepthi suggestion I compared performance of backups on
mainversus this branch.I set up the commerce example cluster, and created a table with ~20 GiB of data, then ran:
Multiple times with this branch and
main, comparing the values ofvttablet_backup_duration_seconds.mainbranch (7fc1b48)Assuming the differences aren't due to vagaries of CPU and disk usage on my Mac M1, this branch adds a roughly 3.8% performance overhead.
Out-of-scope
This PR doesn't add new metrics to all backup engines or storage engines. Would like to get buy-in on this approach (or a different one) first, and then expand whatever approach we adopt in follow-on PRs to cover additional backup engines & storages.
Checklist