Skip to content

go/vt/mysqlctl{/backupstats}: throttle backupstats for perf#12385

Closed
maxenglander wants to merge 1 commit intovitessio:mainfrom
planetscale:maxeng-fix-backup-stats-perf
Closed

go/vt/mysqlctl{/backupstats}: throttle backupstats for perf#12385
maxenglander wants to merge 1 commit intovitessio:mainfrom
planetscale:maxeng-fix-backup-stats-perf

Conversation

@maxenglander
Copy link
Collaborator

@maxenglander maxenglander commented Feb 17, 2023

Description

#11979 added detailed backup/restore metrics, but appeared to add ~4% overhead in a local test. I'm no longer able to reproduce the findings of that test.

In spite of that, this PR attempts to reduce any overhead by throttling stats calls.

Takes inspiration from logutil.ThrottledLogger, but does not drop any stats. Instead batches and and combines unflushed stats and then flushes them either when Flush() is called by the user, or a subsequent stats call occurs after a user-specified time interval.

Performance

A repeat of the performance test in #11979.

I set up the commerce example cluster, and created a table with ~30 GiB of data by repeatedly inserting REPEAT(UUID(),10), and then ran...

$ vtctlclient Backup -- --allow_primary <tablet>

...multiple times, collecting the value of vttablet_backup_duration_seconds.

Did this with the main commit before backupstats was introduced, the main commit when backupstats was introduced, and the HEAD of this PR.

# main (7fc1b48) main (d5364e6) this PR (65ac8fe2)
1 89 90 89
2 89 90 89
3 90 89 88
4 91 89 90
5 89 89 91
Δ N/A -0.22% -0.22%

In the last PR I found that d5364e6 added a bit of overhead 7fc1b48.

adds a roughly 3.8% performance overhead.

In this PR, it's not evident that d5364e6 adds any overhead. It's also not evident that 65ac8fe2 is any improvement to d5364e6.

I'm not sure it makes sense to merge this code, since it's not clear that there's anything to fix.

One possible explanation for getting misleading results in the last PR. I used BackupShard in the last PR whereas I'm using Backup against a specific tablet in this PR. Correct me if I'm wrong but I think that BackupShard picks a random replica tablet. Maybe small differences between replicas could result in different performance results from runs against randomly selected replicas.

Related Issue(s)

#11977
#11979

Checklist

  • "Backport to:" labels have been added if this change should be back-ported
  • Tests were added or are not required
  • Documentation was added or is not required

Deployment Notes

@vitess-bot vitess-bot bot added NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says labels Feb 17, 2023
@vitess-bot
Copy link
Contributor

vitess-bot bot commented Feb 17, 2023

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • If this is a change that users need to know about, please apply the release notes (needs details) label so that merging is blocked unless the summary release notes document is included.
  • If a test is added or modified, there should be a documentation on top of the test to explain what the expected behavior is what the test does.

If a new flag is being introduced:

  • Is it really necessary to add this flag?
  • Flag names should be clear and intuitive (as far as possible)
  • Help text should be descriptive.
  • Flag names should use dashes (-) as word separators rather than underscores (_).

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow should be required, the maintainer team should be notified.

Bug fixes

  • There should be at least one unit or end-to-end test.
  • The Pull Request description should include a link to an issue that describes the bug.

Non-trivial changes

  • There should be some code comments as to why things are implemented the way they are.

New/Existing features

  • Should be documented, either by modifying the existing documentation or creating new documentation.
  • New features should have a link to a feature request issue or an RFC that documents the use cases, corner cases and test cases.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • vtctl command output order should be stable and awk-able.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from VTop, if used there.

@maxenglander maxenglander force-pushed the maxeng-fix-backup-stats-perf branch from 0eaeeb4 to 65ac8fe Compare February 17, 2023 16:52
@maxenglander maxenglander changed the title checkpoint go/vt/mysqlctl{/backupstats}: throttle backupstats for perf Feb 17, 2023
@maxenglander maxenglander marked this pull request as ready for review February 21, 2023 16:18
@deepthi
Copy link
Collaborator

deepthi commented Feb 27, 2023

If there is no difference in performance, I'd like us to hold off on merging this PR.
Maybe when you implement the same metrics for s3, you can re-do the benchmarks, and let's see if this makes a difference?

@maxenglander
Copy link
Collaborator Author

Returning this PR to draft until we have compelling evidence that it's needed.

@maxenglander maxenglander marked this pull request as draft February 27, 2023 17:50
Signed-off-by: Max Englander <max@planetscale.com>
@maxenglander maxenglander force-pushed the maxeng-fix-backup-stats-perf branch from 65ac8fe to d865a29 Compare March 17, 2023 04:29
@github-actions
Copy link
Contributor

This PR is being marked as stale because it has been open for 30 days with no activity. To rectify, you may do any of the following:

  • Push additional commits to the associated branch.
  • Remove the stale label.
  • Add a comment indicating why it is not stale.

If no action is taken within 7 days, this PR will be closed.

@github-actions github-actions bot added the Stale Marks PRs as stale after a period of inactivity, which are then closed after a grace period. label Apr 17, 2023
@github-actions
Copy link
Contributor

This PR was closed because it has been stale for 7 days with no activity.

@github-actions github-actions bot closed this Apr 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says Stale Marks PRs as stale after a period of inactivity, which are then closed after a grace period.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants