Conversation
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
mattlord
left a comment
There was a problem hiding this comment.
LGTM! ❤️ I only had a few minor comments/questions.
pkg/controller/vitessbackupschedule/vitessbackupschedule_controller.go
Outdated
Show resolved
Hide resolved
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
deepthi
left a comment
There was a problem hiding this comment.
I only had a few comments on the diff in this PR, but it's unclear to me how AllowFirstBackup works. Is that functionality that was already present, and hence not showing up as a diff in this PR? If I'm just missing it, please point me to the file where I can see it.
| - “Allow” (default): allows CronJobs to run concurrently; | ||
| - “Forbid”: forbids concurrent runs, skipping next run if previous run hasn’t finished yet; | ||
| - “Allow”: allows CronJobs to run concurrently; | ||
| - “Forbid” (default): forbids concurrent runs, skipping next run if previous run hasn’t finished yet; |
docs/api/index.html
Outdated
| <em>(Optional)</em> | ||
| <p>A list of pointers to currently running jobs.</p> | ||
| <p>A list of pointers to currently running jobs. | ||
| This field is deprecated and no longer used in >= v2.15. It will be removed in a future release.</p> |
There was a problem hiding this comment.
Is this file generated? I'm not repeating my comments from api.md here.
There was a problem hiding this comment.
docs/api.md and docs/api/index.html are both generated using the same source.
|
@deepthi vtbackup pod. The decision whether to set the flag or not is done in vtbackupSpec.
|
|
We have this constant: That is used in both places, once for the vtbackup init pod, and another for the scheduled backup: This flag will be used if: we have no backup and we already have a primary running in the shard. |
Based on an out-of-band discussion, we should not allow people to take "first" backups on a long-running cluster. If they have not been using VItessBackupSchedules and they want to start using them, there must be a usable backup that can be used as the basis for the first scheduled backup. |
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Context
This Pull Request modifies the scheduled backup feature that was introduced in #553. The implementation in the original PR is rather simple and was a "v1". The CRD
VitessBackupSchedulewas introduced, giving the option to end-users to configure differentVitessBackupScheduleStrategy, which instruct vitess-operator which keyspace/shard the user wants to backup. All the strategies in aVitessBackupSchedulewere merged together to form a singlevtctldclientcommand that would be executed in a K8S'sJobfollowing the schedule set in theVitessBackupScheduleCRD.While this approach works, it has limitation: it uses
vtctldclient Backupwhich will temporarily take out a serving tablet from the serving tablet pool and use it to take the backup. While this limitation is documented in the release notes, we want to encourage people to usevtbackupas an alternative.Proposed Change
This Pull Request modifies the
VitessBackupSchedulecontroller to create one KubernetesJobperVitessBackupScheduleStrategydefined theVitessBackupScheduleby the user - instead of merging all strategies into a singleJob. EachJobwill now create avtbackuppod - instead of creating a pod where avtctldclientcommand was executed.Changes
Since
vtbackupcan only backup a single keyspace/shard, we need oneJobper strategy, meaning that now eachVitessBackupSchedulemanages multipleJobresources (one per strategy), which was not the case before.The previous code logic relied heavily on the CRD's Status to schedule next runs, this logic was removed to make Status purely informative to only the end users, not the vitess-operator. Moreover, the previous logic was updating the Status of the resource on, almost, every request without knowing if we actually need to update the Status. This was modified to only update the Status when needed and make it ignore conflicts.
The reconciling loop of
VitessBackupSchedulewas, at times, not returning the correct result+error expected by the controller-runtimeTypedReconcilerinterface, leading to unnecessary re-queue of requests or missed re-queues. The logic has been modified to return correct values. A periodic resync of the controller has also been added to safeguard potential issues and to ensure resources (Jobs, PVCs, etc) are cleaned up correctly no matter what.The upgrade test has been modified to also run the
401_scheduled_backups.yamlmanifest.The default
ConcurrentPolicyof theJobshas been modified fromAllowtoForbid. With the new strategies requiring far more resources and time to complete, this new default makes sense. This value is still configurable by end-users.Previously, the
vtbackupcreated with everyVitessShardwas always attempting to do an initial backup (--initial_backup), which would fail if theVitessBackupdefinition was added after theVitessShardstarted running. That logic has been slightly modified so the vitess-operator is more aware of the current state of theVitessShardbefore creating thevtbackuppod. If there is already a primary in the shard, it will not be an initial backup, but an--allow_first_backupbackup - as we only create this pod if there is no backup for this shard. TheVitessBackupSchedulerelies on the same logic to determine the type of backup it has to run, either: initial, allow first, or normal. Concurrency issues between the initialvtbackuppod and the first scheduled run of aVitessBackupScheduleis, depending on the schedule, likely. In such situation bothvtbackuppods come up with the--initial_backupflag set to true. One of the two pods will simply fail early.Related PRs
vtbackupin scheduled backups vitessio/website#1946vtbackupin scheduled backups vitessio/vitess#17869