Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC-967 Add more details to the node upgrade doc for Kubernetes #960

Merged
merged 15 commits into from
Feb 21, 2025

Conversation

JakeSCahill
Copy link
Contributor

@JakeSCahill JakeSCahill commented Jan 22, 2025

Description

Review deadline: 25 Jan

We recently had a P1 related to ephemeral data loss due to how Azure handles automated node upgrades. We already implemented a clarification in docs to recommend disabling automated node upgrades: https://redpandadata.atlassian.net/browse/DOC-875

However, @chrisseto and I met and discussed improvements we can make to the node upgrade guide, including:

  • Make sure to disable the Decommission/NodeWatcher controllers before upgrading to avoid them interfering with the manual steps.
  • Provide step-by-step instructions for the network-backed PV and ephemeral storage cases.

Also fixes https://redpandadata.atlassian.net/browse/DOC-170

Page previews

https://deploy-preview-960--redpanda-docs-preview.netlify.app/current/upgrade/k-upgrade-kubernetes/

Checks

  • New feature
  • Content gap
  • Support Follow-up
  • Small fix (typos, links, copyedits, etc)

@JakeSCahill JakeSCahill requested a review from a team as a code owner January 22, 2025 14:37
@JakeSCahill JakeSCahill requested a review from chrisseto January 22, 2025 14:37
Copy link

netlify bot commented Jan 22, 2025

Deploy Preview for redpanda-docs-preview ready!

Name Link
🔨 Latest commit 79314ed
🔍 Latest deploy log https://app.netlify.com/sites/redpanda-docs-preview/deploys/67b890551553b50008bf0e41
😎 Deploy Preview https://deploy-preview-960--redpanda-docs-preview.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Copy link
Contributor

@Feediver1 Feediver1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please have a look at comment re diagram. Otherwise, well-written!

@chrisseto
Copy link
Contributor

Sorry for taking so long on this one. IIRC this ask was kicked off as we were trying to determine how best to migrate an Azure cluster as we had written up some instructions. That operation ended up going the worst way imaginable which got me thinking more deeply about this topic, especially because the landscape of self hosted is much broader than cloud.

Here's a heavily annotated flow chart that should capture most cases. It doesn't touch on how to perform the operation manually, just whether or not it needs to be manual 😓

Copy link
Contributor

@chrisseto chrisseto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed with Jake on a hang out. We came away with a few changes to make but I'm going to mark myself as an approver so I don't accidentally become a blocker.

Loose notes:

  • Use kubectl drain instead of kubectl delete <pod> as drain will use eviction and cordon the node.
  • Updating the STS strategy to OnDelete is generally useful but should be required if taints/tolerations/node selectors need to be upgraded.
  • Guides for NodePool Upgrades and Kubernetes Upgrades can be consolidated as it's largely the same operation.
  • Adding a buffer node isn't required for network backed storage but having replicas >= 3 is.
  • For local volumes the flow is: kubectl drain, kubectl delete pvc <pvc name>, kubectl delete pod <pod name>

@JakeSCahill JakeSCahill merged commit 9be44a0 into main Feb 21, 2025
7 checks passed
@JakeSCahill JakeSCahill deleted the new-k8-upgrade branch February 21, 2025 14:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants