Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Selective V2 Data Engine Activation #7015

Closed
derekbit opened this issue Nov 1, 2023 · 5 comments
Closed

[FEATURE] Selective V2 Data Engine Activation #7015

derekbit opened this issue Nov 1, 2023 · 5 comments
Assignees
Labels
area/v2-data-engine v2 data engine (SPDK) highlight Important feature/issue to highlight kind/feature Feature request, new feature priority/0 Must be implement or fixed in this release (managed by PO) require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated require/backport Require backport. Only used when the specific versions to backport have not been definied. require/doc Require updating the longhorn.io documentation
Milestone

Comments

@derekbit
Copy link
Member

derekbit commented Nov 1, 2023

Is your improvement request related to a feature? Please describe (👍 if you like this request)

In a large cluster, both powerful nodes and low-spec nodes exist, and there are lots of kinds of applications running inside it. Currently, the v2-data-engine enables the instance-manager pod for v2 volumes on all Longhorn nodes regardless of machines' specs. To address the issue, the v2 data engines can be activated through the global setting v2-dat-engine and per-node labels, annotations, or spec fields.

The ticket can be extended to instance-manager pod for v1 volumes in the future.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

cc @shuo-wu @innobead

@derekbit derekbit added priority/0 Must be implement or fixed in this release (managed by PO) require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated require/doc Require updating the longhorn.io documentation kind/improvement Request for improvement of existing function require/backport Require backport. Only used when the specific versions to backport have not been definied. labels Nov 1, 2023
@derekbit derekbit added this to the v1.6.0 milestone Nov 1, 2023
@derekbit derekbit added the area/v2-data-engine v2 data engine (SPDK) label Nov 1, 2023
@derekbit derekbit self-assigned this Nov 1, 2023
@longhorn-io-github-bot
Copy link
Collaborator

longhorn-io-github-bot commented Nov 13, 2023

Pre Ready-For-Testing Checklist

  • Where is the reproduce steps/test steps documented?
    The reproduce steps/test steps are at:
  1. Create a 3-node LH volume and enable v1-data-engine and v2-data-engine
  2. Create v1 and v2 volumes with 2 replicas
  3. Add label node.longhorn.io/disable-v2-data-engine: "true" to two of kubernetes nodes
  4. IMs and their pods for v1 volumes should not be impacted.
  5. For IMs and their pods for v2 volumes, it there is no replicas and engine in the IM, the IM with the label should be deleted.
  6. Detach all v2 volumes. After the v2 volumes are detached, the IMs and pods with the label should be deleted.
  7. Remove the label, the deleted IMs and pods should recreated.
  • Does the PR include the explanation for the fix or the feature?

  • Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
    The PR is at

longhorn/longhorn-manager#2292

  • Which areas/issues this PR might have potential impacts on?
    Area: v2 volume, instance manager
    Issues

@innobead innobead added the highlight Important feature/issue to highlight label Jan 3, 2024
@innobead innobead changed the title [IMPROVEMENT] Support instance-manager pod for v2 volumes on selected nodes [FEATURE] Support instance-manager pod for v2 volumes on selected nodes Jan 3, 2024
@innobead innobead added kind/feature Feature request, new feature and removed kind/improvement Request for improvement of existing function labels Jan 3, 2024
@chriscchien chriscchien self-assigned this Jan 4, 2024
@chriscchien
Copy link
Contributor

Verified pass on longhorn master(longhorn-manager 970ba4)

Following test steps, did not encounter problem, close this ticket, thank you.

@khushboo-rancher
Copy link
Contributor

@chriscchien Some priority 1/2 scenarios that can be further tested:

Prerequisite: Have a v1 volume with three replicas and data checksum computed.

  1. Disable all the node for v2 data engine, create a v2 volume. The volume should show unscheduling, Delete the v2 disable label. Verify the volume become schedulable.
  2. Create a v2 volume, add v2 disable label to one of the replica node. Crash the IM on replica node. Check the replica behavior.
  3. Create a v2 volume, add v2 disable label to one of the volume attached node. Crash the IM on attached node. Check the replica behavior.
  4. Create a v2 volume with 2 replicas. Trigger the replica rebuilding. When the replica is progress, add the v2 disable label to the rebuilding replica node`. Verify that the replica finishes successfully.

v1 volume should not be impacted in any of the above scenarios.

@chriscchien
Copy link
Contributor

@chriscchien Some priority 1/2 scenarios that can be further tested:

Prerequisite: Have a v1 volume with three replicas and data checksum computed.

  1. Disable all the node for v2 data engine, create a v2 volume. The volume should show unscheduling, Delete the v2 disable label. Verify the volume become schedulable.
  2. Create a v2 volume, add v2 disable label to one of the replica node. Crash the IM on replica node. Check the replica behavior.
  3. Create a v2 volume, add v2 disable label to one of the volume attached node. Crash the IM on attached node. Check the replica behavior.
  4. Create a v2 volume with 2 replicas. Trigger the replica rebuilding. When the replica is progress, add the v2 disable label to the rebuilding replica node`. Verify that the replica finishes successfully.

v1 volume should not be impacted in any of the above scenarios.

Hi @khushboo-rancher,

  1. v2 volume become schedulable after remove the label or set label to false
  2. The replica on crashed IM gone, volume become degraded, after remove label or set label to false then do offline rebuilding, all replicas ready and data intact.
  3. The replica on crashed IM gone, volume try to reattach to the same node but stuck at attaching, after remove label or set label to false, volume attached, data intact.
  4. Following steps, the replica rebuild success.

And after above tests complete, v1 volume healthy, data intact

@innobead
Copy link
Member

@chriscchien let's automate these cases it they haven't been implemented. Create a ticket for it.

@derekbit derekbit changed the title [FEATURE] Support instance-manager pod for v2 volumes on selected nodes [FEATURE] Selective V2 Data Engine Activation Jan 14, 2024
@derekbit derekbit moved this to Closed in Longhorn Sprint Aug 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/v2-data-engine v2 data engine (SPDK) highlight Important feature/issue to highlight kind/feature Feature request, new feature priority/0 Must be implement or fixed in this release (managed by PO) require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated require/backport Require backport. Only used when the specific versions to backport have not been definied. require/doc Require updating the longhorn.io documentation
Projects
Status: Closed
Development

No branches or pull requests

6 participants