Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DFSM] Support live updates on compute nodes. #6003

Merged

Conversation

gmarciani
Copy link
Contributor

@gmarciani gmarciani commented Jan 10, 2024

Description of changes

Add support for live updates on compute nodes, using cfn-hup. In particular:

  • Move definition of dna.json from compute user data to compute launch template. In this way changes to cluster config file, which are reflected to the dna.json, are reflected on the launch template and so detectable by cfn-hup.
  • Add cfn-hup configuration files.
  • Add cluster config versioning to the dna.json.
  • Add permissions for compute nodes to describe the resources in the compute fleet nested stack, required by cfn-init and cfn-hup to describe the stack and detect updates.
  • Add permissions for compute nodes to read/write items in the DDB table, required to put the version of the cluster config file deployed in the nodes.
  • Relaxed the unit test test_cluster_config_limits to unblock these changes by reducing the maximum number of queues validated by the test from 50 to 46. We will revert this relaxation in a follow up PR where we will be able to move the content causing the limit breach from the template to the AMI.
  • Added new unit tests covering IAM permissions for compute nodes and generation of dna.json (generation of cfn-hup config file intentionally not covered by unit tests because we should move it to cookbook in the short term).
  • Ignore bandit rules B036 and B038, that started failing in this PR but not due to the changes included in this PR (all B038 findings are also false positive)

This PR relies on aws/aws-parallelcluster-cookbook#2614

Tests

  • Existing unit tests (some of them are still failing, this is the reason why the Pr is still in draft)
  • New unit tests covering IAM permissions for compute nodes and generation of dna.json.
  • Manual testing: cluster creation + forced update verifying that the update succeeds and the log line proving it is emitted see [DFSM] Support live updates on compute nodes. aws-parallelcluster-cookbook#2614).
  • Integ test test test_update_slurm succeeded.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@gmarciani gmarciani added skip-changelog-update Disables the check that enforces changelog updates in PRs 3.x labels Jan 10, 2024
@gmarciani gmarciani force-pushed the wip/mgiacomo/390/dfsm2/forced-update-0108 branch 2 times, most recently from a0bada2 to 0955b3c Compare January 15, 2024 14:32
@gmarciani gmarciani force-pushed the wip/mgiacomo/390/dfsm2/forced-update-0108 branch 2 times, most recently from 4de4228 to f786235 Compare January 15, 2024 18:14
@gmarciani gmarciani added the skip-security-exclusions-check Skip the checks regarding the security exclusions label Jan 15, 2024
@gmarciani gmarciani force-pushed the wip/mgiacomo/390/dfsm2/forced-update-0108 branch from f786235 to 0848dd7 Compare January 15, 2024 21:39
Copy link

codecov bot commented Jan 15, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (2b10b88) 90.20% compared to head (088ce00) 90.21%.

❗ Current head 088ce00 differs from pull request most recent head 9ad63df. Consider uploading reports for the commit 9ad63df to get more accurate results

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #6003      +/-   ##
===========================================
+ Coverage    90.20%   90.21%   +0.01%     
===========================================
  Files          180      180              
  Lines        15808    15818      +10     
===========================================
+ Hits         14259    14270      +11     
+ Misses        1549     1548       -1     
Flag Coverage Δ
unittests 90.21% <100.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@gmarciani gmarciani marked this pull request as ready for review January 15, 2024 21:46
@gmarciani gmarciani requested review from a team as code owners January 15, 2024 21:46
@gmarciani gmarciani force-pushed the wip/mgiacomo/390/dfsm2/forced-update-0108 branch from 0848dd7 to c0c3184 Compare January 16, 2024 11:19
@gmarciani gmarciani enabled auto-merge (rebase) January 16, 2024 12:16
@gmarciani gmarciani force-pushed the wip/mgiacomo/390/dfsm2/forced-update-0108 branch from 40f5bd7 to c0c3184 Compare January 16, 2024 13:52
@gmarciani gmarciani force-pushed the wip/mgiacomo/390/dfsm2/forced-update-0108 branch from c0c3184 to 088ce00 Compare January 16, 2024 13:53
Copy link
Contributor

@enrico-usai enrico-usai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks good to me. I verified that all the variables used in the user-data.sh of the compute node are still populated and that we just moved the variables required for dna.json.

@gmarciani gmarciani merged commit f22c6d0 into aws:develop Jan 16, 2024
27 checks passed
@gmarciani gmarciani deleted the wip/mgiacomo/390/dfsm2/forced-update-0108 branch January 17, 2024 12:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.x skip-changelog-update Disables the check that enforces changelog updates in PRs skip-security-exclusions-check Skip the checks regarding the security exclusions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants