Skip to content

Conversation

@AgustinBettati
Copy link
Member

@AgustinBettati AgustinBettati commented Oct 31, 2025

Proposed changes

Jira ticket: CLOUDP-356110

Problem

Cluster resource does not support unpausing. This use case must be handled by defining a separate request only adjusting paused property due to how the API behaves.

Current implement supports pausing a cluster handling a separate request within update callback logic.

Solution

Before sending an update operation information of the complete model (which would result in an error) a check is made to see if the update operation is doing an unpause. Under this scenario an isolated PATCH request is crafted only sending paused property with false. Follow up PATCH request can be sent after this without encountering any errors (in case other changes are present).

Testing

  • Introduced a new e2e test for verifying successful creation, pause, and unpause operations.
  • Created stack using private registry verifying pause/unpause operations.
Screenshot 2025-11-03 at 13 14 27

Type of change:

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as
    expected)
  • This change requires a documentation update
  • If changes include removal or addition of 3rd party GitHub actions, I updated our internal document. Reach out to the APIx Integration slack channel to get access to the internal document.

Manual QA performed:

  • cfn invoke for each of CRUDL/cfn test
  • Updated resource in example
  • Published to AWS private registry
  • Used the template in example to create and update a stack in AWS
  • Deleted stack to ensure resources are deleted
  • Created multiple resources in same stack
  • Validated in Atlas UI
  • Included screenshots

Required Checklist:

  • I have signed the MongoDB CLA
  • I have added tests that prove my fix is effective or that my feature works
  • I have checked that this change does not generate any credentials and that they are NOT accidentally logged anywhere.
  • I have added any necessary documentation (if appropriate)
  • I have run make fmt and formatted my code
  • For CFN Resources: I have released by changes in the private registry and proved by change
    works in Atlas

Further comments

Comment on lines 96 to 99
# run idividual test allowing for parallel execution. Due to usage of t.Setenv() in test code, t.Parallel() is not possible
run: |
cd cfn-resources/test/e2e/cluster
go test -timeout 90m -v -run '^TestClusterPauseCFN$' .
Copy link
Member Author

@AgustinBettati AgustinBettati Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While not ideal this was a solution that did not require a significant refactor to how our CFN e2e are currently executed and allow the new test to run in parallel. Currently e2e tests rely on executing shell scripts and setting env variables during execution.

@AgustinBettati AgustinBettati marked this pull request as ready for review November 3, 2025 16:17
@AgustinBettati AgustinBettati requested a review from a team as a code owner November 3, 2025 16:17
Copilot AI review requested due to automatic review settings November 3, 2025 16:17
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a bug where unpausing a cluster through CloudFormation would fail due to how the Atlas API handles the unpause operation. The fix introduces a separate PATCH request for unpausing that only updates the paused property before applying any other changes.

  • Adds special handling for unpause operations before the main update request
  • Introduces comprehensive e2e tests for cluster pause/unpause functionality
  • Updates CI workflow to run the new pause test independently

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
cfn-resources/cluster/cmd/resource/resource.go Implements unpause-specific handling logic to avoid API errors
cfn-resources/test/e2e/cluster/cluster_pause_test.go Adds e2e test validating create, pause, and unpause operations
cfn-resources/test/e2e/cluster/cluster_pause.json.template Provides CloudFormation template for pause/unpause testing
.github/workflows/e2e-testing.yaml Updates workflow to run cluster pause tests separately

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

go test -timeout 90m -v cluster_test.go
go test -timeout 90m -v -run '^TestClusterCFN$' .
cluster-pause:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why we are introducing a new workflow as opposed to using the cluster one for new tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial motivation is to add the new test but have them run in parallel (sequentially they take 40min + 20min). Due to how are e2e tests are implemented t.Parallel() cannot be used, Go returns an explicit error panic: testing: test using t.Setenv or t.Chdir can not use t.Parallel.
This is the reasoning behind adding a separate test group so both tests can run in parallel, tried to capture this why in the comment.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I see that you changed to a single workflow running 2 packages now. Smart! Ty!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

single workflow running 2 packages now

Was close but did not work 😞 , publishing script modifies specific files in cluster package (for private publishing), need to run the tests in isolated filesystems to avoid race conditions.

}

// Unpausing must be handled separately from other updates to avoid errors from the API.
if pe := handleUnpausingUpdate(client, currentCluster, currentModel); pe != nil {
Copy link
Contributor

@EspenAlbert EspenAlbert Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any need to wait for a stable cluster state? Is it ok we immediately send a different update afterwards?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is not, was able to verify that once a first PATCH is sent doing the unpause, having the cluster in UPDATING allows sending other PATCH requests modifying the cluster

Copy link
Member

@lantoli lantoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, let me know if you want to discuss about testing here

@AgustinBettati AgustinBettati added this pull request to the merge queue Nov 4, 2025
Merged via the queue into master with commit 4f7a75b Nov 4, 2025
40 checks passed
@AgustinBettati AgustinBettati deleted the CLOUDP-356110 branch November 4, 2025 17:00
# Run idividual test in separate test group for parallel execution.
# Due to usage of t.Setenv() in test code t.Parallel() is not possible.
# Having both tests run in same `go test` execution is not possible due to scripts used for private registry publishing modifying same files
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ty for the comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants