Build: Improve NX Cloud performance and flakiness with enterprise plan#34122
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a new QA GitHub Actions workflow that runs Nx/Playwright e2e attempts, collects per-attempt artifacts and summaries; updates an onboarding e2e test to wait for story navigation and perform survey interactions inside the "Storybook user survey" dialog. Changes
Sequence Diagram(s)sequenceDiagram
participant GH as GitHub Actions
participant Matrix as Matrix Coordinator
participant Job as QA Job (attempt)
participant Runner as Job Runner
participant NxCloud as Nx Cloud
participant Artifact as Artifact Storage
GH->>Matrix: trigger QA workflow
Matrix->>Job: spawn attempt jobs
loop per attempt
Job->>Runner: checkout + setup node + install deps + playwright deps
Runner->>NxCloud: start nx distributed run (nx run-many ...)
NxCloud-->>Runner: return nx_cloud_run link & status
Runner->>Artifact: upload nx-output.log
alt failure
Runner->>Artifact: upload test-results and playwright artifacts
end
Runner->>GH: emit per-attempt qa-status/result.json
end
GH->>Artifact: collect per-attempt artifacts
GH->>GH: write workflow step summary with links
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/qa.yml:
- Line 155: The f-string passed to handle.write currently uses shell-style
`${...}` so it will not interpolate; update the string in the handle.write call
that references results (the expression "handle.write(f\"- Target:
`${results[0]['template']}:{results[0]['target']}`\\n\" if results else \"-
Target: unknown\\n\")") to use Python curly-brace interpolation (e.g.
`{results[0]['template']}` and `{results[0]['target']}`) so the template and
target values are correctly formatted when results is non-empty.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: ebecf822-b5a5-4044-b71d-4a1017c020ea
📒 Files selected for processing (1)
.github/workflows/qa.yml
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
.github/workflows/qa.yml (1)
59-59: Avoidnx@latestin CI to keep runs reproducible.Line 59 pulls whatever Nx is latest at runtime, which can introduce sudden breakage or behavior drift across attempts.
♻️ Suggested change
- npx nx@latest start-ci-run + yarn nx start-ci-run🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/qa.yml at line 59, The CI step currently runs "npx nx@latest start-ci-run", which makes runs non-reproducible; update that step to pin NX to a specific version or use the repository's pinned local binary (for example replace "npx nx@latest start-ci-run" with a pinned invocation like "npx nx@<SPECIFIC_VERSION> start-ci-run" or call the project script/local binary such as "npm run start-ci-run" / "pnpm run start-ci-run" to ensure deterministic CI behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/qa.yml:
- Around line 36-43: Job-level expressions use the env context incorrectly
(QA_TEMPLATE, QA_TARGET, QA_MAX_PARALLEL) and will fail evaluation; update the
job "name" and the "strategy.max-parallel" and "matrix.attempt" expressions to
use a valid context (e.g., vars.*, inputs.*, or needs.*) or hardcoded values:
replace `${{ env.QA_TEMPLATE }}`, `${{ env.QA_TARGET }}`, and `${{
fromJSON(env.QA_MAX_PARALLEL) }}` with appropriate `vars`/`inputs`/`needs`
references and ensure the matrix `attempt` source
(`needs.prepare.outputs.matrix`) is parsed in a valid context (use needs.prepare
outputs or vars instead of env). Locate these in the job definition where "name"
and "strategy" are defined and update the expressions accordingly.
---
Nitpick comments:
In @.github/workflows/qa.yml:
- Line 59: The CI step currently runs "npx nx@latest start-ci-run", which makes
runs non-reproducible; update that step to pin NX to a specific version or use
the repository's pinned local binary (for example replace "npx nx@latest
start-ci-run" with a pinned invocation like "npx nx@<SPECIFIC_VERSION>
start-ci-run" or call the project script/local binary such as "npm run
start-ci-run" / "pnpm run start-ci-run" to ensure deterministic CI behavior.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 72dc380b-11f8-4b4c-b9b1-632012377286
📒 Files selected for processing (1)
.github/workflows/qa.yml
|
View your CI Pipeline Execution ↗ for commit a471ffa
☁️ Nx Cloud last updated this comment at |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/qa.yml:
- Around line 4-8: The workflow trigger is hard-coded to a feature branch
("push" -> "branches" -> kasper/stabilize-onboarding-e2e-waits); update the
trigger so the workflow runs on the intended long-lived branch(s) instead of the
feature branch—for example replace or extend the branches list under on: push:
to include "next" (or "main"/"develop" as appropriate) and/or add a manual
trigger (workflow_dispatch) so the job runs post-merge; modify the on: push:
branches array in the QA workflow accordingly.
- Around line 35-39: Replace the runtime-installed Nx invocation "npx nx@latest
start-ci-run" with the workspace-pinned runner "yarn nx start-ci-run" so the CI
uses the repository's pinned Nx version (matching the later step that uses "yarn
nx"); keep the existing flags (--distribute-on and --stop-agents-after)
unchanged so the behavior is identical except for using the pinned Nx version.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: a1316a82-88e3-4ef4-8386-4755fc5fc3b9
📒 Files selected for processing (1)
.github/workflows/qa.yml
Package BenchmarksCommit: No significant changes detected, all good. 👏 |
7930d4d to
dbbfabf
Compare
Replace fixed 19-agent default with changeset-based scaling so smaller PRs use fewer agents and larger PRs get more.
…arding-e2e-waits # Conflicts: # scripts/build/utils/generate-types.ts
valentinpalkovic
left a comment
There was a problem hiding this comment.
Pair-reviewed with @kasperpeulen. LGTM!
Closes #
What I did
We're trying out NX Cloud as an alternative for Circle CI — we have the free enterprise plan for a month to evaluate. This PR improves the NX Cloud CI pipeline.
Removed uncacheable
prepare-sandboxNX targetThe sandbox preparation step (copying from cache, waiting for verdaccio, running
yarn install) is now inlined into each downstream task'srun()function. These targets were uncacheable because they moved cached output to the working directory — NX Cloud reruns work best when every target in the graph is cacheable. This also makes NX Cloud runs significantly faster.Dynamic changeset-based agent distribution
Now that we're on the NX enterprise plan for a month, we can use more agents. Agents now scale based on changeset size (small/medium/large/extra-large) instead of a fixed 20-agent pool. A separate daily distribution config uses even more agents for the larger daily sandbox set — the daily run now takes less than 10 minutes.
Reduced agent disk usage
NX Cloud agents were running out of disk space. To fix this:
enableGlobalCache: true) so each sandbox doesn't duplicate its own.yarn/cachedirectory.node_modulesfrom the NX Cloud cache paths (the yarn global cache is sufficient).Because the global cache is shared across sandboxes, stale
@storybook/*packages (same version, different contents from verdaccio) can causeyarn installto pick up published versions instead of local builds. A cleanup step purges these entries before each run to force fresh copies from verdaccio.Core compile DTS retry logic
Found a way to make the core compile task less flaky. Parallel DTS generation processes can race when one reads another's partially-written
.d.tsoutput (TS2306). Instead of immediately bailing and killing all processes on the first failure, failed entries are now retried sequentially after all parallel processes complete. By that point all dependency.d.tsfiles are fully written, so the retry almost always succeeds. This adds ~8s only when the race actually occurs.NX Cloud remote caching disabled on Circle CI (temporary)
NX Cloud remote caching is disabled on Circle CI so we can measure the true cost of NX Cloud without also getting cache-hit costs from Circle CI runs. This will be reverted once the NX Cloud experiment concludes.
E2E test stabilization
setupproject that waits for Storybook readiness through/index.json, opensexample-button--primary, verifies the manager selection, and waits for the preview to settle before the browser tests start.trySelectStoryfor reliable story selection after creation.Misc
cypressfromALL_TASKSin the NX workflow. We can likely remove cypress from CI altogether, but for now it's removed from the NX experiment.isNxTaskExecution()helper replacing scatteredNX_CLI_SETchecks.$schemareferences toproject.jsonfiles.Checklist for Contributors
Testing
The changes in this PR are covered in the following automated tests:
Manual testing
Verify in the NX Cloud dashboard that runs are faster and less flaky compared to previous runs.
Documentation
MIGRATION.MD
Checklist for Maintainers
When this PR is ready for testing, make sure to add
ci:normal,ci:mergedorci:dailyGH label to it to run a specific set of sandboxes. The particular set of sandboxes can be found incode/lib/cli-storybook/src/sandbox-templates.tsMake sure this PR contains one of the labels below:
Available labels
bug: Internal changes that fixes incorrect behavior.maintenance: User-facing maintenance tasks.dependencies: Upgrading (sometimes downgrading) dependencies.build: Internal-facing build tooling & test updates. Will not show up in release changelog.cleanup: Minor cleanup style change. Will not show up in release changelog.documentation: Documentation only changes. Will not show up in release changelog.feature request: Introducing a new feature.BREAKING CHANGE: Changes that break compatibility in some way with current major version.other: Changes that don't fit in the above categories.🦋 Canary release
This PR does not have a canary release associated. You can request a canary release of this pull request by mentioning the
@storybookjs/coreteam here.core team members can create a canary release here or locally with
gh workflow run --repo storybookjs/storybook publish.yml --field pr=<PR_NUMBER>