Skip to content

Conversation

@vivekgsharma
Copy link
Collaborator

@vivekgsharma vivekgsharma commented Oct 19, 2025

Summary

Fixes deployment failures that occur during initial network deployments.

Changes

  1. Terraform AMI lookup fix (terraform/aws/main.tf)

    • Fixed Ubuntu AMI name pattern from server* to server-* for both AMD64 and ARM64
    • Prevents "AMI not found" errors by correctly matching Ubuntu image names like ubuntu-jammy-22.04-amd64-server-20240927
  2. Rolling restart improvements (ansible/roles/dashmate/tasks/rolling_restart.yml)

    • Made --safe flag conditional via dashmate_safe_restart variable
    • Added failed_when: false to prevent DKG timeout failures on fresh deployments
  3. Elasticsearch certificate handling (ansible/roles/elastic_stack/tasks/main.yml)

    • Added file existence checks before fetching ca.zip and certs.zip
    • Prevents fetch errors when certificates haven't been generated yet

Testing

Tested on devnet-mahua deployment - all services deployed successfully without errors and chain progressing.

Summary by CodeRabbit

  • Chores
    • Enhanced restart command to support a conditional safety flag and adjusted error handling so non-zero exits are ignored
    • Added pre-validation checks for certificate archives to only fetch when present
    • Refined cloud image selection with more precise name-matching patterns for instances

@coderabbitai
Copy link

coderabbitai bot commented Oct 19, 2025

Walkthrough

Three configuration files updated: dashmate restart command is rebuilt as a multiline command and now ignores non‑zero exits; elastic_stack adds stat checks before fetching ca.zip and certs.zip; Terraform AMI name filters gain an extra hyphen for more specific matches.

Changes

Cohort / File(s) Summary
Dashmate restart logic
ansible/roles/dashmate/tasks/rolling_restart.yml
Rewrote restart command composition into a multiline block, conditionally injects --safe when dashmate_safe_restart is true, keeps conditional --platform and --verbose placement, and sets failed_when: false to ignore non-zero exit codes.
Elastic stack pre-flight checks
ansible/roles/elastic_stack/tasks/main.yml
Added ansible.builtin.stat checks for ca.zip and certs.zip (registered as ca_zip_file and certs_zip_file) and made the existing fetch tasks conditional on those results; unarchive/install logic unchanged.
AWS AMI filter refinement
terraform/aws/main.tf
Tightened AMI name filters from "ubuntu-jammy-22.04-*-server*" to "ubuntu-jammy-22.04-*-server-*" in the ubuntu_amd and ubuntu_arm data sources.

Sequence Diagrams

sequenceDiagram
    participant Task as Dashmate Restart Task
    participant Cond as Eval: dashmate_safe_restart
    participant Cmd as Build Command
    participant Exec as Execute Restart
    participant Handle as Result Handler

    Task->>Cond: Check dashmate_safe_restart
    alt safe = true
        Cond->>Cmd: include --safe
    else safe = false
        Cond->>Cmd: omit --safe
    end
    Cmd->>Cmd: ensure --verbose and conditional --platform
    Cmd->>Exec: run restart command
    Exec->>Handle: return exit code
    Note over Handle: failed_when: false -> ignore non-zero exit
Loading
sequenceDiagram
    participant Task as Elastic Stack Tasks
    participant StatCA as Stat: ca.zip
    participant StatCert as Stat: certs.zip
    participant FetchCA as Fetch CA (if missing)
    participant FetchCert as Fetch certs (if missing)
    participant Unarchive as Unarchive / Install

    Task->>StatCA: stat ca.zip
    StatCA-->>Task: exists? true/false
    Task->>StatCert: stat certs.zip
    StatCert-->>Task: exists? true/false

    alt ca.zip not exists
        Task->>FetchCA: fetch ca.zip
    end
    alt certs.zip not exists
        Task->>FetchCert: fetch certs.zip
    end
    FetchCA->>Unarchive: pass files
    FetchCert->>Unarchive: pass files
    Unarchive->>Unarchive: proceed with existing logic
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 I nibbled flags and stitched a line,

--safe hops in when fences align,
I peeked in sacks before I fetch,
tuned AMI names to a tighter stretch,
a quiet restart, errors left behind.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The pull request title "fix: deployment failures in terraform AMI lookup, rolling restart, and elasticsearch (devnet/testnet)" directly aligns with the three main changes in the changeset. The title accurately identifies the three files modified (terraform/aws/main.tf, ansible/roles/dashmate/tasks/rolling_restart.yml, and ansible/roles/elastic_stack/tasks/main.yml) and clearly conveys that these changes address deployment failures. The title is concise, uses the conventional "fix:" prefix, specifies the scope (devnet/testnet), and avoids vague terminology or noise. A developer scanning git history would immediately understand that this PR fixes specific deployment issues across infrastructure provisioning and orchestration tasks.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch vivek/devnet-fixes

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@vivekgsharma vivekgsharma changed the title Vivek/devnet fixes devnet deployement fixes, applicable to testnet as well. Oct 19, 2025
@vivekgsharma vivekgsharma changed the title devnet deployement fixes, applicable to testnet as well. fix: deployment failures in terraform AMI lookup, rolling restart, and elasticsearch (devnet/testnet) Oct 19, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
ansible/roles/dashmate/tasks/rolling_restart.yml (1)

6-7: Use is defined checks instead of | default() per coding guidelines.

The coding guidelines specify using is defined checks for variables like dashmate_safe_restart and needs_core_restart to prevent undefined variable errors, rather than relying on | default() filters.

Apply this diff to use proper conditional checks:

  ansible.builtin.command: >-
    {{ dashmate_cmd }} restart
-   {{ '--safe' if dashmate_safe_restart | default(false) else '' }}
-   --verbose{{ '' if needs_core_restart | default(false) else ' --platform' }}
+   {% if dashmate_safe_restart is defined and dashmate_safe_restart %}--safe{% endif %}
+   --verbose{% if needs_core_restart is not defined or not needs_core_restart %} --platform{% endif %}

Alternatively, keep the inline Jinja2 approach but add explicit is defined checks:

  ansible.builtin.command: >-
    {{ dashmate_cmd }} restart
-   {{ '--safe' if dashmate_safe_restart | default(false) else '' }}
-   --verbose{{ '' if needs_core_restart | default(false) else ' --platform' }}
+   {{ '--safe' if dashmate_safe_restart is defined and dashmate_safe_restart else '' }}
+   --verbose{{ '' if needs_core_restart is defined and needs_core_restart else ' --platform' }}

Based on coding guidelines.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 18211b8 and 2e4244c.

📒 Files selected for processing (1)
  • ansible/roles/dashmate/tasks/rolling_restart.yml (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
ansible/roles/dashmate/tasks/*.yml

📄 CodeRabbit inference engine (CLAUDE.md)

ansible/roles/dashmate/tasks/*.yml: Use is defined checks and conditional execution for variables such as dashmate_group_check, dashmate_user_check, dash_conf_stat, dash_conf_changed, logrotate_config_stat, dashmate_update, dashmate_start_all, dashmate_restart_all, dashmate_install_result, and template_result in Ansible tasks to prevent undefined variable errors
Implement conditional restart logic in Ansible tasks: in fast mode, each node restarts itself independently; in regular mode, coordinate chunked restarts to prevent network disruption
Check current vs required dashmate version in Ansible tasks and only install/update if versions differ, then restart services after version changes
Add meta: flush_handlers before coordinated operations in regular mode to ensure all nodes reach synchronization points

Files:

  • ansible/roles/dashmate/tasks/rolling_restart.yml
ansible/**/*.yml

📄 CodeRabbit inference engine (CLAUDE.md)

Use force flags (force_dashmate_rebuild, force_dashmate_reinstall, force_ssl_regenerate, force_logs_config, skip_dashmate_image_update) as manual overrides in Ansible playbooks and tasks when needed

Files:

  • ansible/roles/dashmate/tasks/rolling_restart.yml
🧠 Learnings (5)
📚 Learning: 2025-08-06T08:09:00.292Z
Learnt from: CR
PR: dashpay/dash-network-deploy#0
File: CLAUDE.md:0-0
Timestamp: 2025-08-06T08:09:00.292Z
Learning: Applies to ansible/roles/dashmate/tasks/*.yml : Implement conditional restart logic in Ansible tasks: in fast mode, each node restarts itself independently; in regular mode, coordinate chunked restarts to prevent network disruption

Applied to files:

  • ansible/roles/dashmate/tasks/rolling_restart.yml
📚 Learning: 2025-08-06T08:09:00.292Z
Learnt from: CR
PR: dashpay/dash-network-deploy#0
File: CLAUDE.md:0-0
Timestamp: 2025-08-06T08:09:00.292Z
Learning: Applies to ansible/roles/dashmate/tasks/*.yml : Check current vs required dashmate version in Ansible tasks and only install/update if versions differ, then restart services after version changes

Applied to files:

  • ansible/roles/dashmate/tasks/rolling_restart.yml
📚 Learning: 2025-08-06T08:09:00.292Z
Learnt from: CR
PR: dashpay/dash-network-deploy#0
File: CLAUDE.md:0-0
Timestamp: 2025-08-06T08:09:00.292Z
Learning: Applies to ansible/roles/dashmate/tasks/quick_update.yml : Create a streamlined update-only task in `ansible/roles/dashmate/tasks/quick_update.yml` for fast updates

Applied to files:

  • ansible/roles/dashmate/tasks/rolling_restart.yml
📚 Learning: 2025-08-06T08:09:00.292Z
Learnt from: CR
PR: dashpay/dash-network-deploy#0
File: CLAUDE.md:0-0
Timestamp: 2025-08-06T08:09:00.292Z
Learning: Applies to ansible/roles/dashmate/tasks/main.yml : Implement conditional execution for user/group checks, dash.conf generation, log configuration, and Docker image updates in `ansible/roles/dashmate/tasks/main.yml` based on fast mode flags

Applied to files:

  • ansible/roles/dashmate/tasks/rolling_restart.yml
📚 Learning: 2025-08-06T08:09:00.292Z
Learnt from: CR
PR: dashpay/dash-network-deploy#0
File: CLAUDE.md:0-0
Timestamp: 2025-08-06T08:09:00.292Z
Learning: Applies to ansible/roles/dashmate/tasks/*.yml : Use `is defined` checks and conditional execution for variables such as `dashmate_group_check`, `dashmate_user_check`, `dash_conf_stat`, `dash_conf_changed`, `logrotate_config_stat`, `dashmate_update`, `dashmate_start_all`, `dashmate_restart_all`, `dashmate_install_result`, and `template_result` in Ansible tasks to prevent undefined variable errors

Applied to files:

  • ansible/roles/dashmate/tasks/rolling_restart.yml
🔇 Additional comments (2)
ansible/roles/dashmate/tasks/rolling_restart.yml (2)

4-7: Verify the inverted logic of the --platform flag condition.

Line 7 conditionally includes --platform when needs_core_restart is false, which appears semantically inverted. When the variable name suggests we need a core restart, the flag is excluded. Conversely, when we don't need a core restart, the flag is included.

Confirm that this logic is intentional and clarify the semantics of --platform vs. needs_core_restart.


13-14: Review error-masking implications of failed_when: false.

Setting failed_when: false causes the task to always succeed, even on non-zero exit codes. While the PR objectives note this is intentional to avoid DKG timeout failures on fresh deployments, this will also mask legitimate restart failures that should be surfaced for investigation.

Consider whether a more granular approach might be better, such as:

  • Checking the exit code and only ignoring specific timeout-related failures.
  • Adding validation in a follow-up task to confirm services are actually running after the "failed" restart.
  • Documenting why fresh deployments are expected to fail and what operators should monitor.

Per PR objectives: This is an intentional trade-off for fresh deployments; however, ensure the downstream tasks and monitoring account for this permissive error handling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants