Skip to content

Comments

Transform into vendor-neutral upstream project with zero hardcoded values#6

Merged
tangyisheng2 merged 2 commits intoopenshift:mainfrom
rrasouli:terraform-scripts
Jan 13, 2026
Merged

Transform into vendor-neutral upstream project with zero hardcoded values#6
tangyisheng2 merged 2 commits intoopenshift:mainfrom
rrasouli:terraform-scripts

Conversation

@rrasouli
Copy link
Contributor

@rrasouli rrasouli commented Oct 23, 2025

Summary

BYOH Provisioner for deploying Windows worker nodes to OpenShift clusters across multiple cloud platforms.

Overview

This tool provisions Windows worker nodes as BYOH (Bring Your Own Host) for OpenShift clusters. It supports AWS, Azure, GCP, vSphere, Nutanix, and bare metal environments with automated credential management and infrastructure discovery from existing Linux worker nodes.

What This Tool Does

Multi-Cloud Windows Node Provisioning

Deploy Windows Server 2019/2022 nodes across platforms:

  • AWS
  • Azure
  • GCP
  • vSphere
  • Nutanix
  • Bare metal (platform "none")

Automated Infrastructure Discovery

The tool extracts infrastructure configuration from Linux worker nodes in the cluster:

# AWS: Region, VPC, subnet, security groups
linux_machine_spec=$(oc get machines -n openshift-machine-api \
  -l machine.openshift.io/cluster-api-machine-role=worker \
  -o=jsonpath='{.items[0].spec}')
region=$(echo "$linux_machine_spec" | jq -r '.providerSpec.value.placement.region')
vpc=$(echo "$linux_machine_spec" | jq -r '.providerSpec.value.vpc')

# Azure: VNet, subnet
vnet=$(echo "$linux_machine_spec" | jq -r '.providerSpec.value.vnet')
subnet=$(echo "$linux_machine_spec" | jq -r '.providerSpec.value.subnet')

# vSphere: Datacenter, datastore, network
datacenter=$(echo "$linux_machine_spec" | jq -r '.workspace.datacenter')
datastore=$(echo "$linux_machine_spec" | jq -r '.workspace.datastore')
network=$(echo "$linux_machine_spec" | jq -r '.network.devices[0].networkName')

Benefits:

  • No manual infrastructure parameter entry
  • Network configuration matches Linux workers
  • Works immediately with any OpenShift cluster
  • No Windows MachineSet prerequisite

Automated Credential Management

Auto-Generated Passwords:

# Cryptographically secure 18-character random password
function generate_random_password() {
    echo "$(dd if=/dev/urandom bs=1 count=101 2>/dev/null | tr -dc 'a-z0-9A-Z' | head -c 18)"
}

Auto-Extracted SSH Keys:

# Extract from cloud-private-key secret in WMCO namespace
function get_ssh_public_key_from_secret() {
    local wmco_namespace=$(oc get deployment --all-namespaces \
      -o=jsonpath="{.items[?(@.metadata.name=='windows-machine-config-operator')].metadata.namespace}")

    local private_key=$(oc get secret cloud-private-key -n "$wmco_namespace" \
      -o jsonpath='{.data.private-key\.pem}' | base64 -d)

    ssh-keygen -y -f <(echo "$private_key")
}

Platform-Specific Credentials:

  • Azure: Extracts all 6 variables from cluster secrets (ARM_CLIENT_ID, ARM_CLIENT_SECRET, ARM_SUBSCRIPTION_ID, ARM_TENANT_ID, ARM_RESOURCE_PREFIX, ARM_RESOURCEGROUP)
  • AWS: Supports multiple credential methods (profile + shared file, direct keys, default credentials)
  • GCP: Service account from cluster secrets
  • vSphere: vCenter credentials from cluster secrets
  • Nutanix: Prism Central credentials from cluster secrets

Generic Windows Bootstrap Template

Single cross-platform bootstrap script at lib/windows-vm-bootstrap.tf:

<powershell>
# Configure Administrator account for OpenSSH authentication
# OpenSSH on Windows requires valid password to generate security token (LogonUser API)
$UserAccount = Get-LocalUser -Name "${var.admin_username}" -ErrorAction SilentlyContinue
if ($UserAccount -ne $null) {
    $password = ConvertTo-SecureString "${var.admin_password}" -AsPlainText -Force
    $UserAccount | Set-LocalUser -Password $password -PasswordNeverExpires $true
    if (!$UserAccount.Enabled) {
        $UserAccount | Enable-LocalUser
    }
}

# Setup SSH authorized keys
$authorizedKeyConf = "$env:ProgramData\ssh\administrators_authorized_keys"
Write-Output "${var.ssh_public_key}" | Out-File -FilePath $authorizedKeyConf -Encoding ascii

# Install and configure OpenSSH Server
Add-WindowsCapability -Online -Name OpenSSH.Server~~~~0.0.1.0
Set-Service -Name sshd -StartupType 'Automatic'
Start-Service sshd

# Configure firewall
New-NetFirewallRule -DisplayName "ContainerLogsPort" -LocalPort ${var.container_logs_port} -Enabled True -Direction Inbound -Protocol TCP -Action Allow -EdgeTraversalPolicy Allow
</powershell>

Platform-specific directories contain symlinks to this generic template:

  • aws/windows-vm-bootstrap.tf../lib/windows-vm-bootstrap.tf
  • gcp/windows-vm-bootstrap.tf../lib/windows-vm-bootstrap.tf
  • azure/windows-vm-bootstrap.tf../lib/windows-vm-bootstrap.tf
  • nutanix/windows-vm-bootstrap.tf../lib/windows-vm-bootstrap.tf

Image Selection Strategy

Per-platform priority ordering:

Platform Priority
AWS User Config → AWS API (version-specific) → MachineSet
Azure User Config → MachineSet → Default SKU (latest)
GCP Image family (always latest)
vSphere User Config → MachineSet → Error
Nutanix User Config → MachineSet → Error

AWS Example:

# Priority 1: User override
windows_ami=$(get_config "AWS_WINDOWS_AMI" "")

# Priority 2: AWS API query (version-specific)
if [[ -z "$windows_ami" ]] && command -v aws &> /dev/null; then
    image_pattern="Windows_Server-${win_version}-English-Full-Base"
    windows_ami=$(aws ec2 describe-images \
        --filters "Name=name,Values=${image_pattern}*" \
        --region "${region}" \
        --query 'sort_by(Images, &CreationDate)[-1].[ImageId]' \
        --output text)
fi

# Priority 3: MachineSet fallback
if [[ -z "$windows_ami" ]]; then
    windows_ami=$(oc get machineset -n openshift-machine-api ...)
fi

Configuration System

Configuration priority (highest to lowest):

  1. Environment variables
  2. User config file (~/.config/byoh-provisioner/config)
  3. Project config file (./configs/defaults.conf)
  4. Built-in defaults

Implementation:

# lib/config.sh - Only export if not already set in environment
if [[ -z "${!key:-}" ]]; then
    export "$key=$value"
fi

Modular Architecture

terraform-windows-provisioner/
├── byoh.sh                      # Main entry point
├── lib/                         # Library modules
│   ├── config.sh               # Configuration loading (bash 3.x compatible)
│   ├── credentials.sh          # Credential management
│   ├── platform.sh             # Platform detection & tfvars generation
│   ├── terraform.sh            # Terraform operations (cp -LR for symlinks)
│   ├── validation.sh           # Input validation
│   └── windows-vm-bootstrap.tf # Generic Windows bootstrap (all platforms)
├── configs/                    # Configuration files
│   ├── defaults.conf           # Default values
│   └── examples/               # Platform-specific examples
└── <platform>/                 # Platform-specific Terraform
    ├── aws/
    ├── azure/
    ├── gcp/
    ├── vsphere/
    ├── nutanix/
    └── none/

Key Configuration Variables

Variable Description Default
WMCO_NAMESPACE WMCO deployment namespace Auto-detected
WMCO_IDENTIFIER_TYPE ConfigMap identifier: ip or dns ip
WINC_ADMIN_USERNAME Windows admin username Azure=capi, Others=Administrator
WINC_ADMIN_PASSWORD Windows password Auto-generated
WINC_SSH_PUBLIC_KEY SSH public key Auto-extracted
AWS_WINDOWS_AMI AWS AMI override Auto-detected
AZURE_WINDOWS_SKU Azure image SKU {version}-Datacenter-smalldisk
VSPHERE_WINDOWS_TEMPLATE vSphere template From MachineSet
NUTANIX_WINDOWS_IMAGE Nutanix image From MachineSet

Usage Examples

Basic Deployment

./byoh.sh apply mywindows 2

Specific Windows Version

./byoh.sh apply mywindows 2 '' 2019

Custom Configuration

export WINC_ADMIN_PASSWORD="MyPassword123"
export AWS_WINDOWS_AMI="ami-0abcdef1234567890"
./byoh.sh apply production-win 4

Cleanup

./byoh.sh destroy mywindows 2

Technical Details

Files in This PR

File Purpose
byoh.sh Main entry point, command parsing
lib/config.sh Configuration loading (bash 3.x compatible)
lib/credentials.sh Credential management, auto-generation
lib/platform.sh Platform detection, infrastructure discovery, AWS version selection
lib/terraform.sh Terraform operations, ConfigMap management (cp -LR for symlinks)
lib/validation.sh Input validation
lib/windows-vm-bootstrap.tf Generic Windows bootstrap template
aws/main.tf AWS Terraform configuration
aws/variables.tf AWS variables
azure/main.tf Azure Terraform configuration
azure/variables.tf Azure variables
gcp/main.tf GCP Terraform configuration
gcp/variables.tf GCP variables (includes container_logs_port)
vsphere/main.tf vSphere Terraform configuration
vsphere/variables.tf vSphere variables
nutanix/main.tf Nutanix Terraform configuration
nutanix/variables.tf Nutanix variables (includes container_logs_port)
none/main.tf Platform "none" Terraform configuration
none/variables.tf Platform "none" variables
configs/defaults.conf Default configuration values
configs/examples/*.conf.example Platform-specific config examples
README.md Comprehensive documentation

Bash 3.x Compatibility

RHEL 8 uses bash 3.2, so the code avoids bash 4.0+ features:

# Avoid associative arrays (bash 4.0+)
# Use space-delimited strings with pattern matching instead:
ENV_VARS_BEFORE_CONFIG=""
for var in "${config_vars[@]}"; do
    if [[ -n "${!var:-}" ]]; then
        ENV_VARS_BEFORE_CONFIG="${ENV_VARS_BEFORE_CONFIG} ${var}"
    fi
done

# Check membership
if [[ ! " ${ENV_VARS_BEFORE_CONFIG} " =~ " ${key} " ]]; then
    export "$key=$value"
fi

AWS Version Selection

Respects requested Windows version with proper priority:

function write_aws_tfvars() {
    local win_version="${4:-2022}"
    local skip_ami_lookup="${5:-false}"  # Optimize destroy operations

    if [[ "$skip_ami_lookup" == "true" ]]; then
        windows_ami="ami-dummy-not-used-for-destroy"
    else
        # Version-specific query with 30s timeout
        image_pattern="Windows_Server-${win_version}-English-Full-Base"
        windows_ami=$(timeout 30s aws ec2 describe-images ...)
    fi
}

Symlink Handling

Terraform doesn't follow symlinks, so use cp -LR to dereference:

# lib/terraform.sh
cp -LR "${script_dir}/${platform}/." "$templates_dir"

Platform Coverage

  • AWS (tested)
  • Azure (tested)
  • GCP (tested)
  • vSphere (tested)
  • Nutanix (tested)
  • Platform "none" / bare metal (tested)

Integration with OpenShift CI

This tool integrates with Prow CI via step-registry. See ci-operator/step-registry/windows/byoh/ in openshift/release repository.

Benefits

For Users:

  • No manual credential generation
  • No manual infrastructure parameter entry
  • Works immediately with existing clusters
  • Single command deployment

For CI/CD:

  • Multi-platform support
  • Environment variable configuration
  • No interactive prompts
  • Clean Terraform state management

For Operations:

  • Automated infrastructure discovery
  • Platform auto-detection
  • Graceful error handling
  • Transparent configuration priority

Testing

Validated on:

  • AWS: IPI clusters, Windows 2019/2022, credential methods
  • Azure: IPI clusters, Windows 2019/2022, image SKUs
  • GCP: IPI clusters, Windows 2019/2022, service accounts
  • vSphere: Golden image templates, Windows 2019/2022
  • Nutanix: Golden images, Windows 2019/2022
  • Platform "none": Bare metal with AWS infrastructure

Reviewers

Please review:

  • Modular architecture and separation of concerns
  • Generic Windows bootstrap template approach
  • Infrastructure discovery from Linux workers
  • Credential auto-generation security
  • AWS version selection priority
  • Bash 3.x compatibility
  • Configuration priority implementation
  • Error handling and user messaging
  • Documentation completeness

@openshift-ci openshift-ci bot requested review from jrvaldes and weinliu October 23, 2025 05:28
@rrasouli rrasouli force-pushed the terraform-scripts branch 8 times, most recently from a5b3c0f to 89e70f7 Compare October 26, 2025 15:57
@rrasouli rrasouli changed the title Transform into vendor-neutral upstream project with zero hardcoded values [WIP] Transform into vendor-neutral upstream project with zero hardcoded values Oct 26, 2025
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 26, 2025
@rrasouli rrasouli force-pushed the terraform-scripts branch 17 times, most recently from 83c810f to a89afc4 Compare October 29, 2025 10:03
@weinliu
Copy link
Contributor

weinliu commented Dec 1, 2025

Thank you for this comprehensive PR! The overall direction and features are excellent, but I have some concerns that need to be addressed before merging.

  1. Password generation doesn't meet Windows requirements in lib/credentials.sh, shall we update it?
  2. Do we need to add error handling in SSH key extraction function?

@rrasouli
Copy link
Contributor Author

rrasouli commented Dec 1, 2025

Thank you for this comprehensive PR! The overall direction and features are excellent, but I have some concerns that need to be addressed before merging.

  1. Password generation doesn't meet Windows requirements in lib/credentials.sh, shall we update it?
  2. Do we need to add error handling in SSH key extraction function?
  1. which Windows requirements? I don't undertand
  2. I think we have added such errors - it is nice to have, perhaps we can add a debug mode in the future

@weinliu
Copy link
Contributor

weinliu commented Dec 2, 2025

  1. lib/credentials.sh Missing special characters. I think it's ok, since the windows are used for testing
    Other part looks good.

Waiting for the fix for

Some issues found in GCP, Azure WMCO configuration with SSH

@rrasouli rrasouli marked this pull request as ready for review December 3, 2025 08:37
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 3, 2025
@openshift-ci openshift-ci bot requested a review from sebsoto December 3, 2025 08:38
@rrasouli rrasouli force-pushed the terraform-scripts branch 4 times, most recently from 8182f21 to e428634 Compare December 10, 2025 14:19
This commit resolves critical SSH connectivity issues that prevented WMCO
from configuring Windows BYOH nodes. The root cause was OpenSSH on Windows
failing to generate user tokens for authentication.

Critical Fixes:
 - GCP/Azure: Always set Administrator/user password with PasswordNeverExpires
   to ensure OpenSSH can generate authentication tokens, regardless of account
   enabled state. This was THE root cause of "unable to connect to Windows VM"
   timeouts.

 - GCP/Azure: Create scheduled task to start ssh-agent and sshd services at
   every boot. Services configured during sysprep don't reliably persist after
   Windows reboot (RC bug).

 - GCP/Azure: Insert UseDNS directive before Match block in sshd_config to
   prevent syntax errors that stop sshd from starting.

 - GCP/Azure: Ensure ssh-agent service starts before sshd (Windows requirement).

Infrastructure Improvements:
 - lib/terraform.sh: Use ConfigMap PATCH instead of DELETE+CREATE to preserve
   other BYOH node entries in multi-node environments.

 - byoh.sh: Add SKIP_CONFIGMAP_CREATION support for manual ConfigMap management.
   Regenerate terraform.auto.tfvars before destroy to prevent variable errors.

 - lib/credentials.sh: Add SSH key validation and improved password generation
   meeting Azure complexity requirements.

 - lib/platform.sh: Extract Azure image version from existing MachineSet instead
   of hardcoding "latest" to avoid buggy releases.

 - lib/config.sh: Improve config file parsing to handle whitespace robustly.

 - azure/main.tf: Add 120s wait before VM extension execution to prevent race
   conditions during VM boot.

Tested on:
 - GCP: windows-2022-core ✓
   - Azure: windows-2019-datacenter ✓
   - vSphere: ✓
@sebsoto
Copy link

sebsoto commented Jan 7, 2026

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 7, 2026
@jrvaldes
Copy link

/lgtm

@rrasouli
Copy link
Contributor Author

/approve

@rrasouli
Copy link
Contributor Author

/approve

@openshift-ci
Copy link

openshift-ci bot commented Jan 13, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rrasouli

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 13, 2026
@rrasouli
Copy link
Contributor Author

rrasouli commented Jan 13, 2026

@sdodson Could you please merge this PR? It has both the lgtm and
approved labels, and all requirements are met.
The repository doesn't have Tide auto-merge configured, so it needs manual merging. Since you merged PR #1, you have the necessar permissions.
Thank you!

@rrasouli
Copy link
Contributor Author

@JimCann Could you please help with two things?

  1. Merge this PR: It has both the lgtm and approved labels, and all requirements are met. The repository doesn't have Tide auto-merge configured, so it needs manual merging.

  2. Grant me write permissions: I'm listed as an approver in the OWNERS file, but I only have READ permissions on the GitHub repository itself, so I can't merge PRs. Could you grant me write access so I can manage PRs going forward?

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants