Skip to content

Conversation

@jhcipar
Copy link

@jhcipar jhcipar commented Sep 23, 2025

Overview: update existing resources instead of wholesale recreation

Objective: Enable in-place resource updates instead of costly redeploys when configuration changes are detected.

Previously, any configuration change would trigger a complete resource teardown and redeploy cycle.
Solution Overview
This PR introduces anupdate system that:

Detects configuration changes using content-based hashing
Updates resources in-place via platform APIs when possible
Handles complex updates that may require both template and endpoint modifications

Key Changes

  1. Durable Resource Identity System

resource_id: Now represents a logical, human-readable identifier (ResourceType_name)

Provides stable identity across configuration changes
Enables resource reuse and update tracking
Replaces the previous hash-based approach for resource identification

resource_hash: Content-based hash for change detection

Built from _hashed_fields - only mutable configuration parameters
Excludes platform state (IDs, deployment metadata) to focus on user-controllable config
Triggers update flow when hash changes between runs

  1. Update Logic
# New update path in ResourceManager.get_or_deploy_resource()
if existing.resource_hash != config.resource_hash:
    # Identify specific changed fields
    for field in existing.__class__._hashed_fields:
        if getattr(existing, field) != getattr(config, field):
            config.fields_to_update.add(field)
    
    # Sync deployment state and update in-place
    await config.sync_config_with_deployed_resource(existing)
    deployed_resource = await config.update()
  1. Platform Integration

New GraphQL operations: update_endpoint() and update_template() mutations
Granular updates: System determines whether template, endpoint, or both need updating
State preservation: Maintains platform IDs and deployment metadata across updates

  1. Enhanced Resource Model

_hashed_fields: Class-level definition of configuration fields that trigger updates
fields_to_update: Runtime tracking of specific changes to optimize update operations
sync_config_with_deployed_resource(): Transfers deployment state between resource instances

Bug Fixes

GPU configuration persistence: Fixed issue where gpuIds wasn't being properly stored in pickled resource state
Template ID tracking: Ensures template relationships are maintained through update cycles

Logic flow for resource update/creation

flowchart TD
    A[get_or_deploy_resource called] --> B[Acquire resource lock]
    B --> C{Resource exists?}
    
    C -->|No| D[Deploy new resource]
    D --> E[Add to manager & save]
    E --> F[Return deployed resource]
    
    C -->|Yes| G{Is resource deployed?}
    G -->|No| H[Remove invalid resource]
    H --> I[Deploy new resource]
    I --> J[Add to manager & save]
    J --> K[Return deployed resource]
    
    G -->|Yes| L{resource_hash changed?}
    L -->|No| M[Resource unchanged]
    M --> N[Return existing resource]
    
    L -->|Yes| O[Config change detected]
    O --> P[Compare _hashed_fields]
    P --> Q[Identify changed fields]
    Q --> R[Add to fields_to_update set]
    R --> S[sync_config_with_deployed_resource]
    S --> T[Call resource.update]
    
    T --> U{Pod template needs update?}
    U -->|Yes| V[Update template via GraphQL]
    V --> W{Template-only changes?}
    W -->|Yes| X[Return updated resource]
    W -->|No| Y[Update endpoint via GraphQL]
    
    U -->|No| Y
    Y --> Z[Remove old resource]
    Z --> AA[Add updated resource]
    AA --> BB[Return updated resource]
    
Loading

In the future, we'll have to integrate with durable Tetra state on the server side.

changed default resource "identifier" to be a resource id that includes
an input config name not generated from config args so a logical
resource isn't defined by its config (so we can change config for the
same resource)

added a "resource hash" to the base sls resource class so we can detect
changes to input config args and update resources in place instead of
redeploying

adds update methods to serverless resources

changed deploy method to add platform-related state (eg durable resource
ids) back to pickled state and config objects at runtime so we can fetch
and interact with runpod sls endpoints created via tetra

add update template methods to sls resource so we can update
template-only variables via gql (eg env vars)

changed the defaults for some sls resource configs to reflect existing
defaults in runpod

add update path to resource manager class when existing and new config
have differnt resource hashes

changed the behavior of sync gpu and gpuIds fields because there was a
bug where gpus would always get created and pickled as the ANY gpu group
@jhcipar jhcipar changed the title Jhcipar/ae 1196/update existing resources Feat: update existing resources instead of wholesale recreation Sep 23, 2025
@jhcipar jhcipar changed the title Feat: update existing resources instead of wholesale recreation [WIP] Feat: update existing resources instead of wholesale recreation Sep 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants