[WIP] Feat: update existing resources instead of wholesale recreation #92
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Overview: update existing resources instead of wholesale recreation
Objective: Enable in-place resource updates instead of costly redeploys when configuration changes are detected.
Previously, any configuration change would trigger a complete resource teardown and redeploy cycle.
Solution Overview
This PR introduces anupdate system that:
Detects configuration changes using content-based hashing
Updates resources in-place via platform APIs when possible
Handles complex updates that may require both template and endpoint modifications
Key Changes
resource_id: Now represents a logical, human-readable identifier (ResourceType_name)
Provides stable identity across configuration changes
Enables resource reuse and update tracking
Replaces the previous hash-based approach for resource identification
resource_hash: Content-based hash for change detection
Built from _hashed_fields - only mutable configuration parameters
Excludes platform state (IDs, deployment metadata) to focus on user-controllable config
Triggers update flow when hash changes between runs
New GraphQL operations: update_endpoint() and update_template() mutations
Granular updates: System determines whether template, endpoint, or both need updating
State preservation: Maintains platform IDs and deployment metadata across updates
_hashed_fields: Class-level definition of configuration fields that trigger updates
fields_to_update: Runtime tracking of specific changes to optimize update operations
sync_config_with_deployed_resource(): Transfers deployment state between resource instances
Bug Fixes
GPU configuration persistence: Fixed issue where gpuIds wasn't being properly stored in pickled resource state
Template ID tracking: Ensures template relationships are maintained through update cycles
Logic flow for resource update/creation
flowchart TD A[get_or_deploy_resource called] --> B[Acquire resource lock] B --> C{Resource exists?} C -->|No| D[Deploy new resource] D --> E[Add to manager & save] E --> F[Return deployed resource] C -->|Yes| G{Is resource deployed?} G -->|No| H[Remove invalid resource] H --> I[Deploy new resource] I --> J[Add to manager & save] J --> K[Return deployed resource] G -->|Yes| L{resource_hash changed?} L -->|No| M[Resource unchanged] M --> N[Return existing resource] L -->|Yes| O[Config change detected] O --> P[Compare _hashed_fields] P --> Q[Identify changed fields] Q --> R[Add to fields_to_update set] R --> S[sync_config_with_deployed_resource] S --> T[Call resource.update] T --> U{Pod template needs update?} U -->|Yes| V[Update template via GraphQL] V --> W{Template-only changes?} W -->|Yes| X[Return updated resource] W -->|No| Y[Update endpoint via GraphQL] U -->|No| Y Y --> Z[Remove old resource] Z --> AA[Add updated resource] AA --> BB[Return updated resource]In the future, we'll have to integrate with durable Tetra state on the server side.