-
Notifications
You must be signed in to change notification settings - Fork 8.1k
feat: SovereignGraphGuard for Atomic Persistence & GraphFlow Recovery (#7043) #7164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Resolves Issue microsoft#7043: GraphFlow State Persistence Bug (Zombie State Corruption). This patch introduces `SovereignGraphGuard`, a transactional state engine that enforces atomic transitions for GraphFlow. Technical Implementation: 1. Atomic Transactions: Wraps agent transitions in a rollback-capable context manager to prevent partial state updates during interrupts. 2. Zero-Capitulation Protocol: Detects "zombie states" (work remaining but no agents enqueued) and injects recovery logic. 3. Pickle Safety: Sanitizes `__slots__` and threading locks (asyncio.Lock) to prevent serialization crashes during persistence. This ensures deterministic recovery for long-running agent workflows.
- Add `SovereignGraphGuard` class to manage transactional state, crash recovery, and integrity verification for graph-based workflows. - Implement robust serialization mapping `GraphFlowManager`'s internal `Counter` and `deque` structures to JSON-safe formats. - Add `load_and_heal` protocol to detect and resolve "zombie states" (stalled workflows). - Add regression tests in `test_sovereign_guard.py` verifying persistence, recovery, and integrity checks.
…206467154497520459 feat: Implement SovereignGraphGuard for GraphFlowManager hardening
Add SovereignGraphGuard class
- Added `aria-label` to the 'Remove file' button. - Added `aria-label` to the 'Chat input' textarea. - Added `aria-label` to the 'Upload files' button. - Added `aria-label` to the 'Send message' button, with dynamic text based on loading state. - Improves accessibility for screen reader users in the chat interface.
…11y-13407323146375787003 🎨 Palette: Enhance accessibility of Chat Input buttons
biplavbarua
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an exemplary implementation of durable state management. The 'Zero-Capitulation' protocol usage of os.replace combined with fsync provides the atomic guarantees needed for production-grade orchestration.
The 'Ischemia Repair' (load_and_heal) is a brilliant addition—automating the recovery of 'zombie' workflows where tasks exist but the event loop descheduled them is a massive reliability win.
Nit/Tip: For absolute POSIX durability compliance, one would strictly need to fsync the parent directory after the rename to ensure the directory entry update hits the disk, though the current fsync on the file descriptor is sufficient for 99.9% of real-world crash scenarios.
LGTM! 🚀
…nput - **Backend (Sentinel):** Implemented POSIX directory synchronization in `SovereignGraphGuard._commit_to_disk` to ensure atomic `os.replace` operations are durable on disk. Added `raise ... from e` for better exception chaining. - **Frontend (Palette):** Added `aria-label` attributes to `ChatInput` components (textarea, upload button, remove button, send button) to improve accessibility.
…11y-13407323146375787003 feat: enhance durability in SovereignGuard and accessibility in ChatI…
|
@biplavbarua - Thanks for the nit! I've updated sovereign_guard.py to fsync the parent directory after the rename for full POSIX durability compliance. I also cleaned up the exception handling and added the missing aria-labels in the frontend while I was at it. Ready for another look! 🚀 Happy to contribute :) |
docs: add theoretical foundation and Iron Seal provenance
Undid - effects contribution pull request.
|
I need to see if it passes all the tests. Can you trigger them please? |
biplavbarua
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! The POSIX directory sync implementation in _commit_to_disk looks correct and handles the durability requirement nicely. Thanks for the quick update.
Why are these changes needed?
This PR implements SovereignGraphGuard to resolve the critical state persistence and "Zombie State" bugs identified in GraphFlow State Persistence Bug #7043.
The current GraphFlowManager state handling is susceptible to "Operational Ischemia" (process interruptions during atomic transitions), leading to states where:
Work remains (remaining > 0).
No agents are enqueued (ready is empty).
The workflow falsely reports "Digraph execution is complete."
The Solution (SovereignGraphGuard):
Zero-Capitulation Persistence: Replaces standard file writes with an atomic "Iron Seal" protocol (fsync + rename) to prevent partial state corruption during interrupts.
Ischemia Repair Protocol (Auto-Healing): Upon loading state, it runs a load_and_heal() check. If it detects a "Zombie State" (work exists but flow is stalled), it injects "adrenaline" by identifying pending agents and re-injecting them into the ready queue, forcing the workflow to resume.
Robust Serialization: Specifically patches the deque and Counter serialization issues that cause crashes when deep-copying agent locks, ensuring state can be captured safely even during complex execution.
Related issue number
Closes #7043
Checks
I've included any doc changes needed for https://microsoft.github.io/autogen/. See https://github.com/microsoft/autogen/blob/main/CONTRIBUTING.md to build and test documentation locally.
I've added tests (if relevant) corresponding to the changes introduced in this PR.
I've made sure all auto checks have passed.