Fix/agentgym server cleanup #53
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📑 Description
改进
train_ppo.sh脚本中 AgentGym 后台服务器的管理和启动验证逻辑。主要变更:
setsid在独立的进程组中启动 AgentGym 服务器,以便统一管理。修改trap EXIT清理逻辑,改为使用kill -- -PGID来终止整个进程组,确保所有子进程(包括实际的服务器)都被可靠关闭。相应的进程 ID 存储变量也从 PIDS 改为 PGIDS。nc命令尝试连接每个服务器的端口,并包含重试逻辑。如果任何服务器在多次尝试后仍无法连接,脚本将中止执行并尝试清理已启动的进程,防止在服务器未就绪的情况下开始训练。这替代了原先基于 PID 的不可靠检查。此系列更改提高了脚本管理的健壮性,解决了脚本退出时服务器清理不彻底的问题,并确保训练只在所有服务器都成功启动并响应后才开始。
✅ Checks
type/descript(e.g.feature/add-llm-agents)ℹ Additional Information
TODO,后续需检查
train_grpo.sh实现类似逻辑仍然存在corner case,当端口被占用时,基于访问的启动检查会返回成功