[sglang-miles] Cherry-pick #24767: Make request dump robust to unpicklable server_args and large meta_info#24902
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a feature to exclude specific keys from request metadata dumps to optimize storage and avoid serialization issues. It also adds a retry mechanism that attempts to dump data without server_args if the initial pickling fails. Feedback highlights the need for consistent metadata filtering across all dump types and suggests making the retry logic more resilient by catching potential errors during the fallback serialization step.
| if self.dump_requests_exclude_meta_keys and isinstance( | ||
| out_dict.get("meta_info"), dict | ||
| ): | ||
| exclude = self.dump_requests_exclude_meta_keys | ||
| if any(k in out_dict["meta_info"] for k in exclude): | ||
| filtered_meta = { | ||
| k: v for k, v in out_dict["meta_info"].items() if k not in exclude | ||
| } | ||
| out_dict = {**out_dict, "meta_info": filtered_meta} |
There was a problem hiding this comment.
The meta_info filtering logic is correctly implemented here for periodic request dumps. However, this filtering is not applied to unfinished requests in dump_requests_before_crash or to requests recorded in record_request_for_crash_dump. If the goal is to exclude bulky keys from all dumps to save space and avoid pickling issues, consider applying this filtering consistently across all dumping paths. Since these functions are not all within the current diff hunks, ensure the logic is shared or repeated where necessary.
| try: | ||
| pickle.dump(to_dump_with_server_args, f) | ||
| except Exception as e: | ||
| # When the server is launched with --trust-remote-code, | ||
| # server_args sometimes fails to pickle. Retry without | ||
| # server_args so the request data still gets persisted. | ||
| logger.error( | ||
| f"Failed to pickle dump with server_args: {e!r}; " | ||
| "retrying without server_args" | ||
| ) | ||
| f.seek(0) | ||
| f.truncate() | ||
| to_dump_with_server_args["server_args"] = None | ||
| pickle.dump(to_dump_with_server_args, f) |
There was a problem hiding this comment.
Consider making the retry logic more robust by wrapping the second pickle.dump in a try...except block. If the failure is caused by something other than server_args (e.g., unpicklable data in the requests themselves), the current implementation will raise an unhandled exception in the background thread. Also, note that this logic is duplicated in dump_requests_before_crash.
| try: | |
| pickle.dump(to_dump_with_server_args, f) | |
| except Exception as e: | |
| # When the server is launched with --trust-remote-code, | |
| # server_args sometimes fails to pickle. Retry without | |
| # server_args so the request data still gets persisted. | |
| logger.error( | |
| f"Failed to pickle dump with server_args: {e!r}; " | |
| "retrying without server_args" | |
| ) | |
| f.seek(0) | |
| f.truncate() | |
| to_dump_with_server_args["server_args"] = None | |
| pickle.dump(to_dump_with_server_args, f) | |
| try: | |
| pickle.dump(to_dump_with_server_args, f) | |
| except Exception as e: | |
| # When the server is launched with --trust-remote-code, | |
| # server_args sometimes fails to pickle. Retry without | |
| # server_args so the request data still gets persisted. | |
| logger.error( | |
| f"Failed to pickle dump with server_args: {e!r}; " | |
| "retrying without server_args" | |
| ) | |
| f.seek(0) | |
| f.truncate() | |
| to_dump_with_server_args["server_args"] = None | |
| try: | |
| pickle.dump(to_dump_with_server_args, f) | |
| except Exception as e2: | |
| logger.error(f"Failed to pickle dump even without server_args: {e2!r}") |
| try: | ||
| pickle.dump(data_to_dump_with_server_args, f) | ||
| except Exception as e: | ||
| # When the server is launched with --trust-remote-code, | ||
| # server_args sometimes fails to pickle. Retry without | ||
| # server_args so the request data still gets persisted. | ||
| logger.error( | ||
| f"Failed to pickle dump with server_args: {e!r}; " | ||
| "retrying without server_args" | ||
| ) | ||
| f.seek(0) | ||
| f.truncate() | ||
| data_to_dump_with_server_args["server_args"] = None | ||
| pickle.dump(data_to_dump_with_server_args, f) |
There was a problem hiding this comment.
Similar to the periodic dump, consider wrapping the second pickle.dump in a try...except block here. Since this is called during crash handling, an unhandled exception here could interfere with the shutdown process or prevent other cleanup tasks from completing.
| try: | |
| pickle.dump(data_to_dump_with_server_args, f) | |
| except Exception as e: | |
| # When the server is launched with --trust-remote-code, | |
| # server_args sometimes fails to pickle. Retry without | |
| # server_args so the request data still gets persisted. | |
| logger.error( | |
| f"Failed to pickle dump with server_args: {e!r}; " | |
| "retrying without server_args" | |
| ) | |
| f.seek(0) | |
| f.truncate() | |
| data_to_dump_with_server_args["server_args"] = None | |
| pickle.dump(data_to_dump_with_server_args, f) | |
| try: | |
| pickle.dump(data_to_dump_with_server_args, f) | |
| except Exception as e: | |
| # When the server is launched with --trust-remote-code, | |
| # server_args sometimes fails to pickle. Retry without | |
| # server_args so the request data still gets persisted. | |
| logger.error( | |
| f"Failed to pickle dump with server_args: {e!r}; " | |
| "retrying without server_args" | |
| ) | |
| f.seek(0) | |
| f.truncate() | |
| data_to_dump_with_server_args["server_args"] = None | |
| try: | |
| pickle.dump(data_to_dump_with_server_args, f) | |
| except Exception as e2: | |
| logger.error(f"Failed to pickle dump even without server_args: {e2!r}") |
Summary
Cherry-pick of #24767 (merge commit
1e6c6d1) ontosglang-miles.server_args=Noneso request data is still persisteddump_requests_exclude_meta_keysto strip bulky keys (routed_experts,hidden_states) from meta_info in dumps--dump-requests-exclude-meta-keysConflict resolution: None — clean cherry-pick (auto-merged).
Test plan
--trust-remote-codemodels that previously caused pickle failuresMade with Cursor