Skip to content

Fix corrupted JSONL metrics file due to concurrent writes#19011

Merged
Fridge003 merged 4 commits intosgl-project:mainfrom
talorabr:bugfix/file_requests_metric_exporter_concurrency
Feb 22, 2026
Merged

Fix corrupted JSONL metrics file due to concurrent writes#19011
Fridge003 merged 4 commits intosgl-project:mainfrom
talorabr:bugfix/file_requests_metric_exporter_concurrency

Conversation

@talorabr
Copy link
Contributor

Motivation

When using sglang server with --export-metrics-to-file --export-metrics-to-file-dir /tmp/sglang-metrics, concurrent writes can corrupt the jsonl file of the metrics.

Example of corrupted jsonl file with 32 concurrent requests:

, "id": "a9667994aff3400686d6d3eb95ee0235", "finish_reason": {"type": "stop", "matched": 128009}, "prompt_tokens": 3538, "weight_version": "default", "total_retractions": 0, "queue_time": 0.021411570953205228, "prefill_launch_delay": 0.0005364660173654556, "prefill_launch_latency": 0.06839973502792418, "prefill_finished_ts": 1771503801.2129033, "completion_tokens": 482, "cached_tokens": 45, "e2e_latency": 5.103912115097046, "spec_accept_rate": 0.9147286821705426, "spec_accept_length": 3.7364341085271318, "spec_accept_token_num": 354, "spec_draft_token_num": 387, "spec_verify_ct": 129, "request_received_ts": 1771503800.9961853, "request_sent_to_scheduler_ts": 1771503801.002038, "decode_finished_ts": 1771503806.1000974, "inference_time": 5.0451679839752614, "decode_throughput": 144.13696466037615, "response_sent_to_client_ts": 1771503806.1002116}

Write lock mechanism will prevent this from happening.

Modifications

Added RWLock for the FileRequestMetricsExporter, which is acquired when we open the metrics file in _write_record.

Accuracy Tests

Not expected to impact model output/accuracy.

Benchmarking and Profiling

Not expected to impact latency/throughput.

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @talorabr, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical data integrity issue where concurrent writes to JSONL metrics files could lead to corruption. By introducing a read-write lock, the system now safely handles parallel write operations, ensuring that all exported metrics data remains valid and uncorrupted, which is crucial for reliable performance monitoring.

Highlights

  • Concurrent Write Protection: Implemented a RWLock mechanism within the FileRequestMetricsExporter to prevent data corruption when multiple concurrent requests attempt to write to the same JSONL metrics file.
  • Metrics File Integrity: Ensured the integrity of exported JSONL metrics files by acquiring a writer lock before opening and writing records, resolving issues observed with corrupted files under high concurrency.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • python/sglang/srt/managers/request_metrics_exporter.py
    • Imported RWLock from sglang.srt.utils.aio_rwlock to enable asynchronous read-write locking.
    • Added a new instance variable _current_file_lock of type RWLock to manage access to the metrics file.
    • Wrapped the file handling and writing logic within the write_record method with an async with self._current_file_lock.writer_lock block, ensuring exclusive write access during file operations.
Activity
  • No human activity has been recorded on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly addresses a potential race condition that could lead to corrupted JSONL metrics files from concurrent writes. The use of a lock to serialize file access within FileRequestMetricsExporter.write_record is the appropriate solution. The implementation is clean and effectively resolves the issue. My review includes a suggestion to simplify the locking mechanism by using a standard asyncio.Lock instead of an RWLock, as only exclusive write access is currently required.


from sglang.srt.managers.io_struct import EmbeddingReqInput, GenerateReqInput
from sglang.srt.server_args import ServerArgs
from sglang.srt.utils.aio_rwlock import RWLock
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Since only the writer lock is used in FileRequestMetricsExporter, a simpler asyncio.Lock would suffice and be slightly more direct. You can remove this import and use the asyncio module, which is already imported.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@talorabr Can we replace with a simpler asyncio.Lock as suggested


# File handler state management
self._current_file_handler = None
self._current_file_lock = RWLock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using a standard asyncio.Lock is simpler here since there are no concurrent read operations. While RWLock is not incorrect, asyncio.Lock is more idiomatic for purely exclusive access. If you anticipate adding concurrent read-only operations in the future, keeping RWLock is reasonable.

Suggested change
self._current_file_lock = RWLock()
self._current_file_lock = asyncio.Lock()


# Ensure correct file handler is open for current hour
self._ensure_file_handler(hour_suffix)
async with self._current_file_lock.writer_lock:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To align with the suggestion of using asyncio.Lock, this should be updated to acquire the lock directly.

Suggested change
async with self._current_file_lock.writer_lock:
async with self._current_file_lock:

@Fridge003
Copy link
Collaborator

Do we have example of json file before/after this PR as comparison

@talorabr
Copy link
Contributor Author

Do we have example of json file before/after this PR as comparison

sglang-request-metrics-20260222_04.after.log
sglang-request-metrics-20260222_04.before.log

In the .before.log we can see several corrupted lines, for example: 412, 413, 583, 584
and in the .after.log we can load the file and see that all the lines are valid json

@Fridge003 Fridge003 merged commit c6a99e4 into sgl-project:main Feb 22, 2026
53 of 61 checks passed
magicYang1573 pushed a commit to magicYang1573/sglang that referenced this pull request Mar 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants