[historyserver][collector] Add file-level idempotency check for prev-logs processing on container restart#4321
Conversation
Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com>
Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com>
Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com>
| } | ||
| } | ||
|
|
||
| // isFileAlreadyPersisted checks if a file has already been persisted to persist-complete-logs |
There was a problem hiding this comment.
isFileAlreadyPersisted detects files already moved to persist-complete-logs and avoid duplicate uploads.
Future-Outlier
left a comment
There was a problem hiding this comment.
Hi, thank you for this PR!
do you mind provide a reproduction script like this for me to test on my env more easily?
|
Hi @my-vegetable-has-exploded, thanks for your contribution. Would you mind modifying the PR title and description to clarify that this feature is actually for container restart, not pod restart? I believe this works only for cases in which the collector sidecar fails and recovers. For now, the data is permanently erased if the head pod restarts since we use an If I have misunderstood anything, please correct me. Thanks! |
| r.prevLogsDir = "/tmp/ray/prev-logs" | ||
| r.persistCompleteLogsDir = "/tmp/ray/persist-complete-logs" |
There was a problem hiding this comment.
Could we move these to constant in historyserver/pkg/utils/utils.go similar to KunWuLuan@9b2ca52?
There was a problem hiding this comment.
Hi @justinyeh1995,
Thanks for pointing this out. I’m currently working on it (handling all constants at once) and will open a PR soon. I’m not sure whether it should be included here or handled in a separate PR.
There was a problem hiding this comment.
no problem. I wasn't aware of that when I wrote the comment. Is the pr related to any issue though? I think I need to catch up quite a bit.
Future-Outlier
left a comment
There was a problem hiding this comment.
plz also add e2e test in this PR.
Get! I would add it tonight. |
Future-Outlier
left a comment
There was a problem hiding this comment.
cc @lorriexingfang to review
There was a problem hiding this comment.
nit: consider using filepath functions to handle file path.
There was a problem hiding this comment.
Thanks for review. There are some same issues in this file. Maybe we can file a new ticket to handle it?
There was a problem hiding this comment.
agreed to handle it as a followup
There was a problem hiding this comment.
Hi, @my-vegetable-has-exploded
do you mind create an issue so that I can assign to others?
There was a problem hiding this comment.
no problem.
There was a problem hiding this comment.
|
bugbot run |
|
cursor review |
historyserver/pkg/collector/logcollector/runtime/logcollector/collector.go
Outdated
Show resolved
Hide resolved
Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com>
Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com>
Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com>
Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com>
Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com>
Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com>
Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com>
|
cc @JiangJiaWei1103 @CheyuWu @seanlaii to test this PR locally before you approve this, thank you! |
Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com>
| // 2. Inject "leftover" logs into prev-logs via the ray-head container while collector is down. | ||
| // Note: We only inject logs, not node_events, because the collector's prev-logs processing | ||
| // currently only handles the logs directory. node_events are handled by the EventServer separately. |
There was a problem hiding this comment.
Just curious. How can we ensure that the injection is completed during the collector downtime?
There was a problem hiding this comment.
Because Kubernetes restarts the collector immediately after kill 1, I can’t absolutely guarantee the process stays down until injection finishes. But the injection is short-lived, so current approach works well in my local environment.
If you have a better way to handle this, I’d be glad to implement it.
There was a problem hiding this comment.
historyserver/pkg/collector/logcollector/runtime/logcollector/collector.go
Outdated
Show resolved
Hide resolved
historyserver/pkg/collector/logcollector/runtime/logcollector/collector.go
Outdated
Show resolved
Hide resolved
historyserver/pkg/collector/logcollector/runtime/logcollector/collector.go
Outdated
Show resolved
Hide resolved
| return | ||
| } | ||
|
|
||
| // Check if this directory has already been processed by checking in persist-complete-logs |
There was a problem hiding this comment.
No changes are needed here. This note is just to clarify that we still need to check for leftover log files since the presence of log directories under persist-complete-logs directory alone does not guarantee that all logs have been fully persisted.
There was a problem hiding this comment.
Thanks for the review! If the Collector crashes halfway through a node directory, persist-complete-logs already exists but some logs remain in prev-logs. Keeping this check would cause the Collector to skip the entire directory upon restart, leading to data loss. The current file-level check isFileAlreadyPersisted will handles this "partial success" scenario correctly without duplicate uploads.
There was a problem hiding this comment.
Replacing it with following codes then you can trigger this case using cd historyserver && go test -v ./pkg/collector/logcollector/runtime/logcollector/ -run TestScanAndProcess.@JiangJiaWei1103
completeDir := filepath.Join(r.persistCompleteLogsDir, sessionID, nodeID, "logs")
if _, err := os.Stat(completeDir); err == nil {
logrus.Infof("Session %s node %s logs already processed, skipping", sessionID, nodeID)
return
}
And e2e test also covers this case.
There was a problem hiding this comment.
Wow, thanks for the detailed explanation! I originally just wanted to leave a brief note for maintainers to get why we removed the code snippet here. All tests pass in my local env. Thanks!
| } | ||
|
|
||
| // Walk through the logs directory and process all files | ||
| err := filepath.WalkDir(logsDir, func(path string, info fs.DirEntry, err error) error { |
There was a problem hiding this comment.
When collector restart, WatchPrevLogsLoops will call processPrevLogsDir, so we can process leftover log files in prev-logs/{sessionID}/{nodeID}/logs/ directories.
Co-authored-by: 江家瑋 <36886416+JiangJiaWei1103@users.noreply.github.com> Signed-off-by: yi wang <48236141+my-vegetable-has-exploded@users.noreply.github.com>
Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com>
Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
|
|
||
| ## Troubleshooting | ||
|
|
||
| ### "too many open files" error | ||
|
|
||
| If you encounter `level=fatal msg="Create fsnotify NewWatcher error too many open files"` in the collector logs, | ||
| it is likely due to the inotify limits on the Kubernetes nodes. | ||
|
|
||
| To fix this, increase the limits on the **host nodes** (not inside the container): | ||
|
|
||
| ```bash | ||
| # Apply changes immediately | ||
| sudo sysctl -w fs.inotify.max_user_instances=8192 | ||
| sudo sysctl -w fs.inotify.max_user_watches=524288 | ||
| ``` | ||
|
|
||
| To make these changes persistent across reboots, use the following lines: | ||
|
|
||
| ```text | ||
| echo "fs.inotify.max_user_instances=8192" | sudo tee -a /etc/sysctl.conf | ||
| echo "fs.inotify.max_user_watches=524288" | sudo tee -a /etc/sysctl.conf | ||
| sudo sysctl -p | ||
| ``` |
There was a problem hiding this comment.
very cool, never heard this, I learned something
There was a problem hiding this comment.
Hi, @my-vegetable-has-exploded
do you mind create an issue so that I can assign to others?
Future-Outlier
left a comment
There was a problem hiding this comment.
LGTM, just revise it, thank you!
cc @rueian to merge
|
Thanks all!!! |
Log persistence can fail if the collector container crashes or restarts. To ensure reliability, the collector must resume processing any logs left in the prev-logs directory upon recovery.
In this PR, we enhanced the WatchPrevLogsLoops() function to perform an initial scan of the prev-logs directory at startup before entering the watch loop. This ensures that any session or node folders left from previous runs (e.g., after a container restart) are correctly processed and uploaded.
Why are these changes needed?
Related issue number
Closes #4281
Checks