-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tests][dask] solve timeouts #5505
Conversation
Suggestion: if it fails on a Linux environment, could we make a strace on the test process to see on which syscall the process is hanging? |
@Remy-Luciani I don't know what an If you could describe how to do what you're referring to (even just a link to documentation), we could try it out. |
To make it simple, strace is a CLI tool to trace system calls between processes and Linux kernel. You can attach a strace process to another process, and it will prints syscalls on stderr output (or in a specified file). At first it looks hard to read because system calls are abbreviated C function like fopen, mmap... But at some point the API becomes clear and we don't need 100% of information to find bug causes most of times. strace is not included in bash nor in most of Linux distros so you need to install it with your package manager. So some resources:
What could be done for the hanging test process is launching the command with strace and specify an output file: strace --output-file=test_strace.log test-command Moreover, since there seems to be some parallelism involved, you might need to follow process forks/child process: strace --follow-forks --output-file=test_strace.log test-command If you want to print system traces in different files for each sub-process I encourage you to take a look at the -ff flag in the manual. Let me know if you're struggling with using the tool or reading the trace! :) |
Very cool, thank you for taking the time to write that out!! That could be an interesting way to approach debugging this, and I know I'll use it for other things in the future. |
@jmoralez since I see you're pushing commits here, want to be sure.... did you see @shiyu1994 's description of what the root cause might be? and my suggestion for something to try |
Yes I saw it, I just wanted to test on mac in the CI here because I wasn't able to replicate it on my mac machine and wanted to check if it was linux specific. I see many failures atm but they don't seem to be stuck yet. |
nvm, seems like 2/3 are going to timeout |
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
Sometimes the dask tests get stuck on the
_train_part
function with the following call stack:Which I'm able to reproduce locally by trying to train 100 consecutive times.
I haven't been able to reproduce this with the number of threads in the workers equal to 1.