-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pool hangs under heavy load #17055
Comments
@scineram why the downvote? |
@owlshrimp Ignore the guy. I honestly have no idea why he hasn't been banned from the repo yet. He follows the discussions 24/7 and gives downvotes to everything without ever providing any useful information. |
@owlshrimp zfs-2.1.5 is fairly old. It's possible the hang may have been fixed in subsequent releases. Consider upgrading to 2.1.16 if you want to stay on the 2.1.x branch. |
If that is indeed the case I guess someone is going to need to go convince ubuntu to ship a newer version :/ |
With 2.3.0 out 2.1.x branch is out of future updates. So update to 2.2.x would be even better. I can't speak about Ubuntu, but 2.1.5 was tagged Jun 21, 2022. Meanwhile, looking on the logs provided, I'd say that problem may be not in ZFS, but in some very slow (dying?) storage device. Kernel regularly complains about 120 seconds delays in sync thread, but the number is not growing, which means that at some point the writes are completing until the problem happen again in next TXG. |
@amotin that would be quite odd since there are almost-new WD reds, but I suppose not impossible. I've had brand new drives fail on me before. Is there a good way to narrow things down to a particular drive if that's the case? |
@owlshrimp Use whatever tools you have in your distro to view active device queue depth and request latency. ZFS has own |
Weirdly it's now switched to triggering nearly as soon as I reboot the machine. I haven't even started hammering it, before:
The only change in behaviour on my end is I haven't run a scrub immediately after booting and waited for it to complete. Probably can't, if it occurs in the first 484 seconds. I don't see anything particularly weird in zpool iostat. Nothing unbalanced between the disks in the mirror, and latency writing to disks is in the low tens of ms, which seems reasonable for spinning disks that need to seek. Even with things broken:
I suppose if this is a kernel bug the best next step would be to rip out the ubuntu package and put in a DKMS of the latest zfs in this stable series, then if that makes it go away file a bug with ubuntu. |
Some of the WD reds were/are SMR. SMR is not a good choice for ZFS |
I'm aware of that and these drives are known not to be SMR. (They even proclaim their suitability for ZFS on the back of the box, as WD started doing to mark CMR drives after the bad press they received) They also test fine as far as I can tell from a performance standpoint. I'll share my findings as soon as I can get a newer version of ZFS up and running. |
System information
This machine is a fairly lightweight AMD fam16h quad core machine with 16GB of ram.
Describe the problem you're observing
During a large download job of videos which some transcoding going going on as well, after a few hours the pool locks up and becomes unresponsive. In dmesg are stack traces of hung kernel tasks.
The occurrences below happened about a week apart both from relatively fresh boots of the system (a few days at most), but I have no reason to think those time frames are relevant. It seems to happen pretty reliably as soon as I start archiving friends' livestreams, no more than about 12 hours in at the latest.
This could be coincidence, but it seems like yt-dlp is usually in the final stage of the download which involves fixing up the final file's video container, which I assume is heavy in both reading and writing. (
[FixupM3u8] Fixing MPEG-TS in MP4 container of "<video file name>"
) I believe I had two instances of yt-dlp running concurrently this time and both seem to have hung during this stage.I've turned up a few other possibly-related issues but some happen under different conditions and none seem at first glance to have similar backtraces:
#15481
#15283
#15217
Describe how to reproduce the problem
Install Ubuntu server 22.04, install ZFS from the ubuntu repo, make sure the system is up to date, and then use the yt-dlp utility to download several large (multiple GB) arbitrary VoD video archives of recent streams from Twitch. (several calls of "yt-dlp <video link>" in a batch file is appears sufficient, or one invocation with many links as arguments, say 10-15; repeatedly downloading and then rm-ing the same file in a loop would also likely work)
Include any warning/errors/backtraces from the system logs
First occurrence, relevant (post boot) portion of dmesg:
Seccond occurrence, relevant portion of dmesg:
The text was updated successfully, but these errors were encountered: