odroidxu4: Lockups with heavy xhci/SATA load #120

scotte · 2015-08-11T16:24:23Z

This is fairly vague at this point, but I have been able to reliably lock up my XU4 when doing heavy I/O load to a hard drive plugged into the cloudshell SATA connector. This is on 4.2-rc1-41.

The best way I've been able to do this consistently is by compiling the kernel on an SSD drive via:

$ make -j9 all

I've been able to reproduce on two different brands/types of SSD. It never happens when running the same command off an SD card.

When it locks up, the fan stays running and I have to unplug/replug the power cable. Oddly, the SATA drive won't be recognized until I reboot a second time. The only interesting bit of kernel log after the first reboot is:
usb 4-1.2: reset SuperSpeed USB device number 3 using xhci-hcd

Again, sorry for the vague issue, but curious if anyone else can reproduce.

tobetter · 2015-08-12T02:38:42Z

I also have the same issue with my XU4 and CloudShell, and this is the issue what I've looked for long with multiple type of HDD and SDD. Seriously this happens with very heavy I/O especially compile with "-j" more than 4, but this issue does not happen on other purpose like video streaming or disk seek, maybe not heavy over load. I've tried to add more components to CloudShell's LCD board close to SATA and USB connector for signal and power, but still had the same issue. So I am assuming that this might be USB 3.0 issue.

When I tried to revert below commit from odroidxu4-v4.2-rc1 branch, I found XU4 have less error.
66c28f9

scotte · 2015-08-12T17:15:53Z

Thanks, at least I know it's not just me. It's a plausible theory that's it's a USB 3.0 issue, and since it locks up the whole board it's not going to be an easy one to debug. I'll try with usbmon when I get a chance and see if anything jumps out at me.

scotte · 2015-08-12T20:58:02Z

Well, I'm not smart enough to do anything with usbmon output, but I did try a few different things and I found something interesting.

If I set scaling_max_freq for all CPUs to 1400000, "make -j9 all" on the LInux kernel completes without any problem on an SSD. I've noticed when it's locked up before that the CPU temp shown on the LCD is ~90C, yet when I run "stress" (with CPUs at max native clock - 1.4Ghz or 2Ghz as appropriate) the temp never goes above ~70C. In my compile with CPUs locked to 1.4Ghz, the max temp was ~60C.

No hotpluggable CPUs or I'd offline half the cores to see what happens.

None of this means it's a thermal issue necessarily, just some curious data.

[Update]
Since "stress" is pretty simple, I tried something a bit more complex - 8 copies of hpcc. I was able to cause a thermal shutdown, but it did not hang:

thermal thermal_zone3: critical temperature reached(121 C),shutting down

tobetter · 2015-08-13T03:15:35Z

Nice try, so 1.4GHz would make XU4 more stable with heavy load. :)
Why don't you post this to ODROID forum and keep discuss with others? Probably someone else who is smarter than me would help us. :)

scotte · 2015-08-13T16:33:16Z

I was thinking this probably belongs on the forum too... I'll post over there and see where it goes. Thanks!

scotte · 2015-08-13T16:57:29Z

Posted here: http://forum.odroid.com/viewtopic.php?f=99&t=15557 feel free to close this github issue if you think this is no longer productive, though adding your comments to that post might be helpful too. :-)

scotte · 2015-08-19T22:17:26Z

I rebased onto 4.2-rc7, which was trivial, the only conflict is on the RTL8152/RTL8153 driver (r8152.c) which is easily resolved by taking v2.04.0 from the XU4 branch versus v1.08.1 in the official kernel tree.

I got a little bit better debugging info when trying to reproduce my earlier issues, so it might be worth upgrading this experimental kernel with the latest 4.2-rc7 mainline (and perhaps keep up with it from there) just to see if anyone else finds some additional meaningful debug messages.

You can see the additional debug info in my post at http://forum.odroid.com/viewtopic.php?f=99&t=15557&p=102228#p102228

Thanks!

jobenvil · 2016-02-06T22:23:55Z

MAybe I have the same issue. Kernel 3.10.94-odroidxu4. rootfs on USB3, controller JMICRO Bus 004 Device 003: ID 152d:0551 JMicron Technology Corp. / JMicron USA Technology Corp.

I observed that basic commands doesn't affect, only some related to heavy load. I checked hdparm all possible parameters -y, -S0, for power management, because I though the problem was here...,etc. I was confused because the system was many hours running well but after doing some commands the login shell keeps connected but the command cannot be executed. Its look like the controller hanged up. Only power of/on system helps. Many tests, googling here and there and I could read the same behaviour with same controller here: https://bugzilla.redhat.com/show_bug.cgi?id=895085

I update the jmicron firmware, because I found this as well:
http://www.heise.de/ct/hotline/USB-3-0-Platte-laeuft-nicht-an-Alternative-1155769.html (German)

I came here to see the development of the kernel I found your post.

My question: it is a controller issue (side of JMICRON) or a controller issue (side of Kernel)?

I could see errors on I/O when I had the rootfs on SD card.

qknight · 2016-02-20T17:34:00Z

just seen this on heavy load:

lsusb:
Bus 003 Device 005: ID 152d:2509 JMicron Technology Corp. / JMicron USA Technology Corp. JMS539 SuperSpeed SATA II 3.0G Bridge

[    7.465741] usb 6-1: reset SuperSpeed USB device number 2 using xhci-hcd
[    7.510986] r8152 6-1:1.0 eth0: v2.04.0 (2015/03/06)
[    7.511000] r8152 6-1:1.0 eth0: This product is covered by one or more of the following patents:
                        US6,570,884, US6,115,776, and US6,327,625.
               
[    8.359404] scsi 0:0:0:0: Direct-Access     Jmicron  Corp.                 PQ: 0 ANSI: 2 CCS
[    8.361974] sd 0:0:0:0: [sda] 1953524995 512-byte logical blocks: (1.00 TB/931 GiB)
[    8.362023] sd 0:0:0:0: Attached scsi generic sg0 type 0
[    8.362502] sd 0:0:0:0: [sda] Write Protect is off
[    8.362518] sd 0:0:0:0: [sda] Mode Sense: 28 00 00 00
[    8.362900] sd 0:0:0:0: [sda] No Caching mode page found
[    8.362915] sd 0:0:0:0: [sda] Assuming drive cache: write through
[    8.732395]  sda: sda1 sda2
[    8.735610] sd 0:0:0:0: [sda] Attached SCSI disk
[    9.152874] Adding 4194300k swap on /dev/sda1.  Priority:-1 extents:1 across:4194300k 
[    9.381289] EXT4-fs (sda2): recovery complete
[    9.381313] EXT4-fs (sda2): mounted filesystem with ordered data mode. Opts: (null)
[   10.124210] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[   13.934238] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[27362.248366] usb 3-1.2: reset high-speed USB device number 3 using xhci-hcd
[27431.208260] usb 3-1.2: reset high-speed USB device number 3 using xhci-hcd
[27436.297052] sd 0:0:0:0: [sda] UNKNOWN(0x2003) Result: hostbyte=0x03 driverbyte=0x00
[27436.297120] sd 0:0:0:0: [sda] CDB: opcode=0x28 28 00 6f c0 87 10 00 00 08 00
[27436.297169] blk_update_request: I/O error, dev sda, sector 1874888464
[27436.319194] xhci-hcd xhci-hcd.6.auto: WARN Event TRB for slot 3 ep 2 with no TDs queued?
[27436.319337] sd 0:0:0:0: [sda] UNKNOWN(0x2003) Result: hostbyte=0x07 driverbyte=0x00
[27436.319427] sd 0:0:0:0: [sda] CDB: opcode=0x28 28 00 6f c0 87 18 00 00 f0 00
[27436.319475] blk_update_request: I/O error, dev sda, sector 1874888472
[27437.855255] xhci-hcd xhci-hcd.6.auto: WARN Event TRB for slot 3 ep 2 with no TDs queued?
[27439.202027] usb 3-1.2: USB disconnect, device number 3
[27439.208299] sd 0:0:0:0: [sda] UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00
[27439.208361] sd 0:0:0:0: [sda] CDB: opcode=0x28 28 00 6f c0 87 20 00 00 e8 00
[27439.208416] blk_update_request: I/O error, dev sda, sector 1874888480
[27439.208846] sd 0:0:0:0: [sda] UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00
[27439.208943] sd 0:0:0:0: [sda] CDB: opcode=0x2a 2a 00 3a 45 1e f0 00 00 f0 00
[27439.208995] blk_update_request: I/O error, dev sda, sector 977608432
[27439.209172] EXT4-fs error (device sda2): __ext4_get_inode_loc:3927: inode #58391728: block 233312234: comm svn: unable to read itable block
[27439.209553] Write-error on swap-device (8:0:15512)
[27439.210075] Read-error on swap-device (8:0:4864)
[27439.210937] Aborting journal on device sda2-8.
[27439.212396] Read-error on swap-device (8:0:15296)
[27439.213410] JBD2: Error -5 detected when updating journal superblock for sda2-8.
[27439.218231] Read-error on swap-device (8:0:15304)
[27439.218315] Read-error on swap-device (8:0:15312)
[27439.218564] Read-error on swap-device (8:0:15320)
[27440.166054] Read-error on swap-device (8:0:9192)
[27440.166066] Read-error on swap-device (8:0:9200)
[27440.166074] Read-error on swap-device (8:0:9208)
[27440.166081] Read-error on swap-device (8:0:9216)
[27440.187517] Read-error on swap-device (8:0:9320)

Obihoernchen · 2016-03-05T12:54:00Z

Maybe related: #166 ?

uDude · 2016-07-15T17:52:40Z

You should consider switching to the odroidxu4-v4.6.3 kernel. A lot of changes are going on there and I believe @tobetter will be releasing a server OS based on the 4.6 kernel. Any problems in that kernel should be discussed (probably close this and start a new ticket if needed under that code base.

jobenvil · 2016-07-15T18:22:13Z

yes, indeed, I'm trying to clarify from where it comes this issue. I though, it will be solved with newer kernels, like 4.6.X, etc, but it wasn't. Even with 4.7-rc4 I observe same behaviour. Yesterday I got following error during f.e.: rsync /media/usb1 /media/usb2, during this command the harddisk will be unmounted and the rsync command breaks:

root@hiperborea:~# [ 5796.947852] blk_update_request: I/O error, dev sda, sector 251955144
[ 5796.953522] blk_update_request: I/O error, dev sda, sector 251955400
[ 5797.299869] EXT4-fs error (device sda2): ext4_find_entry:1456: inode #4456451: comm rsync: reading directory lblock 0
[ 5797.309102] Aborting journal on device sda2-8.
[ 5797.313422] Buffer I/O error on dev sda2, logical block 60325888, lost sync page write
[ 5797.321372] JBD2: Error -5 detected when updating journal superblock for sda2-8.
[ 5797.328774] EXT4-fs (sda2): Remounting filesystem read-only

Now I'm doing the same using a Banana Pro for copying the files instead the OdroidXU4. In this case, during the copyng with same rsync command I observe errors on USB, but more intelligent handling reducing the speed of the sata speed link automatically and following the rsync process (still copying in this moment):

[ 1299.408654] usb 2-1: new high-speed USB device number 2 using ehci-platform
[ 1299.590648] usb 2-1: New USB device found, idVendor=152d, idProduct=3562
[ 1299.590681] usb 2-1: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[ 1299.590707] usb 2-1: Product: AD TO BE II
[ 1299.590724] usb 2-1: Manufacturer: ADMKIV
[ 1299.590740] usb 2-1: SerialNumber: DB123456789628
[ 1299.594331] scsi host2: uas
[ 1299.596990] scsi 2:0:0:0: Direct-Access     ADplus   SuperVer         6302 PQ: 0 ANSI: 6
[ 1299.671401] sd 2:0:0:0: Attached scsi generic sg2 type 0
[ 1299.671859] sd 2:0:0:0: [sdc] 976773168 512-byte logical blocks: (500 GB/466 GiB)
[ 1299.671887] sd 2:0:0:0: [sdc] 4096-byte physical blocks
[ 1299.674352] sd 2:0:0:0: [sdc] Write Protect is off
[ 1299.674396] sd 2:0:0:0: [sdc] Mode Sense: 53 00 00 08
[ 1299.675435] sd 2:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 1299.733752]  sdc: sdc1 sdc2
[ 1299.743770] sd 2:0:0:0: [sdc] Attached SCSI disk
[ 1393.488354] EXT4-fs (sdc2): mounted filesystem with ordered data mode. Opts: (null)
[ 3108.943181] ata1: exception Emask 0x10 SAct 0x0 SErr 0x10200 action 0xe frozen
[ 3108.958612] ata1: irq_stat 0x00400000, PHY RDY changed
[ 3108.968376] ata1: SError: { Persist PHYRdyChg }
[ 3108.977295] ata1: hard resetting link
[ 3112.664401] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[ 3112.703827] ata1.00: configured for UDMA/133
[ 3112.703882] ata1: EH complete
[ 3112.997537] ata1: exception Emask 0x10 SAct 0x0 SErr 0x10200 action 0xe frozen
[ 3113.008408] ata1: irq_stat 0x00400000, PHY RDY changed
[ 3113.017104] ata1: SError: { Persist PHYRdyChg }
[ 3113.025302] ata1: hard resetting link
[ 3115.084371] ata1: SATA link down (SStatus 0 SControl 300)
[ 3115.084409] ata1.00: link offline, clearing class 1 to NONE
[ 3115.112720] ata1: hard resetting link
[ 3120.474327] ata1: link is slow to respond, please be patient (ready=0)
[ 3124.154497] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[ 3124.156609] ata1.00: failed to IDENTIFY (I/O error, err_mask=0x100)
[ 3124.156637] ata1.00: revalidation failed (errno=-5)
[ 3124.166027] ata1: limiting SATA link speed to 1.5 Gbps
[ 3129.154604] ata1: hard resetting link
[ 3129.478364] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[ 3129.481557] ata1.00: configured for UDMA/133
[ 3129.481609] ata1: EH complete
[ 4499.520937] ata1.00: exception Emask 0x10 SAct 0x0 SErr 0x10200 action 0xe frozen
[ 4499.537317] ata1.00: irq_stat 0x00400000, PHY RDY changed
[ 4499.547481] ata1: SError: { Persist PHYRdyChg }
[ 4499.556923] ata1.00: failed command: SMART
[ 4499.566013] ata1.00: cmd b0/d5:01:01:4f:c2/00:00:00:00:00/00 tag 20 pio 512 in
                        res 50/00:00:00:4f:c2/00:00:00:00:00/40 Emask 0x10 (ATA bus error)
[ 4499.595255] ata1.00: status: { DRDY }
[ 4499.604053] ata1: hard resetting link
[ 4502.670925] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[ 4502.694206] ata1.00: configured for UDMA/133
[ 4502.694388] ata1: EH complete
root@locky:/media# uname -a
Linux locky 4.6.2-sunxi #1 SMP Thu Jun 9 02:59:09 PDT 2016 armv7l armv7l armv7l GNU/Linux

Somehow in case of odroidXU4 is using the xhci protocol, uas and the same BUS 04. In case of the Banana Pro is using ehci, uas and different BUSes.

I agree, this can be closed and in case reopened with new Kernel. -but maybe is not kernel relevant-

[ Upstream commit 994ec64 ] The probe function is not allowed to fail after registering the RTC because the following may happen: CPU0: CPU1: sys_load_module() do_init_module() do_one_initcall() cmos_do_probe() rtc_device_register() __register_chrdev() cdev->owner = struct module* open("/dev/rtc0") rtc_device_unregister() module_put() free_module() module_free(mod->module_core) /* struct module *module is now freed */ chrdev_open() spin_lock(cdev_lock) cdev_get() try_module_get() module_is_live() /* dereferences already freed struct module* */ Also, the interrupt handler: ac100_rtc_irq() is dereferencing chip->rtc but this may still be NULL when it is called, resulting in: Unable to handle kernel NULL pointer dereference at virtual address 00000194 pgd = (ptrval) [00000194] *pgd=00000000 Internal error: Oops: 5 [hardkernel#1] SMP ARM Modules linked in: CPU: 0 PID: 72 Comm: irq/71-ac100-rt Not tainted 4.15.0-rc1-next-20171201-dirty hardkernel#120 Hardware name: Allwinner sun8i Family task: (ptrval) task.stack: (ptrval) PC is at mutex_lock+0x14/0x3c LR is at ac100_rtc_irq+0x38/0xc8 pc : [<c06543a4>] lr : [<c04d9a2c>] psr: 60000053 sp : ee9c9f28 ip : 00000000 fp : ee9adfdc r10: 00000000 r9 : c0a04c48 r8 : c015ed18 r7 : ee9bd600 r6 : ee9c9f28 r5 : ee9af590 r4 : c0a04c48 r3 : ef3cb3c0 r2 : 00000000 r1 : ee9af590 r0 : 00000194 Flags: nZCv IRQs on FIQs off Mode SVC_32 ISA ARM Segment none Control: 10c5387d Table: 4000406a DAC: 00000051 Process irq/71-ac100-rt (pid: 72, stack limit = 0x(ptrval)) Stack: (0xee9c9f28 to 0xee9ca000) 9f20: 00000000 7c2fd1be c015ed18 ee9adf40 ee9c0400 ee9c0400 9f40: ee9adf40 c015ed34 ee9c8000 ee9adf64 ee9c0400 c015f040 ee9adf80 00000000 9f60: c015ee24 7c2fd1be ee9adfc0 ee9adf80 00000000 ee9c8000 ee9adf40 c015eef4 9f80: ef1eba34 c0138f14 ee9c8000 ee9adf80 c0138df4 00000000 00000000 00000000 9fa0: 00000000 00000000 00000000 c01010e8 00000000 00000000 00000000 00000000 9fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 9fe0: 00000000 00000000 00000000 00000000 00000013 00000000 ffffffff ffffffff [<c06543a4>] (mutex_lock) from [<c04d9a2c>] (ac100_rtc_irq+0x38/0xc8) [<c04d9a2c>] (ac100_rtc_irq) from [<c015ed34>] (irq_thread_fn+0x1c/0x54) [<c015ed34>] (irq_thread_fn) from [<c015f040>] (irq_thread+0x14c/0x214) [<c015f040>] (irq_thread) from [<c0138f14>] (kthread+0x120/0x150) [<c0138f14>] (kthread) from [<c01010e8>] (ret_from_fork+0x14/0x2c) Solve both issues by moving to devm_rtc_allocate_device()/rtc_register_device() Reported-by: Quentin Schulz <[email protected]> Tested-by: Quentin Schulz <[email protected]> Signed-off-by: Alexandre Belloni <[email protected]> Signed-off-by: Sasha Levin <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit 5c4c450 upstream. The parameters of v4l2_ctrl_new_std_menu_items() are tricky: instead of the number of possible values, it requires the number of the maximum value. In other words, the ARRAY_SIZE() value should be decremented, otherwise it will go past the array bounds, as warned by KASAN: [ 279.839688] BUG: KASAN: global-out-of-bounds in v4l2_querymenu+0x10d/0x180 [videodev] [ 279.839709] Read of size 8 at addr ffffffffc10a4cb0 by task v4l2-compliance/16676 [ 279.839736] CPU: 1 PID: 16676 Comm: v4l2-compliance Not tainted 4.18.0-rc2+ #120 [ 279.839741] Hardware name: /NUC5i7RYB, BIOS RYBDWi35.86A.0364.2017.0511.0949 05/11/2017 [ 279.839743] Call Trace: [ 279.839758] dump_stack+0x71/0xab [ 279.839807] ? v4l2_querymenu+0x10d/0x180 [videodev] [ 279.839817] print_address_description+0x1c9/0x270 [ 279.839863] ? v4l2_querymenu+0x10d/0x180 [videodev] [ 279.839871] kasan_report+0x237/0x360 [ 279.839918] v4l2_querymenu+0x10d/0x180 [videodev] [ 279.839964] __video_do_ioctl+0x2c8/0x590 [videodev] [ 279.840011] ? copy_overflow+0x20/0x20 [videodev] [ 279.840020] ? avc_ss_reset+0xa0/0xa0 [ 279.840028] ? check_stack_object+0x21/0x60 [ 279.840036] ? __check_object_size+0xe7/0x240 [ 279.840080] video_usercopy+0xed/0x730 [videodev] [ 279.840123] ? copy_overflow+0x20/0x20 [videodev] [ 279.840167] ? v4l_enumstd+0x40/0x40 [videodev] [ 279.840177] ? __handle_mm_fault+0x9f9/0x1ba0 [ 279.840186] ? __pmd_alloc+0x2c0/0x2c0 [ 279.840193] ? __vfs_write+0xb6/0x350 [ 279.840200] ? kernel_read+0xa0/0xa0 [ 279.840244] ? video_usercopy+0x730/0x730 [videodev] [ 279.840284] v4l2_ioctl+0xa1/0xb0 [videodev] [ 279.840295] do_vfs_ioctl+0x117/0x8a0 [ 279.840303] ? selinux_file_ioctl+0x211/0x2f0 [ 279.840313] ? ioctl_preallocate+0x120/0x120 [ 279.840319] ? selinux_capable+0x20/0x20 [ 279.840332] ksys_ioctl+0x70/0x80 [ 279.840342] __x64_sys_ioctl+0x3d/0x50 [ 279.840351] do_syscall_64+0x6d/0x1c0 [ 279.840361] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 279.840367] RIP: 0033:0x7fdfb46275d7 [ 279.840369] Code: b3 66 90 48 8b 05 b1 48 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 81 48 2d 00 f7 d8 64 89 01 48 [ 279.840474] RSP: 002b:00007ffee1179038 EFLAGS: 00000202 ORIG_RAX: 0000000000000010 [ 279.840483] RAX: ffffffffffffffda RBX: 00007ffee1179180 RCX: 00007fdfb46275d7 [ 279.840488] RDX: 00007ffee11790c0 RSI: 00000000c02c5625 RDI: 0000000000000003 [ 279.840493] RBP: 0000000000000002 R08: 0000000000000020 R09: 00000000009f0902 [ 279.840497] R10: 0000000000000000 R11: 0000000000000202 R12: 00007ffee117a5a0 [ 279.840501] R13: 00007ffee11790c0 R14: 0000000000000002 R15: 0000000000000000 [ 279.840515] The buggy address belongs to the variable: [ 279.840535] tvp5150_test_patterns+0x10/0xffffffffffffe360 [tvp5150] Fixes: c43875f ("[media] tvp5150: replace MEDIA_ENT_F_CONN_TEST by a control") Cc: [email protected] Signed-off-by: Mauro Carvalho Chehab <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit c278c25 upstream. There is a race between arc_emac_tx() and arc_emac_tx_clean(). sk_buff got freed by arc_emac_tx_clean() while arc_emac_tx() submitting sk_buff. In order to free sk_buff arc_emac_tx_clean() checks: if ((info & FOR_EMAC) || !txbd->data) break; ... dev_kfree_skb_irq(skb); If condition false, arc_emac_tx_clean() free sk_buff. In order to submit txbd, arc_emac_tx() do: priv->tx_buff[*txbd_curr].skb = skb; ... priv->txbd[*txbd_curr].data = cpu_to_le32(addr); ... ... <== arc_emac_tx_clean() check condition here ... <== (info & FOR_EMAC) is false ... <== !txbd->data is false ... *info = cpu_to_le32(FOR_EMAC | FIRST_OR_LAST_MASK | len); In order to reproduce the situation, run device: # iperf -s run on host: # iperf -t 600 -c <device-ip-addr> [ 28.396284] ------------[ cut here ]------------ [ 28.400912] kernel BUG at .../net/core/skbuff.c:1355! [ 28.414019] Internal error: Oops - BUG: 0 [#1] SMP ARM [ 28.419150] Modules linked in: [ 28.422219] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G B 4.4.0+ #120 [ 28.429516] Hardware name: Rockchip (Device Tree) [ 28.434216] task: c0665070 ti: c0660000 task.ti: c0660000 [ 28.439622] PC is at skb_put+0x10/0x54 [ 28.443381] LR is at arc_emac_poll+0x260/0x474 [ 28.447821] pc : [<c03af580>] lr : [<c028fec4>] psr: a0070113 [ 28.447821] sp : c0661e58 ip : eea68502 fp : ef377000 [ 28.459280] r10: 0000012c r9 : f08b2000 r8 : eeb57100 [ 28.464498] r7 : 00000000 r6 : ef376594 r5 : 00000077 r4 : ef376000 [ 28.471015] r3 : 0030488b r2 : ef13e880 r1 : 000005ee r0 : eeb57100 [ 28.477534] Flags: NzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none [ 28.484658] Control: 10c5387d Table: 8eaf004a DAC: 00000051 [ 28.490396] Process swapper/0 (pid: 0, stack limit = 0xc0660210) [ 28.496393] Stack: (0xc0661e58 to 0xc0662000) [ 28.500745] 1e40: 00000002 00000000 [ 28.508913] 1e60: 00000000 ef376520 00000028 f08b23b8 00000000 ef376520 ef7b6900 c028fc64 [ 28.517082] 1e80: 2f158000 c0661ea8 c0661eb0 0000012c c065e900 c03bdeac ffff95e9 c0662100 [ 28.525250] 1ea0: c0663924 00000028 c0661ea8 c0661ea8 c0661eb0 c0661eb0 0000001e c0660000 [ 28.533417] 1ec0: 40000003 00000008 c0695a00 0000000a c066208c 00000100 c0661ee0 c0027410 [ 28.541584] 1ee0: ef0fb700 2f158000 00200000 ffff95e8 00000004 c0662100 c0662080 00000003 [ 28.549751] 1f00: 00000000 00000000 00000000 c065b45c 0000001e ef005000 c0647a30 00000000 [ 28.557919] 1f20: 00000000 c0027798 00000000 c005cf40 f0802100 c0662ffc c0661f60 f0803100 [ 28.566088] 1f40: c0661fb8 c00093bc c000ffb4 60070013 ffffffff c0661f94 c0661fb8 c00137d4 [ 28.574267] 1f60: 00000001 00000000 00000000 c001ffa0 00000000 c0660000 00000000 c065a364 [ 28.582441] 1f80: c0661fb8 c0647a30 00000000 00000000 00000000 c0661fb0 c000ffb0 c000ffb4 [ 28.590608] 1fa0: 60070013 ffffffff 00000051 00000000 00000000 c005496c c0662400 c061bc40 [ 28.598776] 1fc0: ffffffff ffffffff 00000000 c061b680 00000000 c0647a30 00000000 c0695294 [ 28.606943] 1fe0: c0662488 c0647a2c c066619c 6000406a 413fc090 6000807c 00000000 00000000 [ 28.615127] [<c03af580>] (skb_put) from [<ef376520>] (0xef376520) [ 28.621218] Code: e5902054 e590c090 e3520000 0a000000 (e7f001f2) [ 28.627307] ---[ end trace 4824734e2243fdb6 ]--- [ 34.377068] Internal error: Oops: 17 [#1] SMP ARM [ 34.382854] Modules linked in: [ 34.385947] CPU: 0 PID: 3 Comm: ksoftirqd/0 Not tainted 4.4.0+ #120 [ 34.392219] Hardware name: Rockchip (Device Tree) [ 34.396937] task: ef02d040 ti: ef05c000 task.ti: ef05c000 [ 34.402376] PC is at __dev_kfree_skb_irq+0x4/0x80 [ 34.407121] LR is at arc_emac_poll+0x130/0x474 [ 34.411583] pc : [<c03bb640>] lr : [<c028fd94>] psr: 60030013 [ 34.411583] sp : ef05de68 ip : 0008e83c fp : ef377000 [ 34.423062] r10: c001bec4 r9 : 00000000 r8 : f08b24c8 [ 34.428296] r7 : f08b2400 r6 : 00000075 r5 : 00000019 r4 : ef376000 [ 34.434827] r3 : 00060000 r2 : 00000042 r1 : 00000001 r0 : 00000000 [ 34.441365] Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none [ 34.448507] Control: 10c5387d Table: 8f25c04a DAC: 00000051 [ 34.454262] Process ksoftirqd/0 (pid: 3, stack limit = 0xef05c210) [ 34.460449] Stack: (0xef05de68 to 0xef05e000) [ 34.464827] de60: ef376000 c028fd94 00000000 c0669480 c0669480 ef376520 [ 34.473022] de80: 00000028 00000001 00002ae ef376520 ef7b6900 c028fc64 2f158000 ef05dec0 [ 34.481215] dea0: ef05dec8 0000012c c065e900 c03bdeac ffff983f c0662100 c0663924 00000028 [ 34.489409] dec0: ef05dec0 ef05dec0 ef05dec8 ef05dec8 ef7b6000 ef05c000 40000003 00000008 [ 34.497600] dee0: c0695a00 0000000a c066208c 00000100 ef05def8 c0027410 ef7b6000 40000000 [ 34.505795] df00: 04208040 ffff983e 00000004 c0662100 c0662080 00000003 ef05c000 ef027340 [ 34.513985] df20: ef05c000 c0666c2c 00000000 00000001 00000002 00000000 00000000 c0027568 [ 34.522176] df40: ef027340 c003ef48 ef027300 00000000 ef027340 c003edd4 00000000 00000000 [ 34.530367] df60: 00000000 c003c37c ffffff7f 00000001 00000000 ef027340 00000000 00030003 [ 34.538559] df80: ef05df80 ef05df80 00000000 00000000 ef05df90 ef05df90 ef05dfac ef027300 [ 34.546750] dfa0: c003c2a4 00000000 00000000 c000f578 00000000 00000000 00000000 00000000 [ 34.554939] dfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 [ 34.563129] dfe0: 00000000 00000000 00000000 00000000 00000013 00000000 ffffffff dfff7fff [ 34.571360] [<c03bb640>] (__dev_kfree_skb_irq) from [<c028fd94>] (arc_emac_poll+0x130/0x474) [ 34.579840] [<c028fd94>] (arc_emac_poll) from [<c03bdeac>] (net_rx_action+0xdc/0x28c) [ 34.587712] [<c03bdeac>] (net_rx_action) from [<c0027410>] (__do_softirq+0xcc/0x1f8) [ 34.595482] [<c0027410>] (__do_softirq) from [<c0027568>] (run_ksoftirqd+0x2c/0x50) [ 34.603168] [<c0027568>] (run_ksoftirqd) from [<c003ef48>] (smpboot_thread_fn+0x174/0x18c) [ 34.611466] [<c003ef48>] (smpboot_thread_fn) from [<c003c37c>] (kthread+0xd8/0xec) [ 34.619075] [<c003c37c>] (kthread) from [<c000f578>] (ret_from_fork+0x14/0x3c) [ 34.626317] Code: e8bd8010 e3a00000 e12fff1e e92d4010 (e59030a4) [ 34.632572] ---[ end trace cca5a3d86a82249a ]--- Signed-off-by: Alexander Kochetkov <[email protected]> Signed-off-by: David S. Miller <[email protected]> Signed-off-by: Ben Hutchings <[email protected]>

commit 1413ef6 upstream. The struct cdev is embedded in the struct i2c_dev. In the current code, we would free the i2c_dev struct directly in put_i2c_dev(), but the cdev is manged by a kobject, and the release of it is not predictable. So it is very possible that the i2c_dev is freed before the cdev is entirely released. We can easily get the following call trace with CONFIG_DEBUG_KOBJECT_RELEASE and CONFIG_DEBUG_OBJECTS_TIMERS enabled. ODEBUG: free active (active state 0) object type: timer_list hint: delayed_work_timer_fn+0x0/0x38 WARNING: CPU: 19 PID: 1 at lib/debugobjects.c:325 debug_print_object+0xb0/0xf0 Modules linked in: CPU: 19 PID: 1 Comm: swapper/0 Tainted: G W 5.2.20-yocto-standard+ hardkernel#120 Hardware name: Marvell OcteonTX CN96XX board (DT) pstate: 80c00089 (Nzcv daIf +PAN +UAO) pc : debug_print_object+0xb0/0xf0 lr : debug_print_object+0xb0/0xf0 sp : ffff00001292f7d0 x29: ffff00001292f7d0 x28: ffff800b82151788 x27: 0000000000000001 x26: ffff800b892c0000 x25: ffff0000124a2558 x24: 0000000000000000 x23: ffff00001107a1d8 x22: ffff0000116b5088 x21: ffff800bdc6afca8 x20: ffff000012471ae8 x19: ffff00001168f2c8 x18: 0000000000000010 x17: 00000000fd6f304b x16: 00000000ee79de43 x15: ffff800bc0e80568 x14: 79616c6564203a74 x13: 6e6968207473696c x12: 5f72656d6974203a x11: ffff0000113f0018 x10: 0000000000000000 x9 : 000000000000001f x8 : 0000000000000000 x7 : ffff0000101294cc x6 : 0000000000000000 x5 : 0000000000000000 x4 : 0000000000000001 x3 : 00000000ffffffff x2 : 0000000000000000 x1 : 387fc15c8ec0f200 x0 : 0000000000000000 Call trace: debug_print_object+0xb0/0xf0 __debug_check_no_obj_freed+0x19c/0x228 debug_check_no_obj_freed+0x1c/0x28 kfree+0x250/0x440 put_i2c_dev+0x68/0x78 i2cdev_detach_adapter+0x60/0xc8 i2cdev_notifier_call+0x3c/0x70 notifier_call_chain+0x8c/0xe8 blocking_notifier_call_chain+0x64/0x88 device_del+0x74/0x380 device_unregister+0x54/0x78 i2c_del_adapter+0x278/0x2d0 unittest_i2c_bus_remove+0x3c/0x80 platform_drv_remove+0x30/0x50 device_release_driver_internal+0xf4/0x1c0 driver_detach+0x58/0xa0 bus_remove_driver+0x84/0xd8 driver_unregister+0x34/0x60 platform_driver_unregister+0x20/0x30 of_unittest_overlay+0x8d4/0xbe0 of_unittest+0xae8/0xb3c do_one_initcall+0xac/0x450 do_initcall_level+0x208/0x224 kernel_init_freeable+0x2d8/0x36c kernel_init+0x18/0x108 ret_from_fork+0x10/0x1c irq event stamp: 3934661 hardirqs last enabled at (3934661): [<ffff00001009fa04>] debug_exception_exit+0x4c/0x58 hardirqs last disabled at (3934660): [<ffff00001009fb14>] debug_exception_enter+0xa4/0xe0 softirqs last enabled at (3934654): [<ffff000010081d94>] __do_softirq+0x46c/0x628 softirqs last disabled at (3934649): [<ffff0000100b4a1c>] irq_exit+0x104/0x118 This is a common issue when using cdev embedded in a struct. Fortunately, we already have a mechanism to solve this kind of issue. Please see commit 233ed09 ("chardev: add helper function to register char devs with a struct device") for more detail. In this patch, we choose to embed the struct device into the i2c_dev, and use the API provided by the commit 233ed09 to make sure that the release of i2c_dev and cdev are in sequence. Signed-off-by: Kevin Hao <[email protected]> Signed-off-by: Wolfram Sang <[email protected]> Cc: Ben Hutchings <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>

[ Upstream commit 1413ef6 ] The struct cdev is embedded in the struct i2c_dev. In the current code, we would free the i2c_dev struct directly in put_i2c_dev(), but the cdev is manged by a kobject, and the release of it is not predictable. So it is very possible that the i2c_dev is freed before the cdev is entirely released. We can easily get the following call trace with CONFIG_DEBUG_KOBJECT_RELEASE and CONFIG_DEBUG_OBJECTS_TIMERS enabled. ODEBUG: free active (active state 0) object type: timer_list hint: delayed_work_timer_fn+0x0/0x38 WARNING: CPU: 19 PID: 1 at lib/debugobjects.c:325 debug_print_object+0xb0/0xf0 Modules linked in: CPU: 19 PID: 1 Comm: swapper/0 Tainted: G W 5.2.20-yocto-standard+ #120 Hardware name: Marvell OcteonTX CN96XX board (DT) pstate: 80c00089 (Nzcv daIf +PAN +UAO) pc : debug_print_object+0xb0/0xf0 lr : debug_print_object+0xb0/0xf0 sp : ffff00001292f7d0 x29: ffff00001292f7d0 x28: ffff800b82151788 x27: 0000000000000001 x26: ffff800b892c0000 x25: ffff0000124a2558 x24: 0000000000000000 x23: ffff00001107a1d8 x22: ffff0000116b5088 x21: ffff800bdc6afca8 x20: ffff000012471ae8 x19: ffff00001168f2c8 x18: 0000000000000010 x17: 00000000fd6f304b x16: 00000000ee79de43 x15: ffff800bc0e80568 x14: 79616c6564203a74 x13: 6e6968207473696c x12: 5f72656d6974203a x11: ffff0000113f0018 x10: 0000000000000000 x9 : 000000000000001f x8 : 0000000000000000 x7 : ffff0000101294cc x6 : 0000000000000000 x5 : 0000000000000000 x4 : 0000000000000001 x3 : 00000000ffffffff x2 : 0000000000000000 x1 : 387fc15c8ec0f200 x0 : 0000000000000000 Call trace: debug_print_object+0xb0/0xf0 __debug_check_no_obj_freed+0x19c/0x228 debug_check_no_obj_freed+0x1c/0x28 kfree+0x250/0x440 put_i2c_dev+0x68/0x78 i2cdev_detach_adapter+0x60/0xc8 i2cdev_notifier_call+0x3c/0x70 notifier_call_chain+0x8c/0xe8 blocking_notifier_call_chain+0x64/0x88 device_del+0x74/0x380 device_unregister+0x54/0x78 i2c_del_adapter+0x278/0x2d0 unittest_i2c_bus_remove+0x3c/0x80 platform_drv_remove+0x30/0x50 device_release_driver_internal+0xf4/0x1c0 driver_detach+0x58/0xa0 bus_remove_driver+0x84/0xd8 driver_unregister+0x34/0x60 platform_driver_unregister+0x20/0x30 of_unittest_overlay+0x8d4/0xbe0 of_unittest+0xae8/0xb3c do_one_initcall+0xac/0x450 do_initcall_level+0x208/0x224 kernel_init_freeable+0x2d8/0x36c kernel_init+0x18/0x108 ret_from_fork+0x10/0x1c irq event stamp: 3934661 hardirqs last enabled at (3934661): [<ffff00001009fa04>] debug_exception_exit+0x4c/0x58 hardirqs last disabled at (3934660): [<ffff00001009fb14>] debug_exception_enter+0xa4/0xe0 softirqs last enabled at (3934654): [<ffff000010081d94>] __do_softirq+0x46c/0x628 softirqs last disabled at (3934649): [<ffff0000100b4a1c>] irq_exit+0x104/0x118 This is a common issue when using cdev embedded in a struct. Fortunately, we already have a mechanism to solve this kind of issue. Please see commit 233ed09 ("chardev: add helper function to register char devs with a struct device") for more detail. In this patch, we choose to embed the struct device into the i2c_dev, and use the API provided by the commit 233ed09 to make sure that the release of i2c_dev and cdev are in sequence. Signed-off-by: Kevin Hao <[email protected]> Signed-off-by: Wolfram Sang <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit 031af50 ] The inline assembly for arm64's cmpxchg_double*() implementations use a +Q constraint to hazard against other accesses to the memory location being exchanged. However, the pointer passed to the constraint is a pointer to unsigned long, and thus the hazard only applies to the first 8 bytes of the location. GCC can take advantage of this, assuming that other portions of the location are unchanged, leading to a number of potential problems. This is similar to what we fixed back in commit: fee960b ("arm64: xchg: hazard against entire exchange variable") ... but we forgot to adjust cmpxchg_double*() similarly at the same time. The same problem applies, as demonstrated with the following test: | struct big { | u64 lo, hi; | } __aligned(128); | | unsigned long foo(struct big *b) | { | u64 hi_old, hi_new; | | hi_old = b->hi; | cmpxchg_double_local(&b->lo, &b->hi, 0x12, 0x34, 0x56, 0x78); | hi_new = b->hi; | | return hi_old ^ hi_new; | } ... which GCC 12.1.0 compiles as: | 0000000000000000 <foo>: | 0: d503233f paciasp | 4: aa0003e4 mov x4, x0 | 8: 1400000e b 40 <foo+0x40> | c: d2800240 mov x0, #0x12 // #18 | 10: d2800681 mov x1, #0x34 // #52 | 14: aa0003e5 mov x5, x0 | 18: aa0103e6 mov x6, x1 | 1c: d2800ac2 mov x2, #0x56 // #86 | 20: d2800f03 mov x3, #0x78 // #120 | 24: 48207c82 casp x0, x1, x2, x3, [x4] | 28: ca050000 eor x0, x0, x5 | 2c: ca060021 eor x1, x1, x6 | 30: aa010000 orr x0, x0, x1 | 34: d2800000 mov x0, #0x0 // #0 <--- BANG | 38: d50323bf autiasp | 3c: d65f03c0 ret | 40: d2800240 mov x0, #0x12 // #18 | 44: d2800681 mov x1, #0x34 // #52 | 48: d2800ac2 mov x2, #0x56 // #86 | 4c: d2800f03 mov x3, #0x78 // #120 | 50: f9800091 prfm pstl1strm, [x4] | 54: c87f1885 ldxp x5, x6, [x4] | 58: ca0000a5 eor x5, x5, x0 | 5c: ca0100c6 eor x6, x6, x1 | 60: aa0600a6 orr x6, x5, x6 | 64: b5000066 cbnz x6, 70 <foo+0x70> | 68: c8250c82 stxp w5, x2, x3, [x4] | 6c: 35ffff45 cbnz w5, 54 <foo+0x54> | 70: d2800000 mov x0, #0x0 // #0 <--- BANG | 74: d50323bf autiasp | 78: d65f03c0 ret Notice that at the lines with "BANG" comments, GCC has assumed that the higher 8 bytes are unchanged by the cmpxchg_double() call, and that `hi_old ^ hi_new` can be reduced to a constant zero, for both LSE and LL/SC versions of cmpxchg_double(). This patch fixes the issue by passing a pointer to __uint128_t into the +Q constraint, ensuring that the compiler hazards against the entire 16 bytes being modified. With this change, GCC 12.1.0 compiles the above test as: | 0000000000000000 <foo>: | 0: f9400407 ldr x7, [x0, #8] | 4: d503233f paciasp | 8: aa0003e4 mov x4, x0 | c: 1400000f b 48 <foo+0x48> | 10: d2800240 mov x0, #0x12 // #18 | 14: d2800681 mov x1, #0x34 // #52 | 18: aa0003e5 mov x5, x0 | 1c: aa0103e6 mov x6, x1 | 20: d2800ac2 mov x2, #0x56 // #86 | 24: d2800f03 mov x3, #0x78 // #120 | 28: 48207c82 casp x0, x1, x2, x3, [x4] | 2c: ca050000 eor x0, x0, x5 | 30: ca060021 eor x1, x1, x6 | 34: aa010000 orr x0, x0, x1 | 38: f9400480 ldr x0, [x4, #8] | 3c: d50323bf autiasp | 40: ca0000e0 eor x0, x7, x0 | 44: d65f03c0 ret | 48: d2800240 mov x0, #0x12 // #18 | 4c: d2800681 mov x1, #0x34 // #52 | 50: d2800ac2 mov x2, #0x56 // #86 | 54: d2800f03 mov x3, #0x78 // #120 | 58: f9800091 prfm pstl1strm, [x4] | 5c: c87f1885 ldxp x5, x6, [x4] | 60: ca0000a5 eor x5, x5, x0 | 64: ca0100c6 eor x6, x6, x1 | 68: aa0600a6 orr x6, x5, x6 | 6c: b5000066 cbnz x6, 78 <foo+0x78> | 70: c8250c82 stxp w5, x2, x3, [x4] | 74: 35ffff45 cbnz w5, 5c <foo+0x5c> | 78: f9400480 ldr x0, [x4, #8] | 7c: d50323bf autiasp | 80: ca0000e0 eor x0, x7, x0 | 84: d65f03c0 ret ... sampling the high 8 bytes before and after the cmpxchg, and performing an EOR, as we'd expect. For backporting, I've tested this atop linux-4.9.y with GCC 5.5.0. Note that linux-4.9.y is oldest currently supported stable release, and mandates GCC 5.1+. Unfortunately I couldn't get a GCC 5.1 binary to run on my machines due to library incompatibilities. I've also used a standalone test to check that we can use a __uint128_t pointer in a +Q constraint at least as far back as GCC 4.8.5 and LLVM 3.9.1. Fixes: 5284e1b ("arm64: xchg: Implement cmpxchg_double") Fixes: e9a4b79 ("arm64: cmpxchg_dbl: patch in lse instructions when supported by the CPU") Reported-by: Boqun Feng <[email protected]> Link: https://lore.kernel.org/lkml/Y6DEfQXymYVgL3oJ@boqun-archlinux/ Reported-by: Peter Zijlstra <[email protected]> Link: https://lore.kernel.org/lkml/[email protected]/ Signed-off-by: Mark Rutland <[email protected]> Cc: [email protected] Cc: Arnd Bergmann <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Steve Capper <[email protected]> Cc: Will Deacon <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

scotte changed the title ~~Lockups with heavy xhci/SATA~~ Lockups with heavy xhci/SATA load Aug 11, 2015

scotte changed the title ~~Lockups with heavy xhci/SATA load~~ odroidxu4: Lockups with heavy xhci/SATA load Aug 11, 2015

qknight mentioned this issue Mar 24, 2016

issues on xu4 with branch: odroidxu4-v4.2 tobetter/linux#2

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

odroidxu4: Lockups with heavy xhci/SATA load #120

odroidxu4: Lockups with heavy xhci/SATA load #120

scotte commented Aug 11, 2015

tobetter commented Aug 12, 2015

scotte commented Aug 12, 2015

scotte commented Aug 12, 2015

tobetter commented Aug 13, 2015

scotte commented Aug 13, 2015

scotte commented Aug 13, 2015

scotte commented Aug 19, 2015

jobenvil commented Feb 6, 2016

qknight commented Feb 20, 2016

Obihoernchen commented Mar 5, 2016

uDude commented Jul 15, 2016

jobenvil commented Jul 15, 2016

odroidxu4: Lockups with heavy xhci/SATA load #120

odroidxu4: Lockups with heavy xhci/SATA load #120

Comments

scotte commented Aug 11, 2015

tobetter commented Aug 12, 2015

scotte commented Aug 12, 2015

scotte commented Aug 12, 2015

tobetter commented Aug 13, 2015

scotte commented Aug 13, 2015

scotte commented Aug 13, 2015

scotte commented Aug 19, 2015

jobenvil commented Feb 6, 2016

qknight commented Feb 20, 2016

Obihoernchen commented Mar 5, 2016

uDude commented Jul 15, 2016

jobenvil commented Jul 15, 2016