Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPI overflow in CAN bus HAT communication - IRQ handler mcp251xfd_handle_tefif() #6644

Open
filogold opened this issue Feb 2, 2025 · 5 comments

Comments

@filogold
Copy link

filogold commented Feb 2, 2025

Describe the bug

Hi everyone,

I'm encountering an issue with the IRQ handler. When I simulate multiple periodic messages over the CAN bus, at a certain point, I suspect that SPI communication fails to properly manage data exchange with the two HAT modules(https://www.waveshare.com/wiki/2-CH_CAN_FD_HAT) I am using.

I updated the kernel to the latest version (see below), which seemed to improve the situation slightly. However, it only mitigated the buffer issue by delaying the occurrence of the fault rather than fully resolving it.

My setup consists of a Raspberry Pi 5 with two CAN HAT 2CH FD modules, each using the MCP251XFD chip.

To ensure the problem is related to the transmitter and not the receiver, I tested two different scenarios:

  1. Using two channels of the Raspberry Pi to transmit and the remaining two channels to receive (all configured with the same parameters: bitrate, data rate, and sampling time).

  2. Using two channels of the Raspberry Pi to transmit and two channels of a Vector VN1640 (with the 1057Gcap installed) to receive. Also in this case, I carefully verified the CAN configuration.

In both cases the termination are correctly verified, providing 60 ohms in the buses.

I am aware that I am stressing the module, as I am sending multiple periodic messages (128 periodic messages per CAN FD, generating a 40% bus load). The messages require some processing to calculate the internal CRC and MC in the payload, which is correctly handled by the Python code.

Based on the information provided, the issue appears to be related to overrun packets. I suspect this is the cause of the failure.

ifconfig:

can0: flags=193<UP,RUNNING,NOARP> mtu 72
unspec 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 txqueuelen 65536
(UNSPEC)
RX packets 6455202 bytes 68902927 (65.7 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 6455202 bytes 68902927 (65.7 MiB)
TX errors 0 dropped 0 overruns 8471 carrier 0 collisions 0
device interrupt 186
can1: flags=193<UP,RUNNING,NOARP> mtu 72
unspec 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 txqueuelen 65536
(UNSPEC)
RX packets 13736158 bytes 146632299 (139.8 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 13736158 bytes 146632299 (139.8 MiB)
TX errors 0 dropped 0 overruns 11052 carrier 0 collisions 0
device interrupt 187
can2: flags=193<UP,RUNNING,NOARP> mtu 72
unspec 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 txqueuelen 65536
(UNSPEC)
RX packets 13736162 bytes 146632347 (139.8 MiB)
RX errors 0 dropped 0 overruns 3 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
device interrupt 188
can3: flags=193<UP,RUNNING,NOARP> mtu 72
unspec 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 txqueuelen 65536
(UNSPEC)
RX packets 6455206 bytes 68902975 (65.7 MiB)
RX errors 0 dropped 0 overruns 1 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
device interrupt 189

The information I provided above is essentially the same as in the second case, where I use the Vector hardware as the receiver. One observation I have made is that after a bus failure, the overruns seem to stop or decrease. However, after a certain period (several hours), at some point, the second bus also fails.

Thanks in advance

Steps to reproduce the behaviour

I can't find a piece of code that fully replicates the bug. I wrote a program that sends periodic messages, which generates some overruns, but not as many as my main code.

I use Bluetooth serial communication to share information about the message I want to simulate, including the ID, initial payload, and CAN bus settings. This communication is not constant.

I could try running the code without using Bluetooth, but I still need to work on it.

How I Initialize the bus:

           self.canDB[canName] = SocketcanBus(
               channel=canName,
               fd=True if isFD == 'True' else False,
               receive_own_messages=False,  
               local_loopback=False, 
               can_filters=None,
               ignore_rx_error_frames=False
           )

To run a periodic message, I create a thread using the library function:

send_periodic(self.messageInfo[_bus][_id]['MSG_CLASS'], self.messageInfo[_bus][_id]['PERIOD'], modifier_callback=self.messageInfo[_bus][_id]['CALLBACK'], on_error=self.messageInfo[_bus][_id]['ERROR'], autostart=False)

The message object is defined as follow:

can.Message(arbitration_id=_id, data=payload, dlc=_dlc, is_extended_id=_id_extended, is_fd=fdParam, bitrate_switch=fdParam)

Device (s)

Raspberry Pi 5

System

uname -a:

Linux raspberrypi 6.6.74+rpt-rpi-2712 #1 SMP PREEMPT Debian 1:6.6.74-1+rpt1
(2025-01-27) aarch64 GNU/Linux

modinfo can:

filename: /lib/modules/6.6.74+rpt-rpi-2712/kernel/net/can/can.ko.xz
alias: net-pf-29
author: Urs Thuermann <[email protected]>, Oliver Hartkopp
<[email protected]>
license: Dual BSD/GPL
description: Controller Area Network PF_CAN core
srcversion: 37BF79470A254916FD8C595
depends:
intree: Y
name: can
vermagic: 6.6.74+rpt-rpi-2712 SMP preempt mod_unload modversions aarch64
parm: stats_timer:enable timer for statistics (default:on) (int)

modinfo mcp251xfd:

filename:
/lib/modules/6.6.74+rpt-rpi-2712/kernel/drivers/net/can/spi/mcp251xfd/mcp251xfd.ko.xz
license: GPL v2
description: Microchip MCP251xFD Family CAN controller driver
author: Marc Kleine-Budde <[email protected]>
srcversion: 672D5B0159EB5AA465D4A79
alias: spi:mcp251xfd
alias: spi:mcp251863
alias: spi:mcp2518fd
alias: spi:mcp2517fd
alias: of:N*T*Cmicrochip,mcp251xfdC*
alias: of:N*T*Cmicrochip,mcp251xfd
alias: of:N*T*Cmicrochip,mcp251863C*
alias: of:N*T*Cmicrochip,mcp251863
alias: of:N*T*Cmicrochip,mcp2518fdC*
alias: of:N*T*Cmicrochip,mcp2518fd
alias: of:N*T*Cmicrochip,mcp2517fdC*
alias: of:N*T*Cmicrochip,mcp2517fd
depends: can-dev
intree: Y
name: mcp251xfd
vermagic: 6.6.74+rpt-rpi-2712 SMP preempt mod_unload modversions aarch64

Logs

dmesg | grep can:

[ 3.165808] mcp251xfd spi0.1 can0: MCP2518FD rev0.0 (-RX_INT -PLL
-MAB_NO_WARN +CRC_REG +CRC_RX +CRC_TX +ECC -HD o:40.00MHz c:40.00MHz
m:20.00MHz rs:17.00MHz es:16.66MHz rf:17.00MHz ef:16.66MHz) successfully initialized.
[ 3.170311] mcp251xfd spi0.0 can1: MCP2518FD rev0.0 (-RX_INT -PLL -MAB_NO_WARN
+CRC_REG +CRC_RX +CRC_TX +ECC -HD o:40.00MHz c:40.00MHz m:20.00MHz
rs:17.00MHz es:16.66MHz rf:17.00MHz ef:16.66MHz) successfully initialized.
[ 3.179808] mcp251xfd spi1.1 can2: MCP2518FD rev0.0 (-RX_INT -PLL
-MAB_NO_WARN +CRC_REG +CRC_RX +CRC_TX +ECC -HD o:40.00MHz c:40.00MHz
m:20.00MHz rs:17.00MHz es:16.66MHz rf:17.00MHz ef:16.66MHz) successfully initialized.
[ 3.186278] mcp251xfd spi1.0 can3: MCP2518FD rev0.0 (-RX_INT -PLL
-MAB_NO_WARN +CRC_REG +CRC_RX +CRC_TX +ECC -HD o:40.00MHz c:40.00MHz
m:20.00MHz rs:17.00MHz es:16.66MHz rf:17.00MHz ef:16.66MHz) successfully initialized.
[ 112.788097] can: controller area network core
[ 112.795770] can: raw protocol
[ 2339.305706] mcp251xfd spi0.1 can0: IRQ handler mcp251xfd_handle_tefif() returned -22.
[ 2339.305714] mcp251xfd spi0.1 can0: IRQ handler returned -22 (intf=0x3f1a0014).
[ 4809.135009] mcp251xfd spi0.0 can1: IRQ handler mcp251xfd_handle_tefif() returned -22.
[ 4809.135017] mcp251xfd spi0.0 can1: IRQ handler returned -22 (intf=0x3f1a0014).

Additional context

@marckleinebudde

Thanks in advance

@filogold
Copy link
Author

filogold commented Feb 3, 2025

Attached is a code snippet that seems to replicate the bug (in my setup). The longer it runs, the more overflows occur, although I'm not 100% sure, as it's difficult to gather statistics.

import can
import time
import random
from can.interfaces.socketcan import SocketcanBus
import os

period_choices = [0.01, 0.05, 0.1, 0.5, 1]

class CANHandler:
    def __init__(self, interface1='can0', interface2='can3'):
        self.interface1 = interface1
        self.interface2 = interface2
        self.messageInfo = {self.interface1: {}, self.interface2: {}}

    def setup_can_interface(self, interface):
        os.system(f"sudo ip link set down {interface}")
        time.sleep(0.5)
        os.system(f"sudo ip link set {interface} up type can bitrate 500000 sample-point 0.75 dbitrate 2000000 dsample-point 0.75 restart-ms 1000 berr-reporting on fd on")
        time.sleep(0.5)
        os.system(f"sudo ip link set {interface} txqueuelen 65536")
        time.sleep(0.5)

    def send_can_messages(self):
        self.setup_can_interface(self.interface1)
        self.setup_can_interface(self.interface2)

        self.bus0 = can.interfaces.socketcan.SocketcanBus(channel=self.interface1, fd=True)
        self.bus1 = can.interfaces.socketcan.SocketcanBus(channel=self.interface2, fd=True)
        
        for i in range(1, 120):
            _dlc = random.randint(8, 32)
            _id = i
            data = [random.randint(0, 255) for _ in range(_dlc)]
            period = random.choice(period_choices)
            msg_class = can.Message(arbitration_id=_id, data=bytearray(data), is_fd=True, dlc=_dlc, is_extended_id=False, bitrate_switch=True)
            
            self.messageInfo[self.interface1][i] = {
                'MSG_CLASS': msg_class,
                'PERIOD': period,
            }
            
            self.messageInfo[self.interface2][i] = {
                'MSG_CLASS': msg_class,
                'PERIOD': period,
            }
            
            self.messageInfo[self.interface1][i]['THREAD'] = self.bus0.send_periodic(
                self.messageInfo[self.interface1][i]['MSG_CLASS'],
                self.messageInfo[self.interface1][i]['PERIOD'],
                autostart=True
            )
            time.sleep(0.01)
            self.messageInfo[self.interface2][i]['THREAD'] = self.bus1.send_periodic(
                self.messageInfo[self.interface2][i]['MSG_CLASS'],
                self.messageInfo[self.interface2][i]['PERIOD'],
                autostart=True
            )
        
        time.sleep(1000) 
        
        self.bus0.shutdown()
        self.bus1.shutdown()

if __name__ == "__main__":
    can_handler = CANHandler()
    can_handler.send_can_messages()

@marckleinebudde
Copy link
Contributor

Can you please copy https://github.com/linux-can/can-utils/tree/master/mcp251xfd/99-devcoredump.rules to /etc/udev/rules.d and https://github.com/linux-can/can-utils/blob/master/mcp251xfd/devcoredump to /usr/sbin

If the IRQ handler fails with an error, the driver will generate a dump of the driver and chip state and write it to /var/log/devcoredump-*.dump. Please send me this file.

@Vinz1911
Copy link

Vinz1911 commented Feb 3, 2025

i also discovered this problem. I'm on the latest available kernel
Linux robomaster 6.13.0-v8-16k+ #1 SMP PREEMPT_RT Sat Feb 1 22:01:29 CET 2025 aarch64 GNU/Linux

it dont happens every time, it randomly occurs. also the driver feels a bit unstable currently. in the most of the time I randomly face some "timeouts" where im unable to send or read. but not every time I get IRQ handler fails.

@filogold
Copy link
Author

filogold commented Feb 3, 2025

Can you please copy https://github.com/linux-can/can-utils/tree/master/mcp251xfd/99-devcoredump.rules to /etc/udev/rules.d and https://github.com/linux-can/can-utils/blob/master/mcp251xfd/devcoredump to /usr/sbin

If the IRQ handler fails with an error, the driver will generate a dump of the driver and chip state and write it to /var/log/devcoredump-*.dump. Please send me this file.

Did it without success, no file has been generated, but the IRQ handler message is present in dmesg | grep can

filogold@raspberrypi:/etc/udev/rules.d $ ls

99-com.rules  99-devcoredump.rules

filogold@raspberrypi:/usr/sbin $ ls

aa-remove-unknown      grpck                       pppoe-discovery
aa-status              grpconv                     pppstats
aa-teardown            grpunconv                   pwck
accessdb               halt                        pwconv
addgnupghome           hwclock                     pwhistory_helper
addgroup               i2cdetect                   pwunconv
add-shell              i2cdump                     rarp
adduser                i2cget                      readprofile
agetty                 i2cset                      reboot
alsabat-test           i2c-stub-from-dump          remove-default-ispell
alsactl                i2ctransfer                 remove-default-wordlist
alsa-info              iconvconfig                 remove-shell
apparmor_parser        ifconfig                    request-key
apparmor_status        iio-sensor-proxy            resize2fs
applygnupgdefaults     init                        resolvconf
arp                    insmod                      rfkill
arpd                   installkernel               rmmod
aspell-autobuildhash   install-sgmlcatalog         rmt
avahi-daemon           invoke-rc.d                 rmt-tar
badblocks              ip                          route
blkdeactivate          ipmaddr                     rpcbind
blkdiscard             ippevepcl                   rpcctl
blkid                  ippeveprinter               rpcdebug
blkmapd                ippeveps                    rpc.gssd
blkzone                ipp-usb                     rpc.idmapd
blockdev               iptunnel                    rpcinfo
bluetoothd             isosize                     rpc.statd
bridge                 ispell-autobuildhash        rpc.svcgssd
capsh                  iw                          rtacct
cfdisk                 iwconfig                    rtcwake
cgdisk                 iwevent                     rtkitctl
chat                   iwgetid                     rtmon
chcpu                  iwlist                      runlevel
chgpasswd              iwpriv                      runuser
chmem                  iwspy                       saned
chpasswd               kbdrate                     select-default-ispell
chroot                 key.dns_resolver            select-default-wordlist
cifs.idmap             killall5                    service
cifs.upcall            ldattach                    setcap
cpgr                   ldconfig                    setvesablank
cppw                   lightdm                     sfdisk
cron                   lightdm-gtk-greeter         sgdisk
ctrlaltdel             locale-gen                  shadowconfig
cupsaccept             logrotate                   showmount
cups-browsed           logsave                     shutdown
cupsctl                losetup                     slattach
cupsd                  lpadmin                     sm-notify
cupsdisable            lpinfo                      sshd
cupsenable             lpmove                      start-statd
cupsfilter             lsmod                       start-stop-daemon
cupsreject             make-ssl-cert               sudo_logsrvd
dcb                    mii-tool                    sudo_sendlog
debugfs                mkdosfs                     sulogin
delgroup               mke2fs                      swaplabel
deluser                mkfs                        swapoff
depmod                 mkfs.bfs                    swapon
devcoredump            mkfs.cramfs                 switch_root
devlink                mkfs.exfat                  sysctl
dhclient               mkfs.ext2                   tarcat
dhclient-script        mkfs.ext3                   tc
dmidecode              mkfs.ext4                   telinit
dmsetup                mkfs.fat                    th-cmd
dmstats                mkfs.minix                  thd
dnsmasq                mkfs.msdos                  tipc
dosfsck                mkfs.ntfs                   tune2fs
dosfslabel             mkfs.vfat                   tune.exfat
dphys-swapfile         mkhomedir_helper            umount.nfs
dpkg-preconfigure      mkinitramfs                 umount.nfs4
dpkg-reconfigure       mklost+found                umount.udisks2
dumpe2fs               mkntfs                      unix_chkpwd
dump.exfat             mkswap                      unix_update
e2freefrag             ModemManager                update-ca-certificates
e2fsck                 modinfo                     update-catalog
e2image                modprobe                    update-default-aspell
e2label                mount.cifs                  update-default-ispell
e2mmpstatus            mount.fuse                  update-default-wordlist
e2scrub                mount.fuse3                 update-dictcommon-aspell
e2scrub_all            mount.lowntfs-3g            update-dictcommon-hunspell
e2undo                 mount.nfs                   update-fonts-alias
e4crypt                mount.nfs4                  update-fonts-dir
e4defrag               mount.ntfs                  update-fonts-scale
ethtool                mount.ntfs-3g               update-gsfontmap
exfat2img              mount.smb3                  update-icon-caches
exfatlabel             mountstats                  update-inetd
faillock               nameif                      update-initramfs
fake-hwclock           NetworkManager              update-java-alternatives
fatlabel               newusers                    update-locale
fdisk                  nfsconf                     update-mime
filefrag               nfsidmap                    update-passwd
findfs                 nfsiostat                   update-pciids
fixparts               nfsstat                     update-rc.d
flashrom               nft                         update-shells
fsck                   nologin                     update-xmlcatalog
fsck.cramfs            ntfsclone                   usb_modeswitch
fsck.exfat             ntfscp                      usb_modeswitch_dispatcher
fsck.ext2              ntfslabel                   useradd
fsck.ext3              ntfsresize                  userdel
fsck.ext4              ntfsundelete                usermod
fsck.fat               pam-auth-update             v4l2-dbg
fsck.minix             pam_getenv                  validlocale
fsck.msdos             pam_namespace_helper        vcstime
fsck.vfat              pam_timestamp_check         vdpa
fsfreeze               paperconfig                 vigr
fstab-decode           parted                      vipw
fstrim                 partprobe                   visudo
gdisk                  pi-greeter                  wayvnc-control.py
genl                   pivot_root                  wayvnc-generate-keys.sh
getcap                 pkla-admin-identities       wayvnc-run.sh
getpcaps               pkla-check-authorization    wipefs
getty                  plipconfig                  wpa_action
gnome-menus-blacklist  plymouthd                   wpa_cli
groupadd               plymouth-set-default-theme  wpa_supplicant
groupdel               poweroff                    zic
groupmems              pppd                        zramctl
groupmod               pppdump

filogold@raspberrypi:/usr/sbin $ ifconfig

can1: flags=193<UP,RUNNING,NOARP>  mtu 72
        unspec 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00  txqueuelen 65536  (UNSPEC)
        RX packets 6682736  bytes 79431951 (75.7 MiB)
        RX errors 42  dropped 0  overruns 0  frame 0
        TX packets 6682778  bytes 79432515 (75.7 MiB)
        TX errors 0  dropped 0 overruns 96  carrier 0  collisions 0
        device interrupt 187  

can2: flags=193<UP,RUNNING,NOARP>  mtu 72
        unspec 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00  txqueuelen 65536  (UNSPEC)
        RX packets 6906321  bytes 81185955 (77.4 MiB)
        RX errors 6  dropped 0  overruns 0  frame 0
        TX packets 6906327  bytes 81186051 (77.4 MiB)
        TX errors 0  dropped 0 overruns 9529  carrier 0  collisions 0
        device interrupt 188  

can3: flags=193<UP,RUNNING,NOARP>  mtu 72
        unspec 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00  txqueuelen 65536  (UNSPEC)
        RX packets 6733854  bytes 79508518 (75.8 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 6733854  bytes 79508518 (75.8 MiB)
        TX errors 0  dropped 0 overruns 9718  carrier 0  collisions 0
        device interrupt 189  

filogold@raspberrypi:~ $ dmesg | grep can

[    2.765163] mcp251xfd spi0.1 can0: MCP2518FD rev0.0 (-RX_INT -PLL -MAB_NO_WARN +CRC_REG +CRC_RX +CRC_TX +ECC -HD o:40.00MHz c:40.00MHz m:20.00MHz rs:17.00MHz es:16.66MHz rf:17.00MHz ef:16.66MHz) successfully initialized.
[    2.771719] mcp251xfd spi0.0 can1: MCP2518FD rev0.0 (-RX_INT -PLL -MAB_NO_WARN +CRC_REG +CRC_RX +CRC_TX +ECC -HD o:40.00MHz c:40.00MHz m:20.00MHz rs:17.00MHz es:16.66MHz rf:17.00MHz ef:16.66MHz) successfully initialized.
[    2.778191] mcp251xfd spi1.1 can2: MCP2518FD rev0.0 (-RX_INT -PLL -MAB_NO_WARN +CRC_REG +CRC_RX +CRC_TX +ECC -HD o:40.00MHz c:40.00MHz m:20.00MHz rs:17.00MHz es:16.66MHz rf:17.00MHz ef:16.66MHz) successfully initialized.
[    2.783010] mcp251xfd spi1.0 can3: MCP2518FD rev0.0 (-RX_INT -PLL -MAB_NO_WARN +CRC_REG +CRC_RX +CRC_TX +ECC -HD o:40.00MHz c:40.00MHz m:20.00MHz rs:17.00MHz es:16.66MHz rf:17.00MHz ef:16.66MHz) successfully initialized.
[  138.971928] can: controller area network core
[  138.979074] can: raw protocol
[ 2299.725953] mcp251xfd spi1.0 can3: IRQ handler mcp251xfd_handle_tefif() returned -22.
[ 2299.725964] mcp251xfd spi1.0 can3: IRQ handler returned -22 (intf=0xbf1a0014).

@ragazenta
Copy link
Contributor

Did it without success, no file has been generated, but the IRQ handler message is present in dmesg | grep can

Did you make devcoredump executable?

chmod +x /usr/sbin/devcoredump

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants