Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

100% cpu usage! #81

Closed
softdream1981 opened this issue Feb 20, 2019 · 12 comments
Closed

100% cpu usage! #81

softdream1981 opened this issue Feb 20, 2019 · 12 comments
Assignees
Labels

Comments

@softdream1981
Copy link

I check cpu usage with htop, and it always 100%. Any solution?

@mame82
Copy link
Collaborator

mame82 commented Mar 13, 2019

Can't reproduce this, do you have more details? What exact process is causing this load ?

@Brakhma
Copy link

Brakhma commented Apr 19, 2019

Confirm. Process: /usr/local/bin/P4wnP1_service. 100% load only if aloa connected to existing wi-fi network

@naivekun
Copy link

Me too. /usr/local/bin/P4wnP1_service 100% load when wlan0 is working on client mode.(connected to my wifi)

@whackyhack
Copy link

I get extremely high usage as soon as WiFi connects to my WiFi AP, often leading to over-current and shutdown. (The board becomes hot to the touch.) Mostly consumption comes from P4wnP1_service; briefly I saw a few others, like apt-related processes. So I tried to block Internet from my router. But the same thing happens.

In another community, I learned that an older driver for a commonly used WiFi chip was faulty, leading to this kind of behavior. (The driver was fixed in mainstream Debian although not in the custom distro we were using.) But there, the consuming processes are directly related to data transmission in the kernel; extreme CPU consumption occurs only during intensive transmission (e.g., apt update).

@mame82 mame82 added the BUG label Jun 3, 2019
@mame82 mame82 self-assigned this Jun 3, 2019
@mame82
Copy link
Collaborator

mame82 commented Jun 3, 2019

Going to investigate this, thx for reporting

@ca5ua1
Copy link

ca5ua1 commented Sep 18, 2019

Did it fixed? (I have same issue)

@softdream1981
Copy link
Author

Same here.

@ca5ua1
Copy link

ca5ua1 commented Dec 17, 2019

Some tools as wifite or same can't work properly and just kills RPi

@ghost
Copy link

ghost commented Dec 21, 2019

Same here - Nano Pi - with an without WIFI connection - 100% CPU with P4wnP1_service. Using the Nano Pi is close to impossible - very slow SSH... due to the high load...

@mame82
Copy link
Collaborator

mame82 commented Feb 5, 2020

Issue could be reproduced:

  • WiFi in station mode (not failover)
  • valid WiFi credentials (P4wnP1 connects to existing WiFi and receives a valid DHCP lease)

In result P4wnP1_service goes up to full CPU load

The root cause has to be investigated using remote debugging for the P4wnP1_service (over USB, to avoid WiFi interference). This is likely caused by a non blocking loop for sub-process monitoring (wpa_supplicant, dhcpcd, hostapd ...) and maybe could be found using pure code audit.

@mame82
Copy link
Collaborator

mame82 commented Feb 5, 2020

The root cause of the problem was found.

Here are some notes on how it was approached, which are meant to be read by the new maintainer of this repo.

Cause of the error is here:
https://github.com/mame82/P4wnP1_aloa/blob/74a373f895ac0f6457a102e5dc638ce3c9358762/service/wifi.go#L505

func (m *hostapdOutMonitor) Write(p []byte) (n int, err error) {
	// if result already received, the write could exit (early out)
	if m.resultReceived.IsSet() {
		return n, nil
	}

	// check if buffer contains relevant strings (assume write is called line wise by the hosted process
	// otherwise we'd need to utilize an io.Reader
	line := string(p)

	switch {
	case strings.Contains(line, "AP-DISABLED"):
		log.Printf("Starting Access Point failed\n")
		m.Lock()
		defer m.Unlock()
		m.result = false
		m.resultReceived.Set()
	case strings.Contains(line, "AP-ENABLED"):
		log.Printf("Access point is up\n")
		m.Lock()
		defer m.Unlock()
		m.result = true
		m.resultReceived.Set()
	}
	return len(p), nil
}

The code above is a function attached to the structure hostapdOutMonitor.

excursus 1: on object oriented programming in Go:

In Go this "attached function" could be compared to a member function of a class-based object, where the structure hostapdOutMonitor is the class holding the respective attributes, but member functions are declared externally by assigning them to a struct (in this case by adding (m *hostapdOutMonitor) between the func keyword and the function name Write). Prepending the joker sign * in (m *hostapdOutMonitor) means the struct is passed into the function call by reference, whereas a function declaration like func (m hostapdOutMonitor) Write(p []byte) (n int, err error) with the * omitted would pass in the struct by value. In both cases, the function could read the "instance data" of the sruct, which is accessible using the assigned name m. But only when data is passed by reference it could be modified. If a function or structure starts with a lowercase letter this could be compared to a private declaration, while an uppercase letter could be compared to a public declaration.

To sum up: The code snippet above shows the public member-function Write of a private class hostapdOutMonitor. The instance object is not accessed using self or this, but using a name defined during function declaration, which is m in this case.

excursus 2: Interfaces in go

In Go, interfaces could be declared, but a structure (rember, this could roughly be compared to a class) implementing an interface isn't marked with special syntax (like f.e. implements in Java).
Instead a structure just implements all methods which are expected by an interface, and it could automatically be used by other code assuming the struct implements the respective interface.

In this case, the method Write of the struct hostapdOutMonitor does exactly this. It is a function, which fulfills the implementation requirements of the Writer interface, which belongs to Go's io core package.

This interface is pretty common in Go's standard packages and could be compared to a C-style write looking like this:

ssize_t write(void * buf, size_t nbytes);

The C-function would accept a binary buffer buf and a length parameter nbytes. After execution either the count of the bytes written is returned or a negative count in case of an error is returned (thats why the return type is signed).

The Go function is very similar, it accepts a variable length byte array (slice) named p as parameter. A parameter representing the length is not required, as it could be retrieved at runtime using len(p). For an error case, there is a dedicated return value named err of type error (Go supports multiple return values, it is common to use a dedicated err error value to indicate errors). The other return value is n of type int which again represents the bytes written (in case of a successful write the returned error value is nil).

the bug

As highlighted above, the Write function of the Writer interface is widely used in Go. The purpose of the Write implementation of hostapdOutMonitor is to serve as write target for stdout of a running hostapd process and to set a signal once this process either disabled or enabled the Access Point mode of the respective WiFi interface (indicated by "AP-DISABLED" / "AP-ENABLED" on stdout). Once this job is done, the Write method could still be called, if the process provides more data on stdout (which of course is the case). Now the issue here, is that I tried to implement an early out, to safe processing power for successive process output, if it is already clear that the AP has been enabled or disabled. This happens here:

	// if result already received, the write could exit (early out)
	if m.resultReceived.IsSet() {
		return n, nil
	}

The code doesn't check further conditions of the process' output, once it knows that the AP is disabled or enabled (indicated by resultReceived signal). The issue is that it returns n,nil, while n never got a value assigned. This again means, that n has a default value n=0 at this point. As the error value is returned with nil, this means there was no function error, but 0 bytes have been written, no matter what was handed in as parameter p.

As mentioned earlier, this interface is common and in this case it is used to process low-level functionality (writes to stdout of a child process). Returning a successful write with a length 0 to the caller, means the caller still has output data to write in its queue - no matter how often the function is called, the caller never manages to write more than 0 bytes.
In practice this ends up in a kind of endless loop with full CPU load, as the caller tries to write pending data from stdout again and again.

The fix is pretty simple, the return value of the "early out" has to be changed from return n, nil to return len(p), nil. With this modification the function always returns a full length write of the input parameter p.

Note: func (m *wpaSupplicantOutMonitor) Write(p []byte) (n int, err error) suffers from the same issue.

@mame82
Copy link
Collaborator

mame82 commented Feb 5, 2020

How this bug has been debugged

The recommended solution for remote debugging of Go code is to use Delve. Unfortunately Delve only support 64bit architectures at time of this writing, which doesn't get us very far on a Raspberry Pi 0.

So I opted to use gdb locally (details).

Another issue with debugging of P4wnP1 ALOA's systemd service binary P4wnP1_service is that stopping or restarting the binary re-intializes all network connections and the USB stack. A trick is to kill the running service binary, which keeps parts of the configurations up (f.e. the configured USB gadget aren't destroyed in this case), but could leave some child processes running, which aren't intended to run when the service starts again (hostapd, wpa_supplicant etc).

So the idea here was to build a version of P4wnP1_service from current code on an external machine (cross compiling luckily is very easy with Go) push it to a folder on the Pi 0 already running a working P4wnP1 A.L.O.A. image. After this we could SSH in via USB ethernet (using SSH over WiFi makes less sense, while testing the code to spawn APs / join exiting WiFis), kill the running P4wnP1_service (while hopefully the USB connection stays alive) and start the freshly compiled version of P4wnP1_service again (which likely rebuilds the USB gadget and thus could kill the running USB ethernet connection, but usually it comes up fast enough again that it occurs as kind of connection delay to my SSH client). The newly started P4wnP1 _service could be directly attached to gdb.

Step 1 - cross compile P4wnP1_service from github repo source and copy it to P4wnP1 using USB ethernet

I assume that P4wnP1 is accessible on 172.16.0.1 which is the default IPv4 address for USB over ethernet.

# move into P4wnP1 repor main directory

cd P4wnP1_aloa

# cross-compile P4wnP1_service for ARMv6 Linux and save the binary to the ./build folder
# the gcflags disable inlining of function invocations and registerization of variables (see https://golang.org/doc/gdb)

env GOOS=linux GOARCH=arm GOARM=6 go build -gcflags "all=-N -l" -o build/P4wnP1_service cmd/P4wnP1_service/P4wnP1_service.go

# copy the result to /root/P4wnP1/build on the Raspberry Pi 0
scp build/P4wnP1_service 172.16.0.1:~/P4wnP1/build

Step 2 - SSH into P4wnP1, kill the old service and start a gdb session with the new one

After bringing up a SSH connection to P4wnP1 via USB with ssh 172.16.0.1 use the following command:

# kill currently runnning service and hope for USB ethernet to stay alive for SSH access
root@kali~# killall P4wnP1_service

# attach new binary to gdb
root@kali~# cd ~/P4wnP1/build

root@kali:~/P4wnP1/build# gdb ./P4wnP1_service

# from within gdb
GNU gdb (Debian 8.3.1-1) 8.3.1
Copyright (C) 2019 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "arm-linux-gnueabi".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./P4wnP1_service...
warning: Missing auto-load script at offset 0 in section .debug_gdb_scripts
of file /root/P4wnP1/build/P4wnP1_service.
Use `info auto-load python-scripts [REGEXP]' to list them.
(gdb) source /usr/share/go/src/runtime/runtime-gdb
runtime-gdb.py       runtime-gdb_test.go  
(gdb) source /usr/share/go/src/runtime/runtime-gdb.py 
Loading Go Runtime support.

..snip..
## ^^^ note how Go runtime support is manually sourced in
..snip..

(gdb) run
Starting program: /root/P4wnP1/build/P4wnP1_service 
[New LWP 4928]
[New LWP 4929]
P4wnP1 A.L.O.A. v0.1.0-alpha2
[New LWP 4930]
2020/02/05 21:28:47 
2020/02/05 21:28:47 ... killing dnsmasq
2020/02/05 21:28:47 ... dnsmasq for interface 'usbeth' isn't running, no need to stop it
2020/02/05 21:28:47 Stopping DHCP client for interface 'usbeth'...
[Detaching after fork from child process 4931]
2020/02/05 21:28:47 Setting Interface usbeth to IP 172.16.0.1
[New LWP 4932]
2020/02/05 21:28:47

..snip..
## we haven't defined any breakpoints, so after issuing run, the service runs embedded into our gdb session
..snip..

step 3 - Trigger the bug using the web client

The bug occurs once P4wnP1 successfully connects to an existing WiFi AP in STA (or failover) mode. We configure P4wnP1 to do so, using the webclient which should be reachable on http://172.26.0.1:8000 while our USB ethernet connection is up (SSH session with gdb is still open).

Once this is done, an addtional SSH session could be used to inspect CPU load f.e. with a tool like top. If everything was going "right" we should see P4wnP1_service producing close to 100 percent CPU load.

step 4 - Examining the root cause

Once CPU load is at 100 percent, we could head bake to the gdb session and send a soft interrupt to the attached P4wnP1_service process by issuing CTRL + C into the gdb terminal.

..snip (some P4wnP1_service output)..
2020/02/05 21:30:27 Setting Interface wlan0 to MULTICAST to ON
2020/02/05 21:30:27 Starting DHCP client for interface 'wlan0'...
[Detaching after fork from child process 5062]
Dhcpcd output for wlan0:
forked to background, child pid 5063
[Detaching after fork from child process 5066]
2020/02/05 21:30:27 ... DHCP client for interface 'wlan0' started
... stored settings deployed
[Detaching after fork from child process 5084]
Lease monitor usbeth LEASE: &{false usbeth 172.16.0.2 42:63:66:12:34:56 who-knows}
^C
Thread 1 "P4wnP1_service" received signal SIGINT, Interrupt.
runtime.futex () at /usr/lib/go-1.12/src/runtime/sys_linux_arm.s:334
334	/usr/lib/go-1.12/src/runtime/sys_linux_arm.s: No such file or directory.

## here follows the 'info goroutines' command

(gdb) info goroutines
  1 waiting  runtime.gopark
  2 waiting  runtime.gopark
  3 waiting  runtime.gopark
..snip..
  65 waiting  runtime.gopark
  66 waiting  runtime.gopark
  76 waiting  runtime.gopark
  83 waiting  runtime.gopark
  82 waiting  runtime.gopark
* 91 running  github.com/mame82/P4wnP1_aloa/service.(*hostapdOutMonitor).Write
  98 waiting  runtime.gopark
  148 waiting  runtime.gopark
  151 waiting  runtime.gopark
  146 waiting  runtime.gopark
  147 waiting  runtime.gopark
  243 waiting  runtime.gopark


After the SIGINT was send to the process, we are back in gdb's interface. The info goroutines command is available thanks to the imported Go runtime support. A Go-routine could be compared to a very lightweight thread (roughly, as shown above it isn't much of an issue to run very very very many Go routines at the same time). What ultimately was jumping into my eye here, is this pending call in a go routine:

* 91 running  github.com/mame82/P4wnP1_aloa/service.(*hostapdOutMonitor).Write

This was a "lucky hit" as we did no interactive debugging at all, but it occurs almost everytime if the process is interrupted like this. This alone is indication of this being cause for the high CPU load (under normaly circumstances the `Write call should happen in-frequently and last only for a very short time - which means chances to spot this call here should be very low). Additionally, there should be no processing of hostapd stdout at all, as P4wnP1 is running in STA mode which means the hostapd process gets terminated in this case.

From this point on, the search for the root cause was done in code again, where github.com/mame82/P4wnP1_aloa/service.(*hostapdOutMonitor).Writereads like this:

service indicates the package name (which consists of multiple Go source files in this case)

(*hostapdOutMonitor).Write directly points us to the code described above, namely the Write method of the hostapdOutMonitorstruct, which resides in the file service/wifi.go

nice to know

Before fixing the bug, it was confirmed by adding some test output to the "early out" part of the (*hostapdOutMonitor).Write function (including a typo):

func (m *hostapdOutMonitor) Write(p []byte) (n int, err error) {
	// if result already received, the write could exit (early out)
	if m.resultReceived.IsSet() {
		log.Printf("hostapdOutMonitor afte resultReceived: %s", string(p))
		return len(p), nil
	}

In result the following was printed to stdout by the P4wnP1_service binary, whenever WiFi was set to STATION mode and the hostapd process was terminated:

2020/02/05 21:58:52 hostapdOutMonitor afte resultReceived: hostapd: 21:58:51 wlan0: CTRL-EVENT-TERMINATING 
2020/02/05 21:58:52 hostapdOutMonitor afte resultReceived: hostapd: 21:58:51 wlan0: CTRL-EVENT-TERMINATING 
2020/02/05 21:58:52 hostapdOutMonitor afte resultReceived: hostapd: 21:58:51 wlan0: CTRL-EVENT-TERMINATING 
2020/02/05 21:58:52 hostapdOutMonitor afte resultReceived: hostapd: 21:58:51 wlan0: CTRL-EVENT-TERMINATING 
2020/02/05 21:58:52 hostapdOutMonitor afte resultReceived: hostapd: 21:58:51 wlan0: CTRL-EVENT-TERMINATING 
2020/02/05 21:58:52 hostapdOutMonitor afte resultReceived: hostapd: 21:58:51 wlan0: CTRL-EVENT-TERMINATING 
2020/02/05 21:58:52 hostapdOutMonitor afte resultReceived: hostapd: 21:58:51 wlan0: CTRL-EVENT-TERMINATING 
2020/02/05 21:58:52 hostapdOutMonitor afte resultReceived: hostapd: 21:58:51 wlan0: CTRL-EVENT-TERMINATING 
2020/02/05 21:58:52 hostapdOutMonitor afte resultReceived: hostapd: 21:58:51 wlan0: CTRL-EVENT-TERMINATING 

I assume this is caused by an output queue in Go's low level implementation for sub processes, which tries to flush its output endlessly (but can't because the Write() always returns 0 bytes written) or maybe by some of my own code. In the end it doesn't matter as the simple fix described above brings CPU load back below 2 percent and the debug output CTRL-EVENT-TERMINATING disappeared immediately.

mame82 added a commit that referenced this issue Feb 5, 2020
@mame82 mame82 closed this as completed Jun 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants