Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ESP32S3] v4.4 WIFI losing connectivity temporarily or permanently without apparent reason (IDFGH-12977) #13212

Closed
3 tasks done
KonssnoK opened this issue Feb 19, 2024 · 167 comments
Labels
Resolution: Done Issue is done internally Status: Done Issue is done internally

Comments

@KonssnoK
Copy link
Contributor

Answers checklist.

  • I have read the documentation ESP-IDF Programming Guide and the issue is not addressed there.
  • I have updated my IDF branch (master or release) to the latest version and checked that the issue is present there.
  • I have searched the issue tracker for a similar issue and not found a similar issue.

General issue report

@zhangyanjiaoesp

based on 27ec26d
To reproduce

  • Add ping functionality and periodc wifi dumps to main task instead of letting it die:
    esp_ping_config_t ping_config = ESP_PING_DEFAULT_CONFIG();
    IP4_ADDR(&ping_config.target_addr, 8,8,8,8);          // target IP address
    ping_config.count = ESP_PING_COUNT_INFINITE;    // ping in infinite mode, esp_ping_stop can stop it
    ping_config.timeout_ms = 3000;

    /* set callback functions */
    esp_ping_callbacks_t cbs;
    cbs.on_ping_success = test_on_ping_success;
    cbs.on_ping_timeout = test_on_ping_timeout;
    cbs.on_ping_end = test_on_ping_end;
    cbs.cb_args = "foo";  // arguments that will feed to all callback functions, can be NULL

    esp_ping_handle_t ping;
    esp_ping_new_session(&ping_config, &cbs, &ping);
    esp_ping_start(ping);

#define DUMP_INTERVAL_MS (1000*60*2)
    int64_t last_dump = 0;
    while (true) {
        if((os_time_get_timestamp() - last_dump) >= DUMP_INTERVAL_MS) {
            last_dump = os_time_get_timestamp();
            esp_wifi_statis_dump(0xFFFF);
        }
        os_thread_sleep(1000);
    }


static void test_on_ping_success(esp_ping_handle_t hdl, void *args)
{
    // optionally, get callback arguments
    // const char* str = (const char*) args;
    // printf("%s\r\n", str); // "foo"
    uint8_t ttl;
    uint16_t seqno;
    uint32_t elapsed_time, recv_len;
    ip_addr_t target_addr;
    esp_ping_get_profile(hdl, ESP_PING_PROF_SEQNO, &seqno, sizeof(seqno));
    esp_ping_get_profile(hdl, ESP_PING_PROF_TTL, &ttl, sizeof(ttl));
    esp_ping_get_profile(hdl, ESP_PING_PROF_IPADDR, &target_addr, sizeof(target_addr));
    esp_ping_get_profile(hdl, ESP_PING_PROF_SIZE, &recv_len, sizeof(recv_len));
    esp_ping_get_profile(hdl, ESP_PING_PROF_TIMEGAP, &elapsed_time, sizeof(elapsed_time));
    ESP_LOGW(TAG, "%d bytes from %s icmp_seq=%d ttl=%d time=%d ms",
           recv_len, ipaddr_ntoa(&target_addr), seqno, ttl, elapsed_time);
}

static void test_on_ping_timeout(esp_ping_handle_t hdl, void *args)
{
    uint16_t seqno;
    ip_addr_t target_addr;
    esp_ping_get_profile(hdl, ESP_PING_PROF_SEQNO, &seqno, sizeof(seqno));
    esp_ping_get_profile(hdl, ESP_PING_PROF_IPADDR, &target_addr, sizeof(target_addr));
    ESP_LOGW(TAG, "From %s icmp_seq=%d timeout", ipaddr_ntoa(&target_addr), seqno);
}

static void test_on_ping_end(esp_ping_handle_t hdl, void *args)
{
    uint32_t transmitted;
    uint32_t received;
    uint32_t total_time_ms;

    esp_ping_get_profile(hdl, ESP_PING_PROF_REQUEST, &transmitted, sizeof(transmitted));
    esp_ping_get_profile(hdl, ESP_PING_PROF_REPLY, &received, sizeof(received));
    esp_ping_get_profile(hdl, ESP_PING_PROF_DURATION, &total_time_ms, sizeof(total_time_ms));
    ESP_LOGW(TAG, "%d packets transmitted, %d received, time %dms", transmitted, received, total_time_ms);
}
  • Connect to mobile hostspot wifi

  • Put phone simcard to 2G.

  • Wait

  • The issue happens in the ROOT. Once the root is affected, the issue is propagated to all children.

  • The issue is either temporary or permanent. When permanent, the only way to recover is to reboot.

espressif_wifi_dump_3.txt
espressif_wifi_dump.txt
espressif_wifi_dump_2.txt

Other related issues with similar behavior:
#8953
#10506

@AxelLin
Copy link
Contributor

AxelLin commented Jun 7, 2024

@Espressif-liuuuu @zhangyanjiaoesp @nishanth-radja Do you have any finding in above log?

@espressif-bot espressif-bot added the Status: Opened Issue is new label Jun 7, 2024
@github-actions github-actions bot changed the title [ESP32S3] v4.4 WIFI losing connectivity temporarily or permanently without apparent reason [ESP32S3] v4.4 WIFI losing connectivity temporarily or permanently without apparent reason (IDFGH-12977) Jun 7, 2024
@zhangyanjiaoesp
Copy link
Collaborator

@KonssnoK Are you using the ip_internal_network example? In your logs, it show the ping timeout, but it can't confirm the wifi connection is down. Can you capture packets for this ?

@KonssnoK
Copy link
Contributor Author

KonssnoK commented Jun 7, 2024

@KonssnoK Are you using the ip_internal_network example? In your logs, it show the ping timeout, but it can't confirm the wifi connection is down. Can you capture packets for this ?

@zhangyanjiaoesp yes the base is the ip_internal_network example.
how would you check if the wifi connection is down? apart from seeing no packets are sent/receive.

Also, how would you get the packets ? Wireshark connected to a sniffer?

@KonssnoK
Copy link
Contributor Author

KonssnoK commented Jun 10, 2024

So @zhangyanjiaoesp i was able to generate one strange behavior, even if it's not exacly the one reported in this issue.

With the same code (v4.4 top of c0e0af0 )

you can apply patches 1 2 3, which enable monitoring and pinging

03_ip_internal.patch
02_if_dumps.patch
01_packets_dump.patch

by randomly detaching/attaching the layer 2 device i was able to reach this state, where the L2 device is never able to communicate with L1
timeout_l2.txt

I got an extract of L1 too (i would say MESH_EVENT_CHILD_CONNECTED to track L2 events)

ESP-IDF_test.txt

@KonssnoK
Copy link
Contributor Author

KonssnoK commented Jun 10, 2024

interestingly enough to recover the L2 device i had to reboot both devices, meaning rebooting only the L2 device was not solving the issue, and even rebooting the L1 device while L2 device was stuck (after reboot) did not solve the issue

@KonssnoK
Copy link
Contributor Author

KonssnoK commented Jun 11, 2024

@zhangyanjiaoesp
again by simply resetting the 2 devices in different ways, i was able to trigger another case in which one device does not work anymore until reboot.
to be noted: once this device is failing, rebooting the other device makes it fail too.

timeout_reset1.txt
timeout_reset2.txt

to recover the devices i had to keep them offline enough for the phone to lose the cache of connected devices ( pixel8 )

@espressif-bot espressif-bot added Status: In Progress Work is in progress and removed Status: Opened Issue is new labels Jun 13, 2024
@zhangyanjiaoesp
Copy link
Collaborator

@KonssnoK please provide your sdkconfig file, and you are using PSRAM, right?

@KonssnoK
Copy link
Contributor Author

sdkconfig.txt
@zhangyanjiaoesp here it is

@zhangyanjiaoesp
Copy link
Collaborator

zhangyanjiaoesp commented Jun 13, 2024

So @zhangyanjiaoesp i was able to generate one strange behavior, even if it's not exacly the one reported in this issue.

With the same code (v4.4 top of c0e0af0 )

you can apply patches 1 2 3, which enable monitoring and pinging

03_ip_internal.patch 02_if_dumps.patch 01_packets_dump.patch

by randomly detaching/attaching the layer 2 device i was able to reach this state, where the L2 device is never able to communicate with L1 timeout_l2.txt

I got an extract of L1 too (i would say MESH_EVENT_CHILD_CONNECTED to track L2 events)

ESP-IDF_test.txt

This log show the device didn't get the IP address, which cause the ping timeout.

@zhangyanjiaoesp
again by simply resetting the 2 devices in different ways, i was able to trigger another case in which one device does not work anymore until reboot.
to be noted: once this device is failing, rebooting the other device makes it fail too.
timeout_reset1.txt
timeout_reset2.txt
to recover the devices i had to keep them offline enough for the phone to lose the cache of connected devices ( pixel8 )

And this log show the device can't connect to the router, the reason is auth timeout.

I have tested using the router, and can't reproduce this issue. I will use the mobile hostspot to test again, can you provide the model of your phone? Or any phone can reproduce this issue? @KonssnoK

@KonssnoK
Copy link
Contributor Author

@zhangyanjiaoesp i reproduced with a Google Pixel8.

not getting the IP - strange, would mean the IP service is stuck 🤔

@KonssnoK
Copy link
Contributor Author

@zhangyanjiaoesp i moved to 3 devices and trying to replicate but for now without success..

@KonssnoK
Copy link
Contributor Author

KonssnoK commented Jun 18, 2024

and as soon as i wrote that, something strange happened again:
dev3 is not able to connect

dev2.txt
disconnected_dev3.txt
dev1.txt

I (42466) wifi:new:<1,0>, old:<1,1>, ap:<1,1>, sta:<1,1>, prof:1
W (43196) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:0
W (44396) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:1
W (45596) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:2
W (46796) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:3
W (47996) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:4
W (49196) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:5

(devices are 30cm apart fom each other)

it seems it goes on forever
W (1983596) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:1617

@KonssnoK
Copy link
Contributor Author

one hour in:
W (6444006) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:5334

@KonssnoK
Copy link
Contributor Author

W (13189216) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:10955

@KonssnoK
Copy link
Contributor Author

this is instead the log of device1 getting stuck and not trying to connect to the mesh anymore

dev1_stuck.txt

@zhangyanjiaoesp
Copy link
Collaborator

and as soon as i wrote that, something strange happened again: dev3 is not able to connect

dev2.txt
disconnected_dev3.txt
dev1.txt

I (42466) wifi:new:<1,0>, old:<1,1>, ap:<1,1>, sta:<1,1>, prof:1
W (43196) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:0
W (44396) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:1
W (45596) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:2
W (46796) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:3
W (47996) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:4
W (49196) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:5

(devices are 30cm apart fom each other)

it seems it goes on forever W (1983596) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:1617

@KonssnoK In the log, I see that initially communication among the three devices was normal, and then you restarted device2 and device3?
And in the logs of device2 and device3, there are logs showing I (42466) wifi:state: run -> init (2c0), this means the wifi connection is disconnected. The wifi disconnection will cause the W (6444006) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:5334. But I can't find why the wifi disconnect. Can you use wireshark to capture packets of the device1/2/3, and send the logs and captures to me? In the log, please display the absolute time.

And I have using the Google Pixel5 mobile to test, but I didn't reproduce the problem.

@KonssnoK
Copy link
Contributor Author

and as soon as i wrote that, something strange happened again: dev3 is not able to connect
dev2.txt
disconnected_dev3.txt
dev1.txt

I (42466) wifi:new:<1,0>, old:<1,1>, ap:<1,1>, sta:<1,1>, prof:1
W (43196) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:0
W (44396) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:1
W (45596) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:2
W (46796) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:3
W (47996) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:4
W (49196) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:5

(devices are 30cm apart fom each other)
it seems it goes on forever W (1983596) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:1617

@KonssnoK In the log, I see that initially communication among the three devices was normal, and then you restarted device2 and device3? And in the logs of device2 and device3, there are logs showing I (42466) wifi:state: run -> init (2c0), this means the wifi connection is disconnected. The wifi disconnection will cause the W (6444006) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:5334. But I can't find why the wifi disconnect. Can you use wireshark to capture packets of the device1/2/3, and send the logs and captures to me? In the log, please display the absolute time.

And I have using the Google Pixel5 mobile to test, but I didn't reproduce the problem.

@zhangyanjiaoesp
today i have no way to use wireshark.
the only way to recover that device was to reboot it again.
Yes, to trigger the issue i simply randomly restarted devices.
i'm not sure on how to display absolute time considering the logs are in close sourced files.
I understand the wifi is disconnected, but i would expect it to retry a connection once disconnected 🤔

@zhangyanjiaoesp
Copy link
Collaborator

Yes, to trigger the issue i simply randomly restarted devices

Ok, I will try to restart the device and test again.

@KonssnoK
Copy link
Contributor Author

@zhangyanjiaoesp in my experience slow data rates help achieving the issues.
Please put your phone cell technology to 2G or go to "developer options" networking "network download rate limit" and put the minimum.

@KonssnoK
Copy link
Contributor Author

KonssnoK commented Jun 19, 2024

please note that the logs are more or less synchronized at the end, not the start! (i extract them more or less at the same time)

240619dev3.txt
240619dev1.txt
240619dev2.txt

@zhangyanjiaoesp i rebooted the root device and it went offline without managing to reconnect.

After a while device 3 managed to change status and directly connect as the root.
the other 2 remain disconnected

240619dev3_2.txt
240619dev1_2.txt
240619dev2_2.txt

device 2 at some point manages to recover too.

240619dev3_3.txt
240619dev1_3.txt
240619dev2_3.txt

device one is still disconnected and not able to recover instead.

phone connected in 5G with rate limiter at 128kbps

device 1 dodes not recover

@KonssnoK
Copy link
Contributor Author

KonssnoK commented Jun 19, 2024

@zhangyanjiaoesp for reference this setup seems to trigger the issue in the above message quite often.
once again the root device is stuck after a reboot, device 2 takes the root in this occasion, device 3 follows 2, but device 1 is stuck.

240619dev3_4.txt
240619dev1_4.txt
240619dev2_4.txt

@KonssnoK
Copy link
Contributor Author

KonssnoK commented Jun 20, 2024

@zhangyanjiaoesp i tried also today to replicate, to verify if this is consistent:

  • phone on rate limiter 128kbps
  • keep devices off until dev1 root
  • dev 2 3 as children
  • reboot dev 1
  • dev 1 blocked , dev 2 3 blocked

it's quite easy to create issues in this configuration, please let me know if you manage.

240620dev3.txt
240620dev1.txt
240620dev2.txt

after a while dev 3 recovers and then also dev 2. dev 1 is stuck.

240620dev1_3.txt
240620dev3_3.txt
240620dev2_3.txt

@zhangyanjiaoesp
Copy link
Collaborator

@KonssnoK I'm sorry, I have an urgent task recently. I will test your issue next week.

@KonssnoK
Copy link
Contributor Author

@zhangyanjiaoesp sure, i'll concentrate on another issue meanwhile

@zhangyanjiaoesp
Copy link
Collaborator

and as soon as i wrote that, something strange happened again: dev3 is not able to connect

dev2.txt disconnected_dev3.txt dev1.txt

I (42466) wifi:new:<1,0>, old:<1,1>, ap:<1,1>, sta:<1,1>, prof:1
W (43196) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:0
W (44396) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:1
W (45596) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:2
W (46796) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:3
W (47996) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:4
W (49196) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:5

(devices are 30cm apart fom each other)

it seems it goes on forever W (1983596) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:1617

@KonssnoK I have reproduced this issue by rebooting the root device, and I have found the root cause, the following wifi libs can solve the problem. Please replace the wifi libs and test again.
wifi_lib_s3_0625.zip (wifi firmware version: f736b07)

For the other issues, I still can't reproduce them although I randomly reboot the device2/3.

@KonssnoK
Copy link
Contributor Author

KonssnoK commented Jul 17, 2024

@zhangyanjiaoesp by adding some logs i got this


	Line 2061: W (801) esp_netif_lwip: +++esp_netif_start_api 0x3fced4dc
	Line 2392: W (27931) esp_netif_lwip: +++esp_netif_start_api 0x3fced6e0
	Line 4976: W (185131) esp_netif_lwip: +++esp_netif_start_api 0x3fced6e0
	Line 5325: W (198931) esp_netif_lwip: +++esp_netif_dhcps_stop_api esp_netif:0x3fca7888
	Line 5326: W (198941) esp_netif_lwip: +++dhcp server status 0
	Line 5327: W (198941) esp_netif_lwip: +++dhcp server stop successfully
	Line 5328: W (198951) esp_netif_lwip: +++esp_netif_dhcps_start_api esp_netif:0x3fca7888
	Line 5329: W (198951) esp_netif_lwip: +++dhcp server re init
	Line 5330: W (198961) esp_netif_lwip: +++esp_netif_start_api 0x3fca7888
	Line 5331: W (198971) esp_netif_lwip: +++esp_netif_start_api Setting the lwip netif0x3fcb8fdc UP
	Line 5340: W (199011) esp_netif_lwip: +++DHCP server started successfully
	Line 6230: W (214561) esp_netif_lwip: +++esp_netif_stop_api 0x3fced6e0
	Line 6232: W (214581) esp_netif_lwip: +++esp_netif_stop_api netif_is_up
	Line 6234: W (214601) esp_netif_lwip: +++esp_netif_start_api 0x3fced6e0
	Line 7654: W (319211) esp_netif_lwip: +++esp_netif_start_api 0x3fced6e0
	Line 8161: W (355591) esp_netif_lwip: +++esp_netif_start_api 0x3fced6e0
	Line 8549: W (373801) esp_netif_lwip: +++esp_netif_dhcps_stop_api esp_netif:0x3fca72b4
	Line 8550: W (373811) esp_netif_lwip: +++dhcp server status 0
	Line 8551: W (373821) esp_netif_lwip: +++dhcp server stop successfully
	Line 8552: W (373821) esp_netif_lwip: +++esp_netif_dhcps_start_api esp_netif:0x3fca72b4
	Line 8553: W (373831) esp_netif_lwip: +++dhcp server re init
	Line 8554: W (373841) esp_netif_lwip: +++esp_netif_start_api 0x3fca72b4
	Line 8555: W (373841) esp_netif_lwip: +++esp_netif_start_api Setting the lwip netif0x3fcbd848 UP
	Line 8564: E (373881) esp_netif_lwip: +++DHCP server cannot be started

240717dev1_10.txt

@KonssnoK
Copy link
Contributor Author

@zhangyanjiaoesp i think i found the underlying issue.

the function in the example


static esp_err_t set_dhcps_dns(esp_netif_t *netif, uint32_t addr)
{
    esp_netif_dns_info_t dns;
    dns.ip.u_addr.ip4.addr = addr;
    dns.ip.type = IPADDR_TYPE_V4;
    dhcps_offer_t dhcps_dns_value = OFFER_DNS;
    ESP_ERROR_CHECK_WITHOUT_ABORT(esp_netif_dhcps_stop(netif));
    ESP_ERROR_CHECK(esp_netif_dhcps_option(netif, ESP_NETIF_OP_SET, ESP_NETIF_DOMAIN_NAME_SERVER, &dhcps_dns_value, sizeof(dhcps_dns_value)));
    ESP_ERROR_CHECK(esp_netif_set_dns_info(netif, ESP_NETIF_DNS_MAIN, &dns));
    ESP_ERROR_CHECK_WITHOUT_ABORT(esp_netif_dhcps_start(netif));
    return ESP_OK;
}

works only if the same interface is used over and over.

But instead

esp_err_t mesh_netif_start_root_ap(bool is_root, uint32_t addr)
{
    if (is_root) {
        if (netif_ap) {
            esp_netif_action_disconnected(netif_ap, NULL, 0, NULL);
            mesh_delete_if_driver(esp_netif_get_io_driver(netif_ap));
            esp_netif_destroy(netif_ap);
            netif_ap = NULL;
        }
        netif_ap = create_mesh_link_ap();
        mesh_netif_driver_t driver = mesh_create_if_driver(true, true);
        if (driver == NULL) {
            ESP_LOGE(TAG, "Failed to create wifi interface handle");
            return ESP_FAIL;
        }
        esp_netif_attach(netif_ap, driver);
        set_dhcps_dns(netif_ap, addr);
        start_mesh_link_ap();
        ip_napt_enable(g_mesh_netif_subnet_ip.ip.addr, 1);
    }
    return ESP_OK;
}

will generate a new interface each time the device has to start as root.
This means that if the device was already running a DHCPS, it won't be stopped, because the stop is sent to the newly created device and not to the previous one.

This is clearly visible from the logs above, where you see esp_netif_dhcps_stop_api being called on the same interface that is then called with esp_netif_dhcps_start_api

I leave you to fix this in your example.
I will now work to fix it in our code.

How is the propagation of the mesh fixes to 4.4 going ? when can we expect a push to 4.4?

Thanks.

@zhangyanjiaoesp
Copy link
Collaborator

@zhangyanjiaoesp by adding some logs i got this


	Line 2061: W (801) esp_netif_lwip: +++esp_netif_start_api 0x3fced4dc
	Line 2392: W (27931) esp_netif_lwip: +++esp_netif_start_api 0x3fced6e0
	Line 4976: W (185131) esp_netif_lwip: +++esp_netif_start_api 0x3fced6e0
	Line 5325: W (198931) esp_netif_lwip: +++esp_netif_dhcps_stop_api esp_netif:0x3fca7888
	Line 5326: W (198941) esp_netif_lwip: +++dhcp server status 0
	Line 5327: W (198941) esp_netif_lwip: +++dhcp server stop successfully
	Line 5328: W (198951) esp_netif_lwip: +++esp_netif_dhcps_start_api esp_netif:0x3fca7888
	Line 5329: W (198951) esp_netif_lwip: +++dhcp server re init
	Line 5330: W (198961) esp_netif_lwip: +++esp_netif_start_api 0x3fca7888
	Line 5331: W (198971) esp_netif_lwip: +++esp_netif_start_api Setting the lwip netif0x3fcb8fdc UP
	Line 5340: W (199011) esp_netif_lwip: +++DHCP server started successfully
	Line 6230: W (214561) esp_netif_lwip: +++esp_netif_stop_api 0x3fced6e0
	Line 6232: W (214581) esp_netif_lwip: +++esp_netif_stop_api netif_is_up
	Line 6234: W (214601) esp_netif_lwip: +++esp_netif_start_api 0x3fced6e0
	Line 7654: W (319211) esp_netif_lwip: +++esp_netif_start_api 0x3fced6e0
	Line 8161: W (355591) esp_netif_lwip: +++esp_netif_start_api 0x3fced6e0
	Line 8549: W (373801) esp_netif_lwip: +++esp_netif_dhcps_stop_api esp_netif:0x3fca72b4
	Line 8550: W (373811) esp_netif_lwip: +++dhcp server status 0
	Line 8551: W (373821) esp_netif_lwip: +++dhcp server stop successfully
	Line 8552: W (373821) esp_netif_lwip: +++esp_netif_dhcps_start_api esp_netif:0x3fca72b4
	Line 8553: W (373831) esp_netif_lwip: +++dhcp server re init
	Line 8554: W (373841) esp_netif_lwip: +++esp_netif_start_api 0x3fca72b4
	Line 8555: W (373841) esp_netif_lwip: +++esp_netif_start_api Setting the lwip netif0x3fcbd848 UP
	Line 8564: E (373881) esp_netif_lwip: +++DHCP server cannot be started

240717dev1_10.txt

@KonssnoK Can you provide the patch of this log ? I need to know what are the code corresponding to logs.

@KonssnoK
Copy link
Contributor Author

KonssnoK commented Jul 17, 2024

@zhangyanjiaoesp see my last message, i don't think those logs are necessary anymore, i'm testing out a small change that should fix the DHCPS issue


static esp_err_t set_dhcps_dns(esp_netif_t *netif, uint32_t addr)
{
    esp_netif_dns_info_t dns;
    dns.ip.u_addr.ip4.addr = addr;
    dns.ip.type = IPADDR_TYPE_V4;
    dhcps_offer_t dhcps_dns_value = OFFER_DNS;
    ESP_ERROR_CHECK(esp_netif_dhcps_option(netif, ESP_NETIF_OP_SET, ESP_NETIF_DOMAIN_NAME_SERVER, &dhcps_dns_value, sizeof(dhcps_dns_value)));
    ESP_ERROR_CHECK(esp_netif_set_dns_info(netif, ESP_NETIF_DNS_MAIN, &dns));
    ESP_ERROR_CHECK_WITHOUT_ABORT(esp_netif_dhcps_start(netif));
    return ESP_OK;
}

esp_err_t mesh_netif_start_root_ap(bool is_root, uint32_t addr)
{
    if (is_root) {
        if (netif_ap) {
            ESP_ERROR_CHECK_WITHOUT_ABORT(esp_netif_dhcps_stop(netif_ap));
            esp_netif_action_disconnected(netif_ap, NULL, 0, NULL);
            mesh_delete_if_driver(esp_netif_get_io_driver(netif_ap));
            esp_netif_destroy(netif_ap);
            netif_ap = NULL;
        }
        netif_ap = create_mesh_link_ap();
        mesh_netif_driver_t driver = mesh_create_if_driver(true, true);
        if (driver == NULL) {
            ESP_LOGE(TAG, "Failed to create wifi interface handle");
            return ESP_FAIL;
        }
        esp_netif_attach(netif_ap, driver);
        set_dhcps_dns(netif_ap, addr);
        start_mesh_link_ap();
        ip_napt_enable(g_mesh_netif_subnet_ip.ip.addr, 1);
    }
    return ESP_OK;
}

@zhangyanjiaoesp
Copy link
Collaborator

@KonssnoK The fix LGTM.

How is the propagation of the mesh fixes to 4.4 going ? when can we expect a push to 4.4?

For this issue, my colleague has contacted your colleague Lorenzo, and communicated with him about the solution of providing fixes. You can confirm with him.

@KonssnoK
Copy link
Contributor Author

@zhangyanjiaoesp the small fixes i put above are not complete

esp_err_t mesh_netifs_stop(void)

should also contain the same


    if (netif_ap) {
        ESP_ERROR_CHECK_WITHOUT_ABORT(esp_netif_dhcps_stop(netif_ap));
        esp_netif_action_disconnected(netif_ap, NULL, 0, NULL);
        mesh_delete_if_driver(esp_netif_get_io_driver(netif_ap));
        esp_netif_destroy(netif_ap);
        netif_ap = NULL;
    }

@KonssnoK
Copy link
Contributor Author

For this issue, my colleague has contacted your colleague Lorenzo, and communicated with him about the solution of providing fixes. You can confirm with him.

@zhangyanjiaoesp mmm, for now we got a shadow repository of v4.4.8 esp-idf on gitlab, but the problem is that the fix you did is on esp_wifi submodule, so the current repo is not enough. I already wrote to Caijin.

@zhangyanjiaoesp
Copy link
Collaborator

zhangyanjiaoesp commented Jul 17, 2024

@KonssnoK we will push the fixes based on the v4.4.8 esp-idf, and then update it to you. The branch will be provided tomorrow.

@espressif-bot espressif-bot added Status: Reviewing Issue is being reviewed and removed Status: In Progress Work is in progress labels Jul 18, 2024
@KonssnoK
Copy link
Contributor Author

hello @zhangyanjiaoesp ,
somehow i was able to retrigger the issue of a device being in a infinite loop of

>>>mesh_xon_process_disconnected,2638, (1, 1, 2)
W (841867) mesh: [mesh_schedule.c,3132] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:64, xseqno:72, no_wnd_count:0, timeout_count:659

240719dev3_1.txt

the way i triggered this is simply to let devices start without wifi, and waiting for the LTE device to connect to the LTE.

how did this issue triggers in the code that you fixed?
Since now we have the 30 secons timeout, it means something else is blocking the tasks... any idea?

@zhangyanjiaoesp
Copy link
Collaborator

@KonssnoK
Can you add the debug patch in the event loop and test again?

(0002-add-debug-log-in-event-loop.zip)

By the way, since this ticket has been closed and there are too many comments under it, can you create a new ticket to trace this new issue?

@KonssnoK
Copy link
Contributor Author

@zhangyanjiaoesp i'm now transitioning to the lib pushed on gitlab, i will retry to trigger the issues and in case open the new ticket.. it seems quite dificult to be reproduced.

@KonssnoK
Copy link
Contributor Author

@zhangyanjiaoesp was the fix for the deauth router pushed in our 4.4.8 branch? thanks!

@zhangyanjiaoesp
Copy link
Collaborator

@zhangyanjiaoesp was the fix for the deauth router pushed in our 4.4.8 branch? thanks!

Yes, the branch contain 3 changes:

  1. send deauth before auth when root restart
  2. send null data periodic
  3. fix the netif bug for dhcp server

@espressif-bot espressif-bot added Status: Done Issue is done internally Resolution: Done Issue is done internally and removed Status: Reviewing Issue is being reviewed labels Aug 1, 2024
espressif-bot pushed a commit that referenced this issue Aug 7, 2024
espressif-bot pushed a commit that referenced this issue Aug 24, 2024
espressif-bot pushed a commit that referenced this issue Aug 24, 2024
espressif-bot pushed a commit that referenced this issue Sep 10, 2024
espressif-bot pushed a commit that referenced this issue Sep 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Resolution: Done Issue is done internally Status: Done Issue is done internally
Projects
None yet
Development

No branches or pull requests

4 participants