-
Notifications
You must be signed in to change notification settings - Fork 174
-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ena_com_get_dev_basic_stats spin more than 1 sec #140
Comments
Can you share the instance type and instance-id ?
From: Hanoh Haim <[email protected]>
Reply-To: amzn/amzn-drivers <[email protected]>
Date: Monday, August 24, 2020 at 11:13 PM
To: amzn/amzn-drivers <[email protected]>
Cc: Subscribed <[email protected]>
Subject: [amzn/amzn-drivers] ena_com_get_dev_basic_stats spin more than 1 sec (#140)
* DPDK 20.02
0x5584cf3f0faa ./_t-rex-64(+0x194faa) [0x5584cf3f0faa]
2 0x7f0f006668a0 /lib/x86_64-linux-gnu/libpthread.so.0(+0x128a0) [0x7f0f006668a0]
3 0x7f0f00661f85 pthread_cond_timedwait + 821
4 0x5584cf80e393 ena_com_execute_admin_command + 435
5 0x5584cf810659 ena_com_get_dev_basic_stats + 57
6 0x5584cf808129 ./_t-rex-64(+0x5ac129) [0x5584cf808129]
7 0x5584cf6d8020 rte_eth_stats_get + 128
8 0x5584cf490f30 CTRexExtendedDriverBase::get_extended_stats_fixed(CPhyEthIF*, CPhyEthIFStats*, int, int) + 32
9 0x5584cf362922 CPhyEthIF::get_extended_stats() + 28
10 0x5584cf362b67 CPhyEthIF::update_counters() + 17
11 0x5584cf36c73e CGlobalTRex::update_stats() + 56
12 0x5584cf36d657 CGlobalTRex::sync_threads_stats() + 9
13 0x5584cf36e457 CGlobalTRex::port_stats_to_json(Json::Value&, unsigned char) + 17
14 0x5584cf5440e2 TrexRpcCmdGetPortStats::_run(Json::Value const&, Json::Value&) + 66
15 0x5584cf534911 TrexRpcCommand::run(Json::Value const&, Json::Value&) + 81
16 0x5584cf530474 JsonRpcMethod::_execute(Json::Value&) + 52
17 0x5584cf52dbd3 TrexJsonRpcV2ParsedObject::execute(Json::Value&) + 131
18 0x5584cf52c16d TrexRpcServerReqRes::process_request_raw(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&) + 541
19 0x5584cf52c90e TrexRpcServerReqRes::process_zipped_request(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&) + 462
20 0x5584cf52cce5 TrexRpcServerReqRes::handle_request(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 469
21 0x5584cf52d2fb TrexRpcServerReqRes::_rpc_thread_cb_int() + 1339
22 0x5584cf52da2b TrexRpcServerReqRes::_rpc_thread_cb() + 11
23 0x7f0efff6b27f so/x86_64/libstdc++.so.6(+0xba27f) [0x7f0efff6b27f]
24 0x7f0f0065b6db /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f0f0065b6db]
25 0x7f0eff62ba3f clone + 63
This is a new crash https://github.com/cisco-system-traffic-generator/trex-core
It means that the ena_com_get_dev_basic_stats spin for more that 1sec.
Is it expected?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub<#140>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AFTRWCKJO3HELKSHUNC4KBDSCNI33ANCNFSM4QKIPKTQ>.
|
@AWSNB I have forward the questions. |
@AWSNB one more thing. For some it happens and for some it does not. It rather new thing from the past week |
happend on t3.xlarge i-089921d082eaf5bc6 |
We’ll check on our side
But in general, t3 is an burstable instance, with cpu credits and when instance run out of credit it goes down to base performance. I’m not saying this is the root cause yet, because we need to debug it
May I ask to run same test on an m5.xlarge and compare ?
From: jaygmuru <[email protected]>
Reply-To: amzn/amzn-drivers <[email protected]>
Date: Monday, August 24, 2020 at 11:27 PM
To: amzn/amzn-drivers <[email protected]>
Cc: "Bshara, Nafea" <[email protected]>, Mention <[email protected]>
Subject: Re: [amzn/amzn-drivers] ena_com_get_dev_basic_stats spin more than 1 sec (#140)
happend on t3.xlarge i-089921d082eaf5bc6
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#140 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AFTRWCIFPPKC7LMBLUV63CLSCNKRNANCNFSM4QKIPKTQ>.
|
Reference to T2/T3 burst behavior: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/burstable-credits-baseline-concepts.html
From: "Bshara, Nafea" <[email protected]>
Date: Monday, August 24, 2020 at 11:54 PM
To: amzn/amzn-drivers <[email protected]>, amzn/amzn-drivers <[email protected]>
Cc: Mention <[email protected]>
Subject: Re: [amzn/amzn-drivers] ena_com_get_dev_basic_stats spin more than 1 sec (#140)
We’ll check on our side
But in general, t3 is an burstable instance, with cpu credits and when instance run out of credit it goes down to base performance. I’m not saying this is the root cause yet, because we need to debug it
May I ask to run same test on an m5.xlarge and compare ?
From: jaygmuru <[email protected]>
Reply-To: amzn/amzn-drivers <[email protected]>
Date: Monday, August 24, 2020 at 11:27 PM
To: amzn/amzn-drivers <[email protected]>
Cc: "Bshara, Nafea" <[email protected]>, Mention <[email protected]>
Subject: Re: [amzn/amzn-drivers] ena_com_get_dev_basic_stats spin more than 1 sec (#140)
happend on t3.xlarge i-089921d082eaf5bc6
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#140 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AFTRWCIFPPKC7LMBLUV63CLSCNKRNANCNFSM4QKIPKTQ>.
|
(sorry for necrobumping) |
@mpastyl Please share instance-id and time |
@yastreb78 Thank you for your reply. Happened on i-00ffbaea3c3768628 at roughly Sept 16 09:15:24 UTC I have appended the stack trace bellow. Basically we kill the application automatically if it fails to reply after a timeout, hence the signal at frame 4.
|
We are looking into this - will keep this thread updated. |
@mpastyl Thanks for the info, we will review the instance data from our side |
Hey, we reviewed the instances logs from our side but we did not observe any errors on the HW side when issue reproduced.
We would like to attempt to reproduce this in house in parallel; We would appreciate if you could share the following information:
|
Hi @shaibran, thank you for taking the time to look at this. We have been trying to reproduce and isolate the issue ourselves. Here are some updates:
To answer your previous questions:
This would indeed stop the application from timing out, because we see that the function gets unstuck after roughly 3 sec. However, a spurious 3 sec freeze is a problem for our use case and we would like to avoid it.
Thank you for your time. |
Thank you @mpastyl for the detailed response, we will review this and update |
Together with @mpastyl we've been looking further into this issue. I believe that we've identified the issue and fixed the bug. We observe the main thread occasionally waiting on a condition variable in ENA DPDK driver for 3 seconds. Found with Linux perf, tracing
Checking the source code: static int ena_com_wait_and_process_admin_cq_interrupts(struct ena_comp_ctx *comp_ctx,
struct ena_com_admin_queue *admin_queue)
{
unsigned long flags = 0;
int ret;
ENA_WAIT_EVENT_WAIT(comp_ctx->wait_event, // <-- ena_com.c:764
admin_queue->completion_timeout);
/* In case the command wasn't completed find out the root cause.
* There might be 2 kinds of errors
* 1) No completion (timeout reached)
* 2) There is completion but the device didn't get any msi-x interrupt.
*/
if (unlikely(comp_ctx->status == ENA_CMD_SUBMITTED)) {
ENA_SPINLOCK_LOCK(admin_queue->q_lock, flags);
ena_com_handle_admin_completion(admin_queue);
admin_queue->stats.no_completion++;
ENA_SPINLOCK_UNLOCK(admin_queue->q_lock, flags);
if (comp_ctx->status == ENA_CMD_COMPLETED) {
ena_trc_err("The ena device sent a completion but the driver didn't receive a MSI-X interrupt (cmd %d), autopolling mode is %s\n",
comp_ctx->cmd_opcode, admin_queue->auto_polling ? "ON" : "OFF");
/* Check if fallback to polling is enabled */
if (admin_queue->auto_polling)
admin_queue->polling = true;
} else {
ena_trc_err("The ena device didn't send a completion for the admin cmd %d status %d\n",
comp_ctx->cmd_opcode, comp_ctx->status);
} surprisingly, Furthermore, our build had #define q_waitqueue_t \
struct { \
pthread_cond_t cond; \
pthread_mutex_t mutex; \
}
#define ena_wait_queue_t q_waitqueue_t
#define ENA_WAIT_EVENT_WAIT(waitevent, timeout) \
do { \
struct timespec wait; \
struct timeval now; \
unsigned long timeout_us; \
gettimeofday(&now, NULL); \
wait.tv_sec = now.tv_sec + timeout / 1000000UL; \
timeout_us = timeout % 1000000UL; \
wait.tv_nsec = (now.tv_usec + timeout_us) * 1000UL; \
pthread_mutex_lock(&waitevent.mutex); \
pthread_cond_timedwait(&waitevent.cond, \
&waitevent.mutex, &wait); \
pthread_mutex_unlock(&waitevent.mutex); \
} while (0)
#define ENA_WAIT_EVENT_SIGNAL(waitevent) pthread_cond_signal(&waitevent.cond) This definitely doesn't stick to the textbook condition variable patterns! We suspect that in extreme rare cases, We implemented the change similar to 072b9f2bbc2 in http://dpdk.org/git/dpdk, and we were running for 48 hours without issues. |
@mejedi and @mpastyl @hhaim Thank you again. Indeed commit 072b9f2bbc2402a8c86194fe9e11458c1605540a "net/ena: handle spurious wakeups in wait event" resolves the scenario where interrupt comes faster than ENA_WAIT_EVENT_WAIT, and was not part of DPDK v21.02 (included only in 21.05); We are preparing a troubleshooting document where we will include also recommendation to update ENA or cherry pick the specific commit. |
This is a new crash https://github.com/cisco-system-traffic-generator/trex-core
It means that the ena_com_get_dev_basic_stats spin for more that 1sec.
Is it expected?
The text was updated successfully, but these errors were encountered: