Skip to content

Commit 9016dbf

Browse files
committed
Fix: sessiond: assert on empty payload when handling client out event
Observed issue ============== When servicing a large number of tracer notifications and sending notifications to clients, the session daemon occasionally hits an assertion: #4 0x00007fb224d7d116 in __assert_fail () from /usr/lib/libc.so.6 #5 0x000056038b2fe4d7 in client_flush_outgoing_queue (client=0x7fb21400c3b0) at notification-thread-events.cpp:3586 #6 0x000056038b2ff819 in handle_notification_thread_client_out (state=0x7fb221974090, socket=77) at notification-thread-events.cpp:4104 #7 0x000056038b2f3d77 in thread_notification (data=0x56038cc7fe90) at notification-thread.cpp:763 #8 0x000056038b30ca7d in launch_thread (data=0x56038cc7e220) at thread.cpp:66 lttng#9 0x00007fb224dcf5c2 in start_thread () from /usr/lib/libc.so.6 lttng#10 0x00007fb224e54584 in clone () from /usr/lib/libc.so.6 Cause ===== A client "out" event can be received when no payload is left to send under some circumstances. Many threads can flush a client's outgoing queue and, if they had to queue their message (socket was full), will use the "communication update" command to signal the (e)poll thread to monitor for space being made available in the socket. Commands are sent over an internal pipe serviced by the same thread as the client sockets. When space is made available in the socket, there is a race between the (e)poll thread and the other threads that may wish to use the client's socket to flush its outgoing queue. A non-(e)poll thread may attempt (and succeed) in flushing the queue before the (e)poll thread gets a chance to service the client's "out" event. In this situation, the (e)poll thread processing the client out event will see an empty payload: there is nothing to do. Solution ======== The (e)poll thread can simply ignore the "client out" event when an empty payload is seen. There is also no need to update the transmission status as the other thread has already enqueued a "communication update" command to do so. Known drawbacks =============== None. Signed-off-by: Jérémie Galarneau <[email protected]> Change-Id: I8a181bea1e37e8e14cc67b624b76d139b488eded
1 parent 8a880a8 commit 9016dbf

File tree

3 files changed

+106
-29
lines changed

3 files changed

+106
-29
lines changed

src/bin/lttng-sessiond/action-executor.cpp

+9
Original file line numberDiff line numberDiff line change
@@ -233,6 +233,15 @@ static int client_handle_transmission_status(
233233
case CLIENT_TRANSMISSION_STATUS_COMPLETE:
234234
DBG("Successfully sent full notification to client, client_id = %" PRIu64,
235235
client->id);
236+
/*
237+
* There is no need to wake the (e)poll thread. If it was waiting for
238+
* "out" events on the client's socket, it will see that no payload
239+
* in queued and will unsubscribe from that event.
240+
*
241+
* In the other cases, we have to wake the the (e)poll thread to either
242+
* handle the error on the client or to get it to monitor the client "out"
243+
* events.
244+
*/
236245
update_communication = false;
237246
break;
238247
case CLIENT_TRANSMISSION_STATUS_QUEUED:

src/bin/lttng-sessiond/notification-thread-events.cpp

+96-29
Original file line numberDiff line numberDiff line change
@@ -47,8 +47,8 @@
4747
#include "lttng-sessiond.hpp"
4848
#include "kernel.hpp"
4949

50-
#define CLIENT_POLL_MASK_IN (LPOLLIN | LPOLLERR | LPOLLHUP | LPOLLRDHUP)
51-
#define CLIENT_POLL_MASK_IN_OUT (CLIENT_POLL_MASK_IN | LPOLLOUT)
50+
#define CLIENT_POLL_EVENTS_IN (LPOLLIN | LPOLLERR | LPOLLHUP | LPOLLRDHUP)
51+
#define CLIENT_POLL_EVENTS_IN_OUT (CLIENT_POLL_EVENTS_IN | LPOLLOUT)
5252

5353
/* The tracers currently limit the capture size to PIPE_BUF (4kb on linux). */
5454
#define MAX_CAPTURE_SIZE (PIPE_BUF)
@@ -3394,9 +3394,9 @@ int handle_notification_thread_client_connect(
33943394
goto error;
33953395
}
33963396

3397+
client->communication.current_poll_events = CLIENT_POLL_EVENTS_IN;
33973398
ret = lttng_poll_add(&state->events, client->socket,
3398-
LPOLLIN | LPOLLERR |
3399-
LPOLLHUP | LPOLLRDHUP);
3399+
client->communication.current_poll_events);
34003400
if (ret < 0) {
34013401
ERR("Failed to add notification channel client socket to poll set");
34023402
ret = 0;
@@ -3530,6 +3530,18 @@ int handle_notification_thread_trigger_unregister_all(
35303530
return error_occurred ? -1 : 0;
35313531
}
35323532

3533+
static
3534+
bool client_has_outbound_data_left(
3535+
const struct notification_client *client)
3536+
{
3537+
const struct lttng_payload_view pv = lttng_payload_view_from_payload(
3538+
&client->communication.outbound.payload, 0, -1);
3539+
const bool has_data = pv.buffer.size != 0;
3540+
const bool has_fds = lttng_payload_view_get_fd_handle_count(&pv);
3541+
3542+
return has_data || has_fds;
3543+
}
3544+
35333545
static
35343546
int client_handle_transmission_status(
35353547
struct notification_client *client,
@@ -3540,24 +3552,51 @@ int client_handle_transmission_status(
35403552

35413553
switch (transmission_status) {
35423554
case CLIENT_TRANSMISSION_STATUS_COMPLETE:
3543-
ret = lttng_poll_mod(&state->events, client->socket,
3544-
CLIENT_POLL_MASK_IN);
3545-
if (ret) {
3546-
goto end;
3547-
}
3548-
3549-
break;
35503555
case CLIENT_TRANSMISSION_STATUS_QUEUED:
3556+
{
3557+
int current_poll_events;
3558+
int new_poll_events;
35513559
/*
35523560
* We want to be notified whenever there is buffer space
3553-
* available to send the rest of the payload.
3561+
* available to send the rest of the payload if we are
3562+
* waiting to send data to the client.
3563+
*
3564+
* The state of the outbound queue being sampled here is
3565+
* fine since:
3566+
* - it is okay to wake-up "for nothing" in case we see
3567+
* that data is left, but another thread succeeds in
3568+
* flushing it before us when handling the client "out"
3569+
* event. We will simply stop monitoring that event the next
3570+
* time it wakes us up and we see no data left to be sent,
3571+
* - if another thread fails to flush the entire client
3572+
* outgoing queue, it will issue a "communication update"
3573+
* command and cause the client's (e)poll mask to be
3574+
* re-evaluated.
3575+
*
3576+
* The situation we seek to avoid would be to disable the
3577+
* monitoring of "out" client events indefinitely when there is
3578+
* data to be sent, which can't happen because of the
3579+
* aforementioned "communication update" mechanism.
35543580
*/
3555-
ret = lttng_poll_mod(&state->events, client->socket,
3556-
CLIENT_POLL_MASK_IN_OUT);
3557-
if (ret) {
3558-
goto end;
3581+
pthread_mutex_lock(&client->lock);
3582+
current_poll_events = client->communication.current_poll_events;
3583+
new_poll_events = client_has_outbound_data_left(client) ?
3584+
CLIENT_POLL_EVENTS_IN_OUT :
3585+
CLIENT_POLL_EVENTS_IN;
3586+
client->communication.current_poll_events = new_poll_events;
3587+
pthread_mutex_unlock(&client->lock);
3588+
3589+
/* Update the monitored event set only if it changed. */
3590+
if (current_poll_events != new_poll_events) {
3591+
ret = lttng_poll_mod(&state->events, client->socket,
3592+
new_poll_events);
3593+
if (ret) {
3594+
goto end;
3595+
}
35593596
}
3597+
35603598
break;
3599+
}
35613600
case CLIENT_TRANSMISSION_STATUS_FAIL:
35623601
ret = notification_thread_client_disconnect(client, state);
35633602
if (ret) {
@@ -3697,18 +3736,6 @@ enum client_transmission_status client_flush_outgoing_queue(
36973736
return CLIENT_TRANSMISSION_STATUS_ERROR;
36983737
}
36993738

3700-
static
3701-
bool client_has_outbound_data_left(
3702-
const struct notification_client *client)
3703-
{
3704-
const struct lttng_payload_view pv = lttng_payload_view_from_payload(
3705-
&client->communication.outbound.payload, 0, -1);
3706-
const bool has_data = pv.buffer.size != 0;
3707-
const bool has_fds = lttng_payload_view_get_fd_handle_count(&pv);
3708-
3709-
return has_data || has_fds;
3710-
}
3711-
37123739
/* Client lock must _not_ be held by the caller. */
37133740
static
37143741
int client_send_command_reply(struct notification_client *client,
@@ -4117,7 +4144,47 @@ int handle_notification_thread_client_out(
41174144
}
41184145

41194146
pthread_mutex_lock(&client->lock);
4120-
transmission_status = client_flush_outgoing_queue(client);
4147+
if (!client_has_outbound_data_left(client)) {
4148+
/*
4149+
* A client "out" event can be received when no payload is left
4150+
* to send under some circumstances.
4151+
*
4152+
* Many threads can flush a client's outgoing queue and, if they
4153+
* had to queue their message (socket was full), will use the
4154+
* "communication update" command to signal the (e)poll thread
4155+
* to monitor for space being made available in the socket.
4156+
*
4157+
* Commands are sent over an internal pipe serviced by the same
4158+
* thread as the client sockets.
4159+
*
4160+
* When space is made available in the socket, there is a race
4161+
* between the (e)poll thread and the other threads that may
4162+
* wish to use the client's socket to flush its outgoing queue.
4163+
*
4164+
* A non-(e)poll thread may attempt (and succeed) in flushing
4165+
* the queue before the (e)poll thread gets a chance to service
4166+
* the client's "out" event.
4167+
*
4168+
* In this situation, the (e)poll thread processing the client
4169+
* out event will see an empty payload: there is nothing to do
4170+
* except unsubscribing (e)poll "out" events.
4171+
*
4172+
* Note that this thread is the (e)poll thread so it can modify
4173+
* the (e)poll mask directly without using a communication
4174+
* update command. Other threads that flush the outgoing queue
4175+
* will use the "communication update" command to wake up this
4176+
* thread and force it to monitor "out" events.
4177+
*
4178+
* When other threads succeed in emptying the outgoing queue,
4179+
* they don't need to update the (e)poll mask: if the "out"
4180+
* event is monitored, it will fire once and the (e)poll
4181+
* thread will reach this condition, causing the event to
4182+
* stop being monitored.
4183+
*/
4184+
transmission_status = CLIENT_TRANSMISSION_STATUS_COMPLETE;
4185+
} else {
4186+
transmission_status = client_flush_outgoing_queue(client);
4187+
}
41214188
pthread_mutex_unlock(&client->lock);
41224189

41234190
ret = client_handle_transmission_status(

src/bin/lttng-sessiond/notification-thread-internal.hpp

+1
Original file line numberDiff line numberDiff line change
@@ -165,6 +165,7 @@ struct notification_client {
165165
* clean-up.
166166
*/
167167
bool active;
168+
int current_poll_events;
168169
struct {
169170
/*
170171
* During the reception of a message, the reception

0 commit comments

Comments
 (0)