Skip to content

Commit 733c916

Browse files
committed
Fix: sessiond: sessiond and agent deadlock on destroy
Observed issue -------------- While running the out-of-tree java agent tests [1], the session daemon and agent often end up in a deadlock. Attaching gdb to the session daemon, we can see that two threads are blocked in an intriguing state. Thread 13 (Thread 0x7f89027fc700 (LWP 9636)): #0 0x00007f891e81a4cf in __lll_lock_wait () from /usr/lib/libpthread.so.0 #1 0x00007f891e812e03 in pthread_mutex_lock () from /usr/lib/libpthread.so.0 #2 0x000055637f1fbd92 in session_lock_list () at session.c:156 #3 0x000055637f25dc47 in update_agent_app (app=0x7f88ec003480) at agent-thread.c:56 #4 0x000055637f25ec0a in thread_agent_management (data=0x556380cd2400) at agent-thread.c:426 #5 0x000055637f22fb3a in launch_thread (data=0x556380cd24a0) at thread.c:65 #6 0x00007f891e81046f in start_thread () from /usr/lib/libpthread.so.0 #7 0x00007f891e7203d3 in clone () from /usr/lib/libc.so.6 Thread 8 (Thread 0x7f8919309700 (LWP 9631)): #0 0x00007f891e81b44d in recvmsg () from /usr/lib/libpthread.so.0 #1 0x000055637f267847 in lttcomm_recvmsg_inet_sock (sock=0x7f88ec0033c0, buf=0x7f89192f5d5c, len=4, flags=0) at inet.c:367 #2 0x000055637f2146c6 in recv_reply (sock=0x7f88ec0033c0, buf=0x7f89192f5d5c, size=4) at agent.c:275 #3 0x000055637f215202 in app_context_op (app=0x7f88ec003400, ctx=0x7f8908020900, cmd=AGENT_CMD_APP_CTX_DISABLE) at agent.c:552 #4 0x000055637f215c2d in disable_context (ctx=0x7f8908020900, domain=LTTNG_DOMAIN_JUL) at agent.c:841 #5 0x000055637f217480 in agent_destroy (agt=0x7f890801dc20) at agent.c:1326 #6 0x000055637f243448 in trace_ust_destroy_session (session=0x7f8908004010) at trace-ust.c:1408 #7 0x000055637f1fd775 in session_release (ref=0x7f8908001e70) at session.c:873 #8 0x000055637f1fb9ac in urcu_ref_put (ref=0x7f8908001e70, release=0x55637f1fd62a <session_release>) at /usr/include/urcu/ref.h:68 lttng#9 0x000055637f1fdad2 in session_put (session=0x7f8908000d10) at session.c:942 lttng#10 0x000055637f2369e6 in process_client_msg (cmd_ctx=0x7f890800e6e0, sock=0x7f8919308560, sock_error=0x7f8919308564) at client.c:2102 lttng#11 0x000055637f2375ab in thread_manage_clients (data=0x556380cd1840) at client.c:2347 lttng#12 0x000055637f22fb3a in launch_thread (data=0x556380cd18b0) at thread.c:65 lttng#13 0x00007f891e81046f in start_thread () from /usr/lib/libpthread.so.0 lttng#14 0x00007f891e7203d3 in clone () from /usr/lib/libc.so.6 T8 is holding session list lock while the cmd_destroy_session command is being processed. More specifically, it is attempting to destroy an "agent_context" by communicating with an "agent" application. Meanwhile, T13 is still registering that same "agent" application. Cause ----- The deadlock itself is pretty simple to understand. The "agent thread" (T13) has the responsability of accepting new agent application connections. When such a connection occurs, the thread creates a new `agent_app` instance and sends the current sessions' configuration (i.e. their event rules and contexts) to the agent application. When that "update" is complete, a "registration done" message is sent to the new agent application. From the stacktrace above, we can see that T13 is attempting to update the agent application with its initial configuration, but it is blocked on the acquisition of the session list lock. The application's agent is also blocked since it is waiting for the "registration done" message before allowing tracing to proceed (not shown here, but seen in the test logs). Meanwhile, T8 is holding the session list lock while destroying a session. This is expected as all client commands are executed with this lock held. It is, amongst other reasons, used to serialize changes to the sessions' configuration and configuration updates sent to the tracers (i.e. because new apps appear or to keep existing tracers in sync with the users' session configuration). The question becomes: why is T8 tearing down an application that is not yet registered? First, inspecting `agent_app` immediately shows that this structure has no built-in synchronization mechanism. Therefore, the fact that two threads are accessing it at the same time raises a big red flag. Speculating on the intentions of the original design, my intuition is that the "agent_management" thread's role is limited to instantiating an `agent_app` and synchronizing it with the various sessions' configuration. Once that synchronization is performed, the agent application should be published and never accessed again by the "agent thread". Configuration updates (i.e. new event rules, contexts) are then sent synchronously as they are requested by a client in the context of the client thread. Those updates are performed while holding the session list lock. Hence, there is only one thread that should manipulate the agent application at any given time making an explicit `agent_app` lock unnecessary. Overall, this would echo what is done when a 'user space tracer' application registers to the session daemon (see dispatch.c:368). Evidently this isn't what is happening here. The agent thread creates the `agent_app`, publishes it, and then performs an "agent app update" (sending the configuration) while holding the session list lock. This means that there is a window where an agent application is visible to the other threads, yet has not been properly registered. Solution -------- The acquisition of the session list lock is moved outside of update_agent_app() to allow the "agent thread" to hold the session list lock during the "configuration update" phase of the agent application registration. Essentially, the sequence of operation changes from: - Agent tcp connection established - call handle_registration() - agent version check - allocation of agent_app instance - new agent_add is published through the global agent_apps_ht_by_sock hashtable *** it is now reachable by all other threads without any form of exclusivity synchronization. *** - update_agent_app - acquire session list lock - iterate over sessions - send configuration - release session list lock - send registration done to: - Agent tcp connection established - call accept_agent_registration() - agent version check - allocation of agent_app instance - acquire session list lock - update_agent_app - iterate over sessions - send configuration - send registration done - new agent_add is published through the global agent_apps_ht_by_sock hashtable - release session list lock Links ----- [1] https://github.com/lttng/lttng-ust-java-tests Signed-off-by: Jérémie Galarneau <[email protected]> Change-Id: Ia34c5ad81ed3936acbca756b425423e0cb8dbddf
1 parent 85f5bd8 commit 733c916

File tree

4 files changed

+164
-88
lines changed

4 files changed

+164
-88
lines changed

src/bin/lttng-sessiond/agent-thread.c

+144-75
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,15 @@ struct thread_notifiers {
2828
sem_t ready;
2929
};
3030

31+
struct agent_app_id {
32+
pid_t pid;
33+
enum lttng_domain_type domain;
34+
};
35+
36+
struct agent_protocol_version {
37+
unsigned int major, minor;
38+
};
39+
3140
static int agent_tracing_enabled = -1;
3241

3342
/*
@@ -42,38 +51,37 @@ static const char *default_reg_uri =
4251
* Update agent application using the given socket. This is done just after
4352
* registration was successful.
4453
*
45-
* This is a quite heavy call in terms of locking since the session list lock
46-
* AND session lock are acquired.
54+
* This will acquire the various sessions' lock; none must be held by the
55+
* caller.
56+
* The caller must hold the session list lock.
4757
*/
48-
static void update_agent_app(struct agent_app *app)
58+
static void update_agent_app(const struct agent_app *app)
4959
{
5060
struct ltt_session *session, *stmp;
5161
struct ltt_session_list *list;
5262

5363
list = session_get_list();
5464
assert(list);
5565

56-
session_lock_list();
5766
cds_list_for_each_entry_safe(session, stmp, &list->head, list) {
5867
if (!session_get(session)) {
5968
continue;
6069
}
6170

6271
session_lock(session);
6372
if (session->ust_session) {
64-
struct agent *agt;
73+
const struct agent *agt;
6574

6675
rcu_read_lock();
6776
agt = trace_ust_find_agent(session->ust_session, app->domain);
6877
if (agt) {
69-
agent_update(agt, app->sock->fd);
78+
agent_update(agt, app);
7079
}
7180
rcu_read_unlock();
7281
}
7382
session_unlock(session);
7483
session_put(session);
7584
}
76-
session_unlock_list();
7785
}
7886

7987
/*
@@ -188,23 +196,56 @@ static void destroy_tcp_socket(struct lttcomm_sock *sock)
188196
lttcomm_destroy_sock(sock);
189197
}
190198

199+
static const char *domain_type_str(enum lttng_domain_type domain_type)
200+
{
201+
switch (domain_type) {
202+
case LTTNG_DOMAIN_NONE:
203+
return "none";
204+
case LTTNG_DOMAIN_KERNEL:
205+
return "kernel";
206+
case LTTNG_DOMAIN_UST:
207+
return "ust";
208+
case LTTNG_DOMAIN_JUL:
209+
return "jul";
210+
case LTTNG_DOMAIN_LOG4J:
211+
return "log4j";
212+
case LTTNG_DOMAIN_PYTHON:
213+
return "python";
214+
default:
215+
return "unknown";
216+
}
217+
}
218+
219+
static bool is_agent_protocol_version_supported(
220+
const struct agent_protocol_version *version)
221+
{
222+
const bool is_supported = version->major == AGENT_MAJOR_VERSION &&
223+
version->minor == AGENT_MINOR_VERSION;
224+
225+
if (!is_supported) {
226+
WARN("Refusing agent connection: unsupported protocol version %ui.%ui, expected %i.%i",
227+
version->major, version->minor,
228+
AGENT_MAJOR_VERSION, AGENT_MINOR_VERSION);
229+
}
230+
231+
return is_supported;
232+
}
233+
191234
/*
192-
* Handle a new agent registration using the reg socket. After that, a new
193-
* agent application is added to the global hash table and attach to an UST app
194-
* object. If r_app is not NULL, the created app is set to the pointer.
235+
* Handle a new agent connection on the registration socket.
195236
*
196-
* Return the new FD created upon accept() on success or else a negative errno
197-
* value.
237+
* Returns 0 on success, or else a negative errno value.
238+
* On success, the resulting socket is returned through `agent_app_socket`
239+
* and the application's reported id is updated through `agent_app_id`.
198240
*/
199-
static int handle_registration(struct lttcomm_sock *reg_sock,
200-
struct agent_app **r_app)
241+
static int accept_agent_connection(
242+
struct lttcomm_sock *reg_sock,
243+
struct agent_app_id *agent_app_id,
244+
struct lttcomm_sock **agent_app_socket)
201245
{
202246
int ret;
203-
pid_t pid;
204-
uint32_t major_version, minor_version;
247+
struct agent_protocol_version agent_version;
205248
ssize_t size;
206-
enum lttng_domain_type domain;
207-
struct agent_app *app;
208249
struct agent_register_msg msg;
209250
struct lttcomm_sock *new_sock;
210251

@@ -213,60 +254,52 @@ static int handle_registration(struct lttcomm_sock *reg_sock,
213254
new_sock = reg_sock->ops->accept(reg_sock);
214255
if (!new_sock) {
215256
ret = -ENOTCONN;
216-
goto error;
257+
goto end;
217258
}
218259

219260
size = new_sock->ops->recvmsg(new_sock, &msg, sizeof(msg), 0);
220261
if (size < sizeof(msg)) {
262+
if (size < 0) {
263+
PERROR("Failed to register new agent application");
264+
} else if (size != 0) {
265+
ERR("Failed to register new agent application: invalid registration message length: expected length = %zu, message length = %zd",
266+
sizeof(msg), size);
267+
} else {
268+
DBG("Failed to register new agent application: connection closed");
269+
}
221270
ret = -EINVAL;
222-
goto error_socket;
223-
}
224-
domain = be32toh(msg.domain);
225-
pid = be32toh(msg.pid);
226-
major_version = be32toh(msg.major_version);
227-
minor_version = be32toh(msg.minor_version);
228-
229-
/* Test communication protocol version of the registring agent. */
230-
if (major_version != AGENT_MAJOR_VERSION) {
231-
ret = -EINVAL;
232-
goto error_socket;
233-
}
234-
if (minor_version != AGENT_MINOR_VERSION) {
235-
ret = -EINVAL;
236-
goto error_socket;
271+
goto error_close_socket;
237272
}
238273

239-
DBG2("[agent-thread] New registration for pid %d domain %d on socket %d",
240-
pid, domain, new_sock->fd);
274+
agent_version = (struct agent_protocol_version) {
275+
be32toh(msg.major_version),
276+
be32toh(msg.minor_version),
277+
};
241278

242-
app = agent_create_app(pid, domain, new_sock);
243-
if (!app) {
244-
ret = -ENOMEM;
245-
goto error_socket;
279+
/* Test communication protocol version of the registering agent. */
280+
if (!is_agent_protocol_version_supported(&agent_version)) {
281+
ret = -EINVAL;
282+
goto error_close_socket;
246283
}
247284

248-
/*
249-
* Add before assigning the socket value to the UST app so it can be found
250-
* concurrently.
251-
*/
252-
agent_add_app(app);
253-
254-
/*
255-
* We don't need to attach the agent app to the app. If we ever do so, we
256-
* should consider both registration order of agent before app and app
257-
* before agent.
258-
*/
285+
*agent_app_id = (struct agent_app_id) {
286+
.domain = (enum lttng_domain_type) be32toh(msg.domain),
287+
.pid = (pid_t) be32toh(msg.pid),
288+
};
259289

260-
if (r_app) {
261-
*r_app = app;
262-
}
290+
DBG2("New registration for agent application: pid = %ld, domain = %s, socket fd = %d",
291+
(long) agent_app_id->pid,
292+
domain_type_str(agent_app_id->domain), new_sock->fd);
263293

264-
return new_sock->fd;
294+
*agent_app_socket = new_sock;
295+
new_sock = NULL;
296+
ret = 0;
297+
goto end;
265298

266-
error_socket:
299+
error_close_socket:
267300
new_sock->ops->close(new_sock);
268301
lttcomm_destroy_sock(new_sock);
269-
error:
302+
end:
270303
return ret;
271304
}
272305

@@ -362,7 +395,7 @@ static void *thread_agent_management(void *data)
362395
uatomic_set(&agent_tracing_enabled, 1);
363396
mark_thread_as_ready(notifiers);
364397

365-
/* Add TCP socket to poll set. */
398+
/* Add TCP socket to the poll set. */
366399
ret = lttng_poll_add(&events, reg_sock->fd,
367400
LPOLLIN | LPOLLERR | LPOLLHUP | LPOLLRDHUP);
368401
if (ret < 0) {
@@ -399,43 +432,79 @@ static void *thread_agent_management(void *data)
399432
goto exit;
400433
}
401434

435+
/* Activity on the registration socket. */
402436
if (revents & LPOLLIN) {
403-
int new_fd;
404-
struct agent_app *app = NULL;
437+
struct agent_app_id new_app_id;
438+
struct agent_app *new_app = NULL;
439+
struct lttcomm_sock *new_app_socket;
440+
int new_app_socket_fd;
405441

406442
assert(pollfd == reg_sock->fd);
407-
new_fd = handle_registration(reg_sock, &app);
408-
if (new_fd < 0) {
443+
444+
ret = accept_agent_connection(
445+
reg_sock, &new_app_id, &new_app_socket);
446+
if (ret < 0) {
447+
/* Errors are already logged. */
409448
continue;
410449
}
411-
/* Should not have a NULL app on success. */
412-
assert(app);
413450

414451
/*
415-
* Since this is a command socket (write then read),
416-
* only add poll error event to only detect shutdown.
452+
* new_app_socket's ownership has been
453+
* transferred to the new agent app.
417454
*/
418-
ret = lttng_poll_add(&events, new_fd,
455+
new_app = agent_create_app(new_app_id.pid,
456+
new_app_id.domain,
457+
new_app_socket);
458+
if (!new_app) {
459+
new_app_socket->ops->close(
460+
new_app_socket);
461+
continue;
462+
}
463+
new_app_socket_fd = new_app_socket->fd;
464+
new_app_socket = NULL;
465+
466+
/*
467+
* Since this is a command socket (write then
468+
* read), only add poll error event to only
469+
* detect shutdown.
470+
*/
471+
ret = lttng_poll_add(&events, new_app_socket_fd,
419472
LPOLLERR | LPOLLHUP | LPOLLRDHUP);
420473
if (ret < 0) {
421-
agent_destroy_app_by_sock(new_fd);
474+
agent_destroy_app(new_app);
422475
continue;
423476
}
424477

425-
/* Update newly registered app. */
426-
update_agent_app(app);
478+
/*
479+
* Prevent sessions from being modified while
480+
* the agent application's configuration is
481+
* updated.
482+
*/
483+
session_lock_list();
484+
485+
/*
486+
* Update the newly registered applications's
487+
* configuration.
488+
*/
489+
update_agent_app(new_app);
427490

428-
/* On failure, the poll will detect it and clean it up. */
429-
ret = agent_send_registration_done(app);
491+
ret = agent_send_registration_done(new_app);
430492
if (ret < 0) {
431-
/* Removing from the poll set */
432-
ret = lttng_poll_del(&events, new_fd);
493+
agent_destroy_app(new_app);
494+
/* Removing from the poll set. */
495+
ret = lttng_poll_del(&events,
496+
new_app_socket_fd);
433497
if (ret < 0) {
498+
session_unlock_list();
434499
goto error;
435500
}
436-
agent_destroy_app_by_sock(new_fd);
437501
continue;
438502
}
503+
504+
/* Publish the new agent app. */
505+
agent_add_app(new_app);
506+
507+
session_unlock_list();
439508
} else if (revents & (LPOLLERR | LPOLLHUP | LPOLLRDHUP)) {
440509
/* Removing from the poll set */
441510
ret = lttng_poll_del(&events, pollfd);

src/bin/lttng-sessiond/agent.c

+8-10
Original file line numberDiff line numberDiff line change
@@ -380,7 +380,7 @@ static ssize_t list_events(struct agent_app *app, struct lttng_event **events)
380380
*
381381
* Return LTTNG_OK on success or else a LTTNG_ERR* code.
382382
*/
383-
static int enable_event(struct agent_app *app, struct agent_event *event)
383+
static int enable_event(const struct agent_app *app, struct agent_event *event)
384384
{
385385
int ret;
386386
char *bytes_to_send;
@@ -495,8 +495,8 @@ int send_pstring(struct lttcomm_sock *sock, const char *str, uint32_t len)
495495
*
496496
* Return LTTNG_OK on success or else a LTTNG_ERR* code.
497497
*/
498-
static int app_context_op(struct agent_app *app,
499-
struct agent_app_ctx *ctx, enum lttcomm_agent_command cmd)
498+
static int app_context_op(const struct agent_app *app,
499+
const struct agent_app_ctx *ctx, enum lttcomm_agent_command cmd)
500500
{
501501
int ret;
502502
uint32_t reply_ret_code;
@@ -946,7 +946,7 @@ struct agent_app *agent_create_app(pid_t pid, enum lttng_domain_type domain,
946946

947947
app = zmalloc(sizeof(*app));
948948
if (!app) {
949-
PERROR("zmalloc agent create");
949+
PERROR("Failed to allocate agent application instance");
950950
goto error;
951951
}
952952

@@ -1402,26 +1402,24 @@ void agent_app_ht_clean(void)
14021402
* Note that this function is most likely to be used with a tracing session
14031403
* thus the caller should make sure to hold the appropriate lock(s).
14041404
*/
1405-
void agent_update(struct agent *agt, int sock)
1405+
void agent_update(const struct agent *agt, const struct agent_app *app)
14061406
{
14071407
int ret;
1408-
struct agent_app *app;
14091408
struct agent_event *event;
14101409
struct lttng_ht_iter iter;
14111410
struct agent_app_ctx *ctx;
14121411

14131412
assert(agt);
1414-
assert(sock >= 0);
1413+
assert(app);
14151414

1416-
DBG("Agent updating app socket %d", sock);
1415+
DBG("Agent updating app: pid = %ld", (long) app->pid);
14171416

14181417
rcu_read_lock();
1419-
app = agent_find_app_by_sock(sock);
14201418
/*
14211419
* We are in the registration path thus if the application is gone,
14221420
* there is a serious code flow error.
14231421
*/
1424-
assert(app);
1422+
14251423
cds_lfht_for_each_entry(agt->events->ht, &iter.iter, event, node.node) {
14261424
/* Skip event if disabled. */
14271425
if (!event->enabled) {

src/bin/lttng-sessiond/agent.h

+1-1
Original file line numberDiff line numberDiff line change
@@ -163,7 +163,7 @@ int agent_enable_event(struct agent_event *event,
163163
enum lttng_domain_type domain);
164164
int agent_disable_event(struct agent_event *event,
165165
enum lttng_domain_type domain);
166-
void agent_update(struct agent *agt, int sock);
166+
void agent_update(const struct agent *agt, const struct agent_app *app);
167167
int agent_list_events(struct lttng_event **events,
168168
enum lttng_domain_type domain);
169169

0 commit comments

Comments
 (0)