Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: update regression tests involving UST #11

Closed
wants to merge 1 commit into from
Closed

Fix: update regression tests involving UST #11

wants to merge 1 commit into from

Conversation

abusque
Copy link
Contributor

@abusque abusque commented May 27, 2015

This patch fixes previously failing regression tests which involve
either the UST statedump or helper events, following a refactor and
rename.

Signed-off-by: Antoine Busque [email protected]

This patch fixes previously failing regression tests which involve
either the UST statedump or helper events, following a refactor and
rename.

Signed-off-by: Antoine Busque <[email protected]>
@jgalar
Copy link
Member

jgalar commented May 27, 2015

Merged, thanks!

@jgalar jgalar closed this May 27, 2015
@abusque abusque deleted the fix-regression-tests branch May 28, 2015 17:38
jgalar added a commit to jgalar/lttng-tools that referenced this pull request Apr 22, 2020
Observed issue
--------------

While running the out-of-tree java agent tests [1], the session daemon
and agent often end up in a deadlock.

Attaching gdb to the session daemon, we can see that two threads are
blocked in an intriguing state.

Thread 13 (Thread 0x7f89027fc700 (LWP 9636)):
 #0  0x00007f891e81a4cf in __lll_lock_wait () from /usr/lib/libpthread.so.0
 #1  0x00007f891e812e03 in pthread_mutex_lock () from /usr/lib/libpthread.so.0
 #2  0x000055637f1fbd92 in session_lock_list () at session.c:156
 #3  0x000055637f25dc47 in update_agent_app (app=0x7f88ec003480) at agent-thread.c:56
 #4  0x000055637f25ec0a in thread_agent_management (data=0x556380cd2400) at agent-thread.c:426
 #5  0x000055637f22fb3a in launch_thread (data=0x556380cd24a0) at thread.c:65
 #6  0x00007f891e81046f in start_thread () from /usr/lib/libpthread.so.0
 #7  0x00007f891e7203d3 in clone () from /usr/lib/libc.so.6

Thread 8 (Thread 0x7f8919309700 (LWP 9631)):
 #0  0x00007f891e81b44d in recvmsg () from /usr/lib/libpthread.so.0
 #1  0x000055637f267847 in lttcomm_recvmsg_inet_sock (sock=0x7f88ec0033c0, buf=0x7f89192f5d5c, len=4, flags=0) at inet.c:367
 #2  0x000055637f2146c6 in recv_reply (sock=0x7f88ec0033c0, buf=0x7f89192f5d5c, size=4) at agent.c:275
 #3  0x000055637f215202 in app_context_op (app=0x7f88ec003400, ctx=0x7f8908020900, cmd=AGENT_CMD_APP_CTX_DISABLE) at agent.c:552
 #4  0x000055637f215c2d in disable_context (ctx=0x7f8908020900, domain=LTTNG_DOMAIN_JUL) at agent.c:841
 #5  0x000055637f217480 in agent_destroy (agt=0x7f890801dc20) at agent.c:1326
 #6  0x000055637f243448 in trace_ust_destroy_session (session=0x7f8908004010) at trace-ust.c:1408
 #7  0x000055637f1fd775 in session_release (ref=0x7f8908001e70) at session.c:873
 #8  0x000055637f1fb9ac in urcu_ref_put (ref=0x7f8908001e70, release=0x55637f1fd62a <session_release>) at /usr/include/urcu/ref.h:68
 lttng#9  0x000055637f1fdad2 in session_put (session=0x7f8908000d10) at session.c:942
 lttng#10 0x000055637f2369e6 in process_client_msg (cmd_ctx=0x7f890800e6e0, sock=0x7f8919308560, sock_error=0x7f8919308564) at client.c:2102
 lttng#11 0x000055637f2375ab in thread_manage_clients (data=0x556380cd1840) at client.c:2347
 lttng#12 0x000055637f22fb3a in launch_thread (data=0x556380cd18b0) at thread.c:65
 lttng#13 0x00007f891e81046f in start_thread () from /usr/lib/libpthread.so.0
 lttng#14 0x00007f891e7203d3 in clone () from /usr/lib/libc.so.6

T8 is holding session list lock while the cmd_destroy_session
command is being processed. More specifically, it is attempting
to destroy an "agent_context" by communicating with an "agent"
application.

Meanwhile, T13 is still registering that same "agent" application.

Cause
-----

The deadlock itself is pretty simple to understand.

The "agent thread" (T13) has the responsability of accepting new agent
application connections. When such a connection occurs, the thread
creates a new `agent_app` instance and sends the current sessions'
configuration (i.e. their event rules and contexts) to the agent
application. When that "update" is complete, a "registration done"
message is sent to the new agent application.

From the stacktrace above, we can see that T13 is attempting to update
the agent application with its initial configuration, but it is
blocked on the acquisition of the session list lock. The application's
agent is also blocked since it is waiting for the "registration done"
message before allowing tracing to proceed (not shown here, but seen
in the test logs).

Meanwhile, T8 is holding the session list lock while destroying a
session. This is expected as all client commands are executed with
this lock held. It is, amongst other reasons, used to serialize
changes to the sessions' configuration and configuration updates sent
to the tracers (i.e. because new apps appear or to keep existing
tracers in sync with the users' session configuration).

The question becomes: why is T8 tearing down an application that is
not yet registered?

First, inspecting `agent_app` immediately shows that this structure
has no built-in synchronization mechanism. Therefore, the fact that
two threads are accessing it at the same time raises a big red flag.

Speculating on the intentions of the original design, my intuition is
that the "agent_management" thread's role is limited to instantiating
an `agent_app` and synchronizing it with the various sessions'
configuration. Once that synchronization is performed, the agent
application should be published and never accessed again by the "agent
thread".

Configuration updates (i.e. new event rules, contexts) are then sent
synchronously as they are requested by a client in the context of the
client thread. Those updates are performed while holding the session
list lock.

Hence, there is only one thread that should manipulate the agent
application at any given time making an explicit `agent_app` lock
unnecessary.

Overall, this would echo what is done when a 'user space tracer'
application registers to the session daemon (see dispatch.c:368).

Evidently this isn't what is happening here.

The agent thread creates the `agent_app`, publishes it, and then
performs an "agent app update" (sending the configuration) while
holding the session list lock. This means that there is a window where
an agent application is visible to the other threads, yet has not been
properly registered.

Solution
--------

The acquisition of the session list lock is moved outside of
update_agent_app() to allow the "agent thread" to hold the session
list lock during the "configuration update" phase of the agent
application registration.

Essentially, the sequence of operation changes from:

- Agent tcp connection established
- call handle_registration()
  - agent version check
  - allocation of agent_app instance
  - new agent_add is published through the global agent_apps_ht_by_sock
    hashtable
    ***
    it is now reachable by all other threads without any form of
    exclusivity synchronization.
    ***
- update_agent_app
  - acquire session list lock
  - iterate over sessions
    - send configuration
  - release session list lock
- send registration done

to:

- Agent tcp connection established
- call accept_agent_registration()
  - agent version check
- allocation of agent_app instance
- acquire session list lock
- update_agent_app
  - iterate over sessions
    - send configuration
- send registration done
- new agent_add is published through the global agent_apps_ht_by_sock
  hashtable
- release session list lock

Links
-----

[1] https://github.com/lttng/lttng-ust-java-tests

Signed-off-by: Jérémie Galarneau <[email protected]>
Change-Id: Ia34c5ad81ed3936acbca756b425423e0cb8dbddf
jgalar added a commit that referenced this pull request Apr 22, 2020
Observed issue
--------------

While running the out-of-tree java agent tests [1], the session daemon
and agent often end up in a deadlock.

Attaching gdb to the session daemon, we can see that two threads are
blocked in an intriguing state.

Thread 13 (Thread 0x7f89027fc700 (LWP 9636)):
 #0  0x00007f891e81a4cf in __lll_lock_wait () from /usr/lib/libpthread.so.0
 #1  0x00007f891e812e03 in pthread_mutex_lock () from /usr/lib/libpthread.so.0
 #2  0x000055637f1fbd92 in session_lock_list () at session.c:156
 #3  0x000055637f25dc47 in update_agent_app (app=0x7f88ec003480) at agent-thread.c:56
 #4  0x000055637f25ec0a in thread_agent_management (data=0x556380cd2400) at agent-thread.c:426
 #5  0x000055637f22fb3a in launch_thread (data=0x556380cd24a0) at thread.c:65
 #6  0x00007f891e81046f in start_thread () from /usr/lib/libpthread.so.0
 #7  0x00007f891e7203d3 in clone () from /usr/lib/libc.so.6

Thread 8 (Thread 0x7f8919309700 (LWP 9631)):
 #0  0x00007f891e81b44d in recvmsg () from /usr/lib/libpthread.so.0
 #1  0x000055637f267847 in lttcomm_recvmsg_inet_sock (sock=0x7f88ec0033c0, buf=0x7f89192f5d5c, len=4, flags=0) at inet.c:367
 #2  0x000055637f2146c6 in recv_reply (sock=0x7f88ec0033c0, buf=0x7f89192f5d5c, size=4) at agent.c:275
 #3  0x000055637f215202 in app_context_op (app=0x7f88ec003400, ctx=0x7f8908020900, cmd=AGENT_CMD_APP_CTX_DISABLE) at agent.c:552
 #4  0x000055637f215c2d in disable_context (ctx=0x7f8908020900, domain=LTTNG_DOMAIN_JUL) at agent.c:841
 #5  0x000055637f217480 in agent_destroy (agt=0x7f890801dc20) at agent.c:1326
 #6  0x000055637f243448 in trace_ust_destroy_session (session=0x7f8908004010) at trace-ust.c:1408
 #7  0x000055637f1fd775 in session_release (ref=0x7f8908001e70) at session.c:873
 #8  0x000055637f1fb9ac in urcu_ref_put (ref=0x7f8908001e70, release=0x55637f1fd62a <session_release>) at /usr/include/urcu/ref.h:68
 #9  0x000055637f1fdad2 in session_put (session=0x7f8908000d10) at session.c:942
 #10 0x000055637f2369e6 in process_client_msg (cmd_ctx=0x7f890800e6e0, sock=0x7f8919308560, sock_error=0x7f8919308564) at client.c:2102
 #11 0x000055637f2375ab in thread_manage_clients (data=0x556380cd1840) at client.c:2347
 #12 0x000055637f22fb3a in launch_thread (data=0x556380cd18b0) at thread.c:65
 #13 0x00007f891e81046f in start_thread () from /usr/lib/libpthread.so.0
 #14 0x00007f891e7203d3 in clone () from /usr/lib/libc.so.6

T8 is holding session list lock while the cmd_destroy_session
command is being processed. More specifically, it is attempting
to destroy an "agent_context" by communicating with an "agent"
application.

Meanwhile, T13 is still registering that same "agent" application.

Cause
-----

The deadlock itself is pretty simple to understand.

The "agent thread" (T13) has the responsability of accepting new agent
application connections. When such a connection occurs, the thread
creates a new `agent_app` instance and sends the current sessions'
configuration (i.e. their event rules and contexts) to the agent
application. When that "update" is complete, a "registration done"
message is sent to the new agent application.

From the stacktrace above, we can see that T13 is attempting to update
the agent application with its initial configuration, but it is
blocked on the acquisition of the session list lock. The application's
agent is also blocked since it is waiting for the "registration done"
message before allowing tracing to proceed (not shown here, but seen
in the test logs).

Meanwhile, T8 is holding the session list lock while destroying a
session. This is expected as all client commands are executed with
this lock held. It is, amongst other reasons, used to serialize
changes to the sessions' configuration and configuration updates sent
to the tracers (i.e. because new apps appear or to keep existing
tracers in sync with the users' session configuration).

The question becomes: why is T8 tearing down an application that is
not yet registered?

First, inspecting `agent_app` immediately shows that this structure
has no built-in synchronization mechanism. Therefore, the fact that
two threads are accessing it at the same time raises a big red flag.

Speculating on the intentions of the original design, my intuition is
that the "agent_management" thread's role is limited to instantiating
an `agent_app` and synchronizing it with the various sessions'
configuration. Once that synchronization is performed, the agent
application should be published and never accessed again by the "agent
thread".

Configuration updates (i.e. new event rules, contexts) are then sent
synchronously as they are requested by a client in the context of the
client thread. Those updates are performed while holding the session
list lock.

Hence, there is only one thread that should manipulate the agent
application at any given time making an explicit `agent_app` lock
unnecessary.

Overall, this would echo what is done when a 'user space tracer'
application registers to the session daemon (see dispatch.c:368).

Evidently this isn't what is happening here.

The agent thread creates the `agent_app`, publishes it, and then
performs an "agent app update" (sending the configuration) while
holding the session list lock. This means that there is a window where
an agent application is visible to the other threads, yet has not been
properly registered.

Solution
--------

The acquisition of the session list lock is moved outside of
update_agent_app() to allow the "agent thread" to hold the session
list lock during the "configuration update" phase of the agent
application registration.

Essentially, the sequence of operation changes from:

- Agent tcp connection established
- call handle_registration()
  - agent version check
  - allocation of agent_app instance
  - new agent_add is published through the global agent_apps_ht_by_sock
    hashtable
    ***
    it is now reachable by all other threads without any form of
    exclusivity synchronization.
    ***
- update_agent_app
  - acquire session list lock
  - iterate over sessions
    - send configuration
  - release session list lock
- send registration done

to:

- Agent tcp connection established
- call accept_agent_registration()
  - agent version check
- allocation of agent_app instance
- acquire session list lock
- update_agent_app
  - iterate over sessions
    - send configuration
- send registration done
- new agent_add is published through the global agent_apps_ht_by_sock
  hashtable
- release session list lock

Links
-----

[1] https://github.com/lttng/lttng-ust-java-tests

Signed-off-by: Jérémie Galarneau <[email protected]>
Change-Id: Ia34c5ad81ed3936acbca756b425423e0cb8dbddf
jgalar pushed a commit that referenced this pull request May 19, 2020
Observed issue
==============
Core dump:

 #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
 #1  0x0000003eb4025548 in __GI_abort () at abort.c:79
 #2  0x0000003eb402542f in __assert_fail_base (fmt=0x3eb4184ae0 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x4cdee0 "(trace_chunk->timestamp_close).is_set",
     file=0x4cde78 "../../../lttng-tools-2.11.3/src/common/trace-chunk.c", line=903, function=0x4cf4a0 <__PRETTY_FUNCTION__.6756> "lttng_trace_chunk_move_to_completed")
     at assert.c:92
 #3  0x0000003eb4033af2 in __GI___assert_fail (assertion=assertion@entry=0x4cdee0 "(trace_chunk->timestamp_close).is_set",
     file=file@entry=0x4cde78 "../../../lttng-tools-2.11.3/src/common/trace-chunk.c", line=line@entry=903,
     function=function@entry=0x4cf4a0 <__PRETTY_FUNCTION__.6756> "lttng_trace_chunk_move_to_completed") at assert.c:101
 #4  0x000000000047f37e in lttng_trace_chunk_move_to_completed (trace_chunk=0x7fcb5c00e570) at ../../../lttng-tools-2.11.3/src/common/trace-chunk.c:903
 #5  0x0000000000480755 in lttng_trace_chunk_release (ref=0x7fcb5c00e598) at ../../../lttng-tools-2.11.3/src/common/trace-chunk.c:1117
 #6  urcu_ref_put (release=<optimized out>, ref=0x7fcb5c00e598) at /usr/include/urcu/ref.h:68
 #7  lttng_trace_chunk_put (chunk=0x7fcb5c00e570) at ../../../lttng-tools-2.11.3/src/common/trace-chunk.c:1150
 #8  0x0000000000429c22 in cmd_rotate_session (session=0x7fcb5c003ff0, rotate_return=rotate_return@entry=0x7fcb6b7ed470, quiet_rotation=quiet_rotation@entry=false)
     at ../../../../lttng-tools-2.11.3/src/bin/lttng-sessiond/cmd.c:5037
 #9  0x00000000004451d7 in process_client_msg (cmd_ctx=0x7fcb5c00e760, sock=sock@entry=0x7fcb6b7fd4c0, sock_error=sock_error@entry=0x7fcb6b7fd4c4)
     at ../../../../lttng-tools-2.11.3/src/bin/lttng-sessiond/client.c:1852
 #10 0x00000000004474c6 in thread_manage_clients (data=<optimized out>) at ../../../../lttng-tools-2.11.3/src/bin/lttng-sessiond/client.c:2199
 #11 0x00000000004422f2 in launch_thread (data=0x4f97a0) at ../../../../lttng-tools-2.11.3/src/bin/lttng-sessiond/thread.c:75
 #12 0x0000003eb4408ed4 in start_thread (arg=<optimized out>) at pthread_create.c:479
 #13 0x0000003eb40f8e6f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Reproduction:

 Disable ntp/any time management mechanism.

 lttng create
 lttng enable-event -u 'lttng_ust_tracef:*'
 lttng start
 lttng rotate
 date --set="$(date --date='-1 hour')"
 lttng rotate auto-20200515-142503
    Waiting for rotation to complete
    Error: Failed to query the state of the rotation.

Logs:
 DEBUG1 - 12:25:28.570037987 [2660/2717]: Setting trace chunk close command to "move to completed chunk folder" (in lttng_trace_chunk_set_close_command() at ../../../lttng-tools-2.11.3/src/common/trace-chunk.c:1073)
 Error: Failed to set trace chunk close timestamp: close timestamp is before creation timestamp
 Error: Failed to set the close timestamp of the current trace chunk of session "auto-20200515-142503"
 lttng-sessiond: ../../../lttng-tools-2.11.3/src/common/trace-chunk.c:903: lttng_trace_chunk_move_to_completed: Assertion `(trace_chunk->timestamp_close).is_set' failed.

 ...

 Aborted (core dumped)
 root@X10SDV-8C-TLN4F:~# DEBUG1 - 12:25:29.534263017 [2739/2739]: Releasing trace chunk registry to all trace chunks (in lttng_trace_chunk_registry_put_each_chunk() at ../../../lttng-tools-2.11.3/src/common/trace-chunk.c:1414)
 DEBUG1 - 12:25:29.534317468 [2739/2739]: Releasing reference to trace chunk: session_id = 0chunk_id = 2, name = "20200515T122528+0000-2", status = closed (in lttng_trace_chunk_registry_put_each_chunk() at ../../../lttng-tools-2.11.3/src/common/trace-chunk.c:1435)
 DEBUG1 - 12:25:29.534365653 [2739/2739]: Releasing reference to trace chunk: session_id = 0chunk_id = 1, name = "20200515T142520+0000-1", status = closed (in lttng_trace_chunk_registry_put_each_chunk() at ../../../lttng-tools-2.11.3/src/common/trace-chunk.c:1435)
 DEBUG1 - 12:25:29.534400638 [2739/2739]: Released reference to 2 trace chunks in lttng_trace_chunk_registry_put_each_chunk() (in lttng_trace_chunk_registry_put_each_chunk() at ../../../lttng-tools-2.11.3/src/common/trace-chunk.c:1447)
 Error: 2 trace chunks are leaked by lttng-consumerd. This can be caused by an internal error of the session daemon.

Cause
=====
The trace_chunk->timestamp_close is not set since the result from time()
is smaller than the creation timestamp.

The close timestamp is smaller because the calendar system time is
modified by an administrator.

time() offers no monotonicity guarantee and hence is exposed to time
modification of the system.

The begin and close timestamps are strictly used in the name generation
of the chunk/archives. Given the current usage of these timestamps
validating monotonicity should not be a fatal error. Name uniqueness is
provided by the chunk name suffix (auto increment).

Solution
========
Do not enforce monotonicity for the begin and close timestamps but warn
on unexpected return (begin > close).

Known drawbacks
=========
None.

Signed-off-by: Jonathan Rajotte <[email protected]>
Signed-off-by: Jérémie Galarneau <[email protected]>
Change-Id: Ic4b17285d150358d1569d6821c451c243e64e9a1
jgalar pushed a commit that referenced this pull request May 19, 2020
Observed issue
==============
Core dump:

 #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
 #1  0x0000003eb4025548 in __GI_abort () at abort.c:79
 #2  0x0000003eb402542f in __assert_fail_base (fmt=0x3eb4184ae0 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x4cdee0 "(trace_chunk->timestamp_close).is_set",
     file=0x4cde78 "../../../lttng-tools-2.11.3/src/common/trace-chunk.c", line=903, function=0x4cf4a0 <__PRETTY_FUNCTION__.6756> "lttng_trace_chunk_move_to_completed")
     at assert.c:92
 #3  0x0000003eb4033af2 in __GI___assert_fail (assertion=assertion@entry=0x4cdee0 "(trace_chunk->timestamp_close).is_set",
     file=file@entry=0x4cde78 "../../../lttng-tools-2.11.3/src/common/trace-chunk.c", line=line@entry=903,
     function=function@entry=0x4cf4a0 <__PRETTY_FUNCTION__.6756> "lttng_trace_chunk_move_to_completed") at assert.c:101
 #4  0x000000000047f37e in lttng_trace_chunk_move_to_completed (trace_chunk=0x7fcb5c00e570) at ../../../lttng-tools-2.11.3/src/common/trace-chunk.c:903
 #5  0x0000000000480755 in lttng_trace_chunk_release (ref=0x7fcb5c00e598) at ../../../lttng-tools-2.11.3/src/common/trace-chunk.c:1117
 #6  urcu_ref_put (release=<optimized out>, ref=0x7fcb5c00e598) at /usr/include/urcu/ref.h:68
 #7  lttng_trace_chunk_put (chunk=0x7fcb5c00e570) at ../../../lttng-tools-2.11.3/src/common/trace-chunk.c:1150
 #8  0x0000000000429c22 in cmd_rotate_session (session=0x7fcb5c003ff0, rotate_return=rotate_return@entry=0x7fcb6b7ed470, quiet_rotation=quiet_rotation@entry=false)
     at ../../../../lttng-tools-2.11.3/src/bin/lttng-sessiond/cmd.c:5037
 #9  0x00000000004451d7 in process_client_msg (cmd_ctx=0x7fcb5c00e760, sock=sock@entry=0x7fcb6b7fd4c0, sock_error=sock_error@entry=0x7fcb6b7fd4c4)
     at ../../../../lttng-tools-2.11.3/src/bin/lttng-sessiond/client.c:1852
 #10 0x00000000004474c6 in thread_manage_clients (data=<optimized out>) at ../../../../lttng-tools-2.11.3/src/bin/lttng-sessiond/client.c:2199
 #11 0x00000000004422f2 in launch_thread (data=0x4f97a0) at ../../../../lttng-tools-2.11.3/src/bin/lttng-sessiond/thread.c:75
 #12 0x0000003eb4408ed4 in start_thread (arg=<optimized out>) at pthread_create.c:479
 #13 0x0000003eb40f8e6f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Reproduction:

 Disable ntp/any time management mechanism.

 lttng create
 lttng enable-event -u 'lttng_ust_tracef:*'
 lttng start
 lttng rotate
 date --set="$(date --date='-1 hour')"
 lttng rotate auto-20200515-142503
    Waiting for rotation to complete
    Error: Failed to query the state of the rotation.

Logs:
 DEBUG1 - 12:25:28.570037987 [2660/2717]: Setting trace chunk close command to "move to completed chunk folder" (in lttng_trace_chunk_set_close_command() at ../../../lttng-tools-2.11.3/src/common/trace-chunk.c:1073)
 Error: Failed to set trace chunk close timestamp: close timestamp is before creation timestamp
 Error: Failed to set the close timestamp of the current trace chunk of session "auto-20200515-142503"
 lttng-sessiond: ../../../lttng-tools-2.11.3/src/common/trace-chunk.c:903: lttng_trace_chunk_move_to_completed: Assertion `(trace_chunk->timestamp_close).is_set' failed.

 ...

 Aborted (core dumped)
 root@X10SDV-8C-TLN4F:~# DEBUG1 - 12:25:29.534263017 [2739/2739]: Releasing trace chunk registry to all trace chunks (in lttng_trace_chunk_registry_put_each_chunk() at ../../../lttng-tools-2.11.3/src/common/trace-chunk.c:1414)
 DEBUG1 - 12:25:29.534317468 [2739/2739]: Releasing reference to trace chunk: session_id = 0chunk_id = 2, name = "20200515T122528+0000-2", status = closed (in lttng_trace_chunk_registry_put_each_chunk() at ../../../lttng-tools-2.11.3/src/common/trace-chunk.c:1435)
 DEBUG1 - 12:25:29.534365653 [2739/2739]: Releasing reference to trace chunk: session_id = 0chunk_id = 1, name = "20200515T142520+0000-1", status = closed (in lttng_trace_chunk_registry_put_each_chunk() at ../../../lttng-tools-2.11.3/src/common/trace-chunk.c:1435)
 DEBUG1 - 12:25:29.534400638 [2739/2739]: Released reference to 2 trace chunks in lttng_trace_chunk_registry_put_each_chunk() (in lttng_trace_chunk_registry_put_each_chunk() at ../../../lttng-tools-2.11.3/src/common/trace-chunk.c:1447)
 Error: 2 trace chunks are leaked by lttng-consumerd. This can be caused by an internal error of the session daemon.

Cause
=====
The trace_chunk->timestamp_close is not set since the result from time()
is smaller than the creation timestamp.

The close timestamp is smaller because the calendar system time is
modified by an administrator.

time() offers no monotonicity guarantee and hence is exposed to time
modification of the system.

The begin and close timestamps are strictly used in the name generation
of the chunk/archives. Given the current usage of these timestamps
validating monotonicity should not be a fatal error. Name uniqueness is
provided by the chunk name suffix (auto increment).

Solution
========
Do not enforce monotonicity for the begin and close timestamps but warn
on unexpected return (begin > close).

Known drawbacks
=========
None.

Signed-off-by: Jonathan Rajotte <[email protected]>
Signed-off-by: Jérémie Galarneau <[email protected]>
Change-Id: Ic4b17285d150358d1569d6821c451c243e64e9a1
jgalar pushed a commit that referenced this pull request May 19, 2020
Observed issue
==============
Core dump:

 #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
 #1  0x0000003eb4025548 in __GI_abort () at abort.c:79
 #2  0x0000003eb402542f in __assert_fail_base (fmt=0x3eb4184ae0 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x4cdee0 "(trace_chunk->timestamp_close).is_set",
     file=0x4cde78 "../../../lttng-tools-2.11.3/src/common/trace-chunk.c", line=903, function=0x4cf4a0 <__PRETTY_FUNCTION__.6756> "lttng_trace_chunk_move_to_completed")
     at assert.c:92
 #3  0x0000003eb4033af2 in __GI___assert_fail (assertion=assertion@entry=0x4cdee0 "(trace_chunk->timestamp_close).is_set",
     file=file@entry=0x4cde78 "../../../lttng-tools-2.11.3/src/common/trace-chunk.c", line=line@entry=903,
     function=function@entry=0x4cf4a0 <__PRETTY_FUNCTION__.6756> "lttng_trace_chunk_move_to_completed") at assert.c:101
 #4  0x000000000047f37e in lttng_trace_chunk_move_to_completed (trace_chunk=0x7fcb5c00e570) at ../../../lttng-tools-2.11.3/src/common/trace-chunk.c:903
 #5  0x0000000000480755 in lttng_trace_chunk_release (ref=0x7fcb5c00e598) at ../../../lttng-tools-2.11.3/src/common/trace-chunk.c:1117
 #6  urcu_ref_put (release=<optimized out>, ref=0x7fcb5c00e598) at /usr/include/urcu/ref.h:68
 #7  lttng_trace_chunk_put (chunk=0x7fcb5c00e570) at ../../../lttng-tools-2.11.3/src/common/trace-chunk.c:1150
 #8  0x0000000000429c22 in cmd_rotate_session (session=0x7fcb5c003ff0, rotate_return=rotate_return@entry=0x7fcb6b7ed470, quiet_rotation=quiet_rotation@entry=false)
     at ../../../../lttng-tools-2.11.3/src/bin/lttng-sessiond/cmd.c:5037
 #9  0x00000000004451d7 in process_client_msg (cmd_ctx=0x7fcb5c00e760, sock=sock@entry=0x7fcb6b7fd4c0, sock_error=sock_error@entry=0x7fcb6b7fd4c4)
     at ../../../../lttng-tools-2.11.3/src/bin/lttng-sessiond/client.c:1852
 #10 0x00000000004474c6 in thread_manage_clients (data=<optimized out>) at ../../../../lttng-tools-2.11.3/src/bin/lttng-sessiond/client.c:2199
 #11 0x00000000004422f2 in launch_thread (data=0x4f97a0) at ../../../../lttng-tools-2.11.3/src/bin/lttng-sessiond/thread.c:75
 #12 0x0000003eb4408ed4 in start_thread (arg=<optimized out>) at pthread_create.c:479
 #13 0x0000003eb40f8e6f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Reproduction:

 Disable ntp/any time management mechanism.

 lttng create
 lttng enable-event -u 'lttng_ust_tracef:*'
 lttng start
 lttng rotate
 date --set="$(date --date='-1 hour')"
 lttng rotate auto-20200515-142503
    Waiting for rotation to complete
    Error: Failed to query the state of the rotation.

Logs:
 DEBUG1 - 12:25:28.570037987 [2660/2717]: Setting trace chunk close command to "move to completed chunk folder" (in lttng_trace_chunk_set_close_command() at ../../../lttng-tools-2.11.3/src/common/trace-chunk.c:1073)
 Error: Failed to set trace chunk close timestamp: close timestamp is before creation timestamp
 Error: Failed to set the close timestamp of the current trace chunk of session "auto-20200515-142503"
 lttng-sessiond: ../../../lttng-tools-2.11.3/src/common/trace-chunk.c:903: lttng_trace_chunk_move_to_completed: Assertion `(trace_chunk->timestamp_close).is_set' failed.

 ...

 Aborted (core dumped)
 root@X10SDV-8C-TLN4F:~# DEBUG1 - 12:25:29.534263017 [2739/2739]: Releasing trace chunk registry to all trace chunks (in lttng_trace_chunk_registry_put_each_chunk() at ../../../lttng-tools-2.11.3/src/common/trace-chunk.c:1414)
 DEBUG1 - 12:25:29.534317468 [2739/2739]: Releasing reference to trace chunk: session_id = 0chunk_id = 2, name = "20200515T122528+0000-2", status = closed (in lttng_trace_chunk_registry_put_each_chunk() at ../../../lttng-tools-2.11.3/src/common/trace-chunk.c:1435)
 DEBUG1 - 12:25:29.534365653 [2739/2739]: Releasing reference to trace chunk: session_id = 0chunk_id = 1, name = "20200515T142520+0000-1", status = closed (in lttng_trace_chunk_registry_put_each_chunk() at ../../../lttng-tools-2.11.3/src/common/trace-chunk.c:1435)
 DEBUG1 - 12:25:29.534400638 [2739/2739]: Released reference to 2 trace chunks in lttng_trace_chunk_registry_put_each_chunk() (in lttng_trace_chunk_registry_put_each_chunk() at ../../../lttng-tools-2.11.3/src/common/trace-chunk.c:1447)
 Error: 2 trace chunks are leaked by lttng-consumerd. This can be caused by an internal error of the session daemon.

Cause
=====
The trace_chunk->timestamp_close is not set since the result from time()
is smaller than the creation timestamp.

The close timestamp is smaller because the calendar system time is
modified by an administrator.

time() offers no monotonicity guarantee and hence is exposed to time
modification of the system.

The begin and close timestamps are strictly used in the name generation
of the chunk/archives. Given the current usage of these timestamps
validating monotonicity should not be a fatal error. Name uniqueness is
provided by the chunk name suffix (auto increment).

Solution
========
Do not enforce monotonicity for the begin and close timestamps but warn
on unexpected return (begin > close).

Known drawbacks
=========
None.

Signed-off-by: Jonathan Rajotte <[email protected]>
Signed-off-by: Jérémie Galarneau <[email protected]>
Change-Id: Ic4b17285d150358d1569d6821c451c243e64e9a1
frdeso pushed a commit to frdeso/lttng-tools that referenced this pull request May 26, 2020
Observed issue
==============

Deadlock between the notification thread and the action executor thread.

Thread 5 holds cmd_queue.lock and request the client lock.
Thread 6 holds the client lock and request the cmd_queue lock.

Thread 5 have little value in holding the queue lock considering it effectively to a "pop" of the cmd_queue.

Thread 9 is waiting on the cmd_queue lock but does not hold any other
locks and thus not part of the deadlock but is a casualties of this
deadlock and leave a client "hanging".

Other threads are all in their respective waiting state.

Thread 9 (Thread 0x7f76f2ffd700 (LWP 240467)):
 #0  __lll_lock_wait (futex=futex@entry=0x1ad1308, private=0) at lowlevellock.c:52                                                                                                                                                 [1070/1123]
 #1  0x00007f77052c80a3 in __GI___pthread_mutex_lock (mutex=0x1ad1308) at ../nptl/pthread_mutex_lock.c:80
 #2  0x00000000004611dd in run_command_wait (handle=0x1ad12f0, cmd=0x7f76f2fe31e0) at notification-thread-commands.c:31
 lttng#3  0x000000000046143a in notification_thread_command_unregister_trigger (handle=0x1ad12f0, trigger=0x7f76e4000ef0) at notification-thread-commands.c:148
 lttng#4  0x00000000004444af in cmd_unregister_trigger (cmd_ctx=0x7f76e4000d40, sock=68, notification_thread=0x1ad12f0) at cmd.c:4618
 lttng#5  0x0000000000483d23 in process_client_msg (cmd_ctx=0x7f76e4000d40, sock=0x7f76f2ffcba4, sock_error=0x7f76f2ffcb90) at client.c:2001
 lttng#6  0x000000000047f00b in thread_manage_clients (data=0x1ad1a80) at client.c:2402
 lttng#7  0x000000000047b303 in launch_thread (data=0x1ad1af0) at thread.c:66
 lttng#8  0x00007f77052c5609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 lttng#9  0x00007f77051cc103 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 6 (Thread 0x7f7700fcf700 (LWP 240464)):
 #0  __lll_lock_wait (futex=futex@entry=0x1ad1308, private=0) at lowlevellock.c:52
 #1  0x00007f77052c80a3 in __GI___pthread_mutex_lock (mutex=0x1ad1308) at ../nptl/pthread_mutex_lock.c:80
 #2  0x0000000000461bf2 in run_command_no_wait (handle=0x1ad12f0, in_cmd=0x7f7700fce340) at notification-thread-commands.c:87
 lttng#3  0x0000000000461b93 in notification_thread_client_communication_update (handle=0x1ad12f0, id=1, transmission_status=CLIENT_TRANSMISSION_STATUS_QUEUED) at notification-thread-commands.c:400
 lttng#4  0x0000000000497658 in client_handle_transmission_status (client=0x7f76f8004e30, status=CLIENT_TRANSMISSION_STATUS_QUEUED, user_data=0x7f76f8004a00) at action-executor.c:154
 lttng#5  0x0000000000467be7 in notification_client_list_send_evaluation (client_list=0x7f76f8004fe0, condition=0x7f76e40041a0, evaluation=0x7f76cc000cc0, trigger_creds=0x7f76e4004288, source_object_creds=0x0, client_report=0x4971a0 <client_ha
 ndle_transmission_status>, user_data=0x7f76f8004a00) at notification-thread-events.c:4007
 lttng#6  0x00000000004956bb in action_executor_notify_handler (executor=0x7f76f8004a00, work_item=0x7f76f80062d0, action=0x7f76e4004210) at action-executor.c:199
 lttng#7  0x00000000004953fd in action_executor_generic_handler (executor=0x7f76f8004a00, work_item=0x7f76f80062d0, action=0x7f76e4004210) at action-executor.c:493
 lttng#8  0x0000000000495101 in action_work_item_execute (executor=0x7f76f8004a00, work_item=0x7f76f80062d0) at action-executor.c:506
 lttng#9  0x0000000000493ff5 in action_executor_thread (_data=0x7f76f8004a00) at action-executor.c:559
 lttng#10 0x000000000047b303 in launch_thread (data=0x7f76f8004aa0) at thread.c:66
 lttng#11 0x00007f77052c5609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 lttng#12 0x00007f77051cc103 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 5 (Thread 0x7f77017d0700 (LWP 240463)):
 #0  __lll_lock_wait (futex=futex@entry=0x7f76f8004e30, private=0) at lowlevellock.c:52
 #1  0x00007f77052c80a3 in __GI___pthread_mutex_lock (mutex=0x7f76f8004e30) at ../nptl/pthread_mutex_lock.c:80
 #2  0x0000000000463080 in handle_notification_thread_command (handle=0x1ad12f0, state=0x7f77017cfb00) at notification-thread-events.c:2936
 lttng#3  0x000000000045e881 in thread_notification (data=0x1ad12f0) at notification-thread.c:705
 lttng#4  0x000000000047b303 in launch_thread (data=0x1ad1420) at thread.c:66
 lttng#5  0x00007f77052c5609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 lttng#6  0x00007f77051cc103 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Cause
=====

The action executor holds the client lock across the communication to
prevent simultaneous update to the client state.

The notification thread holds the cmd_queue lock across operation for no
apparent reason (TODO make sure there is no internal add to the queue. if
so we should reacquire the lock only when necessery.)

Solution
========

Reduce the windows for which the cmd_queue lock is held by the
notification thread to only the "pop" action on the queue. As soon as we
have the lock, get the cmd, remove it from the list and release the
lock. This prevent inverted lock acquisition base on the pattern of the
action executor thread.

Signed-off-by: Jonathan Rajotte <[email protected]>
Change-Id: I91d30c134bc1a128c96058f0e0cdd325808c91bc
Depends-on: lttng-ust: I8423c510bf6af2f9bf85256e8d6f931d36f7054b
frdeso pushed a commit to frdeso/lttng-tools that referenced this pull request May 27, 2020
Observed issue
==============

Deadlock between the notification thread and the action executor thread.

Thread 5 holds cmd_queue.lock and request the client lock.
Thread 6 holds the client lock and request the cmd_queue lock.

Thread 5 have little value in holding the queue lock considering it effectively to a "pop" of the cmd_queue.

Thread 9 is waiting on the cmd_queue lock but does not hold any other
locks and thus not part of the deadlock but is a casualties of this
deadlock and leave a client "hanging".

Other threads are all in their respective waiting state.

Thread 9 (Thread 0x7f76f2ffd700 (LWP 240467)):
 #0  __lll_lock_wait (futex=futex@entry=0x1ad1308, private=0) at lowlevellock.c:52                                                                                                                                                 [1070/1123]
 #1  0x00007f77052c80a3 in __GI___pthread_mutex_lock (mutex=0x1ad1308) at ../nptl/pthread_mutex_lock.c:80
 #2  0x00000000004611dd in run_command_wait (handle=0x1ad12f0, cmd=0x7f76f2fe31e0) at notification-thread-commands.c:31
 lttng#3  0x000000000046143a in notification_thread_command_unregister_trigger (handle=0x1ad12f0, trigger=0x7f76e4000ef0) at notification-thread-commands.c:148
 lttng#4  0x00000000004444af in cmd_unregister_trigger (cmd_ctx=0x7f76e4000d40, sock=68, notification_thread=0x1ad12f0) at cmd.c:4618
 lttng#5  0x0000000000483d23 in process_client_msg (cmd_ctx=0x7f76e4000d40, sock=0x7f76f2ffcba4, sock_error=0x7f76f2ffcb90) at client.c:2001
 lttng#6  0x000000000047f00b in thread_manage_clients (data=0x1ad1a80) at client.c:2402
 lttng#7  0x000000000047b303 in launch_thread (data=0x1ad1af0) at thread.c:66
 lttng#8  0x00007f77052c5609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 lttng#9  0x00007f77051cc103 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 6 (Thread 0x7f7700fcf700 (LWP 240464)):
 #0  __lll_lock_wait (futex=futex@entry=0x1ad1308, private=0) at lowlevellock.c:52
 #1  0x00007f77052c80a3 in __GI___pthread_mutex_lock (mutex=0x1ad1308) at ../nptl/pthread_mutex_lock.c:80
 #2  0x0000000000461bf2 in run_command_no_wait (handle=0x1ad12f0, in_cmd=0x7f7700fce340) at notification-thread-commands.c:87
 lttng#3  0x0000000000461b93 in notification_thread_client_communication_update (handle=0x1ad12f0, id=1, transmission_status=CLIENT_TRANSMISSION_STATUS_QUEUED) at notification-thread-commands.c:400
 lttng#4  0x0000000000497658 in client_handle_transmission_status (client=0x7f76f8004e30, status=CLIENT_TRANSMISSION_STATUS_QUEUED, user_data=0x7f76f8004a00) at action-executor.c:154
 lttng#5  0x0000000000467be7 in notification_client_list_send_evaluation (client_list=0x7f76f8004fe0, condition=0x7f76e40041a0, evaluation=0x7f76cc000cc0, trigger_creds=0x7f76e4004288, source_object_creds=0x0, client_report=0x4971a0 <client_ha
 ndle_transmission_status>, user_data=0x7f76f8004a00) at notification-thread-events.c:4007
 lttng#6  0x00000000004956bb in action_executor_notify_handler (executor=0x7f76f8004a00, work_item=0x7f76f80062d0, action=0x7f76e4004210) at action-executor.c:199
 lttng#7  0x00000000004953fd in action_executor_generic_handler (executor=0x7f76f8004a00, work_item=0x7f76f80062d0, action=0x7f76e4004210) at action-executor.c:493
 lttng#8  0x0000000000495101 in action_work_item_execute (executor=0x7f76f8004a00, work_item=0x7f76f80062d0) at action-executor.c:506
 lttng#9  0x0000000000493ff5 in action_executor_thread (_data=0x7f76f8004a00) at action-executor.c:559
 lttng#10 0x000000000047b303 in launch_thread (data=0x7f76f8004aa0) at thread.c:66
 lttng#11 0x00007f77052c5609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 lttng#12 0x00007f77051cc103 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 5 (Thread 0x7f77017d0700 (LWP 240463)):
 #0  __lll_lock_wait (futex=futex@entry=0x7f76f8004e30, private=0) at lowlevellock.c:52
 #1  0x00007f77052c80a3 in __GI___pthread_mutex_lock (mutex=0x7f76f8004e30) at ../nptl/pthread_mutex_lock.c:80
 #2  0x0000000000463080 in handle_notification_thread_command (handle=0x1ad12f0, state=0x7f77017cfb00) at notification-thread-events.c:2936
 lttng#3  0x000000000045e881 in thread_notification (data=0x1ad12f0) at notification-thread.c:705
 lttng#4  0x000000000047b303 in launch_thread (data=0x1ad1420) at thread.c:66
 lttng#5  0x00007f77052c5609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 lttng#6  0x00007f77051cc103 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Cause
=====

The action executor holds the client lock across the communication to
prevent simultaneous update to the client state.

The notification thread holds the cmd_queue lock across operation for no
apparent reason (TODO make sure there is no internal add to the queue. if
so we should reacquire the lock only when necessery.)

Solution
========

Reduce the windows for which the cmd_queue lock is held by the
notification thread to only the "pop" action on the queue. As soon as we
have the lock, get the cmd, remove it from the list and release the
lock. This prevent inverted lock acquisition base on the pattern of the
action executor thread.

Signed-off-by: Jonathan Rajotte <[email protected]>
Change-Id: I91d30c134bc1a128c96058f0e0cdd325808c91bc
Depends-on: lttng-ust: I8423c510bf6af2f9bf85256e8d6f931d36f7054b
jgalar added a commit to jgalar/lttng-tools that referenced this pull request May 28, 2020
Observed issue
--------------

While running the out-of-tree java agent tests [1], the session daemon
and agent often end up in a deadlock.

Attaching gdb to the session daemon, we can see that two threads are
blocked in an intriguing state.

Thread 13 (Thread 0x7f89027fc700 (LWP 9636)):
 #0  0x00007f891e81a4cf in __lll_lock_wait () from /usr/lib/libpthread.so.0
 #1  0x00007f891e812e03 in pthread_mutex_lock () from /usr/lib/libpthread.so.0
 #2  0x000055637f1fbd92 in session_lock_list () at session.c:156
 #3  0x000055637f25dc47 in update_agent_app (app=0x7f88ec003480) at agent-thread.c:56
 #4  0x000055637f25ec0a in thread_agent_management (data=0x556380cd2400) at agent-thread.c:426
 #5  0x000055637f22fb3a in launch_thread (data=0x556380cd24a0) at thread.c:65
 #6  0x00007f891e81046f in start_thread () from /usr/lib/libpthread.so.0
 #7  0x00007f891e7203d3 in clone () from /usr/lib/libc.so.6

Thread 8 (Thread 0x7f8919309700 (LWP 9631)):
 #0  0x00007f891e81b44d in recvmsg () from /usr/lib/libpthread.so.0
 #1  0x000055637f267847 in lttcomm_recvmsg_inet_sock (sock=0x7f88ec0033c0, buf=0x7f89192f5d5c, len=4, flags=0) at inet.c:367
 #2  0x000055637f2146c6 in recv_reply (sock=0x7f88ec0033c0, buf=0x7f89192f5d5c, size=4) at agent.c:275
 #3  0x000055637f215202 in app_context_op (app=0x7f88ec003400, ctx=0x7f8908020900, cmd=AGENT_CMD_APP_CTX_DISABLE) at agent.c:552
 #4  0x000055637f215c2d in disable_context (ctx=0x7f8908020900, domain=LTTNG_DOMAIN_JUL) at agent.c:841
 #5  0x000055637f217480 in agent_destroy (agt=0x7f890801dc20) at agent.c:1326
 #6  0x000055637f243448 in trace_ust_destroy_session (session=0x7f8908004010) at trace-ust.c:1408
 #7  0x000055637f1fd775 in session_release (ref=0x7f8908001e70) at session.c:873
 #8  0x000055637f1fb9ac in urcu_ref_put (ref=0x7f8908001e70, release=0x55637f1fd62a <session_release>) at /usr/include/urcu/ref.h:68
 lttng#9  0x000055637f1fdad2 in session_put (session=0x7f8908000d10) at session.c:942
 lttng#10 0x000055637f2369e6 in process_client_msg (cmd_ctx=0x7f890800e6e0, sock=0x7f8919308560, sock_error=0x7f8919308564) at client.c:2102
 lttng#11 0x000055637f2375ab in thread_manage_clients (data=0x556380cd1840) at client.c:2347
 lttng#12 0x000055637f22fb3a in launch_thread (data=0x556380cd18b0) at thread.c:65
 lttng#13 0x00007f891e81046f in start_thread () from /usr/lib/libpthread.so.0
 lttng#14 0x00007f891e7203d3 in clone () from /usr/lib/libc.so.6

T8 is holding session list lock while the cmd_destroy_session
command is being processed. More specifically, it is attempting
to destroy an "agent_context" by communicating with an "agent"
application.

Meanwhile, T13 is still registering that same "agent" application.

Cause
-----

The deadlock itself is pretty simple to understand.

The "agent thread" (T13) has the responsability of accepting new agent
application connections. When such a connection occurs, the thread
creates a new `agent_app` instance and sends the current sessions'
configuration (i.e. their event rules and contexts) to the agent
application. When that "update" is complete, a "registration done"
message is sent to the new agent application.

From the stacktrace above, we can see that T13 is attempting to update
the agent application with its initial configuration, but it is
blocked on the acquisition of the session list lock. The application's
agent is also blocked since it is waiting for the "registration done"
message before allowing tracing to proceed (not shown here, but seen
in the test logs).

Meanwhile, T8 is holding the session list lock while destroying a
session. This is expected as all client commands are executed with
this lock held. It is, amongst other reasons, used to serialize
changes to the sessions' configuration and configuration updates sent
to the tracers (i.e. because new apps appear or to keep existing
tracers in sync with the users' session configuration).

The question becomes: why is T8 tearing down an application that is
not yet registered?

First, inspecting `agent_app` immediately shows that this structure
has no built-in synchronization mechanism. Therefore, the fact that
two threads are accessing it at the same time raises a big red flag.

Speculating on the intentions of the original design, my intuition is
that the "agent_management" thread's role is limited to instantiating
an `agent_app` and synchronizing it with the various sessions'
configuration. Once that synchronization is performed, the agent
application should be published and never accessed again by the "agent
thread".

Configuration updates (i.e. new event rules, contexts) are then sent
synchronously as they are requested by a client in the context of the
client thread. Those updates are performed while holding the session
list lock.

Hence, there is only one thread that should manipulate the agent
application at any given time making an explicit `agent_app` lock
unnecessary.

Overall, this would echo what is done when a 'user space tracer'
application registers to the session daemon (see dispatch.c:368).

Evidently this isn't what is happening here.

The agent thread creates the `agent_app`, publishes it, and then
performs an "agent app update" (sending the configuration) while
holding the session list lock. This means that there is a window where
an agent application is visible to the other threads, yet has not been
properly registered.

Solution
--------

The acquisition of the session list lock is moved outside of
update_agent_app() to allow the "agent thread" to hold the session
list lock during the "configuration update" phase of the agent
application registration.

Essentially, the sequence of operation changes from:

- Agent tcp connection established
- call handle_registration()
  - agent version check
  - allocation of agent_app instance
  - new agent_add is published through the global agent_apps_ht_by_sock
    hashtable
    ***
    it is now reachable by all other threads without any form of
    exclusivity synchronization.
    ***
- update_agent_app
  - acquire session list lock
  - iterate over sessions
    - send configuration
  - release session list lock
- send registration done

to:

- Agent tcp connection established
- call accept_agent_registration()
  - agent version check
- allocation of agent_app instance
- acquire session list lock
- update_agent_app
  - iterate over sessions
    - send configuration
- send registration done
- new agent_add is published through the global agent_apps_ht_by_sock
  hashtable
- release session list lock

Links
-----

[1] https://github.com/lttng/lttng-ust-java-tests

Signed-off-by: Jérémie Galarneau <[email protected]>
Change-Id: Ia34c5ad81ed3936acbca756b425423e0cb8dbddf
jgalar added a commit to jgalar/lttng-tools that referenced this pull request May 28, 2020
Observed issue
--------------

While running the out-of-tree java agent tests [1], the session daemon
and agent often end up in a deadlock.

Attaching gdb to the session daemon, we can see that two threads are
blocked in an intriguing state.

Thread 13 (Thread 0x7f89027fc700 (LWP 9636)):
 #0  0x00007f891e81a4cf in __lll_lock_wait () from /usr/lib/libpthread.so.0
 #1  0x00007f891e812e03 in pthread_mutex_lock () from /usr/lib/libpthread.so.0
 #2  0x000055637f1fbd92 in session_lock_list () at session.c:156
 #3  0x000055637f25dc47 in update_agent_app (app=0x7f88ec003480) at agent-thread.c:56
 #4  0x000055637f25ec0a in thread_agent_management (data=0x556380cd2400) at agent-thread.c:426
 #5  0x000055637f22fb3a in launch_thread (data=0x556380cd24a0) at thread.c:65
 #6  0x00007f891e81046f in start_thread () from /usr/lib/libpthread.so.0
 #7  0x00007f891e7203d3 in clone () from /usr/lib/libc.so.6

Thread 8 (Thread 0x7f8919309700 (LWP 9631)):
 #0  0x00007f891e81b44d in recvmsg () from /usr/lib/libpthread.so.0
 #1  0x000055637f267847 in lttcomm_recvmsg_inet_sock (sock=0x7f88ec0033c0, buf=0x7f89192f5d5c, len=4, flags=0) at inet.c:367
 #2  0x000055637f2146c6 in recv_reply (sock=0x7f88ec0033c0, buf=0x7f89192f5d5c, size=4) at agent.c:275
 #3  0x000055637f215202 in app_context_op (app=0x7f88ec003400, ctx=0x7f8908020900, cmd=AGENT_CMD_APP_CTX_DISABLE) at agent.c:552
 #4  0x000055637f215c2d in disable_context (ctx=0x7f8908020900, domain=LTTNG_DOMAIN_JUL) at agent.c:841
 #5  0x000055637f217480 in agent_destroy (agt=0x7f890801dc20) at agent.c:1326
 #6  0x000055637f243448 in trace_ust_destroy_session (session=0x7f8908004010) at trace-ust.c:1408
 #7  0x000055637f1fd775 in session_release (ref=0x7f8908001e70) at session.c:873
 #8  0x000055637f1fb9ac in urcu_ref_put (ref=0x7f8908001e70, release=0x55637f1fd62a <session_release>) at /usr/include/urcu/ref.h:68
 lttng#9  0x000055637f1fdad2 in session_put (session=0x7f8908000d10) at session.c:942
 lttng#10 0x000055637f2369e6 in process_client_msg (cmd_ctx=0x7f890800e6e0, sock=0x7f8919308560, sock_error=0x7f8919308564) at client.c:2102
 lttng#11 0x000055637f2375ab in thread_manage_clients (data=0x556380cd1840) at client.c:2347
 lttng#12 0x000055637f22fb3a in launch_thread (data=0x556380cd18b0) at thread.c:65
 lttng#13 0x00007f891e81046f in start_thread () from /usr/lib/libpthread.so.0
 lttng#14 0x00007f891e7203d3 in clone () from /usr/lib/libc.so.6

T8 is holding session list lock while the cmd_destroy_session
command is being processed. More specifically, it is attempting
to destroy an "agent_context" by communicating with an "agent"
application.

Meanwhile, T13 is still registering that same "agent" application.

Cause
-----

The deadlock itself is pretty simple to understand.

The "agent thread" (T13) has the responsability of accepting new agent
application connections. When such a connection occurs, the thread
creates a new `agent_app` instance and sends the current sessions'
configuration (i.e. their event rules and contexts) to the agent
application. When that "update" is complete, a "registration done"
message is sent to the new agent application.

From the stacktrace above, we can see that T13 is attempting to update
the agent application with its initial configuration, but it is
blocked on the acquisition of the session list lock. The application's
agent is also blocked since it is waiting for the "registration done"
message before allowing tracing to proceed (not shown here, but seen
in the test logs).

Meanwhile, T8 is holding the session list lock while destroying a
session. This is expected as all client commands are executed with
this lock held. It is, amongst other reasons, used to serialize
changes to the sessions' configuration and configuration updates sent
to the tracers (i.e. because new apps appear or to keep existing
tracers in sync with the users' session configuration).

The question becomes: why is T8 tearing down an application that is
not yet registered?

First, inspecting `agent_app` immediately shows that this structure
has no built-in synchronization mechanism. Therefore, the fact that
two threads are accessing it at the same time raises a big red flag.

Speculating on the intentions of the original design, my intuition is
that the "agent_management" thread's role is limited to instantiating
an `agent_app` and synchronizing it with the various sessions'
configuration. Once that synchronization is performed, the agent
application should be published and never accessed again by the "agent
thread".

Configuration updates (i.e. new event rules, contexts) are then sent
synchronously as they are requested by a client in the context of the
client thread. Those updates are performed while holding the session
list lock.

Hence, there is only one thread that should manipulate the agent
application at any given time making an explicit `agent_app` lock
unnecessary.

Overall, this would echo what is done when a 'user space tracer'
application registers to the session daemon (see dispatch.c:368).

Evidently this isn't what is happening here.

The agent thread creates the `agent_app`, publishes it, and then
performs an "agent app update" (sending the configuration) while
holding the session list lock. This means that there is a window where
an agent application is visible to the other threads, yet has not been
properly registered.

Solution
--------

The acquisition of the session list lock is moved outside of
update_agent_app() to allow the "agent thread" to hold the session
list lock during the "configuration update" phase of the agent
application registration.

Essentially, the sequence of operation changes from:

- Agent tcp connection established
- call handle_registration()
  - agent version check
  - allocation of agent_app instance
  - new agent_add is published through the global agent_apps_ht_by_sock
    hashtable
    ***
    it is now reachable by all other threads without any form of
    exclusivity synchronization.
    ***
- update_agent_app
  - acquire session list lock
  - iterate over sessions
    - send configuration
  - release session list lock
- send registration done

to:

- Agent tcp connection established
- call accept_agent_registration()
  - agent version check
- allocation of agent_app instance
- acquire session list lock
- update_agent_app
  - iterate over sessions
    - send configuration
- send registration done
- new agent_add is published through the global agent_apps_ht_by_sock
  hashtable
- release session list lock

Links
-----

[1] https://github.com/lttng/lttng-ust-java-tests

Signed-off-by: Jérémie Galarneau <[email protected]>
Change-Id: Ia34c5ad81ed3936acbca756b425423e0cb8dbddf
frdeso pushed a commit to frdeso/lttng-tools that referenced this pull request Jun 3, 2020
Observed issue
==============

Deadlock between the notification thread and the action executor thread.

Thread 5 holds cmd_queue.lock and request the client lock.
Thread 6 holds the client lock and request the cmd_queue lock.

Thread 5 have little value in holding the queue lock considering it effectively to a "pop" of the cmd_queue.

Thread 9 is waiting on the cmd_queue lock but does not hold any other
locks and thus not part of the deadlock but is a casualties of this
deadlock and leave a client "hanging".

Other threads are all in their respective waiting state.

Thread 9 (Thread 0x7f76f2ffd700 (LWP 240467)):
 #0  __lll_lock_wait (futex=futex@entry=0x1ad1308, private=0) at lowlevellock.c:52                                                                                                                                                 [1070/1123]
 #1  0x00007f77052c80a3 in __GI___pthread_mutex_lock (mutex=0x1ad1308) at ../nptl/pthread_mutex_lock.c:80
 #2  0x00000000004611dd in run_command_wait (handle=0x1ad12f0, cmd=0x7f76f2fe31e0) at notification-thread-commands.c:31
 lttng#3  0x000000000046143a in notification_thread_command_unregister_trigger (handle=0x1ad12f0, trigger=0x7f76e4000ef0) at notification-thread-commands.c:148
 lttng#4  0x00000000004444af in cmd_unregister_trigger (cmd_ctx=0x7f76e4000d40, sock=68, notification_thread=0x1ad12f0) at cmd.c:4618
 lttng#5  0x0000000000483d23 in process_client_msg (cmd_ctx=0x7f76e4000d40, sock=0x7f76f2ffcba4, sock_error=0x7f76f2ffcb90) at client.c:2001
 lttng#6  0x000000000047f00b in thread_manage_clients (data=0x1ad1a80) at client.c:2402
 lttng#7  0x000000000047b303 in launch_thread (data=0x1ad1af0) at thread.c:66
 lttng#8  0x00007f77052c5609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 lttng#9  0x00007f77051cc103 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 6 (Thread 0x7f7700fcf700 (LWP 240464)):
 #0  __lll_lock_wait (futex=futex@entry=0x1ad1308, private=0) at lowlevellock.c:52
 #1  0x00007f77052c80a3 in __GI___pthread_mutex_lock (mutex=0x1ad1308) at ../nptl/pthread_mutex_lock.c:80
 #2  0x0000000000461bf2 in run_command_no_wait (handle=0x1ad12f0, in_cmd=0x7f7700fce340) at notification-thread-commands.c:87
 lttng#3  0x0000000000461b93 in notification_thread_client_communication_update (handle=0x1ad12f0, id=1, transmission_status=CLIENT_TRANSMISSION_STATUS_QUEUED) at notification-thread-commands.c:400
 lttng#4  0x0000000000497658 in client_handle_transmission_status (client=0x7f76f8004e30, status=CLIENT_TRANSMISSION_STATUS_QUEUED, user_data=0x7f76f8004a00) at action-executor.c:154
 lttng#5  0x0000000000467be7 in notification_client_list_send_evaluation (client_list=0x7f76f8004fe0, condition=0x7f76e40041a0, evaluation=0x7f76cc000cc0, trigger_creds=0x7f76e4004288, source_object_creds=0x0, client_report=0x4971a0 <client_ha
 ndle_transmission_status>, user_data=0x7f76f8004a00) at notification-thread-events.c:4007
 lttng#6  0x00000000004956bb in action_executor_notify_handler (executor=0x7f76f8004a00, work_item=0x7f76f80062d0, action=0x7f76e4004210) at action-executor.c:199
 lttng#7  0x00000000004953fd in action_executor_generic_handler (executor=0x7f76f8004a00, work_item=0x7f76f80062d0, action=0x7f76e4004210) at action-executor.c:493
 lttng#8  0x0000000000495101 in action_work_item_execute (executor=0x7f76f8004a00, work_item=0x7f76f80062d0) at action-executor.c:506
 lttng#9  0x0000000000493ff5 in action_executor_thread (_data=0x7f76f8004a00) at action-executor.c:559
 lttng#10 0x000000000047b303 in launch_thread (data=0x7f76f8004aa0) at thread.c:66
 lttng#11 0x00007f77052c5609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 lttng#12 0x00007f77051cc103 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 5 (Thread 0x7f77017d0700 (LWP 240463)):
 #0  __lll_lock_wait (futex=futex@entry=0x7f76f8004e30, private=0) at lowlevellock.c:52
 #1  0x00007f77052c80a3 in __GI___pthread_mutex_lock (mutex=0x7f76f8004e30) at ../nptl/pthread_mutex_lock.c:80
 #2  0x0000000000463080 in handle_notification_thread_command (handle=0x1ad12f0, state=0x7f77017cfb00) at notification-thread-events.c:2936
 lttng#3  0x000000000045e881 in thread_notification (data=0x1ad12f0) at notification-thread.c:705
 lttng#4  0x000000000047b303 in launch_thread (data=0x1ad1420) at thread.c:66
 lttng#5  0x00007f77052c5609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 lttng#6  0x00007f77051cc103 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Cause
=====

The action executor holds the client lock across the communication to
prevent simultaneous update to the client state.

The notification thread holds the cmd_queue lock across operation for no
apparent reason (TODO make sure there is no internal add to the queue. if
so we should reacquire the lock only when necessery.)

Solution
========

Reduce the windows for which the cmd_queue lock is held by the
notification thread to only the "pop" action on the queue. As soon as we
have the lock, get the cmd, remove it from the list and release the
lock. This prevent inverted lock acquisition base on the pattern of the
action executor thread.

Signed-off-by: Jonathan Rajotte <[email protected]>
Change-Id: I91d30c134bc1a128c96058f0e0cdd325808c91bc
Depends-on: lttng-ust: I8423c510bf6af2f9bf85256e8d6f931d36f7054b
jgalar pushed a commit that referenced this pull request Aug 11, 2020
Observed issue
==============

Deadlock between the notification thread and the action executor thread.

Thread 5 holds cmd_queue.lock and request the client lock.
Thread 6 holds the client lock and request the cmd_queue lock.

Thread 5 have little value in holding the queue lock considering it effectively to a "pop" of the cmd_queue.

Thread 9 is waiting on the cmd_queue lock but does not hold any other
locks and thus not part of the deadlock but is a casualties of this
deadlock and leave a client "hanging".

Other threads are all in their respective waiting state.

Thread 9 (Thread 0x7f76f2ffd700 (LWP 240467)):
 #0  __lll_lock_wait (futex=futex@entry=0x1ad1308, private=0) at lowlevellock.c:52                                                                                                                                                 [1070/1123]
 #1  0x00007f77052c80a3 in __GI___pthread_mutex_lock (mutex=0x1ad1308) at ../nptl/pthread_mutex_lock.c:80
 #2  0x00000000004611dd in run_command_wait (handle=0x1ad12f0, cmd=0x7f76f2fe31e0) at notification-thread-commands.c:31
 #3  0x000000000046143a in notification_thread_command_unregister_trigger (handle=0x1ad12f0, trigger=0x7f76e4000ef0) at notification-thread-commands.c:148
 #4  0x00000000004444af in cmd_unregister_trigger (cmd_ctx=0x7f76e4000d40, sock=68, notification_thread=0x1ad12f0) at cmd.c:4618
 #5  0x0000000000483d23 in process_client_msg (cmd_ctx=0x7f76e4000d40, sock=0x7f76f2ffcba4, sock_error=0x7f76f2ffcb90) at client.c:2001
 #6  0x000000000047f00b in thread_manage_clients (data=0x1ad1a80) at client.c:2402
 #7  0x000000000047b303 in launch_thread (data=0x1ad1af0) at thread.c:66
 #8  0x00007f77052c5609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 #9  0x00007f77051cc103 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 6 (Thread 0x7f7700fcf700 (LWP 240464)):
 #0  __lll_lock_wait (futex=futex@entry=0x1ad1308, private=0) at lowlevellock.c:52
 #1  0x00007f77052c80a3 in __GI___pthread_mutex_lock (mutex=0x1ad1308) at ../nptl/pthread_mutex_lock.c:80
 #2  0x0000000000461bf2 in run_command_no_wait (handle=0x1ad12f0, in_cmd=0x7f7700fce340) at notification-thread-commands.c:87
 #3  0x0000000000461b93 in notification_thread_client_communication_update (handle=0x1ad12f0, id=1, transmission_status=CLIENT_TRANSMISSION_STATUS_QUEUED) at notification-thread-commands.c:400
 #4  0x0000000000497658 in client_handle_transmission_status (client=0x7f76f8004e30, status=CLIENT_TRANSMISSION_STATUS_QUEUED, user_data=0x7f76f8004a00) at action-executor.c:154
 #5  0x0000000000467be7 in notification_client_list_send_evaluation (client_list=0x7f76f8004fe0, condition=0x7f76e40041a0, evaluation=0x7f76cc000cc0, trigger_creds=0x7f76e4004288, source_object_creds=0x0, client_report=0x4971a0 <client_ha
 ndle_transmission_status>, user_data=0x7f76f8004a00) at notification-thread-events.c:4007
 #6  0x00000000004956bb in action_executor_notify_handler (executor=0x7f76f8004a00, work_item=0x7f76f80062d0, action=0x7f76e4004210) at action-executor.c:199
 #7  0x00000000004953fd in action_executor_generic_handler (executor=0x7f76f8004a00, work_item=0x7f76f80062d0, action=0x7f76e4004210) at action-executor.c:493
 #8  0x0000000000495101 in action_work_item_execute (executor=0x7f76f8004a00, work_item=0x7f76f80062d0) at action-executor.c:506
 #9  0x0000000000493ff5 in action_executor_thread (_data=0x7f76f8004a00) at action-executor.c:559
 #10 0x000000000047b303 in launch_thread (data=0x7f76f8004aa0) at thread.c:66
 #11 0x00007f77052c5609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 #12 0x00007f77051cc103 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 5 (Thread 0x7f77017d0700 (LWP 240463)):
 #0  __lll_lock_wait (futex=futex@entry=0x7f76f8004e30, private=0) at lowlevellock.c:52
 #1  0x00007f77052c80a3 in __GI___pthread_mutex_lock (mutex=0x7f76f8004e30) at ../nptl/pthread_mutex_lock.c:80
 #2  0x0000000000463080 in handle_notification_thread_command (handle=0x1ad12f0, state=0x7f77017cfb00) at notification-thread-events.c:2936
 #3  0x000000000045e881 in thread_notification (data=0x1ad12f0) at notification-thread.c:705
 #4  0x000000000047b303 in launch_thread (data=0x1ad1420) at thread.c:66
 #5  0x00007f77052c5609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 #6  0x00007f77051cc103 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Cause
=====

The action executor holds the client lock across the communication to
prevent simultaneous update to the client state.

The notification thread holds the cmd_queue lock across operation for no
apparent reason (TODO make sure there is no internal add to the queue. if
so we should reacquire the lock only when necessery.)

Solution
========

Reduce the windows for which the cmd_queue lock is held by the
notification thread to only the "pop" action on the queue. As soon as we
have the lock, get the cmd, remove it from the list and release the
lock. This prevent inverted lock acquisition base on the pattern of the
action executor thread.

Signed-off-by: Jonathan Rajotte <[email protected]>
Signed-off-by: Jérémie Galarneau <[email protected]>
Change-Id: I91d30c134bc1a128c96058f0e0cdd325808c91bc
frdeso pushed a commit to frdeso/lttng-tools that referenced this pull request Aug 17, 2020
Observed issue
==============

Deadlock between the notification thread and the action executor thread.

Thread 5 holds cmd_queue.lock and request the client lock.
Thread 6 holds the client lock and request the cmd_queue lock.

Thread 5 have little value in holding the queue lock considering it effectively to a "pop" of the cmd_queue.

Thread 9 is waiting on the cmd_queue lock but does not hold any other
locks and thus not part of the deadlock but is a casualties of this
deadlock and leave a client "hanging".

Other threads are all in their respective waiting state.

Thread 9 (Thread 0x7f76f2ffd700 (LWP 240467)):
 #0  __lll_lock_wait (futex=futex@entry=0x1ad1308, private=0) at lowlevellock.c:52                                                                                                                                                 [1070/1123]
 #1  0x00007f77052c80a3 in __GI___pthread_mutex_lock (mutex=0x1ad1308) at ../nptl/pthread_mutex_lock.c:80
 #2  0x00000000004611dd in run_command_wait (handle=0x1ad12f0, cmd=0x7f76f2fe31e0) at notification-thread-commands.c:31
 lttng#3  0x000000000046143a in notification_thread_command_unregister_trigger (handle=0x1ad12f0, trigger=0x7f76e4000ef0) at notification-thread-commands.c:148
 lttng#4  0x00000000004444af in cmd_unregister_trigger (cmd_ctx=0x7f76e4000d40, sock=68, notification_thread=0x1ad12f0) at cmd.c:4618
 lttng#5  0x0000000000483d23 in process_client_msg (cmd_ctx=0x7f76e4000d40, sock=0x7f76f2ffcba4, sock_error=0x7f76f2ffcb90) at client.c:2001
 lttng#6  0x000000000047f00b in thread_manage_clients (data=0x1ad1a80) at client.c:2402
 lttng#7  0x000000000047b303 in launch_thread (data=0x1ad1af0) at thread.c:66
 lttng#8  0x00007f77052c5609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 lttng#9  0x00007f77051cc103 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 6 (Thread 0x7f7700fcf700 (LWP 240464)):
 #0  __lll_lock_wait (futex=futex@entry=0x1ad1308, private=0) at lowlevellock.c:52
 #1  0x00007f77052c80a3 in __GI___pthread_mutex_lock (mutex=0x1ad1308) at ../nptl/pthread_mutex_lock.c:80
 #2  0x0000000000461bf2 in run_command_no_wait (handle=0x1ad12f0, in_cmd=0x7f7700fce340) at notification-thread-commands.c:87
 lttng#3  0x0000000000461b93 in notification_thread_client_communication_update (handle=0x1ad12f0, id=1, transmission_status=CLIENT_TRANSMISSION_STATUS_QUEUED) at notification-thread-commands.c:400
 lttng#4  0x0000000000497658 in client_handle_transmission_status (client=0x7f76f8004e30, status=CLIENT_TRANSMISSION_STATUS_QUEUED, user_data=0x7f76f8004a00) at action-executor.c:154
 lttng#5  0x0000000000467be7 in notification_client_list_send_evaluation (client_list=0x7f76f8004fe0, condition=0x7f76e40041a0, evaluation=0x7f76cc000cc0, trigger_creds=0x7f76e4004288, source_object_creds=0x0, client_report=0x4971a0 <client_ha
 ndle_transmission_status>, user_data=0x7f76f8004a00) at notification-thread-events.c:4007
 lttng#6  0x00000000004956bb in action_executor_notify_handler (executor=0x7f76f8004a00, work_item=0x7f76f80062d0, action=0x7f76e4004210) at action-executor.c:199
 lttng#7  0x00000000004953fd in action_executor_generic_handler (executor=0x7f76f8004a00, work_item=0x7f76f80062d0, action=0x7f76e4004210) at action-executor.c:493
 lttng#8  0x0000000000495101 in action_work_item_execute (executor=0x7f76f8004a00, work_item=0x7f76f80062d0) at action-executor.c:506
 lttng#9  0x0000000000493ff5 in action_executor_thread (_data=0x7f76f8004a00) at action-executor.c:559
 lttng#10 0x000000000047b303 in launch_thread (data=0x7f76f8004aa0) at thread.c:66
 lttng#11 0x00007f77052c5609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 lttng#12 0x00007f77051cc103 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 5 (Thread 0x7f77017d0700 (LWP 240463)):
 #0  __lll_lock_wait (futex=futex@entry=0x7f76f8004e30, private=0) at lowlevellock.c:52
 #1  0x00007f77052c80a3 in __GI___pthread_mutex_lock (mutex=0x7f76f8004e30) at ../nptl/pthread_mutex_lock.c:80
 #2  0x0000000000463080 in handle_notification_thread_command (handle=0x1ad12f0, state=0x7f77017cfb00) at notification-thread-events.c:2936
 lttng#3  0x000000000045e881 in thread_notification (data=0x1ad12f0) at notification-thread.c:705
 lttng#4  0x000000000047b303 in launch_thread (data=0x1ad1420) at thread.c:66
 lttng#5  0x00007f77052c5609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 lttng#6  0x00007f77051cc103 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Cause
=====

The action executor holds the client lock across the communication to
prevent simultaneous update to the client state.

The notification thread holds the cmd_queue lock across operation for no
apparent reason (TODO make sure there is no internal add to the queue. if
so we should reacquire the lock only when necessery.)

Solution
========

Reduce the windows for which the cmd_queue lock is held by the
notification thread to only the "pop" action on the queue. As soon as we
have the lock, get the cmd, remove it from the list and release the
lock. This prevent inverted lock acquisition base on the pattern of the
action executor thread.

Signed-off-by: Jonathan Rajotte <[email protected]>
Change-Id: I91d30c134bc1a128c96058f0e0cdd325808c91bc
jgalar added a commit that referenced this pull request Oct 16, 2020
Observed issue
==============

The clear tests occasionally fail with the following babeltrace error
when a live session is stopped following a "clear". Unfortunately, this
problem only seems to occur on certain machines. In my case, I only
managed to reproduce this on the CI's workers.

  10-07 12:39:48.333  7679  7679 E PLUGIN/SRC.CTF.LTTNG-LIVE/VIEWER [email protected]:1610 [lttng-live] Received get_data_packet response: error
  10-07 12:39:48.333  7679  7679 E PLUGIN/CTF/MSG-ITER [email protected]:563 [lttng-live] User function failed: status=ERROR
  10-07 12:39:48.333  7679  7679 E PLUGIN/CTF/MSG-ITER [email protected]:2899 [lttng-live] Cannot handle state: msg-it-addr=0x5603c28e2830, state=DSCOPE_TRACE_PACKET_HEADER_BEGIN
  10-07 12:39:48.333  7679  7679 E PLUGIN/SRC.CTF.LTTNG-LIVE lttng_live_iterator_next_handle_one_active_data_stream@lttng-live.c:845 [lttng-live] CTF message iterator failed to get next message: msg-iter=0x5603c28e2830, msg-iter-status=ERROR
  10-07 12:39:48.333  7679  7679 E PLUGIN/SRC.CTF.LTTNG-LIVE [email protected]:1665 [lttng-live] Error preparing the next batch of messages: live-iter-status=LTTNG_LIVE_ITERATOR_STATUS_ERROR
  10-07 12:39:48.333  7679  7679 W LIB/MSG-ITER [email protected]:864 Component input port message iterator's "next" method failed: iter-addr=0x5603c28cb0f0, iter-upstream-comp-name="lttng-live", iter-upstream-comp-log-level=WARNING, iter-upstream-comp-class-type=SOURCE, iter-upstream-comp-class-name="lttng-live", iter-upstream-comp-class-partial-descr="Connect to an LTTng relay daemon", iter-upstream-port-type=OUTPUT, iter-upstream-port-name="out", status=ERROR
  10-07 12:39:48.333  7679  7679 E PLUGIN/FLT.UTILS.MUXER [email protected]:454 [muxer] Upstream iterator's next method returned an error: status=ERROR
  10-07 12:39:48.333  7679  7679 E PLUGIN/FLT.UTILS.MUXER [email protected]:991 [muxer] Cannot validate muxer's upstream message iterator wrapper: muxer-msg-iter-addr=0x5603c28dbe70, muxer-upstream-msg-iter-wrap-addr=0x5603c28cd0f0
  10-07 12:39:48.333  7679  7679 E PLUGIN/FLT.UTILS.MUXER [email protected]:1415 [muxer] Cannot get next message: comp-addr=0x5603c28dc960, muxer-comp-addr=0x5603c28db0a0, muxer-msg-iter-addr=0x5603c28dbe70, msg-iter-addr=0x5603c28caf80, status=ERROR
  10-07 12:39:48.333  7679  7679 W LIB/MSG-ITER [email protected]:864 Component input port message iterator's "next" method failed: iter-addr=0x5603c28caf80, iter-upstream-comp-name="muxer", iter-upstream-comp-log-level=WARNING, iter-upstream-comp-class-type=FILTER, iter-upstream-comp-class-name="muxer", iter-upstream-comp-class-partial-descr="Sort messages from multiple inpu", iter-upstream-port-type=OUTPUT, iter-upstream-port-name="out", status=ERROR
  10-07 12:39:48.333  7679  7679 W LIB/GRAPH [email protected]:473 Component's "consume" method failed: status=ERROR, comp-addr=0x5603c28dcb60, comp-name="pretty", comp-log-level=WARNING, comp-class-type=SINK, comp-class-name="pretty", comp-class-partial-descr="Pretty-print messages (`text` fo", comp-class-is-frozen=0, comp-class-so-handle-addr=0x5603c28c8140, comp-class-so-handle-path="/home/jenkins/jgalar-debug/build/usr/lib/babeltrace2/plugins/babeltrace-plugin-text.so", comp-input-port-count=1, comp-output-port-count=0
  10-07 12:39:48.333  7679  7679 E CLI [email protected]:2548 Graph failed to complete successfully
  10-07 12:39:48.333  7679  7679 E PLUGIN/SRC.CTF.LTTNG-LIVE/VIEWER [email protected]:1227 [lttng-live] Unknown detach return code 0

  ERROR:    [Babeltrace CLI] (babeltrace2.c:2548)
    Graph failed to complete successfully
  CAUSED BY [libbabeltrace2] (graph.c:473)
    Component's "consume" method failed: status=ERROR, comp-addr=0x5603c28dcb60,
    comp-name="pretty", comp-log-level=WARNING, comp-class-type=SINK,
    comp-class-name="pretty", comp-class-partial-descr="Pretty-print messages
    (`text` fo", comp-class-is-frozen=0, comp-class-so-handle-addr=0x5603c28c8140,
    comp-class-so-handle-path="/home/jenkins/jgalar-debug/build/usr/lib/babeltrace2/plugins/babeltrace-plugin-text.so",
    comp-input-port-count=1, comp-output-port-count=0
  CAUSED BY [libbabeltrace2] (iterator.c:864)
    Component input port message iterator's "next" method failed:
    iter-addr=0x5603c28caf80, iter-upstream-comp-name="muxer",
    iter-upstream-comp-log-level=WARNING, iter-upstream-comp-class-type=FILTER,
    iter-upstream-comp-class-name="muxer",
    iter-upstream-comp-class-partial-descr="Sort messages from multiple inpu",
    iter-upstream-port-type=OUTPUT, iter-upstream-port-name="out", status=ERROR
  CAUSED BY [muxer: 'filter.utils.muxer'] (muxer.c:991)
    Cannot validate muxer's upstream message iterator wrapper:
    muxer-msg-iter-addr=0x5603c28dbe70,
    muxer-upstream-msg-iter-wrap-addr=0x5603c28cd0f0
  CAUSED BY [muxer: 'filter.utils.muxer'] (muxer.c:454)
    Upstream iterator's next method returned an error: status=ERROR
  CAUSED BY [libbabeltrace2] (iterator.c:864)
    Component input port message iterator's "next" method failed:
    iter-addr=0x5603c28cb0f0, iter-upstream-comp-name="lttng-live",
    iter-upstream-comp-log-level=WARNING, iter-upstream-comp-class-type=SOURCE,
    iter-upstream-comp-class-name="lttng-live",
    iter-upstream-comp-class-partial-descr="Connect to an LTTng relay daemon",
    iter-upstream-port-type=OUTPUT, iter-upstream-port-name="out", status=ERROR
  CAUSED BY [lttng-live: 'source.ctf.lttng-live'] (lttng-live.c:1665)
    Error preparing the next batch of messages:
    live-iter-status=LTTNG_LIVE_ITERATOR_STATUS_ERROR
  CAUSED BY [lttng-live: 'source.ctf.lttng-live'] (lttng-live.c:845)
    CTF message iterator failed to get next message: msg-iter=0x5603c28e2830,
    msg-iter-status=ERROR
  CAUSED BY [lttng-live: 'source.ctf.lttng-live'] (msg-iter.c:2899)
    Cannot handle state: msg-it-addr=0x5603c28e2830,
    state=DSCOPE_TRACE_PACKET_HEADER_BEGIN
  CAUSED BY [lttng-live: 'source.ctf.lttng-live'] (msg-iter.c:563)
    User function failed: status=ERROR
  CAUSED BY [lttng-live: 'source.ctf.lttng-live'] (viewer-connection.c:1610)
    Received get_data_packet response: error

This occurs immediately following a 'stop' on the session. As the error
indicates, a request to obtain a data packet fails with a generic
error reply.

Moreover, the following LTTNG_VIEWER_DETACH_SESSION appears to fail
with an invalid status code. This is addressed in a different commit.

Reproducing the test's failure without redirecting the relay daemon's
allows us to see the following errors after the first stop:
  PERROR - 14:33:44.929675253 [25108/25115]: Failed to open fs handle to ust/uid/1001/64-bit/index/chan_0.idx, open() returned: No such file or directory (in fd_tracker_open_fs_handle() at fd-tracker.c:550)
  PERROR - 14:33:45.030037417 [25108/25115]: Failed to open fs handle to ust/uid/1001/64-bit/index/chan_0.idx, open() returned: No such file or directory (in fd_tracker_open_fs_handle() at fd-tracker.c:550)
  PERROR - 14:33:45.130429370 [25108/25115]: Failed to open fs handle to ust/uid/1001/64-bit/index/chan_0.idx, open() returned: No such file or directory (in fd_tracker_open_fs_handle() at fd-tracker.c:550)
  PERROR - 14:33:45.230829447 [25108/25115]: Failed to open fs handle to ust/uid/1001/64-bit/index/chan_0.idx, open() returned: No such file or directory (in fd_tracker_open_fs_handle() at fd-tracker.c:550)
  PERROR - 14:33:45.331223320 [25108/25115]: Failed to open fs handle to ust/uid/1001/64-bit/index/chan_0.idx, open() returned: No such file or directory (in fd_tracker_open_fs_handle() at fd-tracker.c:550)

This is produced with the following back-trace:
  (gdb) bt
  #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
  #1  0x00007ffff69648b1 in __GI_abort () at abort.c:79
  #2  0x00005555555b4f1f in fd_tracker_open_fs_handle (tracker=0x55555582c620, directory=0x7fffe8006680,
      path=0x7ffff0a25870 "ust/uid/1001/64-bit/index/chan_1.idx", flags=0, mode=0x7ffff0a24508) at fd-tracker.c:550
  #3  0x0000555555595c34 in _lttng_trace_chunk_open_fs_handle_locked (chunk=0x7fffe0002130, file_path=0x7ffff0a25870 "ust/uid/1001/64-bit/index/chan_1.idx",
      flags=0, mode=432, out_handle=0x7ffff0a24710, expect_no_file=true) at trace-chunk.c:1388
  #4  0x0000555555595eef in lttng_trace_chunk_open_fs_handle (chunk=0x7fffe0002130, file_path=0x7ffff0a25870 "ust/uid/1001/64-bit/index/chan_1.idx", flags=0,
      mode=432, out_handle=0x7ffff0a24710, expect_no_file=true) at trace-chunk.c:1433
  #5  0x00005555555da6c2 in _lttng_index_file_create_from_trace_chunk (chunk=0x7fffe0002130, channel_path=0x7fffe8018c30 "ust/uid/1001/64-bit",
      stream_name=0x7fffe8018c10 "chan_1", stream_file_size=0, stream_file_index=0, index_major=1, index_minor=1, unlink_existing_file=false, flags=0,
      expect_no_file=true, file=0x7fffe0002270) at index.c:97
  #6  0x00005555555dad8a in lttng_index_file_create_from_trace_chunk_read_only (chunk=0x7fffe0002130, channel_path=0x7fffe8018c30 "ust/uid/1001/64-bit",
      stream_name=0x7fffe8018c10 "chan_1", stream_file_size=0, stream_file_index=0, index_major=1, index_minor=1, expect_no_file=true, file=0x7fffe0002270)
      at index.c:186
  #7  0x000055555557640f in try_open_index (vstream=0x7fffe0002250, rstream=0x7fffe8018c50) at live.c:1378
  #8  0x0000555555577155 in viewer_get_next_index (conn=0x7fffd4001440) at live.c:1643
  #9  0x0000555555579a01 in process_control (recv_hdr=0x7ffff0a27c30, conn=0x7fffd4001440) at live.c:2311
  #10 0x000055555557a1db in thread_worker (data=0x0) at live.c:2482
  #11 0x00007ffff6d1c6db in start_thread (arg=0x7ffff0a28700) at pthread_create.c:463
  #12 0x00007ffff6a45a3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

That problem is mostly cosmetic in nature (the open can fail
"legitimately") as the PERROR should simply not be printed and is
addressed in a different commit.

This error is also produced after a 'clear' is issued:
  PERROR - 14:33:45.532782268 [25108/25115]: Failed to read from file system handle of viewer stream id 1, offset: 4096: No such file or directory (in viewer_get_packet() at live.c:1849)

Which is produced with the following back-trace:
  #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
  #1  0x00007f53e297c8b1 in __GI_abort () at abort.c:79
  #2  0x000055dd77ccef2c in viewer_get_packet (conn=0x7f53c4001100) at live.c:1850
  #3  0x000055dd77cd0a15 in process_control (recv_hdr=0x7f53dca3fc30, conn=0x7f53c4001100) at live.c:2315
  #4  0x000055dd77cd11db in thread_worker (data=0x0) at live.c:2483
  #5  0x00007f53e2d346db in start_thread (arg=0x7f53dca40700) at pthread_create.c:463
  #6  0x00007f53e2a5da3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

A similar problem occurs, although more rarely, when reading an
index entry in viewer_get_next_index().

Cause
=====

The following situation leads to both failures to get a
packet and failures to get the next index:
  - Viewer connects to an existing session,
  - Viewer consumes a number of packets, alternating the
    GET_NEXT_INDEX and GET_PACKET command,
  - The session's streams are rotated to a new trace chunk
    (as part of a clear),
  - The session is started and stopped, causing new packets
    to be produced and received,
  - The session is stopped and destroyed, causing the session's
    streams to rotate into a "null" trace chunk (no active
    trace files),
  - Viewer issues GET_NEXT_INDEX or GET_PACKET, but the fact
    that a rotation occurred on the receiving end is not detected
    as the relay streams' trace chunk are "null".

The crux of the problem is that lttng_trace_chunk_ids_equal() is
bypassed when the current trace chunk of a relay stream is "null".

The rationale for skipping this check is that it is assumed that the
files currently opened by the live server can can still be used even
if the consumer has rotated the corresponding streams into a 'null'
trace chunk, meaning no trace chunk is 'set' for those streams.

This makes sense in one scenario: the session was destroyed and we wish
to allow a connected live client to finish consuming the trace packets
up to the end of the session's lifetime.

Here, the situation is different. The viewer is reading chunk 'A'.
Meanwhile, a rotation occurs into chunk 'B' and packets are received for
chunk 'B'. Then, a rotation to a 'null' chunk (no active chunk) occurs.

In essence, the live server never sees the rotation between chunk 'A'
and 'B', and simply assumes that a rotation from 'A' to 'null' occurred,
as would happen at the end of a session.

In terms of the code, in viewer_get_next_index(), a call to
check_index_status() is performed to determine if an index is available.
The function checks that `index_received_seqcount` is greater than
`index_sent_seqcount`. In that case, it determines that an index must be
available.

Unfortunately, there is no way for the live server to determine that the
remaining indexes are in a chunk that doesn't exist anymore (chunk 'B').
Thus, viewer_get_next_index() attempts to read an index entry from the
current index file and fails.

Solution
========

1) lttng_trace_chunk_ids_equal() is modified to properly handle
'null' trace chunks:
  - A null and a non-null trace chunk are not equal,
  - Two null trace chunks are equal.

2) Rotation count
  A rotation counter is introduced to track the number of rotations
  that occurred during a relay stream's lifetime. This counter is
  sampled by the matching viewer streams on creation and on rotation
  and is used to determine if all rotations were "seen" by the viewer
  stream.

  Hence, this allows us to handle the special case where a viewer
  is consuming the contents of a relay stream that just transitioned
  into a 'null' trace chunk (see comments in patch).

The rest of the modifications simply allow the live server to handle
null trace chunks in viewer streams. This fixes another unrelated bug
that I observed while investigating this: sessions that don't have an
active trace chunk are not shown when listing sessions with babeltrace.

To reproduce, simply stop, clear a session, and attempt to list the
sessions of the associated relay daemon.

Known drawbacks
===============

None.

Signed-off-by: Jérémie Galarneau <[email protected]>
Change-Id: Ibb3116990e34b7ec3b477f3482d0c0ff1e848d09
jgalar pushed a commit that referenced this pull request Apr 13, 2021
When running

  $ lttng add-trigger --condition on-event -u ust_tests_demo2:loop --capture intfield --action notify

I get the leaks pasted below. It seems like filter_parser_ctx_free
doesn't free everything in filter_parser_ctx. Add what's missing.
Re-order the frees so that they are in the same order as the members of
the struct, just because it's easier to follow and make sure we didn't
forget anything.

=================================================================
==1073803==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 128 byte(s) in 1 object(s) allocated from:
    #0 0x7ffff767783a in __interceptor_realloc /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cpp:164
    #1 0x5555556833be in bytecode_reserve /home/simark/src/lttng-tools/src/common/bytecode/bytecode.c:59
    #2 0x55555568360f in bytecode_push /home/simark/src/lttng-tools/src/common/bytecode/bytecode.c:79
    #3 0x5555556a3d61 in filter_visitor_bytecode_generate /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-bytecode.c:667
    #4 0x55555569c9b1 in filter_parser_ctx_create_from_filter_expression /home/simark/src/lttng-tools/src/common/filter/filter-parser.y:394
    #5 0x55555560542e in parse_event_rule /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:704
    #6 0x555555607429 in handle_condition_event /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1088
    #7 0x555555608760 in parse_condition /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1326
    #8 0x55555560bca0 in cmd_add_trigger /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1925
    #9 0x555555616b55 in handle_command /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:237
    #10 0x555555617516 in parse_args /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:421
    #11 0x555555617812 in main /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:470
    #12 0x7ffff700bb24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24)

Direct leak of 112 byte(s) in 1 object(s) allocated from:
    #0 0x7ffff767783a in __interceptor_realloc /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cpp:164
    #1 0x5555556833be in bytecode_reserve /home/simark/src/lttng-tools/src/common/bytecode/bytecode.c:59
    #2 0x55555568360f in bytecode_push /home/simark/src/lttng-tools/src/common/bytecode/bytecode.c:79
    #3 0x5555556a1b94 in visit_node_load_expression_legacy /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-bytecode.c:198
    #4 0x5555556a1d18 in visit_node_load_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-bytecode.c:231
    #5 0x5555556a2540 in visit_node_load /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-bytecode.c:399
    #6 0x5555556a3a8b in recursive_visit_gen_bytecode /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-bytecode.c:622
    #7 0x5555556a12fa in visit_node_root /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-bytecode.c:53
    #8 0x5555556a3a76 in recursive_visit_gen_bytecode /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-bytecode.c:620
    #9 0x5555556a3c55 in filter_visitor_bytecode_generate /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-bytecode.c:661
    #10 0x55555569c9b1 in filter_parser_ctx_create_from_filter_expression /home/simark/src/lttng-tools/src/common/filter/filter-parser.y:394
    #11 0x55555560542e in parse_event_rule /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:704
    #12 0x555555607429 in handle_condition_event /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1088
    #13 0x555555608760 in parse_condition /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1326
    #14 0x55555560bca0 in cmd_add_trigger /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1925
    #15 0x555555616b55 in handle_command /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:237
    #16 0x555555617516 in parse_args /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:421
    #17 0x555555617812 in main /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:470
    #18 0x7ffff700bb24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24)

Direct leak of 40 byte(s) in 1 object(s) allocated from:
    #0 0x7ffff7677639 in __interceptor_calloc /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
    #1 0x5555556a3dd2 in make_op_root /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:35
    #2 0x5555556a73a5 in generate_ir_recursive /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:874
    #3 0x5555556a74d6 in filter_visitor_ir_generate /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:903
    #4 0x55555569c859 in filter_parser_ctx_create_from_filter_expression /home/simark/src/lttng-tools/src/common/filter/filter-parser.y:353
    #5 0x55555560542e in parse_event_rule /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:704
    #6 0x555555607429 in handle_condition_event /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1088
    #7 0x555555608760 in parse_condition /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1326
    #8 0x55555560bca0 in cmd_add_trigger /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1925
    #9 0x555555616b55 in handle_command /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:237
    #10 0x555555617516 in parse_args /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:421
    #11 0x555555617812 in main /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:470
    #12 0x7ffff700bb24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24)

Indirect leak of 40 byte(s) in 1 object(s) allocated from:
    #0 0x7ffff7677639 in __interceptor_calloc /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
    #1 0x5555556a4f1d in make_op_load_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:280
    #2 0x5555556a696f in make_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:637
    #3 0x5555556a73df in generate_ir_recursive /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:882
    #4 0x5555556a7382 in generate_ir_recursive /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:870
    #5 0x5555556a74d6 in filter_visitor_ir_generate /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:903
    #6 0x55555569c859 in filter_parser_ctx_create_from_filter_expression /home/simark/src/lttng-tools/src/common/filter/filter-parser.y:353
    #7 0x55555560542e in parse_event_rule /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:704
    #8 0x555555607429 in handle_condition_event /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1088
    #9 0x555555608760 in parse_condition /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1326
    #10 0x55555560bca0 in cmd_add_trigger /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1925
    #11 0x555555616b55 in handle_command /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:237
    #12 0x555555617516 in parse_args /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:421
    #13 0x555555617812 in main /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:470
    #14 0x7ffff700bb24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24)

Indirect leak of 24 byte(s) in 1 object(s) allocated from:
    #0 0x7ffff7677639 in __interceptor_calloc /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
    #1 0x5555556a484d in create_load_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:201
    #2 0x5555556a5040 in make_op_load_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:287
    #3 0x5555556a696f in make_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:637
    #4 0x5555556a73df in generate_ir_recursive /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:882
    #5 0x5555556a7382 in generate_ir_recursive /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:870
    #6 0x5555556a74d6 in filter_visitor_ir_generate /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:903
    #7 0x55555569c859 in filter_parser_ctx_create_from_filter_expression /home/simark/src/lttng-tools/src/common/filter/filter-parser.y:353
    #8 0x55555560542e in parse_event_rule /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:704
    #9 0x555555607429 in handle_condition_event /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1088
    #10 0x555555608760 in parse_condition /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1326
    #11 0x55555560bca0 in cmd_add_trigger /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1925
    #12 0x555555616b55 in handle_command /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:237
    #13 0x555555617516 in parse_args /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:421
    #14 0x555555617812 in main /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:470
    #15 0x7ffff700bb24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24)

Indirect leak of 24 byte(s) in 1 object(s) allocated from:
    #0 0x7ffff7677639 in __interceptor_calloc /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
    #1 0x5555556a4e64 in create_load_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:262
    #2 0x5555556a5040 in make_op_load_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:287
    #3 0x5555556a696f in make_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:637
    #4 0x5555556a73df in generate_ir_recursive /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:882
    #5 0x5555556a7382 in generate_ir_recursive /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:870
    #6 0x5555556a74d6 in filter_visitor_ir_generate /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:903
    #7 0x55555569c859 in filter_parser_ctx_create_from_filter_expression /home/simark/src/lttng-tools/src/common/filter/filter-parser.y:353
    #8 0x55555560542e in parse_event_rule /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:704
    #9 0x555555607429 in handle_condition_event /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1088
    #10 0x555555608760 in parse_condition /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1326
    #11 0x55555560bca0 in cmd_add_trigger /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1925
    #12 0x555555616b55 in handle_command /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:237
    #13 0x555555617516 in parse_args /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:421
    #14 0x555555617812 in main /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:470
    #15 0x7ffff700bb24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24)

Indirect leak of 24 byte(s) in 1 object(s) allocated from:
    #0 0x7ffff7677639 in __interceptor_calloc /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
    #1 0x5555556a4bbc in create_load_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:233
    #2 0x5555556a5040 in make_op_load_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:287
    #3 0x5555556a696f in make_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:637
    #4 0x5555556a73df in generate_ir_recursive /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:882
    #5 0x5555556a7382 in generate_ir_recursive /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:870
    #6 0x5555556a74d6 in filter_visitor_ir_generate /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:903
    #7 0x55555569c859 in filter_parser_ctx_create_from_filter_expression /home/simark/src/lttng-tools/src/common/filter/filter-parser.y:353
    #8 0x55555560542e in parse_event_rule /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:704
    #9 0x555555607429 in handle_condition_event /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1088
    #10 0x555555608760 in parse_condition /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1326
    #11 0x55555560bca0 in cmd_add_trigger /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1925
    #12 0x555555616b55 in handle_command /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:237
    #13 0x555555617516 in parse_args /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:421
    #14 0x555555617812 in main /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:470
    #15 0x7ffff700bb24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24)

Indirect leak of 9 byte(s) in 1 object(s) allocated from:
    #0 0x7ffff761fa69 in __interceptor_strdup /build/gcc/src/gcc/libsanitizer/asan/asan_interceptors.cpp:452
    #1 0x5555556a4c41 in create_load_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:238
    #2 0x5555556a5040 in make_op_load_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:287
    #3 0x5555556a696f in make_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:637
    #4 0x5555556a73df in generate_ir_recursive /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:882
    #5 0x5555556a7382 in generate_ir_recursive /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:870
    #6 0x5555556a74d6 in filter_visitor_ir_generate /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:903
    #7 0x55555569c859 in filter_parser_ctx_create_from_filter_expression /home/simark/src/lttng-tools/src/common/filter/filter-parser.y:353
    #8 0x55555560542e in parse_event_rule /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:704
    #9 0x555555607429 in handle_condition_event /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1088
    #10 0x555555608760 in parse_condition /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1326
    #11 0x55555560bca0 in cmd_add_trigger /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1925
    #12 0x555555616b55 in handle_command /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:237
    #13 0x555555617516 in parse_args /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:421
    #14 0x555555617812 in main /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:470
    #15 0x7ffff700bb24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24)

Indirect leak of 8 byte(s) in 1 object(s) allocated from:
    #0 0x7ffff7677639 in __interceptor_calloc /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
    #1 0x5555556a4829 in create_load_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:196
    #2 0x5555556a5040 in make_op_load_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:287
    #3 0x5555556a696f in make_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:637
    #4 0x5555556a73df in generate_ir_recursive /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:882
    #5 0x5555556a7382 in generate_ir_recursive /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:870
    #6 0x5555556a74d6 in filter_visitor_ir_generate /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:903
    #7 0x55555569c859 in filter_parser_ctx_create_from_filter_expression /home/simark/src/lttng-tools/src/common/filter/filter-parser.y:353
    #8 0x55555560542e in parse_event_rule /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:704
    #9 0x555555607429 in handle_condition_event /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1088
    #10 0x555555608760 in parse_condition /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1326
    #11 0x55555560bca0 in cmd_add_trigger /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1925
    #12 0x555555616b55 in handle_command /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:237
    #13 0x555555617516 in parse_args /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:421
    #14 0x555555617812 in main /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:470
    #15 0x7ffff700bb24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24)

SUMMARY: AddressSanitizer: 409 byte(s) leaked in 9 allocation(s).

Change-Id: I04f9eb5ab7b18ae4ffdf7a49842768a6fdae5dbc
Signed-off-by: Simon Marchi <[email protected]>
Signed-off-by: Jérémie Galarneau <[email protected]>
jgalar pushed a commit that referenced this pull request Apr 23, 2021
Observed issue
==============

A dead lock is observed during the start-stop test suite for triggers.

Cause
=====

A start session action is executed by the action executor, the
`cmd_start_trace` function is called and effectively holds the
`session_list_lock.`. During `cmd_start_trace` a call to
`notification_thread_command_add_channel` is performed to inform the
notification thread of the new channel presence.

At the same time, a tracer event notification is received by the
notification thread. The actions are queued up and the sample of the
session id take place and a call to `session_lock_list` is performed and
blocks on the lock operation.

The notification thread wait on the `session_list_lock` and the
`session_list_lock` holder, the action executor, waits on the completion
of a command the be run by the notification thread: deadlock.

The backtrace:

 Thread 6 (Thread 0x7f831c8a6700 (LWP 3046458)):
 #0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
 #1  0x000000000053b852 in futex (uaddr=0x7f831c8a45e0, op=0, val=0, timeout=0x0, uaddr2=0x0, val3=0) at /home/joraj/lttng/master/install/include/urcu/futex.h:72
 #2  0x000000000053b4f9 in futex_noasync (uaddr=0x7f831c8a45e0, op=0, val=0, timeout=0x0, uaddr2=0x0, val3=0) at /home/joraj/lttng/master/install/include/urcu/futex.h:81
 #3  0x000000000053af10 in lttng_waiter_wait (waiter=0x7f831c8a45d8) at waiter.c:55
 #4  0x000000000046b0f2 in run_command_wait (handle=0xe60520, cmd=0x7f831c8a4588) at notification-thread-commands.c:49
 #5  0x000000000046b270 in notification_thread_command_add_channel (handle=0xe60520, session_name=0x7f8300006c30 "my_triggered_session", uid=1000, gid=1000, channel_name=0x7f82dc00be04 "channel0", key=1, domain=LTTNG_DOMAIN_UST, capacity=2097152) at notification-thread-commands.c:184
 #6  0x00000000004c7f65 in create_channel_per_uid (app=0x7f82d8000bf0, usess=0x7f8300000bb0, ua_sess=0x7f82dc002600, ua_chan=0x7f82dc00bde0) at ust-app.c:3360
 #7  0x00000000004c6f98 in ust_app_channel_send (app=0x7f82d8000bf0, usess=0x7f8300000bb0, ua_sess=0x7f82dc002600, ua_chan=0x7f82dc00bde0) at ust-app.c:3514
 #8  0x00000000004c6bde in ust_app_channel_create (usess=0x7f8300000bb0, ua_sess=0x7f82dc002600, uchan=0x7f8300005a90, app=0x7f82d8000bf0, _ua_chan=0x7f831c8a48b0) at ust-app.c:4771
 #9  0x00000000004c6968 in find_or_create_ust_app_channel (usess=0x7f8300000bb0, ua_sess=0x7f82dc002600, app=0x7f82d8000bf0, uchan=0x7f8300005a90, ua_chan=0x7f831c8a48b0) at ust-app.c:5610
 #10 0x00000000004c4f09 in ust_app_synchronize_all_channels (usess=0x7f8300000bb0, ua_sess=0x7f82dc002600, app=0x7f82d8000bf0) at ust-app.c:5820
 #11 0x00000000004b958c in ust_app_synchronize (usess=0x7f8300000bb0, app=0x7f82d8000bf0) at ust-app.c:5886
 #12 0x00000000004b8500 in ust_app_global_update (usess=0x7f8300000bb0, app=0x7f82d8000bf0) at ust-app.c:5960
 #13 0x00000000004b7ec2 in ust_app_start_trace_all (usess=0x7f8300000bb0) at ust-app.c:5520
 #14 0x0000000000444e86 in cmd_start_trace (session=0x7f8300006c30) at cmd.c:2707
 #15 0x00000000004a5af9 in action_executor_start_session_handler (executor=0x7f8314004410, work_item=0x7f8314005100, item=0x7f83140050b0) at action-executor.c:342
 #16 0x00000000004a537f in action_executor_generic_handler (executor=0x7f8314004410, work_item=0x7f8314005100, item=0x7f83140050b0) at action-executor.c:696
 #17 0x00000000004a4dbc in action_work_item_execute (executor=0x7f8314004410, work_item=0x7f8314005100) at action-executor.c:715
 #18 0x00000000004a37e6 in action_executor_thread (_data=0x7f8314004410) at action-executor.c:797
 #19 0x0000000000486193 in launch_thread (data=0x7f83140044b0) at thread.c:66
 #20 0x00007f8320b60609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 #21 0x00007f8320a87293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

 Thread 5 (Thread 0x7f831d0a7700 (LWP 3046457)):
 #0  __lll_lock_wait (futex=futex@entry=0x5e1c10 <ltt_session_list>, private=0) at lowlevellock.c:52
 #1  0x00007f8320b630a3 in __GI___pthread_mutex_lock (mutex=0x5e1c10 <ltt_session_list>) at ../nptl/pthread_mutex_lock.c:80
 #2  0x00000000004378c3 in session_lock_list () at session.c:156
 #3  0x00000000004a871c in add_action_to_subitem_array (action=0x7f830001a730, subitems=0x7f83140051d0) at action-executor.c:1081
 #4  0x00000000004a8578 in add_action_to_subitem_array (action=0x7f830001a620, subitems=0x7f83140051d0) at action-executor.c:1025
 #5  0x00000000004a4922 in populate_subitem_array_from_trigger (trigger=0x7f830001a950, subitems=0x7f83140051d0) at action-executor.c:1116
 #6  0x00000000004a416e in action_executor_enqueue_trigger (executor=0x7f8314004410, trigger=0x7f830001a950, evaluation=0x7f8314005190, object_creds=0x0, client_list=0x7f8314004980) at action-executor.c:924
 #7  0x0000000000479481 in dispatch_one_event_notifier_notification (state=0x7f831d0a63e8, notification=0x7f8314005160) at notification-thread-events.c:4613
 #8  0x0000000000472324 in handle_one_event_notifier_notification (state=0x7f831d0a63e8, pipe=65, domain=LTTNG_DOMAIN_UST) at notification-thread-events.c:4702
 #9  0x0000000000472271 in handle_notification_thread_event_notification (state=0x7f831d0a63e8, pipe=65, domain=LTTNG_DOMAIN_UST) at notification-thread-events.c:4717
 #10 0x00000000004695a3 in handle_event_notification_pipe (event_source_fd=65, domain=LTTNG_DOMAIN_UST, revents=1, state=0x7f831d0a63e8) at notification-thread.c:591
 #11 0x000000000046849b in thread_notification (data=0xe60520) at notification-thread.c:727
 #12 0x0000000000486193 in launch_thread (data=0xe60610) at thread.c:66
 #13 0x00007f8320b60609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 #14 0x00007f8320a87293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Solution
========

Instead of using session_find_by_name() which requires the
`session_list_lock`, we introduce `sample_session_id_by_name` that uses
a urcu backed data structure. This allows the sampling of the session
id without holding the session list lock. We accept the small window
where a session object is still accessible but concretely not valid
since the actual execution context will be validated at the moment of
execution. The execution side already handles the possibility that the
session is removed at that point or is not the same session. The
execution side acquires the session_list_lock for validation.

Known drawbacks
=========

None

Signed-off-by: Jonathan Rajotte <[email protected]>
Signed-off-by: Jérémie Galarneau <[email protected]>
Change-Id: I5ad2c57acc0d03d2814dda59f8ecf2d831fd961e
jgalar pushed a commit that referenced this pull request May 11, 2021
When running

  $ lttng add-trigger --condition on-event -u ust_tests_demo2:loop --capture intfield --action notify

I get the leaks pasted below. It seems like filter_parser_ctx_free
doesn't free everything in filter_parser_ctx. Add what's missing.
Re-order the frees so that they are in the same order as the members of
the struct, just because it's easier to follow and make sure we didn't
forget anything.

=================================================================
==1073803==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 128 byte(s) in 1 object(s) allocated from:
    #0 0x7ffff767783a in __interceptor_realloc /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cpp:164
    #1 0x5555556833be in bytecode_reserve /home/simark/src/lttng-tools/src/common/bytecode/bytecode.c:59
    #2 0x55555568360f in bytecode_push /home/simark/src/lttng-tools/src/common/bytecode/bytecode.c:79
    #3 0x5555556a3d61 in filter_visitor_bytecode_generate /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-bytecode.c:667
    #4 0x55555569c9b1 in filter_parser_ctx_create_from_filter_expression /home/simark/src/lttng-tools/src/common/filter/filter-parser.y:394
    #5 0x55555560542e in parse_event_rule /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:704
    #6 0x555555607429 in handle_condition_event /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1088
    #7 0x555555608760 in parse_condition /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1326
    #8 0x55555560bca0 in cmd_add_trigger /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1925
    #9 0x555555616b55 in handle_command /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:237
    #10 0x555555617516 in parse_args /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:421
    #11 0x555555617812 in main /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:470
    #12 0x7ffff700bb24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24)

Direct leak of 112 byte(s) in 1 object(s) allocated from:
    #0 0x7ffff767783a in __interceptor_realloc /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cpp:164
    #1 0x5555556833be in bytecode_reserve /home/simark/src/lttng-tools/src/common/bytecode/bytecode.c:59
    #2 0x55555568360f in bytecode_push /home/simark/src/lttng-tools/src/common/bytecode/bytecode.c:79
    #3 0x5555556a1b94 in visit_node_load_expression_legacy /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-bytecode.c:198
    #4 0x5555556a1d18 in visit_node_load_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-bytecode.c:231
    #5 0x5555556a2540 in visit_node_load /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-bytecode.c:399
    #6 0x5555556a3a8b in recursive_visit_gen_bytecode /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-bytecode.c:622
    #7 0x5555556a12fa in visit_node_root /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-bytecode.c:53
    #8 0x5555556a3a76 in recursive_visit_gen_bytecode /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-bytecode.c:620
    #9 0x5555556a3c55 in filter_visitor_bytecode_generate /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-bytecode.c:661
    #10 0x55555569c9b1 in filter_parser_ctx_create_from_filter_expression /home/simark/src/lttng-tools/src/common/filter/filter-parser.y:394
    #11 0x55555560542e in parse_event_rule /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:704
    #12 0x555555607429 in handle_condition_event /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1088
    #13 0x555555608760 in parse_condition /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1326
    #14 0x55555560bca0 in cmd_add_trigger /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1925
    #15 0x555555616b55 in handle_command /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:237
    #16 0x555555617516 in parse_args /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:421
    #17 0x555555617812 in main /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:470
    #18 0x7ffff700bb24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24)

Direct leak of 40 byte(s) in 1 object(s) allocated from:
    #0 0x7ffff7677639 in __interceptor_calloc /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
    #1 0x5555556a3dd2 in make_op_root /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:35
    #2 0x5555556a73a5 in generate_ir_recursive /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:874
    #3 0x5555556a74d6 in filter_visitor_ir_generate /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:903
    #4 0x55555569c859 in filter_parser_ctx_create_from_filter_expression /home/simark/src/lttng-tools/src/common/filter/filter-parser.y:353
    #5 0x55555560542e in parse_event_rule /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:704
    #6 0x555555607429 in handle_condition_event /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1088
    #7 0x555555608760 in parse_condition /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1326
    #8 0x55555560bca0 in cmd_add_trigger /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1925
    #9 0x555555616b55 in handle_command /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:237
    #10 0x555555617516 in parse_args /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:421
    #11 0x555555617812 in main /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:470
    #12 0x7ffff700bb24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24)

Indirect leak of 40 byte(s) in 1 object(s) allocated from:
    #0 0x7ffff7677639 in __interceptor_calloc /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
    #1 0x5555556a4f1d in make_op_load_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:280
    #2 0x5555556a696f in make_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:637
    #3 0x5555556a73df in generate_ir_recursive /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:882
    #4 0x5555556a7382 in generate_ir_recursive /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:870
    #5 0x5555556a74d6 in filter_visitor_ir_generate /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:903
    #6 0x55555569c859 in filter_parser_ctx_create_from_filter_expression /home/simark/src/lttng-tools/src/common/filter/filter-parser.y:353
    #7 0x55555560542e in parse_event_rule /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:704
    #8 0x555555607429 in handle_condition_event /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1088
    #9 0x555555608760 in parse_condition /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1326
    #10 0x55555560bca0 in cmd_add_trigger /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1925
    #11 0x555555616b55 in handle_command /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:237
    #12 0x555555617516 in parse_args /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:421
    #13 0x555555617812 in main /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:470
    #14 0x7ffff700bb24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24)

Indirect leak of 24 byte(s) in 1 object(s) allocated from:
    #0 0x7ffff7677639 in __interceptor_calloc /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
    #1 0x5555556a484d in create_load_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:201
    #2 0x5555556a5040 in make_op_load_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:287
    #3 0x5555556a696f in make_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:637
    #4 0x5555556a73df in generate_ir_recursive /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:882
    #5 0x5555556a7382 in generate_ir_recursive /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:870
    #6 0x5555556a74d6 in filter_visitor_ir_generate /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:903
    #7 0x55555569c859 in filter_parser_ctx_create_from_filter_expression /home/simark/src/lttng-tools/src/common/filter/filter-parser.y:353
    #8 0x55555560542e in parse_event_rule /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:704
    #9 0x555555607429 in handle_condition_event /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1088
    #10 0x555555608760 in parse_condition /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1326
    #11 0x55555560bca0 in cmd_add_trigger /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1925
    #12 0x555555616b55 in handle_command /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:237
    #13 0x555555617516 in parse_args /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:421
    #14 0x555555617812 in main /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:470
    #15 0x7ffff700bb24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24)

Indirect leak of 24 byte(s) in 1 object(s) allocated from:
    #0 0x7ffff7677639 in __interceptor_calloc /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
    #1 0x5555556a4e64 in create_load_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:262
    #2 0x5555556a5040 in make_op_load_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:287
    #3 0x5555556a696f in make_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:637
    #4 0x5555556a73df in generate_ir_recursive /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:882
    #5 0x5555556a7382 in generate_ir_recursive /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:870
    #6 0x5555556a74d6 in filter_visitor_ir_generate /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:903
    #7 0x55555569c859 in filter_parser_ctx_create_from_filter_expression /home/simark/src/lttng-tools/src/common/filter/filter-parser.y:353
    #8 0x55555560542e in parse_event_rule /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:704
    #9 0x555555607429 in handle_condition_event /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1088
    #10 0x555555608760 in parse_condition /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1326
    #11 0x55555560bca0 in cmd_add_trigger /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1925
    #12 0x555555616b55 in handle_command /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:237
    #13 0x555555617516 in parse_args /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:421
    #14 0x555555617812 in main /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:470
    #15 0x7ffff700bb24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24)

Indirect leak of 24 byte(s) in 1 object(s) allocated from:
    #0 0x7ffff7677639 in __interceptor_calloc /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
    #1 0x5555556a4bbc in create_load_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:233
    #2 0x5555556a5040 in make_op_load_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:287
    #3 0x5555556a696f in make_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:637
    #4 0x5555556a73df in generate_ir_recursive /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:882
    #5 0x5555556a7382 in generate_ir_recursive /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:870
    #6 0x5555556a74d6 in filter_visitor_ir_generate /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:903
    #7 0x55555569c859 in filter_parser_ctx_create_from_filter_expression /home/simark/src/lttng-tools/src/common/filter/filter-parser.y:353
    #8 0x55555560542e in parse_event_rule /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:704
    #9 0x555555607429 in handle_condition_event /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1088
    #10 0x555555608760 in parse_condition /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1326
    #11 0x55555560bca0 in cmd_add_trigger /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1925
    #12 0x555555616b55 in handle_command /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:237
    #13 0x555555617516 in parse_args /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:421
    #14 0x555555617812 in main /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:470
    #15 0x7ffff700bb24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24)

Indirect leak of 9 byte(s) in 1 object(s) allocated from:
    #0 0x7ffff761fa69 in __interceptor_strdup /build/gcc/src/gcc/libsanitizer/asan/asan_interceptors.cpp:452
    #1 0x5555556a4c41 in create_load_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:238
    #2 0x5555556a5040 in make_op_load_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:287
    #3 0x5555556a696f in make_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:637
    #4 0x5555556a73df in generate_ir_recursive /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:882
    #5 0x5555556a7382 in generate_ir_recursive /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:870
    #6 0x5555556a74d6 in filter_visitor_ir_generate /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:903
    #7 0x55555569c859 in filter_parser_ctx_create_from_filter_expression /home/simark/src/lttng-tools/src/common/filter/filter-parser.y:353
    #8 0x55555560542e in parse_event_rule /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:704
    #9 0x555555607429 in handle_condition_event /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1088
    #10 0x555555608760 in parse_condition /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1326
    #11 0x55555560bca0 in cmd_add_trigger /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1925
    #12 0x555555616b55 in handle_command /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:237
    #13 0x555555617516 in parse_args /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:421
    #14 0x555555617812 in main /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:470
    #15 0x7ffff700bb24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24)

Indirect leak of 8 byte(s) in 1 object(s) allocated from:
    #0 0x7ffff7677639 in __interceptor_calloc /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
    #1 0x5555556a4829 in create_load_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:196
    #2 0x5555556a5040 in make_op_load_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:287
    #3 0x5555556a696f in make_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:637
    #4 0x5555556a73df in generate_ir_recursive /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:882
    #5 0x5555556a7382 in generate_ir_recursive /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:870
    #6 0x5555556a74d6 in filter_visitor_ir_generate /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:903
    #7 0x55555569c859 in filter_parser_ctx_create_from_filter_expression /home/simark/src/lttng-tools/src/common/filter/filter-parser.y:353
    #8 0x55555560542e in parse_event_rule /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:704
    #9 0x555555607429 in handle_condition_event /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1088
    #10 0x555555608760 in parse_condition /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1326
    #11 0x55555560bca0 in cmd_add_trigger /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1925
    #12 0x555555616b55 in handle_command /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:237
    #13 0x555555617516 in parse_args /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:421
    #14 0x555555617812 in main /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:470
    #15 0x7ffff700bb24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24)

SUMMARY: AddressSanitizer: 409 byte(s) leaked in 9 allocation(s).

Signed-off-by: Simon Marchi <[email protected]>
Signed-off-by: Jérémie Galarneau <[email protected]>
Change-Id: Ic338ea1689d3f002bf9cade6d4f23e62d935968b
jgalar pushed a commit that referenced this pull request May 11, 2021
When running

  $ lttng add-trigger --condition on-event -u ust_tests_demo2:loop --capture intfield --action notify

I get the leaks pasted below. It seems like filter_parser_ctx_free
doesn't free everything in filter_parser_ctx. Add what's missing.
Re-order the frees so that they are in the same order as the members of
the struct, just because it's easier to follow and make sure we didn't
forget anything.

=================================================================
==1073803==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 128 byte(s) in 1 object(s) allocated from:
    #0 0x7ffff767783a in __interceptor_realloc /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cpp:164
    #1 0x5555556833be in bytecode_reserve /home/simark/src/lttng-tools/src/common/bytecode/bytecode.c:59
    #2 0x55555568360f in bytecode_push /home/simark/src/lttng-tools/src/common/bytecode/bytecode.c:79
    #3 0x5555556a3d61 in filter_visitor_bytecode_generate /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-bytecode.c:667
    #4 0x55555569c9b1 in filter_parser_ctx_create_from_filter_expression /home/simark/src/lttng-tools/src/common/filter/filter-parser.y:394
    #5 0x55555560542e in parse_event_rule /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:704
    #6 0x555555607429 in handle_condition_event /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1088
    #7 0x555555608760 in parse_condition /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1326
    #8 0x55555560bca0 in cmd_add_trigger /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1925
    #9 0x555555616b55 in handle_command /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:237
    #10 0x555555617516 in parse_args /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:421
    #11 0x555555617812 in main /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:470
    #12 0x7ffff700bb24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24)

Direct leak of 112 byte(s) in 1 object(s) allocated from:
    #0 0x7ffff767783a in __interceptor_realloc /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cpp:164
    #1 0x5555556833be in bytecode_reserve /home/simark/src/lttng-tools/src/common/bytecode/bytecode.c:59
    #2 0x55555568360f in bytecode_push /home/simark/src/lttng-tools/src/common/bytecode/bytecode.c:79
    #3 0x5555556a1b94 in visit_node_load_expression_legacy /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-bytecode.c:198
    #4 0x5555556a1d18 in visit_node_load_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-bytecode.c:231
    #5 0x5555556a2540 in visit_node_load /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-bytecode.c:399
    #6 0x5555556a3a8b in recursive_visit_gen_bytecode /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-bytecode.c:622
    #7 0x5555556a12fa in visit_node_root /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-bytecode.c:53
    #8 0x5555556a3a76 in recursive_visit_gen_bytecode /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-bytecode.c:620
    #9 0x5555556a3c55 in filter_visitor_bytecode_generate /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-bytecode.c:661
    #10 0x55555569c9b1 in filter_parser_ctx_create_from_filter_expression /home/simark/src/lttng-tools/src/common/filter/filter-parser.y:394
    #11 0x55555560542e in parse_event_rule /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:704
    #12 0x555555607429 in handle_condition_event /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1088
    #13 0x555555608760 in parse_condition /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1326
    #14 0x55555560bca0 in cmd_add_trigger /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1925
    #15 0x555555616b55 in handle_command /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:237
    #16 0x555555617516 in parse_args /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:421
    #17 0x555555617812 in main /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:470
    #18 0x7ffff700bb24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24)

Direct leak of 40 byte(s) in 1 object(s) allocated from:
    #0 0x7ffff7677639 in __interceptor_calloc /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
    #1 0x5555556a3dd2 in make_op_root /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:35
    #2 0x5555556a73a5 in generate_ir_recursive /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:874
    #3 0x5555556a74d6 in filter_visitor_ir_generate /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:903
    #4 0x55555569c859 in filter_parser_ctx_create_from_filter_expression /home/simark/src/lttng-tools/src/common/filter/filter-parser.y:353
    #5 0x55555560542e in parse_event_rule /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:704
    #6 0x555555607429 in handle_condition_event /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1088
    #7 0x555555608760 in parse_condition /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1326
    #8 0x55555560bca0 in cmd_add_trigger /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1925
    #9 0x555555616b55 in handle_command /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:237
    #10 0x555555617516 in parse_args /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:421
    #11 0x555555617812 in main /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:470
    #12 0x7ffff700bb24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24)

Indirect leak of 40 byte(s) in 1 object(s) allocated from:
    #0 0x7ffff7677639 in __interceptor_calloc /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
    #1 0x5555556a4f1d in make_op_load_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:280
    #2 0x5555556a696f in make_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:637
    #3 0x5555556a73df in generate_ir_recursive /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:882
    #4 0x5555556a7382 in generate_ir_recursive /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:870
    #5 0x5555556a74d6 in filter_visitor_ir_generate /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:903
    #6 0x55555569c859 in filter_parser_ctx_create_from_filter_expression /home/simark/src/lttng-tools/src/common/filter/filter-parser.y:353
    #7 0x55555560542e in parse_event_rule /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:704
    #8 0x555555607429 in handle_condition_event /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1088
    #9 0x555555608760 in parse_condition /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1326
    #10 0x55555560bca0 in cmd_add_trigger /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1925
    #11 0x555555616b55 in handle_command /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:237
    #12 0x555555617516 in parse_args /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:421
    #13 0x555555617812 in main /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:470
    #14 0x7ffff700bb24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24)

Indirect leak of 24 byte(s) in 1 object(s) allocated from:
    #0 0x7ffff7677639 in __interceptor_calloc /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
    #1 0x5555556a484d in create_load_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:201
    #2 0x5555556a5040 in make_op_load_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:287
    #3 0x5555556a696f in make_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:637
    #4 0x5555556a73df in generate_ir_recursive /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:882
    #5 0x5555556a7382 in generate_ir_recursive /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:870
    #6 0x5555556a74d6 in filter_visitor_ir_generate /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:903
    #7 0x55555569c859 in filter_parser_ctx_create_from_filter_expression /home/simark/src/lttng-tools/src/common/filter/filter-parser.y:353
    #8 0x55555560542e in parse_event_rule /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:704
    #9 0x555555607429 in handle_condition_event /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1088
    #10 0x555555608760 in parse_condition /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1326
    #11 0x55555560bca0 in cmd_add_trigger /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1925
    #12 0x555555616b55 in handle_command /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:237
    #13 0x555555617516 in parse_args /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:421
    #14 0x555555617812 in main /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:470
    #15 0x7ffff700bb24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24)

Indirect leak of 24 byte(s) in 1 object(s) allocated from:
    #0 0x7ffff7677639 in __interceptor_calloc /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
    #1 0x5555556a4e64 in create_load_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:262
    #2 0x5555556a5040 in make_op_load_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:287
    #3 0x5555556a696f in make_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:637
    #4 0x5555556a73df in generate_ir_recursive /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:882
    #5 0x5555556a7382 in generate_ir_recursive /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:870
    #6 0x5555556a74d6 in filter_visitor_ir_generate /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:903
    #7 0x55555569c859 in filter_parser_ctx_create_from_filter_expression /home/simark/src/lttng-tools/src/common/filter/filter-parser.y:353
    #8 0x55555560542e in parse_event_rule /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:704
    #9 0x555555607429 in handle_condition_event /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1088
    #10 0x555555608760 in parse_condition /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1326
    #11 0x55555560bca0 in cmd_add_trigger /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1925
    #12 0x555555616b55 in handle_command /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:237
    #13 0x555555617516 in parse_args /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:421
    #14 0x555555617812 in main /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:470
    #15 0x7ffff700bb24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24)

Indirect leak of 24 byte(s) in 1 object(s) allocated from:
    #0 0x7ffff7677639 in __interceptor_calloc /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
    #1 0x5555556a4bbc in create_load_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:233
    #2 0x5555556a5040 in make_op_load_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:287
    #3 0x5555556a696f in make_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:637
    #4 0x5555556a73df in generate_ir_recursive /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:882
    #5 0x5555556a7382 in generate_ir_recursive /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:870
    #6 0x5555556a74d6 in filter_visitor_ir_generate /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:903
    #7 0x55555569c859 in filter_parser_ctx_create_from_filter_expression /home/simark/src/lttng-tools/src/common/filter/filter-parser.y:353
    #8 0x55555560542e in parse_event_rule /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:704
    #9 0x555555607429 in handle_condition_event /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1088
    #10 0x555555608760 in parse_condition /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1326
    #11 0x55555560bca0 in cmd_add_trigger /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1925
    #12 0x555555616b55 in handle_command /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:237
    #13 0x555555617516 in parse_args /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:421
    #14 0x555555617812 in main /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:470
    #15 0x7ffff700bb24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24)

Indirect leak of 9 byte(s) in 1 object(s) allocated from:
    #0 0x7ffff761fa69 in __interceptor_strdup /build/gcc/src/gcc/libsanitizer/asan/asan_interceptors.cpp:452
    #1 0x5555556a4c41 in create_load_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:238
    #2 0x5555556a5040 in make_op_load_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:287
    #3 0x5555556a696f in make_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:637
    #4 0x5555556a73df in generate_ir_recursive /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:882
    #5 0x5555556a7382 in generate_ir_recursive /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:870
    #6 0x5555556a74d6 in filter_visitor_ir_generate /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:903
    #7 0x55555569c859 in filter_parser_ctx_create_from_filter_expression /home/simark/src/lttng-tools/src/common/filter/filter-parser.y:353
    #8 0x55555560542e in parse_event_rule /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:704
    #9 0x555555607429 in handle_condition_event /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1088
    #10 0x555555608760 in parse_condition /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1326
    #11 0x55555560bca0 in cmd_add_trigger /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1925
    #12 0x555555616b55 in handle_command /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:237
    #13 0x555555617516 in parse_args /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:421
    #14 0x555555617812 in main /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:470
    #15 0x7ffff700bb24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24)

Indirect leak of 8 byte(s) in 1 object(s) allocated from:
    #0 0x7ffff7677639 in __interceptor_calloc /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
    #1 0x5555556a4829 in create_load_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:196
    #2 0x5555556a5040 in make_op_load_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:287
    #3 0x5555556a696f in make_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:637
    #4 0x5555556a73df in generate_ir_recursive /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:882
    #5 0x5555556a7382 in generate_ir_recursive /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:870
    #6 0x5555556a74d6 in filter_visitor_ir_generate /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:903
    #7 0x55555569c859 in filter_parser_ctx_create_from_filter_expression /home/simark/src/lttng-tools/src/common/filter/filter-parser.y:353
    #8 0x55555560542e in parse_event_rule /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:704
    #9 0x555555607429 in handle_condition_event /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1088
    #10 0x555555608760 in parse_condition /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1326
    #11 0x55555560bca0 in cmd_add_trigger /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1925
    #12 0x555555616b55 in handle_command /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:237
    #13 0x555555617516 in parse_args /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:421
    #14 0x555555617812 in main /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:470
    #15 0x7ffff700bb24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24)

SUMMARY: AddressSanitizer: 409 byte(s) leaked in 9 allocation(s).

Signed-off-by: Simon Marchi <[email protected]>
Signed-off-by: Jérémie Galarneau <[email protected]>
Change-Id: Ic338ea1689d3f002bf9cade6d4f23e62d935968b
jgalar added a commit that referenced this pull request May 13, 2021
Observed issue
==============

The clear tests occasionally fail with the following babeltrace error
when a live session is stopped following a "clear". Unfortunately, this
problem only seems to occur on certain machines. In my case, I only
managed to reproduce this on the CI's workers.

  10-07 12:39:48.333  7679  7679 E PLUGIN/SRC.CTF.LTTNG-LIVE/VIEWER [email protected]:1610 [lttng-live] Received get_data_packet response: error
  10-07 12:39:48.333  7679  7679 E PLUGIN/CTF/MSG-ITER [email protected]:563 [lttng-live] User function failed: status=ERROR
  10-07 12:39:48.333  7679  7679 E PLUGIN/CTF/MSG-ITER [email protected]:2899 [lttng-live] Cannot handle state: msg-it-addr=0x5603c28e2830, state=DSCOPE_TRACE_PACKET_HEADER_BEGIN
  10-07 12:39:48.333  7679  7679 E PLUGIN/SRC.CTF.LTTNG-LIVE lttng_live_iterator_next_handle_one_active_data_stream@lttng-live.c:845 [lttng-live] CTF message iterator failed to get next message: msg-iter=0x5603c28e2830, msg-iter-status=ERROR
  10-07 12:39:48.333  7679  7679 E PLUGIN/SRC.CTF.LTTNG-LIVE [email protected]:1665 [lttng-live] Error preparing the next batch of messages: live-iter-status=LTTNG_LIVE_ITERATOR_STATUS_ERROR
  10-07 12:39:48.333  7679  7679 W LIB/MSG-ITER [email protected]:864 Component input port message iterator's "next" method failed: iter-addr=0x5603c28cb0f0, iter-upstream-comp-name="lttng-live", iter-upstream-comp-log-level=WARNING, iter-upstream-comp-class-type=SOURCE, iter-upstream-comp-class-name="lttng-live", iter-upstream-comp-class-partial-descr="Connect to an LTTng relay daemon", iter-upstream-port-type=OUTPUT, iter-upstream-port-name="out", status=ERROR
  10-07 12:39:48.333  7679  7679 E PLUGIN/FLT.UTILS.MUXER [email protected]:454 [muxer] Upstream iterator's next method returned an error: status=ERROR
  10-07 12:39:48.333  7679  7679 E PLUGIN/FLT.UTILS.MUXER [email protected]:991 [muxer] Cannot validate muxer's upstream message iterator wrapper: muxer-msg-iter-addr=0x5603c28dbe70, muxer-upstream-msg-iter-wrap-addr=0x5603c28cd0f0
  10-07 12:39:48.333  7679  7679 E PLUGIN/FLT.UTILS.MUXER [email protected]:1415 [muxer] Cannot get next message: comp-addr=0x5603c28dc960, muxer-comp-addr=0x5603c28db0a0, muxer-msg-iter-addr=0x5603c28dbe70, msg-iter-addr=0x5603c28caf80, status=ERROR
  10-07 12:39:48.333  7679  7679 W LIB/MSG-ITER [email protected]:864 Component input port message iterator's "next" method failed: iter-addr=0x5603c28caf80, iter-upstream-comp-name="muxer", iter-upstream-comp-log-level=WARNING, iter-upstream-comp-class-type=FILTER, iter-upstream-comp-class-name="muxer", iter-upstream-comp-class-partial-descr="Sort messages from multiple inpu", iter-upstream-port-type=OUTPUT, iter-upstream-port-name="out", status=ERROR
  10-07 12:39:48.333  7679  7679 W LIB/GRAPH [email protected]:473 Component's "consume" method failed: status=ERROR, comp-addr=0x5603c28dcb60, comp-name="pretty", comp-log-level=WARNING, comp-class-type=SINK, comp-class-name="pretty", comp-class-partial-descr="Pretty-print messages (`text` fo", comp-class-is-frozen=0, comp-class-so-handle-addr=0x5603c28c8140, comp-class-so-handle-path="/home/jenkins/jgalar-debug/build/usr/lib/babeltrace2/plugins/babeltrace-plugin-text.so", comp-input-port-count=1, comp-output-port-count=0
  10-07 12:39:48.333  7679  7679 E CLI [email protected]:2548 Graph failed to complete successfully
  10-07 12:39:48.333  7679  7679 E PLUGIN/SRC.CTF.LTTNG-LIVE/VIEWER [email protected]:1227 [lttng-live] Unknown detach return code 0

  ERROR:    [Babeltrace CLI] (babeltrace2.c:2548)
    Graph failed to complete successfully
  CAUSED BY [libbabeltrace2] (graph.c:473)
    Component's "consume" method failed: status=ERROR, comp-addr=0x5603c28dcb60,
    comp-name="pretty", comp-log-level=WARNING, comp-class-type=SINK,
    comp-class-name="pretty", comp-class-partial-descr="Pretty-print messages
    (`text` fo", comp-class-is-frozen=0, comp-class-so-handle-addr=0x5603c28c8140,
    comp-class-so-handle-path="/home/jenkins/jgalar-debug/build/usr/lib/babeltrace2/plugins/babeltrace-plugin-text.so",
    comp-input-port-count=1, comp-output-port-count=0
  CAUSED BY [libbabeltrace2] (iterator.c:864)
    Component input port message iterator's "next" method failed:
    iter-addr=0x5603c28caf80, iter-upstream-comp-name="muxer",
    iter-upstream-comp-log-level=WARNING, iter-upstream-comp-class-type=FILTER,
    iter-upstream-comp-class-name="muxer",
    iter-upstream-comp-class-partial-descr="Sort messages from multiple inpu",
    iter-upstream-port-type=OUTPUT, iter-upstream-port-name="out", status=ERROR
  CAUSED BY [muxer: 'filter.utils.muxer'] (muxer.c:991)
    Cannot validate muxer's upstream message iterator wrapper:
    muxer-msg-iter-addr=0x5603c28dbe70,
    muxer-upstream-msg-iter-wrap-addr=0x5603c28cd0f0
  CAUSED BY [muxer: 'filter.utils.muxer'] (muxer.c:454)
    Upstream iterator's next method returned an error: status=ERROR
  CAUSED BY [libbabeltrace2] (iterator.c:864)
    Component input port message iterator's "next" method failed:
    iter-addr=0x5603c28cb0f0, iter-upstream-comp-name="lttng-live",
    iter-upstream-comp-log-level=WARNING, iter-upstream-comp-class-type=SOURCE,
    iter-upstream-comp-class-name="lttng-live",
    iter-upstream-comp-class-partial-descr="Connect to an LTTng relay daemon",
    iter-upstream-port-type=OUTPUT, iter-upstream-port-name="out", status=ERROR
  CAUSED BY [lttng-live: 'source.ctf.lttng-live'] (lttng-live.c:1665)
    Error preparing the next batch of messages:
    live-iter-status=LTTNG_LIVE_ITERATOR_STATUS_ERROR
  CAUSED BY [lttng-live: 'source.ctf.lttng-live'] (lttng-live.c:845)
    CTF message iterator failed to get next message: msg-iter=0x5603c28e2830,
    msg-iter-status=ERROR
  CAUSED BY [lttng-live: 'source.ctf.lttng-live'] (msg-iter.c:2899)
    Cannot handle state: msg-it-addr=0x5603c28e2830,
    state=DSCOPE_TRACE_PACKET_HEADER_BEGIN
  CAUSED BY [lttng-live: 'source.ctf.lttng-live'] (msg-iter.c:563)
    User function failed: status=ERROR
  CAUSED BY [lttng-live: 'source.ctf.lttng-live'] (viewer-connection.c:1610)
    Received get_data_packet response: error

This occurs immediately following a 'stop' on the session. As the error
indicates, a request to obtain a data packet fails with a generic
error reply.

Moreover, the following LTTNG_VIEWER_DETACH_SESSION appears to fail
with an invalid status code. This is addressed in a different commit.

Reproducing the test's failure without redirecting the relay daemon's
allows us to see the following errors after the first stop:
  PERROR - 14:33:44.929675253 [25108/25115]: Failed to open fs handle to ust/uid/1001/64-bit/index/chan_0.idx, open() returned: No such file or directory (in fd_tracker_open_fs_handle() at fd-tracker.c:550)
  PERROR - 14:33:45.030037417 [25108/25115]: Failed to open fs handle to ust/uid/1001/64-bit/index/chan_0.idx, open() returned: No such file or directory (in fd_tracker_open_fs_handle() at fd-tracker.c:550)
  PERROR - 14:33:45.130429370 [25108/25115]: Failed to open fs handle to ust/uid/1001/64-bit/index/chan_0.idx, open() returned: No such file or directory (in fd_tracker_open_fs_handle() at fd-tracker.c:550)
  PERROR - 14:33:45.230829447 [25108/25115]: Failed to open fs handle to ust/uid/1001/64-bit/index/chan_0.idx, open() returned: No such file or directory (in fd_tracker_open_fs_handle() at fd-tracker.c:550)
  PERROR - 14:33:45.331223320 [25108/25115]: Failed to open fs handle to ust/uid/1001/64-bit/index/chan_0.idx, open() returned: No such file or directory (in fd_tracker_open_fs_handle() at fd-tracker.c:550)

This is produced with the following back-trace:
  (gdb) bt
  #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
  #1  0x00007ffff69648b1 in __GI_abort () at abort.c:79
  #2  0x00005555555b4f1f in fd_tracker_open_fs_handle (tracker=0x55555582c620, directory=0x7fffe8006680,
      path=0x7ffff0a25870 "ust/uid/1001/64-bit/index/chan_1.idx", flags=0, mode=0x7ffff0a24508) at fd-tracker.c:550
  #3  0x0000555555595c34 in _lttng_trace_chunk_open_fs_handle_locked (chunk=0x7fffe0002130, file_path=0x7ffff0a25870 "ust/uid/1001/64-bit/index/chan_1.idx",
      flags=0, mode=432, out_handle=0x7ffff0a24710, expect_no_file=true) at trace-chunk.c:1388
  #4  0x0000555555595eef in lttng_trace_chunk_open_fs_handle (chunk=0x7fffe0002130, file_path=0x7ffff0a25870 "ust/uid/1001/64-bit/index/chan_1.idx", flags=0,
      mode=432, out_handle=0x7ffff0a24710, expect_no_file=true) at trace-chunk.c:1433
  #5  0x00005555555da6c2 in _lttng_index_file_create_from_trace_chunk (chunk=0x7fffe0002130, channel_path=0x7fffe8018c30 "ust/uid/1001/64-bit",
      stream_name=0x7fffe8018c10 "chan_1", stream_file_size=0, stream_file_index=0, index_major=1, index_minor=1, unlink_existing_file=false, flags=0,
      expect_no_file=true, file=0x7fffe0002270) at index.c:97
  #6  0x00005555555dad8a in lttng_index_file_create_from_trace_chunk_read_only (chunk=0x7fffe0002130, channel_path=0x7fffe8018c30 "ust/uid/1001/64-bit",
      stream_name=0x7fffe8018c10 "chan_1", stream_file_size=0, stream_file_index=0, index_major=1, index_minor=1, expect_no_file=true, file=0x7fffe0002270)
      at index.c:186
  #7  0x000055555557640f in try_open_index (vstream=0x7fffe0002250, rstream=0x7fffe8018c50) at live.c:1378
  #8  0x0000555555577155 in viewer_get_next_index (conn=0x7fffd4001440) at live.c:1643
  #9  0x0000555555579a01 in process_control (recv_hdr=0x7ffff0a27c30, conn=0x7fffd4001440) at live.c:2311
  #10 0x000055555557a1db in thread_worker (data=0x0) at live.c:2482
  #11 0x00007ffff6d1c6db in start_thread (arg=0x7ffff0a28700) at pthread_create.c:463
  #12 0x00007ffff6a45a3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

That problem is mostly cosmetic in nature (the open can fail
"legitimately") as the PERROR should simply not be printed and is
addressed in a different commit.

This error is also produced after a 'clear' is issued:
  PERROR - 14:33:45.532782268 [25108/25115]: Failed to read from file system handle of viewer stream id 1, offset: 4096: No such file or directory (in viewer_get_packet() at live.c:1849)

Which is produced with the following back-trace:
  #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
  #1  0x00007f53e297c8b1 in __GI_abort () at abort.c:79
  #2  0x000055dd77ccef2c in viewer_get_packet (conn=0x7f53c4001100) at live.c:1850
  #3  0x000055dd77cd0a15 in process_control (recv_hdr=0x7f53dca3fc30, conn=0x7f53c4001100) at live.c:2315
  #4  0x000055dd77cd11db in thread_worker (data=0x0) at live.c:2483
  #5  0x00007f53e2d346db in start_thread (arg=0x7f53dca40700) at pthread_create.c:463
  #6  0x00007f53e2a5da3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

A similar problem occurs, although more rarely, when reading an
index entry in viewer_get_next_index().

Cause
=====

The following situation leads to both failures to get a
packet and failures to get the next index:
  - Viewer connects to an existing session,
  - Viewer consumes a number of packets, alternating the
    GET_NEXT_INDEX and GET_PACKET command,
  - The session's streams are rotated to a new trace chunk
    (as part of a clear),
  - The session is started and stopped, causing new packets
    to be produced and received,
  - The session is stopped and destroyed, causing the session's
    streams to rotate into a "null" trace chunk (no active
    trace files),
  - Viewer issues GET_NEXT_INDEX or GET_PACKET, but the fact
    that a rotation occurred on the receiving end is not detected
    as the relay streams' trace chunk are "null".

The crux of the problem is that lttng_trace_chunk_ids_equal() is
bypassed when the current trace chunk of a relay stream is "null".

The rationale for skipping this check is that it is assumed that the
files currently opened by the live server can can still be used even
if the consumer has rotated the corresponding streams into a 'null'
trace chunk, meaning no trace chunk is 'set' for those streams.

This makes sense in one scenario: the session was destroyed and we wish
to allow a connected live client to finish consuming the trace packets
up to the end of the session's lifetime.

Here, the situation is different. The viewer is reading chunk 'A'.
Meanwhile, a rotation occurs into chunk 'B' and packets are received for
chunk 'B'. Then, a rotation to a 'null' chunk (no active chunk) occurs.

In essence, the live server never sees the rotation between chunk 'A'
and 'B', and simply assumes that a rotation from 'A' to 'null' occurred,
as would happen at the end of a session.

In terms of the code, in viewer_get_next_index(), a call to
check_index_status() is performed to determine if an index is available.
The function checks that `index_received_seqcount` is greater than
`index_sent_seqcount`. In that case, it determines that an index must be
available.

Unfortunately, there is no way for the live server to determine that the
remaining indexes are in a chunk that doesn't exist anymore (chunk 'B').
Thus, viewer_get_next_index() attempts to read an index entry from the
current index file and fails.

Solution
========

1) lttng_trace_chunk_ids_equal() is modified to properly handle
'null' trace chunks:
  - A null and a non-null trace chunk are not equal,
  - Two null trace chunks are equal.

2) Rotation count
  A rotation counter is introduced to track the number of rotations
  that occurred during a relay stream's lifetime. This counter is
  sampled by the matching viewer streams on creation and on rotation
  and is used to determine if all rotations were "seen" by the viewer
  stream.

  Hence, this allows us to handle the special case where a viewer
  is consuming the contents of a relay stream that just transitioned
  into a 'null' trace chunk (see comments in patch).

The rest of the modifications simply allow the live server to handle
null trace chunks in viewer streams. This fixes another unrelated bug
that I observed while investigating this: sessions that don't have an
active trace chunk are not shown when listing sessions with babeltrace.

To reproduce, simply stop, clear a session, and attempt to list the
sessions of the associated relay daemon.

Known drawbacks
===============

None.

Signed-off-by: Jérémie Galarneau <[email protected]>
Change-Id: Ibb3116990e34b7ec3b477f3482d0c0ff1e848d09
jgalar pushed a commit that referenced this pull request Aug 2, 2021
When doing `lttng stop`, I get:

    Direct leak of 656 byte(s) in 1 object(s) allocated from:
        #0 0x7f970719e459 in __interceptor_calloc /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
        #1 0x7f9706eba29a in zmalloc /home/simark/src/lttng-tools/src/common/macros.h:45
        #2 0x7f9706ebbb9d in recv_sessiond_optional_data /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl.c:494
        #3 0x7f9706ebbf9a in lttng_ctl_ask_sessiond_fds_varlen /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl.c:596
        #4 0x7f9706eba714 in lttng_ctl_ask_sessiond_varlen_no_cmd_header /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl-helper.h:58
        #5 0x7f9706eba747 in lttng_ctl_ask_sessiond /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl-helper.h:78
        #6 0x7f9706ec4604 in lttng_list_channels /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl.c:2262
        #7 0x55837235c4e7 in get_session_stats_str /home/simark/src/lttng-tools/src/bin/lttng/utils.c:499
        #8 0x55837235bf73 in print_session_stats /home/simark/src/lttng-tools/src/bin/lttng/utils.c:445
        #9 0x55837231cc12 in stop_tracing /home/simark/src/lttng-tools/src/bin/lttng/commands/stop.c:138
        #10 0x55837231d062 in cmd_stop /home/simark/src/lttng-tools/src/bin/lttng/commands/stop.c:229
        #11 0x55837235e63e in handle_command /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:237
        #12 0x55837235f0f2 in parse_args /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:426
        #13 0x55837235f3f4 in main /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:475
        #14 0x7f9706adcb24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24)

    Direct leak of 308 byte(s) in 1 object(s) allocated from:
        #0 0x7f970719e459 in __interceptor_calloc /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
        #1 0x7f9706eba29a in zmalloc /home/simark/src/lttng-tools/src/common/macros.h:45
        #2 0x7f9706ebbb9d in recv_sessiond_optional_data /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl.c:494
        #3 0x7f9706ebbf9a in lttng_ctl_ask_sessiond_fds_varlen /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl.c:596
        #4 0x7f9706eba714 in lttng_ctl_ask_sessiond_varlen_no_cmd_header /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl-helper.h:58
        #5 0x7f9706eba747 in lttng_ctl_ask_sessiond /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl-helper.h:78
        #6 0x7f9706ec421c in lttng_list_domains /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl.c:2220
        #7 0x55837235c3d3 in get_session_stats_str /home/simark/src/lttng-tools/src/bin/lttng/utils.c:484
        #8 0x55837235bf73 in print_session_stats /home/simark/src/lttng-tools/src/bin/lttng/utils.c:445
        #9 0x55837231cc12 in stop_tracing /home/simark/src/lttng-tools/src/bin/lttng/commands/stop.c:138
        #10 0x55837231d062 in cmd_stop /home/simark/src/lttng-tools/src/bin/lttng/commands/stop.c:229
        #11 0x55837235e63e in handle_command /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:237
        #12 0x55837235f0f2 in parse_args /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:426
        #13 0x55837235f3f4 in main /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:475
        #14 0x7f9706adcb24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24)

This is due to the get_session_stats_str function not free'ing the
results of lttng_list_channels and lttng_list_domains.  Fix that.

Change-Id: I4c200d3df41bf09bdce8eadb000abbff7fe5a751
Signed-off-by: Simon Marchi <[email protected]>
Signed-off-by: Jérémie Galarneau <[email protected]>
jgalar pushed a commit that referenced this pull request Aug 2, 2021
When doing `lttng destroy`, I get:

    Direct leak of 4385 byte(s) in 1 object(s) allocated from:
        #0 0x7f74ae025459 in __interceptor_calloc /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
        #1 0x7f74add4129a in zmalloc /home/simark/src/lttng-tools/src/common/macros.h:45
        #2 0x7f74add42b9d in recv_sessiond_optional_data /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl.c:494
        #3 0x7f74add42f9a in lttng_ctl_ask_sessiond_fds_varlen /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl.c:596
        #4 0x7f74add41714 in lttng_ctl_ask_sessiond_varlen_no_cmd_header /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl-helper.h:58
        #5 0x7f74add41747 in lttng_ctl_ask_sessiond /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl-helper.h:78
        #6 0x7f74add4a922 in lttng_list_sessions /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl.c:2105
        #7 0x56472bcbdf80 in cmd_destroy /home/simark/src/lttng-tools/src/bin/lttng/commands/destroy.c:330
        #8 0x56472bd00764 in handle_command /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:237
        #9 0x56472bd01218 in parse_args /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:426
        #10 0x56472bd0151a in main /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:475
        #11 0x7f74ad963b24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24)

This is due to cmd_destroy not free'ing the result of
lttng_list_sessions. Fix that.

Change-Id: Iff2e75e6ec1cdcd0bdfdbbc3d5099422e592905b
Signed-off-by: Simon Marchi <[email protected]>
Signed-off-by: Jérémie Galarneau <[email protected]>
jgalar pushed a commit that referenced this pull request Aug 3, 2021
When doing `lttng stop`, I get:

    Direct leak of 656 byte(s) in 1 object(s) allocated from:
        #0 0x7f970719e459 in __interceptor_calloc /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
        #1 0x7f9706eba29a in zmalloc /home/simark/src/lttng-tools/src/common/macros.h:45
        #2 0x7f9706ebbb9d in recv_sessiond_optional_data /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl.c:494
        #3 0x7f9706ebbf9a in lttng_ctl_ask_sessiond_fds_varlen /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl.c:596
        #4 0x7f9706eba714 in lttng_ctl_ask_sessiond_varlen_no_cmd_header /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl-helper.h:58
        #5 0x7f9706eba747 in lttng_ctl_ask_sessiond /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl-helper.h:78
        #6 0x7f9706ec4604 in lttng_list_channels /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl.c:2262
        #7 0x55837235c4e7 in get_session_stats_str /home/simark/src/lttng-tools/src/bin/lttng/utils.c:499
        #8 0x55837235bf73 in print_session_stats /home/simark/src/lttng-tools/src/bin/lttng/utils.c:445
        #9 0x55837231cc12 in stop_tracing /home/simark/src/lttng-tools/src/bin/lttng/commands/stop.c:138
        #10 0x55837231d062 in cmd_stop /home/simark/src/lttng-tools/src/bin/lttng/commands/stop.c:229
        #11 0x55837235e63e in handle_command /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:237
        #12 0x55837235f0f2 in parse_args /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:426
        #13 0x55837235f3f4 in main /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:475
        #14 0x7f9706adcb24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24)

    Direct leak of 308 byte(s) in 1 object(s) allocated from:
        #0 0x7f970719e459 in __interceptor_calloc /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
        #1 0x7f9706eba29a in zmalloc /home/simark/src/lttng-tools/src/common/macros.h:45
        #2 0x7f9706ebbb9d in recv_sessiond_optional_data /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl.c:494
        #3 0x7f9706ebbf9a in lttng_ctl_ask_sessiond_fds_varlen /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl.c:596
        #4 0x7f9706eba714 in lttng_ctl_ask_sessiond_varlen_no_cmd_header /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl-helper.h:58
        #5 0x7f9706eba747 in lttng_ctl_ask_sessiond /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl-helper.h:78
        #6 0x7f9706ec421c in lttng_list_domains /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl.c:2220
        #7 0x55837235c3d3 in get_session_stats_str /home/simark/src/lttng-tools/src/bin/lttng/utils.c:484
        #8 0x55837235bf73 in print_session_stats /home/simark/src/lttng-tools/src/bin/lttng/utils.c:445
        #9 0x55837231cc12 in stop_tracing /home/simark/src/lttng-tools/src/bin/lttng/commands/stop.c:138
        #10 0x55837231d062 in cmd_stop /home/simark/src/lttng-tools/src/bin/lttng/commands/stop.c:229
        #11 0x55837235e63e in handle_command /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:237
        #12 0x55837235f0f2 in parse_args /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:426
        #13 0x55837235f3f4 in main /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:475
        #14 0x7f9706adcb24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24)

This is due to the get_session_stats_str function not free'ing the
results of lttng_list_channels and lttng_list_domains.  Fix that.

Change-Id: I4c200d3df41bf09bdce8eadb000abbff7fe5a751
Signed-off-by: Simon Marchi <[email protected]>
Signed-off-by: Jérémie Galarneau <[email protected]>
jgalar pushed a commit that referenced this pull request Aug 3, 2021
When doing `lttng destroy`, I get:

    Direct leak of 4385 byte(s) in 1 object(s) allocated from:
        #0 0x7f74ae025459 in __interceptor_calloc /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
        #1 0x7f74add4129a in zmalloc /home/simark/src/lttng-tools/src/common/macros.h:45
        #2 0x7f74add42b9d in recv_sessiond_optional_data /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl.c:494
        #3 0x7f74add42f9a in lttng_ctl_ask_sessiond_fds_varlen /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl.c:596
        #4 0x7f74add41714 in lttng_ctl_ask_sessiond_varlen_no_cmd_header /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl-helper.h:58
        #5 0x7f74add41747 in lttng_ctl_ask_sessiond /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl-helper.h:78
        #6 0x7f74add4a922 in lttng_list_sessions /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl.c:2105
        #7 0x56472bcbdf80 in cmd_destroy /home/simark/src/lttng-tools/src/bin/lttng/commands/destroy.c:330
        #8 0x56472bd00764 in handle_command /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:237
        #9 0x56472bd01218 in parse_args /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:426
        #10 0x56472bd0151a in main /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:475
        #11 0x7f74ad963b24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24)

This is due to cmd_destroy not free'ing the result of
lttng_list_sessions. Fix that.

Change-Id: Iff2e75e6ec1cdcd0bdfdbbc3d5099422e592905b
Signed-off-by: Simon Marchi <[email protected]>
Signed-off-by: Jérémie Galarneau <[email protected]>
jgalar pushed a commit that referenced this pull request Aug 6, 2021
When doing `lttng stop`, I get:

    Direct leak of 656 byte(s) in 1 object(s) allocated from:
        #0 0x7f970719e459 in __interceptor_calloc /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
        #1 0x7f9706eba29a in zmalloc /home/simark/src/lttng-tools/src/common/macros.h:45
        #2 0x7f9706ebbb9d in recv_sessiond_optional_data /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl.c:494
        #3 0x7f9706ebbf9a in lttng_ctl_ask_sessiond_fds_varlen /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl.c:596
        #4 0x7f9706eba714 in lttng_ctl_ask_sessiond_varlen_no_cmd_header /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl-helper.h:58
        #5 0x7f9706eba747 in lttng_ctl_ask_sessiond /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl-helper.h:78
        #6 0x7f9706ec4604 in lttng_list_channels /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl.c:2262
        #7 0x55837235c4e7 in get_session_stats_str /home/simark/src/lttng-tools/src/bin/lttng/utils.c:499
        #8 0x55837235bf73 in print_session_stats /home/simark/src/lttng-tools/src/bin/lttng/utils.c:445
        #9 0x55837231cc12 in stop_tracing /home/simark/src/lttng-tools/src/bin/lttng/commands/stop.c:138
        #10 0x55837231d062 in cmd_stop /home/simark/src/lttng-tools/src/bin/lttng/commands/stop.c:229
        #11 0x55837235e63e in handle_command /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:237
        #12 0x55837235f0f2 in parse_args /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:426
        #13 0x55837235f3f4 in main /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:475
        #14 0x7f9706adcb24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24)

    Direct leak of 308 byte(s) in 1 object(s) allocated from:
        #0 0x7f970719e459 in __interceptor_calloc /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
        #1 0x7f9706eba29a in zmalloc /home/simark/src/lttng-tools/src/common/macros.h:45
        #2 0x7f9706ebbb9d in recv_sessiond_optional_data /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl.c:494
        #3 0x7f9706ebbf9a in lttng_ctl_ask_sessiond_fds_varlen /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl.c:596
        #4 0x7f9706eba714 in lttng_ctl_ask_sessiond_varlen_no_cmd_header /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl-helper.h:58
        #5 0x7f9706eba747 in lttng_ctl_ask_sessiond /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl-helper.h:78
        #6 0x7f9706ec421c in lttng_list_domains /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl.c:2220
        #7 0x55837235c3d3 in get_session_stats_str /home/simark/src/lttng-tools/src/bin/lttng/utils.c:484
        #8 0x55837235bf73 in print_session_stats /home/simark/src/lttng-tools/src/bin/lttng/utils.c:445
        #9 0x55837231cc12 in stop_tracing /home/simark/src/lttng-tools/src/bin/lttng/commands/stop.c:138
        #10 0x55837231d062 in cmd_stop /home/simark/src/lttng-tools/src/bin/lttng/commands/stop.c:229
        #11 0x55837235e63e in handle_command /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:237
        #12 0x55837235f0f2 in parse_args /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:426
        #13 0x55837235f3f4 in main /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:475
        #14 0x7f9706adcb24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24)

This is due to the get_session_stats_str function not free'ing the
results of lttng_list_channels and lttng_list_domains.  Fix that.

Change-Id: I4c200d3df41bf09bdce8eadb000abbff7fe5a751
Signed-off-by: Simon Marchi <[email protected]>
Signed-off-by: Jérémie Galarneau <[email protected]>
jgalar pushed a commit that referenced this pull request Aug 6, 2021
When doing `lttng destroy`, I get:

    Direct leak of 4385 byte(s) in 1 object(s) allocated from:
        #0 0x7f74ae025459 in __interceptor_calloc /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
        #1 0x7f74add4129a in zmalloc /home/simark/src/lttng-tools/src/common/macros.h:45
        #2 0x7f74add42b9d in recv_sessiond_optional_data /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl.c:494
        #3 0x7f74add42f9a in lttng_ctl_ask_sessiond_fds_varlen /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl.c:596
        #4 0x7f74add41714 in lttng_ctl_ask_sessiond_varlen_no_cmd_header /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl-helper.h:58
        #5 0x7f74add41747 in lttng_ctl_ask_sessiond /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl-helper.h:78
        #6 0x7f74add4a922 in lttng_list_sessions /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl.c:2105
        #7 0x56472bcbdf80 in cmd_destroy /home/simark/src/lttng-tools/src/bin/lttng/commands/destroy.c:330
        #8 0x56472bd00764 in handle_command /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:237
        #9 0x56472bd01218 in parse_args /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:426
        #10 0x56472bd0151a in main /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:475
        #11 0x7f74ad963b24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24)

This is due to cmd_destroy not free'ing the result of
lttng_list_sessions. Fix that.

Change-Id: Iff2e75e6ec1cdcd0bdfdbbc3d5099422e592905b
Signed-off-by: Simon Marchi <[email protected]>
Signed-off-by: Jérémie Galarneau <[email protected]>
jgalar pushed a commit that referenced this pull request Sep 23, 2021
Observed issue
==============

The following backtrace have been reported [1].

 #0  __GI_raise (sig=sig@entry=6) at /usr/src/debug/glibc/2.31+gitAUTOINC+f84949f1c4-r0/git/sysdeps/unix/sysv/linux/raise.c:50
 #1  0x0000003123025528 in __GI_abort () at /usr/src/debug/glibc/2.31+gitAUTOINC+f84949f1c4-r0/git/stdlib/abort.c:79
 #2  0x0000000000419884 in lttng_trace_archive_location_serialize (location=0x7f1c9c001160, buffer=0x7f1cb961c320) at /usr/src/debug/lttng-tools/2.13.0-r0/lttng-tools-2.13.0/src/common/location.c:230
 #3  0x00000000004c8f06 in lttng_evaluation_session_rotation_serialize (evaluation=0x7f1cb000a7f0, payload=0x7f1cb961c320) at /usr/src/debug/lttng-tools/2.13.0-r0/lttng-tools-2.13.0/src/common/conditions/session-rotation.c:539
 #4  0x00000000004a80fa in lttng_evaluation_serialize (evaluation=0x7f1cb000a7f0, payload=0x7f1cb961c320) at /usr/src/debug/lttng-tools/2.13.0-r0/lttng-tools-2.13.0/src/common/evaluation.c:42
 #5  0x00000000004bc24f in lttng_notification_serialize (notification=0x7f1cb961c310, payload=0x7f1cb961c320) at /usr/src/debug/lttng-tools/2.13.0-r0/lttng-tools-2.13.0/src/common/notification.c:63
 #6  0x0000000000458b7d in notification_client_list_send_evaluation (client_list=0x7f1cb0008f90, trigger=0x7f1ca40113d0, evaluation=<optimized out>, source_object_creds=0x7f1cb000a874, client_report=0x475840 <client_handle_transmission_status>, user_data=0x7f1cb0006010) at /usr/src/debug/lttng-tools/2.13.0-r0/lttng-tools-2.13.0/src/bin/lttng-sessiond/notification-thread-events.c:4379
 #7  0x0000000000476586 in action_executor_generic_handler (item=0x7f1cb0009600, work_item=0x7f1cb000a820, executor=0x7f1cb0006010) at /usr/src/debug/lttng-tools/2.13.0-r0/lttng-tools-2.13.0/src/bin/lttng-sessiond/action-executor.c:696
 #8  action_work_item_execute (work_item=0x7f1cb000a820, executor=0x7f1cb0006010) at /usr/src/debug/lttng-tools/2.13.0-r0/lttng-tools-2.13.0/src/bin/lttng-sessiond/action-executor.c:715
 #9  action_executor_thread (_data=0x7f1cb0006010) at /usr/src/debug/lttng-tools/2.13.0-r0/lttng-tools-2.13.0/src/bin/lttng-sessiond/action-executor.c:797
 #10 0x0000000000462327 in launch_thread (data=0x7f1cb00060b0) at /usr/src/debug/lttng-tools/2.13.0-r0/lttng-tools-2.13.0/src/bin/lttng-sessiond/thread.c:66
 #11 0x0000003123408ea4 in start_thread (arg=<optimized out>) at /usr/src/debug/glibc/2.31+gitAUTOINC+f84949f1c4-r0/git/nptl/pthread_create.c:477
 #12 0x00000031230f8dcf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

This can be easily reproduced with the following session and trigger
configuration:

 lttng create test
 lttng enable-event -u -a
 lttng start
 # Register two similar triggers via a dummy C program since rotation
 # completed condition is not exposed on the CLI for now. Yielding the
 # following triggers:
 lttng list-triggers
 - name: trigger0
   owner uid: 1000
   condition: session rotation completed
     session name: test
     errors: none
  action:notify
   errors: none
 - name: trigger1
   owner uid: 1000
   condition: session rotation completed
     session name: test
     errors: none
  action:notify
   errors: none

  lttng rotate <- abort happens here.

Cause
=====

The problem lies in how the location (`lttng_trace_archive_location`)
object is assigned to the `lttng_evaluation` objects. A single location
object can end up being shared between multiple `lttng_evaluation` objects
since we iterate over all triggers and create an `lttng_evaluation` object
with the location each time as needed.

See `src/bin/lttng-sessiond/notification-thread-events.c:1956`.

The location object is then freed when the first notification is
completely serialized. The second serialization end up having a
reference to a freed `lttng_trace_archive_location` object.

Solution
========

Implement ref counting for the lttng_trace_archive_location object.

Note
=======

This also fixes a leak that was present in `cmd_destroy_session_reply`.

The location is created by `session_get_trace_archive_location` and is
never `destroyed`/`put`.

Known drawbacks
=========

None.

References
==========

[1] https://bugs.lttng.org/issues/1325

Fixes: #1325

Signed-off-by: Jonathan Rajotte <[email protected]>
Change-Id: I99dc595ee5b0288c727b193ed061f5273752bd24
Signed-off-by: Jérémie Galarneau <[email protected]>
jgalar pushed a commit that referenced this pull request Sep 23, 2021
Observed issue
==============

The following backtrace have been reported [1].

 #0  __GI_raise (sig=sig@entry=6) at /usr/src/debug/glibc/2.31+gitAUTOINC+f84949f1c4-r0/git/sysdeps/unix/sysv/linux/raise.c:50
 #1  0x0000003123025528 in __GI_abort () at /usr/src/debug/glibc/2.31+gitAUTOINC+f84949f1c4-r0/git/stdlib/abort.c:79
 #2  0x0000000000419884 in lttng_trace_archive_location_serialize (location=0x7f1c9c001160, buffer=0x7f1cb961c320) at /usr/src/debug/lttng-tools/2.13.0-r0/lttng-tools-2.13.0/src/common/location.c:230
 #3  0x00000000004c8f06 in lttng_evaluation_session_rotation_serialize (evaluation=0x7f1cb000a7f0, payload=0x7f1cb961c320) at /usr/src/debug/lttng-tools/2.13.0-r0/lttng-tools-2.13.0/src/common/conditions/session-rotation.c:539
 #4  0x00000000004a80fa in lttng_evaluation_serialize (evaluation=0x7f1cb000a7f0, payload=0x7f1cb961c320) at /usr/src/debug/lttng-tools/2.13.0-r0/lttng-tools-2.13.0/src/common/evaluation.c:42
 #5  0x00000000004bc24f in lttng_notification_serialize (notification=0x7f1cb961c310, payload=0x7f1cb961c320) at /usr/src/debug/lttng-tools/2.13.0-r0/lttng-tools-2.13.0/src/common/notification.c:63
 #6  0x0000000000458b7d in notification_client_list_send_evaluation (client_list=0x7f1cb0008f90, trigger=0x7f1ca40113d0, evaluation=<optimized out>, source_object_creds=0x7f1cb000a874, client_report=0x475840 <client_handle_transmission_status>, user_data=0x7f1cb0006010) at /usr/src/debug/lttng-tools/2.13.0-r0/lttng-tools-2.13.0/src/bin/lttng-sessiond/notification-thread-events.c:4379
 #7  0x0000000000476586 in action_executor_generic_handler (item=0x7f1cb0009600, work_item=0x7f1cb000a820, executor=0x7f1cb0006010) at /usr/src/debug/lttng-tools/2.13.0-r0/lttng-tools-2.13.0/src/bin/lttng-sessiond/action-executor.c:696
 #8  action_work_item_execute (work_item=0x7f1cb000a820, executor=0x7f1cb0006010) at /usr/src/debug/lttng-tools/2.13.0-r0/lttng-tools-2.13.0/src/bin/lttng-sessiond/action-executor.c:715
 #9  action_executor_thread (_data=0x7f1cb0006010) at /usr/src/debug/lttng-tools/2.13.0-r0/lttng-tools-2.13.0/src/bin/lttng-sessiond/action-executor.c:797
 #10 0x0000000000462327 in launch_thread (data=0x7f1cb00060b0) at /usr/src/debug/lttng-tools/2.13.0-r0/lttng-tools-2.13.0/src/bin/lttng-sessiond/thread.c:66
 #11 0x0000003123408ea4 in start_thread (arg=<optimized out>) at /usr/src/debug/glibc/2.31+gitAUTOINC+f84949f1c4-r0/git/nptl/pthread_create.c:477
 #12 0x00000031230f8dcf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

This can be easily reproduced with the following session and trigger
configuration:

 lttng create test
 lttng enable-event -u -a
 lttng start
 # Register two similar triggers via a dummy C program since rotation
 # completed condition is not exposed on the CLI for now. Yielding the
 # following triggers:
 lttng list-triggers
 - name: trigger0
   owner uid: 1000
   condition: session rotation completed
     session name: test
     errors: none
  action:notify
   errors: none
 - name: trigger1
   owner uid: 1000
   condition: session rotation completed
     session name: test
     errors: none
  action:notify
   errors: none

  lttng rotate <- abort happens here.

Cause
=====

The problem lies in how the location (`lttng_trace_archive_location`)
object is assigned to the `lttng_evaluation` objects. A single location
object can end up being shared between multiple `lttng_evaluation` objects
since we iterate over all triggers and create an `lttng_evaluation` object
with the location each time as needed.

See `src/bin/lttng-sessiond/notification-thread-events.c:1956`.

The location object is then freed when the first notification is
completely serialized. The second serialization end up having a
reference to a freed `lttng_trace_archive_location` object.

Solution
========

Implement ref counting for the lttng_trace_archive_location object.

Note
=======

This also fixes a leak that was present in `cmd_destroy_session_reply`.

The location is created by `session_get_trace_archive_location` and is
never `destroyed`/`put`.

Known drawbacks
=========

None.

References
==========

[1] https://bugs.lttng.org/issues/1325

Fixes: #1325

Signed-off-by: Jonathan Rajotte <[email protected]>
Change-Id: I99dc595ee5b0288c727b193ed061f5273752bd24
Signed-off-by: Jérémie Galarneau <[email protected]>
jgalar pushed a commit that referenced this pull request Sep 23, 2021
Observed issue
==============

The following backtrace have been reported [1].

 #0  __GI_raise (sig=sig@entry=6) at /usr/src/debug/glibc/2.31+gitAUTOINC+f84949f1c4-r0/git/sysdeps/unix/sysv/linux/raise.c:50
 #1  0x0000003123025528 in __GI_abort () at /usr/src/debug/glibc/2.31+gitAUTOINC+f84949f1c4-r0/git/stdlib/abort.c:79
 #2  0x0000000000419884 in lttng_trace_archive_location_serialize (location=0x7f1c9c001160, buffer=0x7f1cb961c320) at /usr/src/debug/lttng-tools/2.13.0-r0/lttng-tools-2.13.0/src/common/location.c:230
 #3  0x00000000004c8f06 in lttng_evaluation_session_rotation_serialize (evaluation=0x7f1cb000a7f0, payload=0x7f1cb961c320) at /usr/src/debug/lttng-tools/2.13.0-r0/lttng-tools-2.13.0/src/common/conditions/session-rotation.c:539
 #4  0x00000000004a80fa in lttng_evaluation_serialize (evaluation=0x7f1cb000a7f0, payload=0x7f1cb961c320) at /usr/src/debug/lttng-tools/2.13.0-r0/lttng-tools-2.13.0/src/common/evaluation.c:42
 #5  0x00000000004bc24f in lttng_notification_serialize (notification=0x7f1cb961c310, payload=0x7f1cb961c320) at /usr/src/debug/lttng-tools/2.13.0-r0/lttng-tools-2.13.0/src/common/notification.c:63
 #6  0x0000000000458b7d in notification_client_list_send_evaluation (client_list=0x7f1cb0008f90, trigger=0x7f1ca40113d0, evaluation=<optimized out>, source_object_creds=0x7f1cb000a874, client_report=0x475840 <client_handle_transmission_status>, user_data=0x7f1cb0006010) at /usr/src/debug/lttng-tools/2.13.0-r0/lttng-tools-2.13.0/src/bin/lttng-sessiond/notification-thread-events.c:4379
 #7  0x0000000000476586 in action_executor_generic_handler (item=0x7f1cb0009600, work_item=0x7f1cb000a820, executor=0x7f1cb0006010) at /usr/src/debug/lttng-tools/2.13.0-r0/lttng-tools-2.13.0/src/bin/lttng-sessiond/action-executor.c:696
 #8  action_work_item_execute (work_item=0x7f1cb000a820, executor=0x7f1cb0006010) at /usr/src/debug/lttng-tools/2.13.0-r0/lttng-tools-2.13.0/src/bin/lttng-sessiond/action-executor.c:715
 #9  action_executor_thread (_data=0x7f1cb0006010) at /usr/src/debug/lttng-tools/2.13.0-r0/lttng-tools-2.13.0/src/bin/lttng-sessiond/action-executor.c:797
 #10 0x0000000000462327 in launch_thread (data=0x7f1cb00060b0) at /usr/src/debug/lttng-tools/2.13.0-r0/lttng-tools-2.13.0/src/bin/lttng-sessiond/thread.c:66
 #11 0x0000003123408ea4 in start_thread (arg=<optimized out>) at /usr/src/debug/glibc/2.31+gitAUTOINC+f84949f1c4-r0/git/nptl/pthread_create.c:477
 #12 0x00000031230f8dcf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

This can be easily reproduced with the following session and trigger
configuration:

 lttng create test
 lttng enable-event -u -a
 lttng start
 # Register two similar triggers via a dummy C program since rotation
 # completed condition is not exposed on the CLI for now. Yielding the
 # following triggers:
 lttng list-triggers
 - name: trigger0
   owner uid: 1000
   condition: session rotation completed
     session name: test
     errors: none
  action:notify
   errors: none
 - name: trigger1
   owner uid: 1000
   condition: session rotation completed
     session name: test
     errors: none
  action:notify
   errors: none

  lttng rotate <- abort happens here.

Cause
=====

The problem lies in how the location (`lttng_trace_archive_location`)
object is assigned to the `lttng_evaluation` objects. A single location
object can end up being shared between multiple `lttng_evaluation` objects
since we iterate over all triggers and create an `lttng_evaluation` object
with the location each time as needed.

See `src/bin/lttng-sessiond/notification-thread-events.c:1956`.

The location object is then freed when the first notification is
completely serialized. The second serialization end up having a
reference to a freed `lttng_trace_archive_location` object.

Solution
========

Implement ref counting for the lttng_trace_archive_location object.

Note
=======

This also fixes a leak that was present in `cmd_destroy_session_reply`.

The location is created by `session_get_trace_archive_location` and is
never `destroyed`/`put`.

Known drawbacks
=========

None.

References
==========

[1] https://bugs.lttng.org/issues/1325

Fixes: #1325

Signed-off-by: Jonathan Rajotte <[email protected]>
Change-Id: I99dc595ee5b0288c727b193ed061f5273752bd24
Signed-off-by: Jérémie Galarneau <[email protected]>
jgalar pushed a commit that referenced this pull request Oct 15, 2021
Observed issue
==============

A segmentation fault is observed for multiple UST timeout scenarios.

Backtrace:

 #0  __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:384
 #1  0x0000557fe0395df9 in copy_filter_bytecode (orig_f=0x7f9c5802b790) at ust-app.c:1196
 #2  0x0000557fe0397702 in shadow_copy_event (ua_event=0x7f9c58025ff0, uevent=0x7f9c58033560) at ust-app.c:1824
 #3  0x0000557fe039ac46 in create_ust_app_event (ua_sess=0x7f9c5802ec20, ua_chan=0x7f9c58025cc0, uevent=0x7f9c58033560, app=0x7f9c5c001da0) at ust-app.c:3192
 #4  0x0000557fe03a054d in ust_app_channel_synchronize_event (ua_chan=0x7f9c58025cc0, uevent=0x7f9c58033560, ua_sess=0x7f9c5802ec20, app=0x7f9c5c001da0) at ust-app.c:5096
 #5  0x0000557fe03a0772 in ust_app_synchronize (usess=0x7f9c580074a0, app=0x7f9c5c001da0) at ust-app.c:5173
 #6  0x0000557fe03a0a70 in ust_app_global_update (usess=0x7f9c580074a0, app=0x7f9c5c001da0) at ust-app.c:5255
 #7  0x0000557fe03a00e0 in ust_app_start_trace_all (usess=0x7f9c580074a0) at ust-app.c:4987
 #8  0x0000557fe0355c6a in cmd_start_trace (session=0x7f9c5800a190) at cmd.c:2668
 #9  0x0000557fe0382e70 in process_client_msg (cmd_ctx=0x7f9c58003d70, sock=0x7f9c74bf44e0, sock_error=0x7f9c74bf44e4) at client.c:1527
 #10 0x0000557fe03848a2 in thread_manage_clients (data=0x557fe06d9440) at client.c:2200
 #11 0x0000557fe037d1cb in launch_thread (data=0x557fe06d94b0) at thread.c:75
 #12 0x00007f9c796af609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 #13 0x00007f9c795b6293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

The scenario:

  # Start an instrumented app
  ./app
  gdb lttng-sessiond
  # put a breakpoint on ustctl_set_filter
  lttng create my_session
  lttng enable-event -u tp:tp_test
  lttng start
  lttng enable-event -u __dummy --filter 'my_field == "user34"'
  # The tracepoint should hit. Do not continue.
  kill -s SIGSTOP $(pgrep app)
  # Continue lttng-sessiond.
  # enable-event will return an error. This a bug in itself, still let's
  # continue with the current bug.
  lttng stop
  # Start a new app that will register.
  ./app &
  sleep 1
  lttng start
  # lttng-sessiond should segfault.

Cause
=====

During the "lttng enable-event" command, the timeout error bubbles up
all the way to event_ust_enable_tracepoint and is different from
LTTNG_UST_ERR_EXIST. `trace_ust_destroy_event` is called and frees the
`uevent` object. Note that contrary to the comment `uevent` is added to
the channel event hash table at this point.

On the next `lttng start` command, the event node is still present in
the hash table and is iterated on. lttng-sessiond segfault on the first
data access of the previously freed memory.

The problem was introduced by commit
88e3c2f [1]. Which essentially move the
callsite of `add_unique_ust_event` before `ust_app_*_event_glb` calls.

Solution
========

Go to `end` label to prevent freeing of the uevent object.

Note that app synchronization should not force an error at the channel
level, since a single app can fail but the whole channel should not.

The `error` label is now obsolete.

Known drawbacks
=========

None.

References
==========

[1] 88e3c2f

Signed-off-by: Jonathan Rajotte <[email protected]>
Change-Id: Ifaf3f4c71bb2da869c7b441aaa4b367f8f7cbdd6
Signed-off-by: Jérémie Galarneau <[email protected]>
jgalar pushed a commit that referenced this pull request Oct 18, 2021
Observed issue
==============

A segmentation fault is observed for multiple UST timeout scenarios.

Backtrace:

 #0  __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:384
 #1  0x0000557fe0395df9 in copy_filter_bytecode (orig_f=0x7f9c5802b790) at ust-app.c:1196
 #2  0x0000557fe0397702 in shadow_copy_event (ua_event=0x7f9c58025ff0, uevent=0x7f9c58033560) at ust-app.c:1824
 #3  0x0000557fe039ac46 in create_ust_app_event (ua_sess=0x7f9c5802ec20, ua_chan=0x7f9c58025cc0, uevent=0x7f9c58033560, app=0x7f9c5c001da0) at ust-app.c:3192
 #4  0x0000557fe03a054d in ust_app_channel_synchronize_event (ua_chan=0x7f9c58025cc0, uevent=0x7f9c58033560, ua_sess=0x7f9c5802ec20, app=0x7f9c5c001da0) at ust-app.c:5096
 #5  0x0000557fe03a0772 in ust_app_synchronize (usess=0x7f9c580074a0, app=0x7f9c5c001da0) at ust-app.c:5173
 #6  0x0000557fe03a0a70 in ust_app_global_update (usess=0x7f9c580074a0, app=0x7f9c5c001da0) at ust-app.c:5255
 #7  0x0000557fe03a00e0 in ust_app_start_trace_all (usess=0x7f9c580074a0) at ust-app.c:4987
 #8  0x0000557fe0355c6a in cmd_start_trace (session=0x7f9c5800a190) at cmd.c:2668
 #9  0x0000557fe0382e70 in process_client_msg (cmd_ctx=0x7f9c58003d70, sock=0x7f9c74bf44e0, sock_error=0x7f9c74bf44e4) at client.c:1527
 #10 0x0000557fe03848a2 in thread_manage_clients (data=0x557fe06d9440) at client.c:2200
 #11 0x0000557fe037d1cb in launch_thread (data=0x557fe06d94b0) at thread.c:75
 #12 0x00007f9c796af609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 #13 0x00007f9c795b6293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

The scenario:

  # Start an instrumented app
  ./app
  gdb lttng-sessiond
  # put a breakpoint on ustctl_set_filter
  lttng create my_session
  lttng enable-event -u tp:tp_test
  lttng start
  lttng enable-event -u __dummy --filter 'my_field == "user34"'
  # The tracepoint should hit. Do not continue.
  kill -s SIGSTOP $(pgrep app)
  # Continue lttng-sessiond.
  # enable-event will return an error. This a bug in itself, still let's
  # continue with the current bug.
  lttng stop
  # Start a new app that will register.
  ./app &
  sleep 1
  lttng start
  # lttng-sessiond should segfault.

Cause
=====

During the "lttng enable-event" command, the timeout error bubbles up
all the way to event_ust_enable_tracepoint and is different from
LTTNG_UST_ERR_EXIST. `trace_ust_destroy_event` is called and frees the
`uevent` object. Note that contrary to the comment `uevent` is added to
the channel event hash table at this point.

On the next `lttng start` command, the event node is still present in
the hash table and is iterated on. lttng-sessiond segfault on the first
data access of the previously freed memory.

The problem was introduced by commit
88e3c2f [1]. Which essentially move the
callsite of `add_unique_ust_event` before `ust_app_*_event_glb` calls.

Solution
========

Go to `end` label to prevent freeing of the uevent object.

Note that app synchronization should not force an error at the channel
level, since a single app can fail but the whole channel should not.

The `error` label is now obsolete.

Known drawbacks
=========

None.

References
==========

[1] 88e3c2f

Signed-off-by: Jonathan Rajotte <[email protected]>
Change-Id: Ifaf3f4c71bb2da869c7b441aaa4b367f8f7cbdd6
Signed-off-by: Jérémie Galarneau <[email protected]>
jgalar pushed a commit that referenced this pull request Oct 18, 2021
Observed issue
==============

A segmentation fault is observed for multiple UST timeout scenarios.

Backtrace:

 #0  __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:384
 #1  0x0000557fe0395df9 in copy_filter_bytecode (orig_f=0x7f9c5802b790) at ust-app.c:1196
 #2  0x0000557fe0397702 in shadow_copy_event (ua_event=0x7f9c58025ff0, uevent=0x7f9c58033560) at ust-app.c:1824
 #3  0x0000557fe039ac46 in create_ust_app_event (ua_sess=0x7f9c5802ec20, ua_chan=0x7f9c58025cc0, uevent=0x7f9c58033560, app=0x7f9c5c001da0) at ust-app.c:3192
 #4  0x0000557fe03a054d in ust_app_channel_synchronize_event (ua_chan=0x7f9c58025cc0, uevent=0x7f9c58033560, ua_sess=0x7f9c5802ec20, app=0x7f9c5c001da0) at ust-app.c:5096
 #5  0x0000557fe03a0772 in ust_app_synchronize (usess=0x7f9c580074a0, app=0x7f9c5c001da0) at ust-app.c:5173
 #6  0x0000557fe03a0a70 in ust_app_global_update (usess=0x7f9c580074a0, app=0x7f9c5c001da0) at ust-app.c:5255
 #7  0x0000557fe03a00e0 in ust_app_start_trace_all (usess=0x7f9c580074a0) at ust-app.c:4987
 #8  0x0000557fe0355c6a in cmd_start_trace (session=0x7f9c5800a190) at cmd.c:2668
 #9  0x0000557fe0382e70 in process_client_msg (cmd_ctx=0x7f9c58003d70, sock=0x7f9c74bf44e0, sock_error=0x7f9c74bf44e4) at client.c:1527
 #10 0x0000557fe03848a2 in thread_manage_clients (data=0x557fe06d9440) at client.c:2200
 #11 0x0000557fe037d1cb in launch_thread (data=0x557fe06d94b0) at thread.c:75
 #12 0x00007f9c796af609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 #13 0x00007f9c795b6293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

The scenario:

  # Start an instrumented app
  ./app
  gdb lttng-sessiond
  # put a breakpoint on ustctl_set_filter
  lttng create my_session
  lttng enable-event -u tp:tp_test
  lttng start
  lttng enable-event -u __dummy --filter 'my_field == "user34"'
  # The tracepoint should hit. Do not continue.
  kill -s SIGSTOP $(pgrep app)
  # Continue lttng-sessiond.
  # enable-event will return an error. This a bug in itself, still let's
  # continue with the current bug.
  lttng stop
  # Start a new app that will register.
  ./app &
  sleep 1
  lttng start
  # lttng-sessiond should segfault.

Cause
=====

During the "lttng enable-event" command, the timeout error bubbles up
all the way to event_ust_enable_tracepoint and is different from
LTTNG_UST_ERR_EXIST. `trace_ust_destroy_event` is called and frees the
`uevent` object. Note that contrary to the comment `uevent` is added to
the channel event hash table at this point.

On the next `lttng start` command, the event node is still present in
the hash table and is iterated on. lttng-sessiond segfault on the first
data access of the previously freed memory.

The problem was introduced by commit
88e3c2f [1]. Which essentially move the
callsite of `add_unique_ust_event` before `ust_app_*_event_glb` calls.

Solution
========

Go to `end` label to prevent freeing of the uevent object.

Note that app synchronization should not force an error at the channel
level, since a single app can fail but the whole channel should not.

The `error` label is now obsolete.

Known drawbacks
=========

None.

References
==========

[1] 88e3c2f

Signed-off-by: Jonathan Rajotte <[email protected]>
Change-Id: Ifaf3f4c71bb2da869c7b441aaa4b367f8f7cbdd6
Signed-off-by: Jérémie Galarneau <[email protected]>
jgalar pushed a commit that referenced this pull request Oct 18, 2021
Observed issue
==============

A segmentation fault is observed for multiple UST timeout scenarios.

Backtrace:

 #0  __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:384
 #1  0x0000557fe0395df9 in copy_filter_bytecode (orig_f=0x7f9c5802b790) at ust-app.c:1196
 #2  0x0000557fe0397702 in shadow_copy_event (ua_event=0x7f9c58025ff0, uevent=0x7f9c58033560) at ust-app.c:1824
 #3  0x0000557fe039ac46 in create_ust_app_event (ua_sess=0x7f9c5802ec20, ua_chan=0x7f9c58025cc0, uevent=0x7f9c58033560, app=0x7f9c5c001da0) at ust-app.c:3192
 #4  0x0000557fe03a054d in ust_app_channel_synchronize_event (ua_chan=0x7f9c58025cc0, uevent=0x7f9c58033560, ua_sess=0x7f9c5802ec20, app=0x7f9c5c001da0) at ust-app.c:5096
 #5  0x0000557fe03a0772 in ust_app_synchronize (usess=0x7f9c580074a0, app=0x7f9c5c001da0) at ust-app.c:5173
 #6  0x0000557fe03a0a70 in ust_app_global_update (usess=0x7f9c580074a0, app=0x7f9c5c001da0) at ust-app.c:5255
 #7  0x0000557fe03a00e0 in ust_app_start_trace_all (usess=0x7f9c580074a0) at ust-app.c:4987
 #8  0x0000557fe0355c6a in cmd_start_trace (session=0x7f9c5800a190) at cmd.c:2668
 #9  0x0000557fe0382e70 in process_client_msg (cmd_ctx=0x7f9c58003d70, sock=0x7f9c74bf44e0, sock_error=0x7f9c74bf44e4) at client.c:1527
 #10 0x0000557fe03848a2 in thread_manage_clients (data=0x557fe06d9440) at client.c:2200
 #11 0x0000557fe037d1cb in launch_thread (data=0x557fe06d94b0) at thread.c:75
 #12 0x00007f9c796af609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 #13 0x00007f9c795b6293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

The scenario:

  # Start an instrumented app
  ./app
  gdb lttng-sessiond
  # put a breakpoint on ustctl_set_filter
  lttng create my_session
  lttng enable-event -u tp:tp_test
  lttng start
  lttng enable-event -u __dummy --filter 'my_field == "user34"'
  # The tracepoint should hit. Do not continue.
  kill -s SIGSTOP $(pgrep app)
  # Continue lttng-sessiond.
  # enable-event will return an error. This a bug in itself, still let's
  # continue with the current bug.
  lttng stop
  # Start a new app that will register.
  ./app &
  sleep 1
  lttng start
  # lttng-sessiond should segfault.

Cause
=====

During the "lttng enable-event" command, the timeout error bubbles up
all the way to event_ust_enable_tracepoint and is different from
LTTNG_UST_ERR_EXIST. `trace_ust_destroy_event` is called and frees the
`uevent` object. Note that contrary to the comment `uevent` is added to
the channel event hash table at this point.

On the next `lttng start` command, the event node is still present in
the hash table and is iterated on. lttng-sessiond segfault on the first
data access of the previously freed memory.

The problem was introduced by commit
88e3c2f [1]. Which essentially move the
callsite of `add_unique_ust_event` before `ust_app_*_event_glb` calls.

Solution
========

Go to `end` label to prevent freeing of the uevent object.

Note that app synchronization should not force an error at the channel
level, since a single app can fail but the whole channel should not.

The `error` label is now obsolete.

Known drawbacks
=========

None.

References
==========

[1] 88e3c2f

Signed-off-by: Jonathan Rajotte <[email protected]>
Change-Id: Ifaf3f4c71bb2da869c7b441aaa4b367f8f7cbdd6
Signed-off-by: Jérémie Galarneau <[email protected]>
jgalar pushed a commit that referenced this pull request Dec 10, 2021
…y_name

Observed issue
==============

The lttng-sessiond asserts with the following backtrace on lttng create:

 #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
 #1  0x00007ffff7ab5859 in __GI_abort () at abort.c:79
 #2  0x00007ffff7ab5729 in __assert_fail_base (fmt=0x7ffff7c4b588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x5555556ab0a6 "node_ptr == &node->node", file=0x5555556ab085 "hashtable.c", line=298, function=<optimized out>) at a
 #3  0x00007ffff7ac6f36 in __GI___assert_fail (assertion=assertion@entry=0x5555556ab0a6 "node_ptr == &node->node", file=file@entry=0x5555556ab085 "hashtable.c", line=line@entry=298, function=function@entry=0x5555556ab380 <__PRETTY_FUNCTIO
 #4  0x000055555560be44 in lttng_ht_add_unique_str (ht=<optimized out>, node=0x7fffe0026c58) at hashtable.c:298
 #5  0x000055555558fb6a in add_session_ht (ls=0x7fffe0024970) at session.c:372
 #6  session_create (name=<optimized out>, uid=1000, gid=1000, out_session=out_session@entry=0x7fffedfddbd8) at session.c:1308
 #7  0x000055555559b219 in cmd_create_session_from_descriptor (creds=<optimized out>, creds=<optimized out>, home_path=<optimized out>, descriptor=<optimized out>) at cmd.c:3040
 #8  cmd_create_session (cmd_ctx=cmd_ctx@entry=0x7fffedfe5fa0, sock=<optimized out>, return_descriptor=return_descriptor@entry=0x7fffedfddd68) at cmd.c:3176
 #9  0x00005555555cc341 in process_client_msg (sock_error=0x7fffedfddd10, sock=0x7fffedfddd0c, cmd_ctx=0x7fffedfe5fa0) at client.c:2177
 #10 thread_manage_clients (data=<optimized out>) at client.c:2742
 #11 0x00005555555c5fff in launch_thread (data=0x55555571b780) at thread.c:66
 #12 0x00007ffff7c8b609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 #13 0x00007ffff7bb2293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

The issue can be reproduced with modifications to the rotation thread
code and the following scenario:

 $ lttng create test
 $ lttng enable-event -u -a
 $ lttng start
 run any app just so that we have a complete valid session. (might not be necessary)
 $ lttng destroy --no-wait
 $ lttng create test
 ^ Should assert here.

The diff to be applied:

 diff --git a/src/bin/lttng-sessiond/rotation-thread.cpp b/src/bin/lttng-sessiond/rotation-thread.cpp
 index ac149c845..c11f068ed 100644
 --- a/src/bin/lttng-sessiond/rotation-thread.cpp
 +++ b/src/bin/lttng-sessiond/rotation-thread.cpp
 @@ -565,6 +565,8 @@ int handle_job_queue(struct rotation_thread_handle *handle,
  {
         int ret = 0;

 +       sleep(5);
 +
         for (;;) {
                 struct ltt_session *session;
                 struct rotation_thread_job *job;

Note that the initial report for this issue was on a system under load
for which the `lttng destroy` completion check failed and a `lttng
create` was performed. As of today the exact reason why the completion
check failed is not known. Still we can "fix" the race leading to the
lttng-sessiond assertion considering a user might use the `--no-wait`
variant of `lttng destroy` and could easily end up in this
situation.

Cause
=====

Note: all `lttng create` commands have the same session name passed as
argument.

On `lttng destroy` the ltt_session object is flagged as destroyed
(ltt_session::destroyed). The removal of the object from the hash
table (ltt_sessions_ht_by_name) will be performed during the
`session_release` which is driven by the session refcount.

A reference on the `ltt_session` object is held for the
rotation initiated by the `lttng destroy` command. The rotation is
enqueued by the rotation thread.

At this point the system is busy and the rotation thread does not run.
We simulate this with a `sleep(5)` during the `handle_job_queue`.

The `lttng destroy --no-wait` returns. If the `--no-wait` option is not
passed the `lttng destroy` command will work as expected and wait for
completion. We can SIGINT the `lttng destroy` command and perform a
`lttng create` yielding the same backtrace.

On `lttng create`, `session_create` validates that the name does not
conflict with an existing session using `session_find_by_name`. It is
important to note that `session_find_by_name` discriminates also on the
`session->destroyed` flag (introduced by [1]).

The `ltt_sessions_ht_by_name` hash table was introduced by [2] to remove
the need to lock the session list to sample a session id during the
queueing of actions to be executed related to a trigger. The assumption
was made that, during the creation phase, the session would
always be unique in that hash table based on its name. This is simply
not true since multiple sessions with the same name can coexist as long
as only a single one is marked as "not destroyed". This is an important
concept due to the refcounting of the object and the feature relying on
the lifetime of the object (i.e rotation). This is mostly valid when
talking about the global session list.

Solution
========

Move the hash table removal earlier during the release of the session
object.

Move the removal from `del_session_ht`, which is done during the
`session_release` function, to the `lttng_session_destroy` function.

It is safe to do so since currently the only user of that hash table
(the action executor) does not care much about destroyed session at that
point.

This ensures that we maintain the uniqueness property of the key (name)
for that hash table on insertion.

The alternative was to expose an hash table that could contain
duplicates and force the handling of a set on all lookups.

Known drawbacks
=========

None.

References
==========
[1] https://git.lttng.org/?p=lttng-tools.git;a=commit;h=e32d7f274604b77bcd83c24994e88df3761ed658
[2] https://git.lttng.org/?p=lttng-tools.git;a=commit;h=e1bbf98908a6399f39a9a8bc95bd8b59cecaa816

Signed-off-by: Jonathan Rajotte <[email protected]>
Signed-off-by: Jérémie Galarneau <[email protected]>
Change-Id: I2f1d0d6c04ee7210166e9847a850afbe6eaa7609
jgalar pushed a commit that referenced this pull request Dec 16, 2021
Observed issue
==============

lttng-consumerd segfaults for the following scenario:

  $ lttng create test --snapshot
  $ lttng enable-event -u -an
  $ lttng start
  # Run an app just to have some events
  $ lttng regenerate metadata
  $ lttng snapshot record

The following backtrace is obtained:

 (gdb) bt
 #0  __GI___pthread_mutex_lock (mutex=0x130) at ../nptl/pthread_mutex_lock.c:67
 #1  0x000055b383cfaed3 in lttng_ustconsumer_recv_metadata (sock=29, key=4, offset=0, len=12391, version=1, channel=0x7fe90000a760, timer=0, wait=1) at ust-consumer.c:1347
 #2  0x000055b383d00197 in lttng_ustconsumer_request_metadata (ctx=0x55b3855a1e50, channel=0x7fe90000a760, timer=0, wait=1) at ust-consumer.c:3336
 #3  0x000055b383cf9e76 in snapshot_metadata (metadata_channel=0x7fe90000a760, key=4, path=0x7fe911a09944 "uid/1000/64-bit", relayd_id=18446744073709551615, ctx=0x55b3855a1e50) at ust-consum
 #4  0x000055b383cfbe73 in lttng_ustconsumer_recv_cmd (ctx=0x55b3855a1e50, sock=28, consumer_sockpoll=0x7fe911a0dbb0) at ust-consumer.c:1853
 #5  0x000055b383ccf9b7 in lttng_consumer_recv_cmd (ctx=0x55b3855a1e50, sock=28, consumer_sockpoll=0x7fe911a0dbb0) at consumer.c:2097
 #6  0x000055b383cd3bfd in consumer_thread_sessiond_poll (data=0x55b3855a1e50) at consumer.c:3284
 #7  0x00007fe914c22609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 #8  0x00007fe914b47293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
 (gdb) up
 #1  0x000055b383cfaed3 in lttng_ustconsumer_recv_metadata (sock=29, key=4, offset=0, len=12391, version=1, channel=0x7fe90000a760, timer=0, wait=1) at ust-consumer.c:1347
 1347                    pthread_mutex_lock(&channel->metadata_stream->lock);
 (gdb) print channel->metadata_stream
 $1 = (struct lttng_consumer_stream *) 0x0

Note that the following scenario also leads to a similar backtrace:

  $ lttng create test --snapshot
  $ lttng enable-event -u -an
  $ lttng start
  # Run an app just to have some events with a short duration
  $ lttng regenerate metadata
  # Run a second app just to have some events and a metadata flush while
  # the metadata cache status is set as `invalidated`.
  ^ lttng-consumerd segfault on app termination.

The backtrace:

 (gdb) bt
 #0  __GI___pthread_mutex_lock (mutex=0x130) at ../nptl/pthread_mutex_lock.c:67
 #1  0x00005612a5a13ed3 in lttng_ustconsumer_recv_metadata (sock=28, key=2, offset=0, len=12391, version=1, channel=0x7f3978005400, timer=0, wait=1) at ust-consumer.c:1347
 #2  0x00005612a5a14d0a in lttng_ustconsumer_recv_cmd (ctx=0x5612a6feee50, sock=28, consumer_sockpoll=0x7f3989494bb0) at ust-consumer.c:1818
 #3  0x00005612a59e89b7 in lttng_consumer_recv_cmd (ctx=0x5612a6feee50, sock=28, consumer_sockpoll=0x7f3989494bb0) at consumer.c:2097
 #4  0x00005612a59ecbfd in consumer_thread_sessiond_poll (data=0x5612a6feee50) at consumer.c:3284
 #5  0x00007f398c6a9609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 #6  0x00007f398c5ce293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
 (gdb) up
 #1  0x00005612a5a13ed3 in lttng_ustconsumer_recv_metadata (sock=28, key=2, offset=0, len=12391, version=1, channel=0x7f3978005400, timer=0, wait=1) at ust-consumer.c:1347
 1347                    pthread_mutex_lock(&channel->metadata_stream->lock);
 (gdb) print channel->metadata_stream
 $1 = (struct lttng_consumer_stream *) 0x0
 (gdb)

Cause
=====

For session configured in snapshot mode the metadata channel
metadata_stream field is NULL except for a "short" window during the
actual snapshot record action (snapshot_metadata).

The `regenerate metadata` effectively flag the metadata cache as invalid
leading to handling the cache invalidation state
(`CONSUMER_METADATA_CACHE_WRITE_STATUS_INVALIDATED`) in
`lttng_ustconsumer_recv_metadata`. This was introduced by
b1316da [1].

At that point the function expects the `channel->metadata_stream` to be
non null. This is simply not true for snapshot session metadata channels.

Note that we cannot simply swap `lttng_ustconsumer_request_metadata` and
`create_ust_streams` in `snapshot_metadata` to ensure that the
`channel->metadata_stream` is non null since
`lttng_ustconsumer_recv_metadata` ends up being called on metadata flush when
an app quit. This sequence of events corresponds to the second scenario
put forward in the `Observed Issue` section.

Solution
========

Null check `channel->metadata_stream` and perform only the operation
when it is non null. This partly mirror what is done in `consumer_metadata_wakeup_pipe`.

I am not sure if the check on `channel->monitor` is required but it
seems irrelevant to the notion of resetting the stream consumed position
when the stream exists.

With this taken care off, we find ourself with another
backtrace:

 #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50                                                                                                             [93/122]
 #1  0x00007f75cf9b3859 in __GI_abort () at abort.c:79
 #2  0x00007f75cf9b3729 in __assert_fail_base (fmt=0x7f75cfb49588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x55ab004e9c68 "pthread_mutex_trylock(&stream->lock)", file=0x55ab004
 #3  0x00007f75cf9c4f36 in __GI___assert_fail (assertion=0x55ab004e9c68 "pthread_mutex_trylock(&stream->lock)", file=0x55ab004e8d7a "ust-consumer.c", line=1285, function=0x55ab004eb8a0 <__PR
 #4  0x000055ab004b1b9c in metadata_stream_reset_cache_consumed_position (stream=0x7f75b400a850) at ust-consumer.c:1285
 #5  0x000055ab004b4fef in commit_one_metadata_packet (stream=0x7f75b400a850) at ust-consumer.c:2551
 #6  0x000055ab004b5f46 in get_next_subbuffer_metadata (stream=0x7f75b400a850, subbuffer=0x7f75cc972630) at ust-consumer.c:2927
 #7  0x000055ab0048b6a9 in lttng_consumer_read_subbuffer (stream=0x7f75b400a850, ctx=0x55ab01d4ee50, locked_by_caller=true) at consumer.c:3522
 #8  0x000055ab004b0f66 in snapshot_metadata (metadata_channel=0x7f75b4005400, key=2, path=0x7f75cc972944 "uid/1000/64-bit", relayd_id=18446744073709551615, ctx=0x55ab01d4ee50) at ust-consum
 #9  0x000055ab004b2e86 in lttng_ustconsumer_recv_cmd (ctx=0x55ab01d4ee50, sock=28, consumer_sockpoll=0x7f75cc976bb0) at ust-consumer.c:1861
 #10 0x000055ab004869b7 in lttng_consumer_recv_cmd (ctx=0x55ab01d4ee50, sock=28, consumer_sockpoll=0x7f75cc976bb0) at consumer.c:2097
 #11 0x000055ab0048abfd in consumer_thread_sessiond_poll (data=0x55ab01d4ee50) at consumer.c:3284
 #12 0x00007f75cfb8b609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 #13 0x00007f75cfab0293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Which is also caused in part to the invalidation of the cache.
`metadata_stream_reset_cache_consumed_position` expect that the stream
at that point be locked. Which is not the case despite what the last argument
to `lttng_consumer_read_subbuffer` indicates. To alleviate that we hold
the lock during the call to `lttng_consumer_read_subbuffer`.

Known drawbacks
=========

None.

References
=========

[1] https://git.lttng.org/?p=lttng-tools.git;a=commit;h=b1316da1ffbd276fc8271e7a9438e683ad352781

Signed-off-by: Jonathan Rajotte <[email protected]>
Signed-off-by: Jérémie Galarneau <[email protected]>
Change-Id: Ib09fdc989b1baae26d0f496afeabc3e7cb27228c
jgalar pushed a commit that referenced this pull request Dec 17, 2021
Observed issue
==============

lttng-consumerd segfaults for the following scenario:

  $ lttng create test --snapshot
  $ lttng enable-event -u -an
  $ lttng start
  # Run an app just to have some events
  $ lttng regenerate metadata
  $ lttng snapshot record

The following backtrace is obtained:

 (gdb) bt
 #0  __GI___pthread_mutex_lock (mutex=0x130) at ../nptl/pthread_mutex_lock.c:67
 #1  0x000055b383cfaed3 in lttng_ustconsumer_recv_metadata (sock=29, key=4, offset=0, len=12391, version=1, channel=0x7fe90000a760, timer=0, wait=1) at ust-consumer.c:1347
 #2  0x000055b383d00197 in lttng_ustconsumer_request_metadata (ctx=0x55b3855a1e50, channel=0x7fe90000a760, timer=0, wait=1) at ust-consumer.c:3336
 #3  0x000055b383cf9e76 in snapshot_metadata (metadata_channel=0x7fe90000a760, key=4, path=0x7fe911a09944 "uid/1000/64-bit", relayd_id=18446744073709551615, ctx=0x55b3855a1e50) at ust-consum
 #4  0x000055b383cfbe73 in lttng_ustconsumer_recv_cmd (ctx=0x55b3855a1e50, sock=28, consumer_sockpoll=0x7fe911a0dbb0) at ust-consumer.c:1853
 #5  0x000055b383ccf9b7 in lttng_consumer_recv_cmd (ctx=0x55b3855a1e50, sock=28, consumer_sockpoll=0x7fe911a0dbb0) at consumer.c:2097
 #6  0x000055b383cd3bfd in consumer_thread_sessiond_poll (data=0x55b3855a1e50) at consumer.c:3284
 #7  0x00007fe914c22609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 #8  0x00007fe914b47293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
 (gdb) up
 #1  0x000055b383cfaed3 in lttng_ustconsumer_recv_metadata (sock=29, key=4, offset=0, len=12391, version=1, channel=0x7fe90000a760, timer=0, wait=1) at ust-consumer.c:1347
 1347                    pthread_mutex_lock(&channel->metadata_stream->lock);
 (gdb) print channel->metadata_stream
 $1 = (struct lttng_consumer_stream *) 0x0

Note that the following scenario also leads to a similar backtrace:

  $ lttng create test --snapshot
  $ lttng enable-event -u -an
  $ lttng start
  # Run an app just to have some events with a short duration
  $ lttng regenerate metadata
  # Run a second app just to have some events and a metadata flush while
  # the metadata cache status is set as `invalidated`.
  ^ lttng-consumerd segfault on app termination.

The backtrace:

 (gdb) bt
 #0  __GI___pthread_mutex_lock (mutex=0x130) at ../nptl/pthread_mutex_lock.c:67
 #1  0x00005612a5a13ed3 in lttng_ustconsumer_recv_metadata (sock=28, key=2, offset=0, len=12391, version=1, channel=0x7f3978005400, timer=0, wait=1) at ust-consumer.c:1347
 #2  0x00005612a5a14d0a in lttng_ustconsumer_recv_cmd (ctx=0x5612a6feee50, sock=28, consumer_sockpoll=0x7f3989494bb0) at ust-consumer.c:1818
 #3  0x00005612a59e89b7 in lttng_consumer_recv_cmd (ctx=0x5612a6feee50, sock=28, consumer_sockpoll=0x7f3989494bb0) at consumer.c:2097
 #4  0x00005612a59ecbfd in consumer_thread_sessiond_poll (data=0x5612a6feee50) at consumer.c:3284
 #5  0x00007f398c6a9609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 #6  0x00007f398c5ce293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
 (gdb) up
 #1  0x00005612a5a13ed3 in lttng_ustconsumer_recv_metadata (sock=28, key=2, offset=0, len=12391, version=1, channel=0x7f3978005400, timer=0, wait=1) at ust-consumer.c:1347
 1347                    pthread_mutex_lock(&channel->metadata_stream->lock);
 (gdb) print channel->metadata_stream
 $1 = (struct lttng_consumer_stream *) 0x0
 (gdb)

Cause
=====

For session configured in snapshot mode the metadata channel
metadata_stream field is NULL except for a "short" window during the
actual snapshot record action (snapshot_metadata).

The `regenerate metadata` effectively flag the metadata cache as invalid
leading to handling the cache invalidation state
(`CONSUMER_METADATA_CACHE_WRITE_STATUS_INVALIDATED`) in
`lttng_ustconsumer_recv_metadata`. This was introduced by
b1316da [1].

At that point the function expects the `channel->metadata_stream` to be
non null. This is simply not true for snapshot session metadata channels.

Note that we cannot simply swap `lttng_ustconsumer_request_metadata` and
`create_ust_streams` in `snapshot_metadata` to ensure that the
`channel->metadata_stream` is non null since
`lttng_ustconsumer_recv_metadata` ends up being called on metadata flush when
an app quit. This sequence of events corresponds to the second scenario
put forward in the `Observed Issue` section.

Solution
========

Null check `channel->metadata_stream` and perform only the operation
when it is non null. This partly mirror what is done in `consumer_metadata_wakeup_pipe`.

I am not sure if the check on `channel->monitor` is required but it
seems irrelevant to the notion of resetting the stream consumed position
when the stream exists.

With this taken care off, we find ourself with another
backtrace:

 #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50                                                                                                             [93/122]
 #1  0x00007f75cf9b3859 in __GI_abort () at abort.c:79
 #2  0x00007f75cf9b3729 in __assert_fail_base (fmt=0x7f75cfb49588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x55ab004e9c68 "pthread_mutex_trylock(&stream->lock)", file=0x55ab004
 #3  0x00007f75cf9c4f36 in __GI___assert_fail (assertion=0x55ab004e9c68 "pthread_mutex_trylock(&stream->lock)", file=0x55ab004e8d7a "ust-consumer.c", line=1285, function=0x55ab004eb8a0 <__PR
 #4  0x000055ab004b1b9c in metadata_stream_reset_cache_consumed_position (stream=0x7f75b400a850) at ust-consumer.c:1285
 #5  0x000055ab004b4fef in commit_one_metadata_packet (stream=0x7f75b400a850) at ust-consumer.c:2551
 #6  0x000055ab004b5f46 in get_next_subbuffer_metadata (stream=0x7f75b400a850, subbuffer=0x7f75cc972630) at ust-consumer.c:2927
 #7  0x000055ab0048b6a9 in lttng_consumer_read_subbuffer (stream=0x7f75b400a850, ctx=0x55ab01d4ee50, locked_by_caller=true) at consumer.c:3522
 #8  0x000055ab004b0f66 in snapshot_metadata (metadata_channel=0x7f75b4005400, key=2, path=0x7f75cc972944 "uid/1000/64-bit", relayd_id=18446744073709551615, ctx=0x55ab01d4ee50) at ust-consum
 #9  0x000055ab004b2e86 in lttng_ustconsumer_recv_cmd (ctx=0x55ab01d4ee50, sock=28, consumer_sockpoll=0x7f75cc976bb0) at ust-consumer.c:1861
 #10 0x000055ab004869b7 in lttng_consumer_recv_cmd (ctx=0x55ab01d4ee50, sock=28, consumer_sockpoll=0x7f75cc976bb0) at consumer.c:2097
 #11 0x000055ab0048abfd in consumer_thread_sessiond_poll (data=0x55ab01d4ee50) at consumer.c:3284
 #12 0x00007f75cfb8b609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 #13 0x00007f75cfab0293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Which is also caused in part to the invalidation of the cache.
`metadata_stream_reset_cache_consumed_position` expect that the stream
at that point be locked. Which is not the case despite what the last argument
to `lttng_consumer_read_subbuffer` indicates. To alleviate that we hold
the lock during the call to `lttng_consumer_read_subbuffer`.

Known drawbacks
=========

None.

References
=========

[1] https://git.lttng.org/?p=lttng-tools.git;a=commit;h=b1316da1ffbd276fc8271e7a9438e683ad352781

Signed-off-by: Jonathan Rajotte <[email protected]>
Signed-off-by: Jérémie Galarneau <[email protected]>
Change-Id: Ib09fdc989b1baae26d0f496afeabc3e7cb27228c
jgalar pushed a commit that referenced this pull request Dec 17, 2021
…y_name

Observed issue
==============

The lttng-sessiond asserts with the following backtrace on lttng create:

 #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
 #1  0x00007ffff7ab5859 in __GI_abort () at abort.c:79
 #2  0x00007ffff7ab5729 in __assert_fail_base (fmt=0x7ffff7c4b588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x5555556ab0a6 "node_ptr == &node->node", file=0x5555556ab085 "hashtable.c", line=298, function=<optimized out>) at a
 #3  0x00007ffff7ac6f36 in __GI___assert_fail (assertion=assertion@entry=0x5555556ab0a6 "node_ptr == &node->node", file=file@entry=0x5555556ab085 "hashtable.c", line=line@entry=298, function=function@entry=0x5555556ab380 <__PRETTY_FUNCTIO
 #4  0x000055555560be44 in lttng_ht_add_unique_str (ht=<optimized out>, node=0x7fffe0026c58) at hashtable.c:298
 #5  0x000055555558fb6a in add_session_ht (ls=0x7fffe0024970) at session.c:372
 #6  session_create (name=<optimized out>, uid=1000, gid=1000, out_session=out_session@entry=0x7fffedfddbd8) at session.c:1308
 #7  0x000055555559b219 in cmd_create_session_from_descriptor (creds=<optimized out>, creds=<optimized out>, home_path=<optimized out>, descriptor=<optimized out>) at cmd.c:3040
 #8  cmd_create_session (cmd_ctx=cmd_ctx@entry=0x7fffedfe5fa0, sock=<optimized out>, return_descriptor=return_descriptor@entry=0x7fffedfddd68) at cmd.c:3176
 #9  0x00005555555cc341 in process_client_msg (sock_error=0x7fffedfddd10, sock=0x7fffedfddd0c, cmd_ctx=0x7fffedfe5fa0) at client.c:2177
 #10 thread_manage_clients (data=<optimized out>) at client.c:2742
 #11 0x00005555555c5fff in launch_thread (data=0x55555571b780) at thread.c:66
 #12 0x00007ffff7c8b609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 #13 0x00007ffff7bb2293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

The issue can be reproduced with modifications to the rotation thread
code and the following scenario:

 $ lttng create test
 $ lttng enable-event -u -a
 $ lttng start
 run any app just so that we have a complete valid session. (might not be necessary)
 $ lttng destroy --no-wait
 $ lttng create test
 ^ Should assert here.

The diff to be applied:

 diff --git a/src/bin/lttng-sessiond/rotation-thread.cpp b/src/bin/lttng-sessiond/rotation-thread.cpp
 index ac149c845..c11f068ed 100644
 --- a/src/bin/lttng-sessiond/rotation-thread.cpp
 +++ b/src/bin/lttng-sessiond/rotation-thread.cpp
 @@ -565,6 +565,8 @@ int handle_job_queue(struct rotation_thread_handle *handle,
  {
         int ret = 0;

 +       sleep(5);
 +
         for (;;) {
                 struct ltt_session *session;
                 struct rotation_thread_job *job;

Note that the initial report for this issue was on a system under load
for which the `lttng destroy` completion check failed and a `lttng
create` was performed. As of today the exact reason why the completion
check failed is not known. Still we can "fix" the race leading to the
lttng-sessiond assertion considering a user might use the `--no-wait`
variant of `lttng destroy` and could easily end up in this
situation.

Cause
=====

Note: all `lttng create` commands have the same session name passed as
argument.

On `lttng destroy` the ltt_session object is flagged as destroyed
(ltt_session::destroyed). The removal of the object from the hash
table (ltt_sessions_ht_by_name) will be performed during the
`session_release` which is driven by the session refcount.

A reference on the `ltt_session` object is held for the
rotation initiated by the `lttng destroy` command. The rotation is
enqueued by the rotation thread.

At this point the system is busy and the rotation thread does not run.
We simulate this with a `sleep(5)` during the `handle_job_queue`.

The `lttng destroy --no-wait` returns. If the `--no-wait` option is not
passed the `lttng destroy` command will work as expected and wait for
completion. We can SIGINT the `lttng destroy` command and perform a
`lttng create` yielding the same backtrace.

On `lttng create`, `session_create` validates that the name does not
conflict with an existing session using `session_find_by_name`. It is
important to note that `session_find_by_name` discriminates also on the
`session->destroyed` flag (introduced by [1]).

The `ltt_sessions_ht_by_name` hash table was introduced by [2] to remove
the need to lock the session list to sample a session id during the
queueing of actions to be executed related to a trigger. The assumption
was made that, during the creation phase, the session would
always be unique in that hash table based on its name. This is simply
not true since multiple sessions with the same name can coexist as long
as only a single one is marked as "not destroyed". This is an important
concept due to the refcounting of the object and the feature relying on
the lifetime of the object (i.e rotation). This is mostly valid when
talking about the global session list.

Solution
========

Move the hash table removal earlier during the release of the session
object.

Move the removal from `del_session_ht`, which is done during the
`session_release` function, to the `lttng_session_destroy` function.

It is safe to do so since currently the only user of that hash table
(the action executor) does not care much about destroyed session at that
point.

This ensures that we maintain the uniqueness property of the key (name)
for that hash table on insertion.

The alternative was to expose an hash table that could contain
duplicates and force the handling of a set on all lookups.

Known drawbacks
=========

None.

References
==========
[1] https://git.lttng.org/?p=lttng-tools.git;a=commit;h=e32d7f274604b77bcd83c24994e88df3761ed658
[2] https://git.lttng.org/?p=lttng-tools.git;a=commit;h=e1bbf98908a6399f39a9a8bc95bd8b59cecaa816

Signed-off-by: Jonathan Rajotte <[email protected]>
Signed-off-by: Jérémie Galarneau <[email protected]>
Change-Id: I2f1d0d6c04ee7210166e9847a850afbe6eaa7609
jgalar pushed a commit that referenced this pull request Dec 17, 2021
Observed issue
==============

lttng-consumerd segfaults for the following scenario:

  $ lttng create test --snapshot
  $ lttng enable-event -u -an
  $ lttng start
  # Run an app just to have some events
  $ lttng regenerate metadata
  $ lttng snapshot record

The following backtrace is obtained:

 (gdb) bt
 #0  __GI___pthread_mutex_lock (mutex=0x130) at ../nptl/pthread_mutex_lock.c:67
 #1  0x000055b383cfaed3 in lttng_ustconsumer_recv_metadata (sock=29, key=4, offset=0, len=12391, version=1, channel=0x7fe90000a760, timer=0, wait=1) at ust-consumer.c:1347
 #2  0x000055b383d00197 in lttng_ustconsumer_request_metadata (ctx=0x55b3855a1e50, channel=0x7fe90000a760, timer=0, wait=1) at ust-consumer.c:3336
 #3  0x000055b383cf9e76 in snapshot_metadata (metadata_channel=0x7fe90000a760, key=4, path=0x7fe911a09944 "uid/1000/64-bit", relayd_id=18446744073709551615, ctx=0x55b3855a1e50) at ust-consum
 #4  0x000055b383cfbe73 in lttng_ustconsumer_recv_cmd (ctx=0x55b3855a1e50, sock=28, consumer_sockpoll=0x7fe911a0dbb0) at ust-consumer.c:1853
 #5  0x000055b383ccf9b7 in lttng_consumer_recv_cmd (ctx=0x55b3855a1e50, sock=28, consumer_sockpoll=0x7fe911a0dbb0) at consumer.c:2097
 #6  0x000055b383cd3bfd in consumer_thread_sessiond_poll (data=0x55b3855a1e50) at consumer.c:3284
 #7  0x00007fe914c22609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 #8  0x00007fe914b47293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
 (gdb) up
 #1  0x000055b383cfaed3 in lttng_ustconsumer_recv_metadata (sock=29, key=4, offset=0, len=12391, version=1, channel=0x7fe90000a760, timer=0, wait=1) at ust-consumer.c:1347
 1347                    pthread_mutex_lock(&channel->metadata_stream->lock);
 (gdb) print channel->metadata_stream
 $1 = (struct lttng_consumer_stream *) 0x0

Note that the following scenario also leads to a similar backtrace:

  $ lttng create test --snapshot
  $ lttng enable-event -u -an
  $ lttng start
  # Run an app just to have some events with a short duration
  $ lttng regenerate metadata
  # Run a second app just to have some events and a metadata flush while
  # the metadata cache status is set as `invalidated`.
  ^ lttng-consumerd segfault on app termination.

The backtrace:

 (gdb) bt
 #0  __GI___pthread_mutex_lock (mutex=0x130) at ../nptl/pthread_mutex_lock.c:67
 #1  0x00005612a5a13ed3 in lttng_ustconsumer_recv_metadata (sock=28, key=2, offset=0, len=12391, version=1, channel=0x7f3978005400, timer=0, wait=1) at ust-consumer.c:1347
 #2  0x00005612a5a14d0a in lttng_ustconsumer_recv_cmd (ctx=0x5612a6feee50, sock=28, consumer_sockpoll=0x7f3989494bb0) at ust-consumer.c:1818
 #3  0x00005612a59e89b7 in lttng_consumer_recv_cmd (ctx=0x5612a6feee50, sock=28, consumer_sockpoll=0x7f3989494bb0) at consumer.c:2097
 #4  0x00005612a59ecbfd in consumer_thread_sessiond_poll (data=0x5612a6feee50) at consumer.c:3284
 #5  0x00007f398c6a9609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 #6  0x00007f398c5ce293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
 (gdb) up
 #1  0x00005612a5a13ed3 in lttng_ustconsumer_recv_metadata (sock=28, key=2, offset=0, len=12391, version=1, channel=0x7f3978005400, timer=0, wait=1) at ust-consumer.c:1347
 1347                    pthread_mutex_lock(&channel->metadata_stream->lock);
 (gdb) print channel->metadata_stream
 $1 = (struct lttng_consumer_stream *) 0x0
 (gdb)

Cause
=====

For session configured in snapshot mode the metadata channel
metadata_stream field is NULL except for a "short" window during the
actual snapshot record action (snapshot_metadata).

The `regenerate metadata` effectively flag the metadata cache as invalid
leading to handling the cache invalidation state
(`CONSUMER_METADATA_CACHE_WRITE_STATUS_INVALIDATED`) in
`lttng_ustconsumer_recv_metadata`. This was introduced by
b1316da [1].

At that point the function expects the `channel->metadata_stream` to be
non null. This is simply not true for snapshot session metadata channels.

Note that we cannot simply swap `lttng_ustconsumer_request_metadata` and
`create_ust_streams` in `snapshot_metadata` to ensure that the
`channel->metadata_stream` is non null since
`lttng_ustconsumer_recv_metadata` ends up being called on metadata flush when
an app quit. This sequence of events corresponds to the second scenario
put forward in the `Observed Issue` section.

Solution
========

Null check `channel->metadata_stream` and perform only the operation
when it is non null. This partly mirror what is done in `consumer_metadata_wakeup_pipe`.

I am not sure if the check on `channel->monitor` is required but it
seems irrelevant to the notion of resetting the stream consumed position
when the stream exists.

With this taken care off, we find ourself with another
backtrace:

 #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50                                                                                                             [93/122]
 #1  0x00007f75cf9b3859 in __GI_abort () at abort.c:79
 #2  0x00007f75cf9b3729 in __assert_fail_base (fmt=0x7f75cfb49588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x55ab004e9c68 "pthread_mutex_trylock(&stream->lock)", file=0x55ab004
 #3  0x00007f75cf9c4f36 in __GI___assert_fail (assertion=0x55ab004e9c68 "pthread_mutex_trylock(&stream->lock)", file=0x55ab004e8d7a "ust-consumer.c", line=1285, function=0x55ab004eb8a0 <__PR
 #4  0x000055ab004b1b9c in metadata_stream_reset_cache_consumed_position (stream=0x7f75b400a850) at ust-consumer.c:1285
 #5  0x000055ab004b4fef in commit_one_metadata_packet (stream=0x7f75b400a850) at ust-consumer.c:2551
 #6  0x000055ab004b5f46 in get_next_subbuffer_metadata (stream=0x7f75b400a850, subbuffer=0x7f75cc972630) at ust-consumer.c:2927
 #7  0x000055ab0048b6a9 in lttng_consumer_read_subbuffer (stream=0x7f75b400a850, ctx=0x55ab01d4ee50, locked_by_caller=true) at consumer.c:3522
 #8  0x000055ab004b0f66 in snapshot_metadata (metadata_channel=0x7f75b4005400, key=2, path=0x7f75cc972944 "uid/1000/64-bit", relayd_id=18446744073709551615, ctx=0x55ab01d4ee50) at ust-consum
 #9  0x000055ab004b2e86 in lttng_ustconsumer_recv_cmd (ctx=0x55ab01d4ee50, sock=28, consumer_sockpoll=0x7f75cc976bb0) at ust-consumer.c:1861
 #10 0x000055ab004869b7 in lttng_consumer_recv_cmd (ctx=0x55ab01d4ee50, sock=28, consumer_sockpoll=0x7f75cc976bb0) at consumer.c:2097
 #11 0x000055ab0048abfd in consumer_thread_sessiond_poll (data=0x55ab01d4ee50) at consumer.c:3284
 #12 0x00007f75cfb8b609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 #13 0x00007f75cfab0293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Which is also caused in part to the invalidation of the cache.
`metadata_stream_reset_cache_consumed_position` expect that the stream
at that point be locked. Which is not the case despite what the last argument
to `lttng_consumer_read_subbuffer` indicates. To alleviate that we hold
the lock during the call to `lttng_consumer_read_subbuffer`.

Known drawbacks
=========

None.

References
=========

[1] https://git.lttng.org/?p=lttng-tools.git;a=commit;h=b1316da1ffbd276fc8271e7a9438e683ad352781

Signed-off-by: Jonathan Rajotte <[email protected]>
Signed-off-by: Jérémie Galarneau <[email protected]>
Change-Id: Ib09fdc989b1baae26d0f496afeabc3e7cb27228c
jgalar added a commit that referenced this pull request Mar 8, 2022
Observed issue
==============

When consumer_stream_destroy() is called from, for example, the error
path in setup_metadata(), consumer_stream_free() can end up being called
twice on the same stream.  Since the stream->metadata_bucket is not set
to NULL after being destroyed, it leads to a use-after-free:

 ERROR: AddressSanitizer: heap-use-after-free on address 0x604000000318
 READ of size 8 at 0x604000000318 thread T7
     #0 in metadata_bucket_destroy
     #1 in consumer_stream_free
     #2 in consumer_stream_destroy
     #3 in setup_metadata
     #4 in lttng_ustconsumer_recv_cmd
     #5 in lttng_consumer_recv_cmd
     #6 in consumer_thread_sessiond_poll
     #7 in start_thread nptl/pthread_create.c:481
     #8 in clone (/lib/x86_64-linux-gnu/libc.so.6+0xfcbde)

 0x604000000318 is located 8 bytes inside of 48-byte region [0x604000000310,0x604000000340)
 freed by thread T7 here:
     #0 in __interceptor_free
     #1 in metadata_bucket_destroy
     #2 in consumer_stream_free
     #3 in consumer_stream_destroy
     #4 in clean_channel_stream_list
     #5 in consumer_del_channel
     #6 in consumer_stream_destroy
     #7 in setup_metadata
     #8 in lttng_ustconsumer_recv_cmd
     #9 in lttng_consumer_recv_cmd
     #10 in consumer_thread_sessiond_poll
     #11 in start_thread nptl/pthread_create.c:481

 previously allocated by thread T7 here:
     #0 in __interceptor_calloc
     #1 in zmalloc
     #2 in metadata_bucket_create
     #3 in consumer_stream_enable_metadata_bucketization
     #4 in lttng_ustconsumer_set_stream_ops
     #5 in lttng_ustconsumer_on_recv_stream
     #6 in lttng_consumer_on_recv_stream
     #7 in create_ust_streams
     #8 in ask_channel
     #9 in lttng_ustconsumer_recv_cmd
     #10 in lttng_consumer_recv_cmd
     #11 in consumer_thread_sessiond_poll
     #12 in start_thread nptl/pthread_create.c:481

 Thread T7 created by T0 here:
     #0 in __interceptor_pthread_create
     #1 in main
     #2 in __libc_start_main ../csu/libc-start.c:332

 SUMMARY: AddressSanitizer: heap-use-after-free in metadata_bucket_destroy

This can be easily reproduced by forcing a failure during the setup
of the metadata reproducible using the following change:

  diff --git a/src/common/ust-consumer/ust-consumer.c b/src/common/ust-consumer/ust-consumer.c
  index fa1c71299..97ed59632 100644

  --- a/src/common/ust-consumer/ust-consumer.c
  +++ b/src/common/ust-consumer/ust-consumer.c
  @@ -908,8 +908,7 @@ static int setup_metadata(struct lttng_consumer_local_data *ctx, uint64_t key)

           /* Send metadata stream to relayd if needed. */
           if (metadata->metadata_stream->net_seq_idx != (uint64_t) -1ULL) {
  -                ret = consumer_send_relayd_stream(metadata->metadata_stream,
  -                                metadata->pathname);
  +                ret = -1;
                   if (ret < 0) {
                           ret = LTTCOMM_CONSUMERD_ERROR_METADATA;
                           goto error;

Cause
=====

Channels have a list of streams that are being "setup" and are not
yet monitored for consumption. During this setup phase, the streams are
owned by the channel. On destruction of the channel, any stream in that
list will thus be cleaned-up.

When destroying a consumer stream, a reference to its channel is 'put'.
This can result in the destruction of the channel.

In the situation described above, the release of the channel's reference
is done before the stream is removed from the channel's stream list.
This causes the channel's clean-up to invoke (again) the current
stream's clean-up, resulting in the double-free of the metadata bucket.

This problem is present in a number of error paths.

Solution
========

Some error paths already manually removed the consumer stream from it's
channel's stream list before invoking consumer_stream_destroy(). The
various error paths that have to deal with this possible situation are
changed to simply invoke consumer_stream_destroy().

consumer_stream_destroy() is modified to always remove the stream from
its channel's list before performing the rest of the clean-up. This
ensures that those double clean-ups can't occur.

Drawbacks
=========

None.

Reported-by: Vincent Whitchurch <[email protected]>
Tested-by: Vincent Whitchurch <[email protected]>
Signed-off-by: Jérémie Galarneau <[email protected]>
Change-Id: Ibeca9b675b86fc46be3f57826f7158de4da43df8
jgalar added a commit that referenced this pull request Mar 9, 2022
Observed issue
==============

When consumer_stream_destroy() is called from, for example, the error
path in setup_metadata(), consumer_stream_free() can end up being called
twice on the same stream.  Since the stream->metadata_bucket is not set
to NULL after being destroyed, it leads to a use-after-free:

 ERROR: AddressSanitizer: heap-use-after-free on address 0x604000000318
 READ of size 8 at 0x604000000318 thread T7
     #0 in metadata_bucket_destroy
     #1 in consumer_stream_free
     #2 in consumer_stream_destroy
     #3 in setup_metadata
     #4 in lttng_ustconsumer_recv_cmd
     #5 in lttng_consumer_recv_cmd
     #6 in consumer_thread_sessiond_poll
     #7 in start_thread nptl/pthread_create.c:481
     #8 in clone (/lib/x86_64-linux-gnu/libc.so.6+0xfcbde)

 0x604000000318 is located 8 bytes inside of 48-byte region [0x604000000310,0x604000000340)
 freed by thread T7 here:
     #0 in __interceptor_free
     #1 in metadata_bucket_destroy
     #2 in consumer_stream_free
     #3 in consumer_stream_destroy
     #4 in clean_channel_stream_list
     #5 in consumer_del_channel
     #6 in consumer_stream_destroy
     #7 in setup_metadata
     #8 in lttng_ustconsumer_recv_cmd
     #9 in lttng_consumer_recv_cmd
     #10 in consumer_thread_sessiond_poll
     #11 in start_thread nptl/pthread_create.c:481

 previously allocated by thread T7 here:
     #0 in __interceptor_calloc
     #1 in zmalloc
     #2 in metadata_bucket_create
     #3 in consumer_stream_enable_metadata_bucketization
     #4 in lttng_ustconsumer_set_stream_ops
     #5 in lttng_ustconsumer_on_recv_stream
     #6 in lttng_consumer_on_recv_stream
     #7 in create_ust_streams
     #8 in ask_channel
     #9 in lttng_ustconsumer_recv_cmd
     #10 in lttng_consumer_recv_cmd
     #11 in consumer_thread_sessiond_poll
     #12 in start_thread nptl/pthread_create.c:481

 Thread T7 created by T0 here:
     #0 in __interceptor_pthread_create
     #1 in main
     #2 in __libc_start_main ../csu/libc-start.c:332

 SUMMARY: AddressSanitizer: heap-use-after-free in metadata_bucket_destroy

This can be easily reproduced by forcing a failure during the setup
of the metadata reproducible using the following change:

  diff --git a/src/common/ust-consumer/ust-consumer.c b/src/common/ust-consumer/ust-consumer.c
  index fa1c71299..97ed59632 100644

  --- a/src/common/ust-consumer/ust-consumer.c
  +++ b/src/common/ust-consumer/ust-consumer.c
  @@ -908,8 +908,7 @@ static int setup_metadata(struct lttng_consumer_local_data *ctx, uint64_t key)

           /* Send metadata stream to relayd if needed. */
           if (metadata->metadata_stream->net_seq_idx != (uint64_t) -1ULL) {
  -                ret = consumer_send_relayd_stream(metadata->metadata_stream,
  -                                metadata->pathname);
  +                ret = -1;
                   if (ret < 0) {
                           ret = LTTCOMM_CONSUMERD_ERROR_METADATA;
                           goto error;

Cause
=====

Channels have a list of streams that are being "setup" and are not
yet monitored for consumption. During this setup phase, the streams are
owned by the channel. On destruction of the channel, any stream in that
list will thus be cleaned-up.

When destroying a consumer stream, a reference to its channel is 'put'.
This can result in the destruction of the channel.

In the situation described above, the release of the channel's reference
is done before the stream is removed from the channel's stream list.
This causes the channel's clean-up to invoke (again) the current
stream's clean-up, resulting in the double-free of the metadata bucket.

This problem is present in a number of error paths.

Solution
========

Some error paths already manually removed the consumer stream from it's
channel's stream list before invoking consumer_stream_destroy(). The
various error paths that have to deal with this possible situation are
changed to simply invoke consumer_stream_destroy().

consumer_stream_destroy() is modified to always remove the stream from
its channel's list before performing the rest of the clean-up. This
ensures that those double clean-ups can't occur.

Drawbacks
=========

None.

Reported-by: Vincent Whitchurch <[email protected]>
Tested-by: Vincent Whitchurch <[email protected]>
Signed-off-by: Jérémie Galarneau <[email protected]>
Change-Id: Ibeca9b675b86fc46be3f57826f7158de4da43df8
jgalar added a commit that referenced this pull request Mar 9, 2022
Observed issue
==============

When consumer_stream_destroy() is called from, for example, the error
path in setup_metadata(), consumer_stream_free() can end up being called
twice on the same stream.  Since the stream->metadata_bucket is not set
to NULL after being destroyed, it leads to a use-after-free:

 ERROR: AddressSanitizer: heap-use-after-free on address 0x604000000318
 READ of size 8 at 0x604000000318 thread T7
     #0 in metadata_bucket_destroy
     #1 in consumer_stream_free
     #2 in consumer_stream_destroy
     #3 in setup_metadata
     #4 in lttng_ustconsumer_recv_cmd
     #5 in lttng_consumer_recv_cmd
     #6 in consumer_thread_sessiond_poll
     #7 in start_thread nptl/pthread_create.c:481
     #8 in clone (/lib/x86_64-linux-gnu/libc.so.6+0xfcbde)

 0x604000000318 is located 8 bytes inside of 48-byte region [0x604000000310,0x604000000340)
 freed by thread T7 here:
     #0 in __interceptor_free
     #1 in metadata_bucket_destroy
     #2 in consumer_stream_free
     #3 in consumer_stream_destroy
     #4 in clean_channel_stream_list
     #5 in consumer_del_channel
     #6 in consumer_stream_destroy
     #7 in setup_metadata
     #8 in lttng_ustconsumer_recv_cmd
     #9 in lttng_consumer_recv_cmd
     #10 in consumer_thread_sessiond_poll
     #11 in start_thread nptl/pthread_create.c:481

 previously allocated by thread T7 here:
     #0 in __interceptor_calloc
     #1 in zmalloc
     #2 in metadata_bucket_create
     #3 in consumer_stream_enable_metadata_bucketization
     #4 in lttng_ustconsumer_set_stream_ops
     #5 in lttng_ustconsumer_on_recv_stream
     #6 in lttng_consumer_on_recv_stream
     #7 in create_ust_streams
     #8 in ask_channel
     #9 in lttng_ustconsumer_recv_cmd
     #10 in lttng_consumer_recv_cmd
     #11 in consumer_thread_sessiond_poll
     #12 in start_thread nptl/pthread_create.c:481

 Thread T7 created by T0 here:
     #0 in __interceptor_pthread_create
     #1 in main
     #2 in __libc_start_main ../csu/libc-start.c:332

 SUMMARY: AddressSanitizer: heap-use-after-free in metadata_bucket_destroy

This can be easily reproduced by forcing a failure during the setup
of the metadata reproducible using the following change:

  diff --git a/src/common/ust-consumer/ust-consumer.c b/src/common/ust-consumer/ust-consumer.c
  index fa1c71299..97ed59632 100644

  --- a/src/common/ust-consumer/ust-consumer.c
  +++ b/src/common/ust-consumer/ust-consumer.c
  @@ -908,8 +908,7 @@ static int setup_metadata(struct lttng_consumer_local_data *ctx, uint64_t key)

           /* Send metadata stream to relayd if needed. */
           if (metadata->metadata_stream->net_seq_idx != (uint64_t) -1ULL) {
  -                ret = consumer_send_relayd_stream(metadata->metadata_stream,
  -                                metadata->pathname);
  +                ret = -1;
                   if (ret < 0) {
                           ret = LTTCOMM_CONSUMERD_ERROR_METADATA;
                           goto error;

Cause
=====

Channels have a list of streams that are being "setup" and are not
yet monitored for consumption. During this setup phase, the streams are
owned by the channel. On destruction of the channel, any stream in that
list will thus be cleaned-up.

When destroying a consumer stream, a reference to its channel is 'put'.
This can result in the destruction of the channel.

In the situation described above, the release of the channel's reference
is done before the stream is removed from the channel's stream list.
This causes the channel's clean-up to invoke (again) the current
stream's clean-up, resulting in the double-free of the metadata bucket.

This problem is present in a number of error paths.

Solution
========

Some error paths already manually removed the consumer stream from it's
channel's stream list before invoking consumer_stream_destroy(). The
various error paths that have to deal with this possible situation are
changed to simply invoke consumer_stream_destroy().

consumer_stream_destroy() is modified to always remove the stream from
its channel's list before performing the rest of the clean-up. This
ensures that those double clean-ups can't occur.

Drawbacks
=========

None.

Reported-by: Vincent Whitchurch <[email protected]>
Tested-by: Vincent Whitchurch <[email protected]>
Signed-off-by: Jérémie Galarneau <[email protected]>
Change-Id: Ibeca9b675b86fc46be3f57826f7158de4da43df8
jgalar pushed a commit that referenced this pull request Apr 6, 2022
Observed issue
==============

Scenario:

gdb lttng-sessiond
  set non-stop
  break rotation-thread.cpp:584
  ^ simulates a slow rotation thread or not scheduled thread.

lttng create test1
lttng enable-event -u -a
lttng start test1
lttng create test2
lttng enable-event -u -a
lttng start test2
lttng destroy test1
   This will hang on rotation pending checks on the CLI side.

In another shell:

lttng destroy test2
   This will hang on rotation pending checks on the CLI side.

Back to gdb
   thread 7
   continue

Results in:

 #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
 #1  0x00007ffff786c859 in __GI_abort () at abort.c:79
 #2  0x00007ffff786c729 in __assert_fail_base (fmt=0x7ffff7a02588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x5555556bb148 "count == lttng_ht_get_count(ltt_sessions_ht_by_name)", file=0x5555556bae9f "session.cpp", line=395, function=<optimized out>) at assert.c:92
 #3  0x00007ffff787e006 in __GI___assert_fail (assertion=0x5555556bb148 "count == lttng_ht_get_count(ltt_sessions_ht_by_name)", file=0x5555556bae9f "session.cpp", line=395, function=0x5555556bb129 "int ltt_sessions_ht_empty()") at assert.c:101
 #4  0x0000555555586d59 in ltt_sessions_ht_empty () at session.cpp:395
 #5  0x0000555555586e53 in del_session_ht (ls=0x7fffdc000c30) at session.cpp:418
 #6  0x0000555555588a95 in session_release (ref=0x7fffdc001e50) at session.cpp:999
 #7  0x000055555558620f in urcu_ref_put (ref=0x7fffdc001e50, release=0x5555555886eb <session_release(urcu_ref*)>) at /home/joraj/lttng/master/install/include/urcu/ref.h:68
 #8  0x0000555555588c8f in session_put (session=0x7fffdc000c30) at session.cpp:1048
 #9  0x00005555555bf995 in handle_job_queue (handle=0x55555575d260, state=0x7fffeeffc240, queue=0x555555758960) at rotation-thread.cpp:612
 #10 0x00005555555c05da in thread_rotation (data=0x55555575d260) at rotation-thread.cpp:847
 #11 0x00005555555c3b1c in launch_thread (data=0x55555575d2f0) at thread.cpp:66
 #12 0x00007ffff7a46609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 #13 0x00007ffff7969163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Other scenarios can lead to a similar backtrace when using the
`--no-wait` lttng destroy option.

Cause
=====

Since ed41e57 [1], hash table removal
for the session object for the `ltt_sessions_ht_by_name` and
`ltt_sessions_ht_by_name` are "decoupled". Removal from
`ltt_sessions_ht_by_name` is done early in `session_destroy()` while
removal from `ltt_sessions_ht_by_id` is done during `session_release` when
the last reference of a session object is released.

This can leads to `imbalances` between the size of the two hash tables
when multiple sessions are at play.

Solution
========

Rework `ltt_sessions_ht_empty()` to exit early when
`ltt_sessions_ht_by_id` is not empty. Perform a sanity check on
`ltt_sessions_ht_by_name` only when `ltt_sessions_ht_by_id` is empty.

Note
========

Ideally both hash tables' lifetime would be managed separately but it
seems easier in term of initialization to bundle them together for now
considering the limited scope of the `ltt_sessions_ht_by_name` hash
table.

Signed-off-by: Jonathan Rajotte <[email protected]>
Signed-off-by: Jérémie Galarneau <[email protected]>
Change-Id: I66c459f80298f929add703ac977cccd1da6dd556
jgalar pushed a commit that referenced this pull request Apr 8, 2022
Observed issue
==============

Scenario:

gdb lttng-sessiond
  set non-stop
  break rotation-thread.cpp:584
  ^ simulates a slow rotation thread or not scheduled thread.

lttng create test1
lttng enable-event -u -a
lttng start test1
lttng create test2
lttng enable-event -u -a
lttng start test2
lttng destroy test1
   This will hang on rotation pending checks on the CLI side.

In another shell:

lttng destroy test2
   This will hang on rotation pending checks on the CLI side.

Back to gdb
   thread 7
   continue

Results in:

 #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
 #1  0x00007ffff786c859 in __GI_abort () at abort.c:79
 #2  0x00007ffff786c729 in __assert_fail_base (fmt=0x7ffff7a02588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x5555556bb148 "count == lttng_ht_get_count(ltt_sessions_ht_by_name)", file=0x5555556bae9f "session.cpp", line=395, function=<optimized out>) at assert.c:92
 #3  0x00007ffff787e006 in __GI___assert_fail (assertion=0x5555556bb148 "count == lttng_ht_get_count(ltt_sessions_ht_by_name)", file=0x5555556bae9f "session.cpp", line=395, function=0x5555556bb129 "int ltt_sessions_ht_empty()") at assert.c:101
 #4  0x0000555555586d59 in ltt_sessions_ht_empty () at session.cpp:395
 #5  0x0000555555586e53 in del_session_ht (ls=0x7fffdc000c30) at session.cpp:418
 #6  0x0000555555588a95 in session_release (ref=0x7fffdc001e50) at session.cpp:999
 #7  0x000055555558620f in urcu_ref_put (ref=0x7fffdc001e50, release=0x5555555886eb <session_release(urcu_ref*)>) at /home/joraj/lttng/master/install/include/urcu/ref.h:68
 #8  0x0000555555588c8f in session_put (session=0x7fffdc000c30) at session.cpp:1048
 #9  0x00005555555bf995 in handle_job_queue (handle=0x55555575d260, state=0x7fffeeffc240, queue=0x555555758960) at rotation-thread.cpp:612
 #10 0x00005555555c05da in thread_rotation (data=0x55555575d260) at rotation-thread.cpp:847
 #11 0x00005555555c3b1c in launch_thread (data=0x55555575d2f0) at thread.cpp:66
 #12 0x00007ffff7a46609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 #13 0x00007ffff7969163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Other scenarios can lead to a similar backtrace when using the
`--no-wait` lttng destroy option.

Cause
=====

Since ed41e57 [1], hash table removal
for the session object for the `ltt_sessions_ht_by_name` and
`ltt_sessions_ht_by_name` are "decoupled". Removal from
`ltt_sessions_ht_by_name` is done early in `session_destroy()` while
removal from `ltt_sessions_ht_by_id` is done during `session_release` when
the last reference of a session object is released.

This can leads to `imbalances` between the size of the two hash tables
when multiple sessions are at play.

Solution
========

Rework `ltt_sessions_ht_empty()` to exit early when
`ltt_sessions_ht_by_id` is not empty. Perform a sanity check on
`ltt_sessions_ht_by_name` only when `ltt_sessions_ht_by_id` is empty.

Note
========

Ideally both hash tables' lifetime would be managed separately but it
seems easier in term of initialization to bundle them together for now
considering the limited scope of the `ltt_sessions_ht_by_name` hash
table.

Signed-off-by: Jonathan Rajotte <[email protected]>
Signed-off-by: Jérémie Galarneau <[email protected]>
Change-Id: I66c459f80298f929add703ac977cccd1da6dd556
jgalar pushed a commit that referenced this pull request Apr 12, 2022
Observed issue
==============

During high throughput event notification generation scenarios the
following deadlock happens:

 Thread 14 (Thread 0x7f74b4ff9700 (LWP 76062)):
 #0  __lll_lock_wait (futex=futex@entry=0x56408765dde8, private=0) at lowlevellock.c:52
 #1  0x00007f74c941a0a3 in __GI___pthread_mutex_lock (mutex=0x56408765dde8) at ../nptl/pthread_mutex_lock.c:80
 #2  0x000056408704b207 in run_command_wait (handle=0x56408765ddd0, cmd=0x7f74b4ff7f70) at notification-thread-commands.cpp:31
 #3  0x000056408704bcef in notification_thread_command_remove_tracer_event_source (handle=0x56408765ddd0, tracer_event_source_fd=54) at notification-thread-commands.cpp:319
 #4  0x000056408708a0c1 in delete_ust_app (app=0x7f749c000bf0) at ust-app.cpp:1059
 #5  0x000056408708a511 in delete_ust_app_rcu (head=0x7f749c000ca0) at ust-app.cpp:1122
 #6  0x00007f74c988b4a7 in call_rcu_thread (arg=0x7f74b8004a80) at ../src/urcu-call-rcu-impl.h:369
 #7  0x00007f74c9417609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 #8  0x00007f74c933a163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

 Thread 13 (Thread 0x7f74b57fa700 (LWP 76047)):
 #0  0x00007f74c933a49e in epoll_wait (epfd=48, events=0x7f74a4000b60, maxevents=2, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
 #1  0x00005640870eafa6 in compat_epoll_wait (events=0x7f74b57f9240, timeout=-1, interruptible=false) at compat/poll.cpp:280
 #2  0x00005640870abb65 in thread_agent_management (data=0x56408765f0b0) at agent-thread.cpp:424
 #3  0x0000564087062b1a in launch_thread (data=0x56408765f150) at thread.cpp:66
 #4  0x00007f74c9417609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 #5  0x00007f74c933a163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

 Thread 12 (Thread 0x7f74b5ffb700 (LWP 76046)):                                                                                                                                       [630/709]
 #0  0x00007f74c933a49e in epoll_wait (epfd=47, events=0x7f74a0000b60, maxevents=2, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
 #1  0x00005640870eafa6 in compat_epoll_wait (events=0x7f74b5ffa170, timeout=-1, interruptible=false) at compat/poll.cpp:280
 #2  0x00005640870a4095 in thread_application_notification (data=0x56408765ee40) at notify-apps.cpp:78
 #3  0x0000564087062b1a in launch_thread (data=0x56408765eed0) at thread.cpp:66
 #4  0x00007f74c9417609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 #5  0x00007f74c933a163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

 Thread 11 (Thread 0x7f74b67fc700 (LWP 76045)):
 #0  0x00007f74c933a49e in epoll_wait (epfd=44, events=0x7f74ac000b60, maxevents=2, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
 #1  0x00005640870eafa6 in compat_epoll_wait (events=0x7f74b67fb170, timeout=-1, interruptible=false) at compat/poll.cpp:280
 #2  0x00005640870723db in thread_application_management (data=0x56408765ebd0) at manage-apps.cpp:93
 #3  0x0000564087062b1a in launch_thread (data=0x56408765ec60) at thread.cpp:66
 #4  0x00007f74c9417609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 #5  0x00007f74c933a163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

 Thread 10 (Thread 0x7f74b6ffd700 (LWP 76044)):
 #0  0x00007f74c933a49e in epoll_wait (epfd=39, events=0x7f74a8000b60, maxevents=2, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
 #1  0x00005640870eafa6 in compat_epoll_wait (events=0x7f74b6ffc130, timeout=-1, interruptible=false) at compat/poll.cpp:280
 #2  0x0000564087070a27 in thread_application_registration (data=0x56408765e940) at register.cpp:214
 #3  0x0000564087062b1a in launch_thread (data=0x56408765e9f0) at thread.cpp:66
 #4  0x00007f74c9417609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 #5  0x00007f74c933a163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

 Thread 9 (Thread 0x7f74b77fe700 (LWP 76043)):                                                                                                                                        [654/709]
 #0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
 #1  0x00005640870c8e25 in futex (uaddr=0x5640871e2800 <ust_cmd_queue>, op=0, val=-1, timeout=0x0, uaddr2=0x0, val3=0) at /home/joraj/lttng/master/install/include/urcu/futex.h:72
 #2  0x00005640870c8e6d in futex_async (uaddr=0x5640871e2800 <ust_cmd_queue>, op=0, val=-1, timeout=0x0, uaddr2=0x0, val3=0) at /home/joraj/lttng/master/install/include/urcu/futex.h:104
 #3  0x00005640870c939a in futex_nto1_wait (futex=0x5640871e2800 <ust_cmd_queue>) at futex.cpp:77
 #4  0x000056408706f2af in thread_dispatch_ust_registration (data=0x56408765e740) at dispatch.cpp:453
 #5  0x0000564087062b1a in launch_thread (data=0x56408765e760) at thread.cpp:66
 #6  0x00007f74c9417609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 #7  0x00007f74c933a163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

 Thread 8 (Thread 0x7f74b7fff700 (LWP 76042)):
 #0  0x00007f74c933a49e in epoll_wait (epfd=33, events=0x7f74b0000b60, maxevents=2, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
 #1  0x00005640870eafa6 in compat_epoll_wait (events=0x7f74b7ffad40, timeout=-1, interruptible=false) at compat/poll.cpp:280
 #2  0x000056408706c424 in thread_manage_clients (data=0x56408765e4f0) at client.cpp:2528
 #3  0x0000564087062b1a in launch_thread (data=0x56408765e560) at thread.cpp:66
 #4  0x00007f74c9417609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 #5  0x00007f74c933a163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

 Thread 7 (Thread 0x7f74c4b8f700 (LWP 76041)):                                                                                                                                        [672/709]
 #0  0x00007f74c933a49e in epoll_wait (epfd=31, events=0x7f74bc000b60, maxevents=3, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
 #1  0x00005640870eafa6 in compat_epoll_wait (events=0x7f74c4b8e240, timeout=-1, interruptible=false) at compat/poll.cpp:280
 #2  0x000056408705f2b6 in thread_rotation (data=0x56408765e280) at rotation-thread.cpp:804
 #3  0x0000564087062b1a in launch_thread (data=0x56408765e310) at thread.cpp:66
 #4  0x00007f74c9417609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 #5  0x00007f74c933a163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

 Thread 6 (Thread 0x7f74c5390700 (LWP 76040)):
 #0  0x00007f74c925f1d2 in __GI___sigtimedwait (set=0x7f74c538f090, info=0x7f74c538f110, timeout=0x0) at ../sysdeps/unix/sysv/linux/sigtimedwait.c:29
 #1  0x000056408706138a in thread_timer (data=0x7ffc1fcbe3f0) at timer.cpp:359
 #2  0x0000564087062b1a in launch_thread (data=0x56408765e0a0) at thread.cpp:66
 #3  0x00007f74c9417609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 #4  0x00007f74c933a163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

 Thread 5 (Thread 0x7f74c5b91700 (LWP 76039)):
 #0  __libc_write (nbytes=8, buf=0x7f74c5b8fc88, fd=24) at ../sysdeps/unix/sysv/linux/write.c:26
 #1  __libc_write (fd=24, buf=0x7f74c5b8fc88, nbytes=8) at ../sysdeps/unix/sysv/linux/write.c:24
 #2  0x00005640870eeb4f in lttng_write (fd=24, buf=0x7f74c5b8fc88, count=8) at readwrite.cpp:77
 #3  0x000056408704b535 in run_command_no_wait (handle=0x56408765ddd0, in_cmd=0x7f74c5b8fdf0) at notification-thread-commands.cpp:92
 #4  0x000056408704bf49 in notification_thread_client_communication_update (handle=0x56408765ddd0, id=2, transmission_status=CLIENT_TRANSMISSION_STATUS_QUEUED) at notification-thread-command
 #5  0x000056408707bc62 in client_handle_transmission_status (client=0x7f74b80050d0, status=CLIENT_TRANSMISSION_STATUS_QUEUED, user_data=0x7f74b8004410) at action-executor.cpp:258
 #6  0x0000564087057525 in notification_client_list_send_evaluation (client_list=0x7f74b8004df0, trigger=0x7f74b0001030, evaluation=0x7f74b815d1d0, source_object_creds=0x0, client_report=0x5
 #7  0x000056408707bce9 in action_executor_notify_handler (executor=0x7f74b8004410, work_item=0x7f74b815d430, item=0x7f74b80e48e0) at action-executor.cpp:269
 #8  0x000056408707dd6d in action_executor_generic_handler (executor=0x7f74b8004410, work_item=0x7f74b815d430, item=0x7f74b80e48e0) at action-executor.cpp:670
 #9  0x000056408707df01 in action_work_item_execute (executor=0x7f74b8004410, work_item=0x7f74b815d430) at action-executor.cpp:689

 #10 0x000056408707e525 in action_executor_thread (_data=0x7f74b8004410) at action-executor.cpp:771                                                                                   [698/709]
 #11 0x0000564087062b1a in launch_thread (data=0x7f74b80044b0) at thread.cpp:66
 #12 0x00007f74c9417609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 #13 0x00007f74c933a163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

 Thread 4 (Thread 0x7f74c6392700 (LWP 76038)):
 #0  __lll_lock_wait (futex=futex@entry=0x56408765dde8, private=0) at lowlevellock.c:52
 #1  0x00007f74c941a0a3 in __GI___pthread_mutex_lock (mutex=0x56408765dde8) at ../nptl/pthread_mutex_lock.c:80
 #2  0x0000564087053c89 in handle_notification_thread_command (handle=0x56408765ddd0, state=0x7f74c63911b0) at notification-thread-events.cpp:3142
 #3  0x000056408704ac81 in thread_notification (data=0x56408765ddd0) at notification-thread.cpp:715
 #4  0x0000564087062b1a in launch_thread (data=0x56408765dec0) at thread.cpp:66
 #5  0x00007f74c9417609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 #run_command_no_wait6  0x00007f74c933a163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

 Thread 3 (Thread 0x7f74c6b93700 (LWP 76037)):
 #0  0x00007f74c933a49e in epoll_wait (epfd=21, events=0x7f74c0000b60, maxevents=2, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
 #1  0x00005640870eafa6 in compat_epoll_wait (events=0x7f74c6b92170, timeout=-1, interruptible=false) at compat/poll.cpp:280
 #2  0x000056408706400a in thread_manage_health (data=0x56408765db50) at health.cpp:140
 #3  0x0000564087062b1a in launch_thread (data=0x56408765dbf0) at thread.cpp:66
 #4  0x00007f74c9417609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 #5  0x00007f74c933a163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

 Thread 2 (Thread 0x7f74c7394700 (LWP 76036)):
 #0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
 #1  0x00007f74c987d238 in futex (uaddr=0x564087659b10, op=0, val=-1, timeout=0x0, uaddr2=0x0, val3=0) at ../include/urcu/futex.h:72
 #2  futex_async (uaddr=0x564087659b10, op=0, val=-1, timeout=0x0, uaddr2=0x0, val3=0) at ../include/urcu/futex.h:104
 #3  futex_wait (futex=0x564087659b10) at workqueue.c:136
 #4  0x00007f74c987ced2 in workqueue_thread (arg=0x564087659ad0) at workqueue.c:237
 #5  0x00007f74c9417609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 #6  0x00007f74c933a163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

 Thread 1 (Thread 0x7f74c73cd300 (LWP 76034)):
 #0  0x00007f74c933a49e in epoll_wait (epfd=50, events=0x564087666880, maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
 #1  0x00005640870eafa6 in compat_epoll_wait (events=0x7ffc1fcbe280, timeout=-1, interruptible=false) at compat/poll.cpp:280
 #2  0x0000564087062244 in sessiond_wait_for_quit_pipe (timeout_ms=-1) at thread-utils.cpp:83
 #3  0x00005640870127dc in main (argc=1, argv=0x7ffc1fcbe668) at main.cpp:1921

Cause
=====

The event_pipe used to notify the notification poll loop is full and the
lttng_write call blocks with the locks for both the client and the
cmd_queue held.

Solution
========

Go back to using eventfd but without the use of EFD_SEMAPHORE (linux
2.6.30) to continue supporting kernel between 2.6.27 and 2.6.29.

The EFD_SEMAPHORE is emulated with a read, decrement, write as explained
by the initial committer of EFD_SEMAPHORE [1].

Known drawbacks
=========

This does not solve the actual block+lock problem but simply push it
back further. The lttng_write on the eventfd can block when reaching
UINT64_MAX. This would represent, at 1 command queued per ns (which is
ridiculous), ~584 years of queueing without a dequeue operation.

Reference
=======
[1] https://lwn.net/Articles/318151/

Signed-off-by: Jonathan Rajotte <[email protected]>
Signed-off-by: Jérémie Galarneau <[email protected]>
Change-Id: Ie749c4169708f57463fe3cfab2366f1015bae4e0
jgalar pushed a commit that referenced this pull request Apr 12, 2022
Observed issue
==============

During high throughput event notification generation scenarios the
following deadlock happens:

 Thread 14 (Thread 0x7f74b4ff9700 (LWP 76062)):
 #0  __lll_lock_wait (futex=futex@entry=0x56408765dde8, private=0) at lowlevellock.c:52
 #1  0x00007f74c941a0a3 in __GI___pthread_mutex_lock (mutex=0x56408765dde8) at ../nptl/pthread_mutex_lock.c:80
 #2  0x000056408704b207 in run_command_wait (handle=0x56408765ddd0, cmd=0x7f74b4ff7f70) at notification-thread-commands.cpp:31
 #3  0x000056408704bcef in notification_thread_command_remove_tracer_event_source (handle=0x56408765ddd0, tracer_event_source_fd=54) at notification-thread-commands.cpp:319
 #4  0x000056408708a0c1 in delete_ust_app (app=0x7f749c000bf0) at ust-app.cpp:1059
 #5  0x000056408708a511 in delete_ust_app_rcu (head=0x7f749c000ca0) at ust-app.cpp:1122
 #6  0x00007f74c988b4a7 in call_rcu_thread (arg=0x7f74b8004a80) at ../src/urcu-call-rcu-impl.h:369
 #7  0x00007f74c9417609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 #8  0x00007f74c933a163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

 Thread 13 (Thread 0x7f74b57fa700 (LWP 76047)):
 #0  0x00007f74c933a49e in epoll_wait (epfd=48, events=0x7f74a4000b60, maxevents=2, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
 #1  0x00005640870eafa6 in compat_epoll_wait (events=0x7f74b57f9240, timeout=-1, interruptible=false) at compat/poll.cpp:280
 #2  0x00005640870abb65 in thread_agent_management (data=0x56408765f0b0) at agent-thread.cpp:424
 #3  0x0000564087062b1a in launch_thread (data=0x56408765f150) at thread.cpp:66
 #4  0x00007f74c9417609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 #5  0x00007f74c933a163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

 Thread 12 (Thread 0x7f74b5ffb700 (LWP 76046)):                                                                                                                                       [630/709]
 #0  0x00007f74c933a49e in epoll_wait (epfd=47, events=0x7f74a0000b60, maxevents=2, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
 #1  0x00005640870eafa6 in compat_epoll_wait (events=0x7f74b5ffa170, timeout=-1, interruptible=false) at compat/poll.cpp:280
 #2  0x00005640870a4095 in thread_application_notification (data=0x56408765ee40) at notify-apps.cpp:78
 #3  0x0000564087062b1a in launch_thread (data=0x56408765eed0) at thread.cpp:66
 #4  0x00007f74c9417609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 #5  0x00007f74c933a163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

 Thread 11 (Thread 0x7f74b67fc700 (LWP 76045)):
 #0  0x00007f74c933a49e in epoll_wait (epfd=44, events=0x7f74ac000b60, maxevents=2, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
 #1  0x00005640870eafa6 in compat_epoll_wait (events=0x7f74b67fb170, timeout=-1, interruptible=false) at compat/poll.cpp:280
 #2  0x00005640870723db in thread_application_management (data=0x56408765ebd0) at manage-apps.cpp:93
 #3  0x0000564087062b1a in launch_thread (data=0x56408765ec60) at thread.cpp:66
 #4  0x00007f74c9417609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 #5  0x00007f74c933a163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

 Thread 10 (Thread 0x7f74b6ffd700 (LWP 76044)):
 #0  0x00007f74c933a49e in epoll_wait (epfd=39, events=0x7f74a8000b60, maxevents=2, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
 #1  0x00005640870eafa6 in compat_epoll_wait (events=0x7f74b6ffc130, timeout=-1, interruptible=false) at compat/poll.cpp:280
 #2  0x0000564087070a27 in thread_application_registration (data=0x56408765e940) at register.cpp:214
 #3  0x0000564087062b1a in launch_thread (data=0x56408765e9f0) at thread.cpp:66
 #4  0x00007f74c9417609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 #5  0x00007f74c933a163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

 Thread 9 (Thread 0x7f74b77fe700 (LWP 76043)):                                                                                                                                        [654/709]
 #0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
 #1  0x00005640870c8e25 in futex (uaddr=0x5640871e2800 <ust_cmd_queue>, op=0, val=-1, timeout=0x0, uaddr2=0x0, val3=0) at /home/joraj/lttng/master/install/include/urcu/futex.h:72
 #2  0x00005640870c8e6d in futex_async (uaddr=0x5640871e2800 <ust_cmd_queue>, op=0, val=-1, timeout=0x0, uaddr2=0x0, val3=0) at /home/joraj/lttng/master/install/include/urcu/futex.h:104
 #3  0x00005640870c939a in futex_nto1_wait (futex=0x5640871e2800 <ust_cmd_queue>) at futex.cpp:77
 #4  0x000056408706f2af in thread_dispatch_ust_registration (data=0x56408765e740) at dispatch.cpp:453
 #5  0x0000564087062b1a in launch_thread (data=0x56408765e760) at thread.cpp:66
 #6  0x00007f74c9417609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 #7  0x00007f74c933a163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

 Thread 8 (Thread 0x7f74b7fff700 (LWP 76042)):
 #0  0x00007f74c933a49e in epoll_wait (epfd=33, events=0x7f74b0000b60, maxevents=2, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
 #1  0x00005640870eafa6 in compat_epoll_wait (events=0x7f74b7ffad40, timeout=-1, interruptible=false) at compat/poll.cpp:280
 #2  0x000056408706c424 in thread_manage_clients (data=0x56408765e4f0) at client.cpp:2528
 #3  0x0000564087062b1a in launch_thread (data=0x56408765e560) at thread.cpp:66
 #4  0x00007f74c9417609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 #5  0x00007f74c933a163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

 Thread 7 (Thread 0x7f74c4b8f700 (LWP 76041)):                                                                                                                                        [672/709]
 #0  0x00007f74c933a49e in epoll_wait (epfd=31, events=0x7f74bc000b60, maxevents=3, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
 #1  0x00005640870eafa6 in compat_epoll_wait (events=0x7f74c4b8e240, timeout=-1, interruptible=false) at compat/poll.cpp:280
 #2  0x000056408705f2b6 in thread_rotation (data=0x56408765e280) at rotation-thread.cpp:804
 #3  0x0000564087062b1a in launch_thread (data=0x56408765e310) at thread.cpp:66
 #4  0x00007f74c9417609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 #5  0x00007f74c933a163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

 Thread 6 (Thread 0x7f74c5390700 (LWP 76040)):
 #0  0x00007f74c925f1d2 in __GI___sigtimedwait (set=0x7f74c538f090, info=0x7f74c538f110, timeout=0x0) at ../sysdeps/unix/sysv/linux/sigtimedwait.c:29
 #1  0x000056408706138a in thread_timer (data=0x7ffc1fcbe3f0) at timer.cpp:359
 #2  0x0000564087062b1a in launch_thread (data=0x56408765e0a0) at thread.cpp:66
 #3  0x00007f74c9417609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 #4  0x00007f74c933a163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

 Thread 5 (Thread 0x7f74c5b91700 (LWP 76039)):
 #0  __libc_write (nbytes=8, buf=0x7f74c5b8fc88, fd=24) at ../sysdeps/unix/sysv/linux/write.c:26
 #1  __libc_write (fd=24, buf=0x7f74c5b8fc88, nbytes=8) at ../sysdeps/unix/sysv/linux/write.c:24
 #2  0x00005640870eeb4f in lttng_write (fd=24, buf=0x7f74c5b8fc88, count=8) at readwrite.cpp:77
 #3  0x000056408704b535 in run_command_no_wait (handle=0x56408765ddd0, in_cmd=0x7f74c5b8fdf0) at notification-thread-commands.cpp:92
 #4  0x000056408704bf49 in notification_thread_client_communication_update (handle=0x56408765ddd0, id=2, transmission_status=CLIENT_TRANSMISSION_STATUS_QUEUED) at notification-thread-command
 #5  0x000056408707bc62 in client_handle_transmission_status (client=0x7f74b80050d0, status=CLIENT_TRANSMISSION_STATUS_QUEUED, user_data=0x7f74b8004410) at action-executor.cpp:258
 #6  0x0000564087057525 in notification_client_list_send_evaluation (client_list=0x7f74b8004df0, trigger=0x7f74b0001030, evaluation=0x7f74b815d1d0, source_object_creds=0x0, client_report=0x5
 #7  0x000056408707bce9 in action_executor_notify_handler (executor=0x7f74b8004410, work_item=0x7f74b815d430, item=0x7f74b80e48e0) at action-executor.cpp:269
 #8  0x000056408707dd6d in action_executor_generic_handler (executor=0x7f74b8004410, work_item=0x7f74b815d430, item=0x7f74b80e48e0) at action-executor.cpp:670
 #9  0x000056408707df01 in action_work_item_execute (executor=0x7f74b8004410, work_item=0x7f74b815d430) at action-executor.cpp:689

 #10 0x000056408707e525 in action_executor_thread (_data=0x7f74b8004410) at action-executor.cpp:771                                                                                   [698/709]
 #11 0x0000564087062b1a in launch_thread (data=0x7f74b80044b0) at thread.cpp:66
 #12 0x00007f74c9417609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 #13 0x00007f74c933a163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

 Thread 4 (Thread 0x7f74c6392700 (LWP 76038)):
 #0  __lll_lock_wait (futex=futex@entry=0x56408765dde8, private=0) at lowlevellock.c:52
 #1  0x00007f74c941a0a3 in __GI___pthread_mutex_lock (mutex=0x56408765dde8) at ../nptl/pthread_mutex_lock.c:80
 #2  0x0000564087053c89 in handle_notification_thread_command (handle=0x56408765ddd0, state=0x7f74c63911b0) at notification-thread-events.cpp:3142
 #3  0x000056408704ac81 in thread_notification (data=0x56408765ddd0) at notification-thread.cpp:715
 #4  0x0000564087062b1a in launch_thread (data=0x56408765dec0) at thread.cpp:66
 #5  0x00007f74c9417609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 #run_command_no_wait6  0x00007f74c933a163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

 Thread 3 (Thread 0x7f74c6b93700 (LWP 76037)):
 #0  0x00007f74c933a49e in epoll_wait (epfd=21, events=0x7f74c0000b60, maxevents=2, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
 #1  0x00005640870eafa6 in compat_epoll_wait (events=0x7f74c6b92170, timeout=-1, interruptible=false) at compat/poll.cpp:280
 #2  0x000056408706400a in thread_manage_health (data=0x56408765db50) at health.cpp:140
 #3  0x0000564087062b1a in launch_thread (data=0x56408765dbf0) at thread.cpp:66
 #4  0x00007f74c9417609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 #5  0x00007f74c933a163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

 Thread 2 (Thread 0x7f74c7394700 (LWP 76036)):
 #0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
 #1  0x00007f74c987d238 in futex (uaddr=0x564087659b10, op=0, val=-1, timeout=0x0, uaddr2=0x0, val3=0) at ../include/urcu/futex.h:72
 #2  futex_async (uaddr=0x564087659b10, op=0, val=-1, timeout=0x0, uaddr2=0x0, val3=0) at ../include/urcu/futex.h:104
 #3  futex_wait (futex=0x564087659b10) at workqueue.c:136
 #4  0x00007f74c987ced2 in workqueue_thread (arg=0x564087659ad0) at workqueue.c:237
 #5  0x00007f74c9417609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 #6  0x00007f74c933a163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

 Thread 1 (Thread 0x7f74c73cd300 (LWP 76034)):
 #0  0x00007f74c933a49e in epoll_wait (epfd=50, events=0x564087666880, maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
 #1  0x00005640870eafa6 in compat_epoll_wait (events=0x7ffc1fcbe280, timeout=-1, interruptible=false) at compat/poll.cpp:280
 #2  0x0000564087062244 in sessiond_wait_for_quit_pipe (timeout_ms=-1) at thread-utils.cpp:83
 #3  0x00005640870127dc in main (argc=1, argv=0x7ffc1fcbe668) at main.cpp:1921

Cause
=====

The event_pipe used to notify the notification poll loop is full and the
lttng_write call blocks with the locks for both the client and the
cmd_queue held.

Solution
========

Go back to using eventfd but without the use of EFD_SEMAPHORE (linux
2.6.30) to continue supporting kernel between 2.6.27 and 2.6.29.

The EFD_SEMAPHORE is emulated with a read, decrement, write as explained
by the initial committer of EFD_SEMAPHORE [1].

Known drawbacks
=========

This does not solve the actual block+lock problem but simply push it
back further. The lttng_write on the eventfd can block when reaching
UINT64_MAX. This would represent, at 1 command queued per ns (which is
ridiculous), ~584 years of queueing without a dequeue operation.

Reference
=======
[1] https://lwn.net/Articles/318151/

Signed-off-by: Jonathan Rajotte <[email protected]>
Signed-off-by: Jérémie Galarneau <[email protected]>
Change-Id: Ie749c4169708f57463fe3cfab2366f1015bae4e0
jgalar added a commit that referenced this pull request Apr 25, 2022
LeakSanitizer reports the following leak:

==974957==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 32 byte(s) in 1 object(s) allocated from:
    #0 0x7fdb86fcd1b2 in __interceptor_realloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:164
    #1 0x7fdb86d7c296 in lttng_dynamic_buffer_set_capacity(lttng_dynamic_buffer*, unsigned long) /home/jgalar/EfficiOS/src/lttng-tools/src/common/dynamic-buffer.cpp:159
    #2 0x7fdb86d7c060 in lttng_dynamic_buffer_set_size(lttng_dynamic_buffer*, unsigned long) /home/jgalar/EfficiOS/src/lttng-tools/src/common/dynamic-buffer.cpp:112
    #3 0x7fdb86d2589a in recv_payload_sessiond /home/jgalar/EfficiOS/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl.cpp:230
    #4 0x7fdb86d26fa5 in lttng_ctl_ask_sessiond_payload(lttng_payload_view*, lttng_payload*) /home/jgalar/EfficiOS/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl.cpp:662
    #5 0x7fdb86d2cd8d in lttng_list_tracepoint_fields /home/jgalar/EfficiOS/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl.cpp:1767
    #6 0x56481623cb4c in list_ust_event_fields commands/list.cpp:850
    #7 0x5648162448d9 in cmd_list(int, char const**) commands/list.cpp:2394
    #8 0x56481628fb3e in handle_command /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng/lttng.cpp:238
    #9 0x564816290601 in parse_args /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng/lttng.cpp:427
    #10 0x564816290908 in main /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng/lttng.cpp:476
    #11 0x7fdb8661730f in __libc_start_call_main (/usr/lib/libc.so.6+0x2d30f)

SUMMARY: AddressSanitizer: 32 byte(s) leaked in 1 allocation(s).

The session daemon's reply is indeed never released in
lttng_list_tracepoint_fields.

Signed-off-by: Jérémie Galarneau <[email protected]>
Change-Id: Idd244b52a69f3b74e5c131c1c36c6ee6d76f4285
jgalar added a commit that referenced this pull request Apr 25, 2022
==1429021==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 8 byte(s) in 1 object(s) allocated from:
    #0 0x7fe305f031b2 in __interceptor_realloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:164
    #1 0x559f1b022238 in lttng_dynamic_buffer_set_capacity(lttng_dynamic_buffer*, unsigned long) /home/jgalar/EfficiOS/src/lttng-tools/src/common/dynamic-buffer.cpp:159
    #2 0x559f1b021d9f in lttng_dynamic_buffer_append(lttng_dynamic_buffer*, void const*, unsigned long) /home/jgalar/EfficiOS/src/lttng-tools/src/common/dynamic-buffer.cpp:52
    #3 0x559f1b02144a in lttng_dynamic_array_add_element(lttng_dynamic_array*, void const*) /home/jgalar/EfficiOS/src/lttng-tools/src/common/dynamic-array.cpp:58
    #4 0x559f1b07d07b in lttng_action_path_copy(lttng_action_path const*, lttng_action_path*) actions/path.cpp:116
    #5 0x559f1b02383f in lttng_error_query_action_create /home/jgalar/EfficiOS/src/lttng-tools/src/common/error-query.cpp:232
    #6 0x559f1b02760e in lttng_error_query_create_from_payload(lttng_payload_view*, lttng_error_query**) /home/jgalar/EfficiOS/src/lttng-tools/src/common/error-query.cpp:911
    #7 0x559f1af5c361 in receive_lttng_error_query /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/client.cpp:740
    #8 0x559f1af64eba in process_client_msg /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/client.cpp:2336
    #9 0x559f1af67378 in thread_manage_clients /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/client.cpp:2624
    #10 0x559f1af50642 in launch_thread /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/thread.cpp:68
    #11 0x7fe3055225c1 in start_thread (/usr/lib/libc.so.6+0x8d5c1)

Signed-off-by: Jérémie Galarneau <[email protected]>
Change-Id: I7a6f7d2a9746124581eebf30877466f16db67a6b
jgalar added a commit that referenced this pull request Apr 25, 2022
==1501334==ERROR: LeakSanitizer: detected memory leaks

Indirect leak of 16386 byte(s) in 1 object(s) allocated from:
    #0 0x7f95efc3cdd9 in __interceptor_malloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:145
    #1 0x55acb0681ed3 in lttng_filter_yyalloc(unsigned long, void*) filter/filter-lexer.cpp:2511
    #2 0x55acb067f2f2 in lttng_filter_yy_create_buffer(_IO_FILE*, int, void*) filter/filter-lexer.cpp:1895
    #3 0x55acb067ea44 in yyrestart(_IO_FILE*, void*) filter/filter-lexer.cpp:1824
    #4 0x55acb0649a43 in filter_parser_ctx_alloc(_IO_FILE*) filter/filter-parser.ypp:271
    #5 0x55acb0649e7f in filter_parser_ctx_create_from_filter_expression(char const*, filter_parser_ctx**) filter/filter-parser.ypp:332
    #6 0x55acb058ee89 in parse_event_rule commands/add_trigger.cpp:783
    #7 0x55acb05920c0 in handle_condition_event commands/add_trigger.cpp:1361
    #8 0x55acb0592739 in parse_condition commands/add_trigger.cpp:1457
    #9 0x55acb0596b56 in cmd_add_trigger(int, char const**) commands/add_trigger.cpp:2304
    #10 0x55acb05a5b80 in handle_command /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng/lttng.cpp:238
    #11 0x55acb05a6643 in parse_args /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng/lttng.cpp:427
    #12 0x55acb05a694a in main /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng/lttng.cpp:476
    #13 0x7f95ef28730f in __libc_start_call_main (/usr/lib/libc.so.6+0x2d30f)

Signed-off-by: Jérémie Galarneau <[email protected]>
Change-Id: I6fa21e7d066e0cf48afc3f91ceefbfd19c6b86fd
jgalar added a commit that referenced this pull request Jun 15, 2022
LeakSanitizer reports the following leak:

==974957==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 32 byte(s) in 1 object(s) allocated from:
    #0 0x7fdb86fcd1b2 in __interceptor_realloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:164
    #1 0x7fdb86d7c296 in lttng_dynamic_buffer_set_capacity(lttng_dynamic_buffer*, unsigned long) /home/jgalar/EfficiOS/src/lttng-tools/src/common/dynamic-buffer.cpp:159
    #2 0x7fdb86d7c060 in lttng_dynamic_buffer_set_size(lttng_dynamic_buffer*, unsigned long) /home/jgalar/EfficiOS/src/lttng-tools/src/common/dynamic-buffer.cpp:112
    #3 0x7fdb86d2589a in recv_payload_sessiond /home/jgalar/EfficiOS/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl.cpp:230
    #4 0x7fdb86d26fa5 in lttng_ctl_ask_sessiond_payload(lttng_payload_view*, lttng_payload*) /home/jgalar/EfficiOS/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl.cpp:662
    #5 0x7fdb86d2cd8d in lttng_list_tracepoint_fields /home/jgalar/EfficiOS/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl.cpp:1767
    #6 0x56481623cb4c in list_ust_event_fields commands/list.cpp:850
    #7 0x5648162448d9 in cmd_list(int, char const**) commands/list.cpp:2394
    #8 0x56481628fb3e in handle_command /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng/lttng.cpp:238
    #9 0x564816290601 in parse_args /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng/lttng.cpp:427
    #10 0x564816290908 in main /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng/lttng.cpp:476
    #11 0x7fdb8661730f in __libc_start_call_main (/usr/lib/libc.so.6+0x2d30f)

SUMMARY: AddressSanitizer: 32 byte(s) leaked in 1 allocation(s).

The session daemon's reply is indeed never released in
lttng_list_tracepoint_fields.

Signed-off-by: Jérémie Galarneau <[email protected]>
Change-Id: Idd244b52a69f3b74e5c131c1c36c6ee6d76f4285
jgalar added a commit that referenced this pull request Jun 15, 2022
==1429021==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 8 byte(s) in 1 object(s) allocated from:
    #0 0x7fe305f031b2 in __interceptor_realloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:164
    #1 0x559f1b022238 in lttng_dynamic_buffer_set_capacity(lttng_dynamic_buffer*, unsigned long) /home/jgalar/EfficiOS/src/lttng-tools/src/common/dynamic-buffer.cpp:159
    #2 0x559f1b021d9f in lttng_dynamic_buffer_append(lttng_dynamic_buffer*, void const*, unsigned long) /home/jgalar/EfficiOS/src/lttng-tools/src/common/dynamic-buffer.cpp:52
    #3 0x559f1b02144a in lttng_dynamic_array_add_element(lttng_dynamic_array*, void const*) /home/jgalar/EfficiOS/src/lttng-tools/src/common/dynamic-array.cpp:58
    #4 0x559f1b07d07b in lttng_action_path_copy(lttng_action_path const*, lttng_action_path*) actions/path.cpp:116
    #5 0x559f1b02383f in lttng_error_query_action_create /home/jgalar/EfficiOS/src/lttng-tools/src/common/error-query.cpp:232
    #6 0x559f1b02760e in lttng_error_query_create_from_payload(lttng_payload_view*, lttng_error_query**) /home/jgalar/EfficiOS/src/lttng-tools/src/common/error-query.cpp:911
    #7 0x559f1af5c361 in receive_lttng_error_query /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/client.cpp:740
    #8 0x559f1af64eba in process_client_msg /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/client.cpp:2336
    #9 0x559f1af67378 in thread_manage_clients /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/client.cpp:2624
    #10 0x559f1af50642 in launch_thread /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/thread.cpp:68
    #11 0x7fe3055225c1 in start_thread (/usr/lib/libc.so.6+0x8d5c1)

Signed-off-by: Jérémie Galarneau <[email protected]>
Change-Id: I7a6f7d2a9746124581eebf30877466f16db67a6b
jgalar added a commit that referenced this pull request Jun 15, 2022
==1501334==ERROR: LeakSanitizer: detected memory leaks

Indirect leak of 16386 byte(s) in 1 object(s) allocated from:
    #0 0x7f95efc3cdd9 in __interceptor_malloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:145
    #1 0x55acb0681ed3 in lttng_filter_yyalloc(unsigned long, void*) filter/filter-lexer.cpp:2511
    #2 0x55acb067f2f2 in lttng_filter_yy_create_buffer(_IO_FILE*, int, void*) filter/filter-lexer.cpp:1895
    #3 0x55acb067ea44 in yyrestart(_IO_FILE*, void*) filter/filter-lexer.cpp:1824
    #4 0x55acb0649a43 in filter_parser_ctx_alloc(_IO_FILE*) filter/filter-parser.ypp:271
    #5 0x55acb0649e7f in filter_parser_ctx_create_from_filter_expression(char const*, filter_parser_ctx**) filter/filter-parser.ypp:332
    #6 0x55acb058ee89 in parse_event_rule commands/add_trigger.cpp:783
    #7 0x55acb05920c0 in handle_condition_event commands/add_trigger.cpp:1361
    #8 0x55acb0592739 in parse_condition commands/add_trigger.cpp:1457
    #9 0x55acb0596b56 in cmd_add_trigger(int, char const**) commands/add_trigger.cpp:2304
    #10 0x55acb05a5b80 in handle_command /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng/lttng.cpp:238
    #11 0x55acb05a6643 in parse_args /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng/lttng.cpp:427
    #12 0x55acb05a694a in main /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng/lttng.cpp:476
    #13 0x7f95ef28730f in __libc_start_call_main (/usr/lib/libc.so.6+0x2d30f)

Signed-off-by: Jérémie Galarneau <[email protected]>
Change-Id: I6fa21e7d066e0cf48afc3f91ceefbfd19c6b86fd
jgalar pushed a commit that referenced this pull request Nov 14, 2022
When converting msgpack objects to their event_field_value equivalent,
the following assertion fails: LTTNG_ASSERT(val);

 #4  0x00007f1f65349486 in __assert_fail () from /usr/lib/libc.so.6
 #5  0x00007f1f65584da7 in lttng_event_field_value_string_create_with_size (val=0x0, size=0) at event-field-value.cpp:186
 #6  0x00007f1f65576a1a in event_field_value_from_obj (obj=0x557f597ccdb8, field_val=0x7ffcc9675dd0)
     at conditions/event-rule-matches.cpp:1120
 #7  0x00007f1f65577176 in event_field_value_from_capture_payload (condition=0x557f597c8520,
     capture_payload=0x557f597c825b "\221\240", capture_payload_size=2) at conditions/event-rule-matches.cpp:1340
 #8  0x00007f1f655772ea in lttng_evaluation_event_rule_matches_create (condition=0x557f597c8520,
     capture_payload=0x557f597c825b "\221\240", capture_payload_size=2, decode_capture_payload=true)
     at conditions/event-rule-matches.cpp:1398
 #9  0x00007f1f655765fc in lttng_evaluation_event_rule_matches_create_from_payload (condition=0x557f597c8520,
     view=0x7ffcc9675ff0, _evaluation=0x7ffcc9676080) at conditions/event-rule-matches.cpp:990
 #10 0x00007f1f6557f273 in lttng_evaluation_create_from_payload (condition=0x557f597c8520, src_view=0x7ffcc9676100,
     evaluation=0x7ffcc9676080) at evaluation.cpp:120
 #11 0x00007f1f6559ba36 in lttng_notification_create_from_payload (src_view=0x7ffcc9676190, notification=0x7ffcc9676180)
     at notification.cpp:123
 #12 0x00007f1f65552577 in create_notification_from_current_message (channel=0x557f597c8ee0) at channel.cpp:124
 #13 0x00007f1f6555298c in lttng_notification_channel_get_next_notification (channel=0x557f597c8ee0, _notification=0x7ffcc9676280)
     at channel.cpp:292

The msgpack API represents string as p-style while the implementation of
event_field_value relies on null-terminated strings. When an empty
string is captured by a tracer, it is decoded as a msgpack_object with
`str = {size = 0, ptr = 0x0}`.

lttng_event_field_value_string_create_with_size does not require a
null-terminated string since it also receives the length. Hence, this
fix causes lttng_event_field_value_string_create_with_size to accept
null strings when their length is zero. A copy of an empty string is
made to accomodate the null-termination convention used by the rest of
that API.

Signed-off-by: Mathieu Desnoyers <[email protected]>
Signed-off-by: Jérémie Galarneau <[email protected]>
Change-Id: I7c3a839dbbeeb95a1b3bf6ddc3205a2f6b4538e3
jgalar pushed a commit that referenced this pull request Jan 6, 2023
When converting msgpack objects to their event_field_value equivalent,
the following assertion fails: LTTNG_ASSERT(val);

 #4  0x00007f1f65349486 in __assert_fail () from /usr/lib/libc.so.6
 #5  0x00007f1f65584da7 in lttng_event_field_value_string_create_with_size (val=0x0, size=0) at event-field-value.cpp:186
 #6  0x00007f1f65576a1a in event_field_value_from_obj (obj=0x557f597ccdb8, field_val=0x7ffcc9675dd0)
     at conditions/event-rule-matches.cpp:1120
 #7  0x00007f1f65577176 in event_field_value_from_capture_payload (condition=0x557f597c8520,
     capture_payload=0x557f597c825b "\221\240", capture_payload_size=2) at conditions/event-rule-matches.cpp:1340
 #8  0x00007f1f655772ea in lttng_evaluation_event_rule_matches_create (condition=0x557f597c8520,
     capture_payload=0x557f597c825b "\221\240", capture_payload_size=2, decode_capture_payload=true)
     at conditions/event-rule-matches.cpp:1398
 #9  0x00007f1f655765fc in lttng_evaluation_event_rule_matches_create_from_payload (condition=0x557f597c8520,
     view=0x7ffcc9675ff0, _evaluation=0x7ffcc9676080) at conditions/event-rule-matches.cpp:990
 #10 0x00007f1f6557f273 in lttng_evaluation_create_from_payload (condition=0x557f597c8520, src_view=0x7ffcc9676100,
     evaluation=0x7ffcc9676080) at evaluation.cpp:120
 #11 0x00007f1f6559ba36 in lttng_notification_create_from_payload (src_view=0x7ffcc9676190, notification=0x7ffcc9676180)
     at notification.cpp:123
 #12 0x00007f1f65552577 in create_notification_from_current_message (channel=0x557f597c8ee0) at channel.cpp:124
 #13 0x00007f1f6555298c in lttng_notification_channel_get_next_notification (channel=0x557f597c8ee0, _notification=0x7ffcc9676280)
     at channel.cpp:292

The msgpack API represents string as p-style while the implementation of
event_field_value relies on null-terminated strings. When an empty
string is captured by a tracer, it is decoded as a msgpack_object with
`str = {size = 0, ptr = 0x0}`.

lttng_event_field_value_string_create_with_size does not require a
null-terminated string since it also receives the length. Hence, this
fix causes lttng_event_field_value_string_create_with_size to accept
null strings when their length is zero. A copy of an empty string is
made to accomodate the null-termination convention used by the rest of
that API.

Signed-off-by: Mathieu Desnoyers <[email protected]>
Signed-off-by: Jérémie Galarneau <[email protected]>
Change-Id: I7c3a839dbbeeb95a1b3bf6ddc3205a2f6b4538e3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants