Skip to content

Remove tctracer (ip options)#1366

Merged
grcevski merged 10 commits into
open-telemetry:mainfrom
rafaelroquetto:sock_dir
Feb 26, 2026
Merged

Remove tctracer (ip options)#1366
grcevski merged 10 commits into
open-telemetry:mainfrom
rafaelroquetto:sock_dir

Conversation

@rafaelroquetto
Copy link
Copy Markdown
Contributor

@rafaelroquetto rafaelroquetto commented Feb 26, 2026

This PR is best reviewed on a per commit basis

  • Remove the tctracer component (L4 context propagation via IP options injected into IPv4/IPv6 headers using TC egress/ingress BPF programs)
  • Add a BPF iter/tcp socket iterator to tpinjector that pre-populates the sock_dir sockmap at startup using socket cookies as keys, with full IPv4/IPv6 address/port logging per tracked socket
  • Deduplicate runIterator across tpinjector and generictracer into a shared (*Iter).Run method in pkg/ebpf/common

This ensures sockets established before tpinjector attaches are tracked in sock_dir and are visible to the sk_msg program.

Kernel bug affecting ver < 6.4: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9378096e8a65

Checklist

@rafaelroquetto rafaelroquetto requested a review from a team as a code owner February 26, 2026 00:23
@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 26, 2026

Codecov Report

❌ Patch coverage is 23.40426% with 36 lines in your changes missing coverage. Please review.
✅ Project coverage is 43.74%. Comparing base (4793b9b) to head (effb500).
⚠️ Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
pkg/internal/ebpf/tpinjector/tpinjector.go 0.00% 19 Missing ⚠️
pkg/ebpf/common/common.go 42.85% 9 Missing and 3 partials ⚠️
pkg/config/ebpf_tracer.go 33.33% 4 Missing ⚠️
pkg/internal/ebpf/generictracer/generictracer.go 0.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1366      +/-   ##
==========================================
- Coverage   43.76%   43.74%   -0.02%     
==========================================
  Files         308      310       +2     
  Lines       33495    33772     +277     
==========================================
+ Hits        14658    14775     +117     
- Misses      17894    18047     +153     
- Partials      943      950       +7     
Flag Coverage Δ
integration-test 21.56% <25.71%> (-0.12%) ⬇️
integration-test-arm 0.00% <0.00%> (ø)
integration-test-vm-x86_64-5.15.152 0.00% <0.00%> (ø)
integration-test-vm-x86_64-6.10.6 0.00% <0.00%> (ø)
k8s-integration-test 2.32% <0.00%> (-0.01%) ⬇️
oats-test 0.00% <0.00%> (ø)
unittests 44.63% <5.71%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.


major, minor := ebpfcommon.KernelVersion()

if major < 6 || (major == 6 && minor < 4) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if I understand correctly, this shouldn't change anything for default configurations since they would not load tctracer anyway but in case they did, existing connections are now penalized in old kernels, correct?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct - it means that for existing connections tpinjector is not tracking them - this was already the case when tctracer was not enabled in theory - and in practice never the case as the code was not working anyway. So it's an improvement in newer kernel versions that now are able to track existing connections.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, that's a different one, I hit that as well locally when experimenting with a different approach - basically, it is not safe to update sockmap/sockhash instances in the network ingress/egress codepath - it can lock up - my computer froze with a similar stack:

Feb 25 20:32:59 crux kernel: Call Trace:
Feb 25 20:32:59 crux kernel:  <IRQ>
Feb 25 20:32:59 crux kernel:  _raw_spin_lock+0x29/0x30
Feb 25 20:32:59 crux kernel:  sock_map_update_elem+0x6f/0x110
Feb 25 20:32:59 crux kernel:  bpf_prog_88f5125eaeff2c01_obi_app_egress+0x1be/0x3143
Feb 25 20:32:59 crux kernel:  ? __ieee80211_subif_start_xmit+0x309/0x3f0 [mac80211 23eb58a1e4cc8c5d279ec699c4ed327d9592eb0d]
Feb 25 20:32:59 crux kernel:  ? __ieee80211_subif_start_xmit+0x207/0x3f0 [mac80211 23eb58a1e4cc8c5d279ec699c4ed327d9592eb0d]
Feb 25 20:32:59 crux kernel:  ? nf_conntrack_tcp_packet+0x9ec/0x17b0 [nf_conntrack ae3ae3f014ce84e5ce96e4960e8682ba63b74bf5]
Feb 25 20:32:59 crux kernel:  __dev_queue_xmit+0x660/0xee0
Feb 25 20:32:59 crux kernel:  ip_finish_output2+0x2b3/0x630
Feb 25 20:32:59 crux kernel:  ? __ip_finish_output+0x47/0x180
Feb 25 20:32:59 crux kernel:  ip_output+0x63/0x110
Feb 25 20:32:59 crux kernel:  ? __pfx_ip_finish_output+0x10/0x10
Feb 25 20:32:59 crux kernel:  __ip_queue_xmit+0x369/0x4f0
Feb 25 20:32:59 crux kernel:  __tcp_transmit_skb+0xa7b/0xe70
Feb 25 20:32:59 crux kernel:  tcp_rcv_established+0xa25/0xc20
Feb 25 20:32:59 crux kernel:  tcp_v4_do_rcv+0x1e7/0x380
Feb 25 20:32:59 crux kernel:  tcp_v4_rcv+0xe45/0x1560
Feb 25 20:32:59 crux kernel:  ? nf_nat_ipv4_local_in+0x58/0x160 [nf_nat f032bf00c70370b798d781e7818997ca2ce355c4]
Feb 25 20:32:59 crux kernel:  ? raw_local_deliver+0xd0/0x2b0
Feb 25 20:32:59 crux kernel:  ip_protocol_deliver_rcu+0x2c/0x170
Feb 25 20:32:59 crux kernel:  ip_local_deliver_finish+0x85/0x100
Feb 25 20:32:59 crux kernel:  ip_sublist_rcv+0x2c9/0x370
Feb 25 20:32:59 crux kernel:  ? __pfx_ip_rcv_finish+0x10/0x10
Feb 25 20:32:59 crux kernel:  ip_list_rcv+0x138/0x170
Feb 25 20:32:59 crux kernel:  __netif_receive_skb_list_core+0x2a3/0x2d0
Feb 25 20:32:59 crux kernel:  netif_receive_skb_list_internal+0x1d5/0x310
Feb 25 20:32:59 crux kernel:  napi_complete_done+0x80/0x1b0

This very PR should fix it. The lock up on the map in the 5.15 iterator program is a different bug involving an RCU deadlock, which is what the kernel patch linked on the PR description solves.

If you want to get your PR going, either merge this one and rebase on top of it, or edit tctracer.c and remove track_sock and related code.

Comment thread bpf/tpinjector/sock_iter.c Outdated
Comment thread bpf/tpinjector/sock_iter.c Outdated
Comment thread bpf/tpinjector/sock_iter.c Outdated
Comment thread bpf/tpinjector/sock_iter.c
Copy link
Copy Markdown
Contributor

@mmat11 mmat11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

Copy link
Copy Markdown
Contributor

@grcevski grcevski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just one minor comment on guarding the debug code with if(debug..)

Comment thread bpf/tpinjector/sock_iter.c
@grcevski grcevski merged commit 56db240 into open-telemetry:main Feb 26, 2026
75 of 76 checks passed
@rafaelroquetto rafaelroquetto deleted the sock_dir branch February 26, 2026 21:41
NimrodAvni78 pushed a commit to coralogix/opentelemetry-ebpf-instrumentation that referenced this pull request Mar 1, 2026
@MrAlias MrAlias added this to the v0.6.0 milestone Mar 2, 2026
@MrAlias MrAlias mentioned this pull request Mar 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants