Skip to content

feat(go jsonrpc): fix jsonrpc flaky test#298

Closed
titaneric wants to merge 5 commits into
open-telemetry:mainfrom
titaneric:fix/jsonrpc-flaky-test
Closed

feat(go jsonrpc): fix jsonrpc flaky test#298
titaneric wants to merge 5 commits into
open-telemetry:mainfrom
titaneric:fix/jsonrpc-flaky-test

Conversation

@titaneric
Copy link
Copy Markdown
Contributor

@titaneric titaneric commented Jul 12, 2025

This PR intends to fix #273, which is a JSON-RPC flaky test introduced in #161.

@codecov
Copy link
Copy Markdown

codecov Bot commented Jul 12, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 76.75%. Comparing base (5c7c535) to head (5eb6851).
Report is 12 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #298      +/-   ##
==========================================
- Coverage   76.85%   76.75%   -0.11%     
==========================================
  Files         181      181              
  Lines       19884    19884              
==========================================
- Hits        15282    15261      -21     
- Misses       3810     3829      +19     
- Partials      792      794       +2     
Flag Coverage Δ
integration-test 56.48% <ø> (-0.56%) ⬇️
integration-test-arm 35.32% <ø> (-0.02%) ⬇️
k8s-integration-test 51.71% <ø> (+0.10%) ⬆️
oats-test 35.45% <ø> (+0.01%) ⬆️
unittests 47.07% <ø> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

@grcevski grcevski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like fix works! Anything else we need to change here of we should start the review process?

@titaneric
Copy link
Copy Markdown
Contributor Author

Sorry, I think the flaky test still exist, and I have added some comment in #273 .
I think it's hard to re-produce, and to be honest, I am not sure what's root cause.

@titaneric titaneric force-pushed the fix/jsonrpc-flaky-test branch from 5fb3e0b to 84af67d Compare July 15, 2025 17:34
@titaneric
Copy link
Copy Markdown
Contributor Author

titaneric commented Jul 15, 2025

I think the missing log may come from ring buffer size limit of bpf/logger/bpf_dbg.h.

@titaneric
Copy link
Copy Markdown
Contributor Author

I think I find the crucial log content type is not json, skipping inside this debug log and reported log, which cause the jsonrpc method failed to extract from request. I would take a dig dive into the log and try to resolve it in the tomorrow.

@rafaelroquetto rafaelroquetto force-pushed the fix/jsonrpc-flaky-test branch from 84af67d to b773679 Compare July 16, 2025 20:15
@rafaelroquetto
Copy link
Copy Markdown
Contributor

@titaneric I've rebased your branch onto main - I believe one of the tests were failing because of the changes introduced by #276 - in particular, this one: https://github.com/open-telemetry/opentelemetry-ebpf-instrumentation/pull/276/files#diff-9840da89403837f076f6f383e9d645312ed48c02732bda31a1c9abcfcefb8133

hopefully that will do the trick (famous last words)

@rafaelroquetto
Copy link
Copy Markdown
Contributor

Ignore the failing OATs tests, it's not related to your changes and we are looking into it.

@rafaelroquetto rafaelroquetto force-pushed the fix/jsonrpc-flaky-test branch from b773679 to ce1cd4f Compare July 17, 2025 14:51
@titaneric
Copy link
Copy Markdown
Contributor Author

Thank you for helping rebase the main branch. However, the flasky test still exits.
I think my patch do not work at all. In this test, there is no content type is not json, skipping in the log, possibly missing due to ring buffer size limit.

@titaneric
Copy link
Copy Markdown
Contributor Author

titaneric commented Jul 17, 2025

I think I find the root cause. I take a look at original log in Line 9999-10017. For some reason, the ServeHTTP is called multiple times with the same go routine address (c000334e00).

In the first phase, the did detect the application/json content-type, and set invocation.json_content_type = 1, but did not copy header_inv content_type into new invocation. That cause second phase failed to detect content-type correctly.

autoinstrumenter-1      | time=2025-07-07T20:55:47.191Z level=DEBUG msg="=== uprobe/ServeHTTP === " component=BPFLogger pid=42624 comm=testserver
autoinstrumenter-1      | time=2025-07-07T20:55:47.191Z level=DEBUG msg="goroutine_addr c000334e00" component=BPFLogger pid=42624 comm=testserver
autoinstrumenter-1      | time=2025-07-07T20:55:47.191Z level=DEBUG msg="Decoded from existing traceparent" component=BPFLogger pid=42624 comm=testserver
autoinstrumenter-1      | time=2025-07-07T20:55:47.191Z level=DEBUG msg="tp: 00-ea6d20edeed65d3a29e3098306c82fea-2ed898ccf04c8057-01\xff\x10}`\x06ۦ\xff\xff" component=BPFLogger pid=42624 comm=testserver
autoinstrumenter-1      | time=2025-07-07T20:55:47.191Z level=DEBUG msg="Found content type in ongoing request: application/json" component=BPFLogger pid=42624 comm=testserver
autoinstrumenter-1      | time=2025-07-07T20:55:47.191Z level=DEBUG msg="path: /jsonrpc" component=BPFLogger pid=42624 comm=testserver
autoinstrumenter-1      | time=2025-07-07T20:55:47.191Z level=DEBUG msg="=== uprobe/ServeHTTP === " component=BPFLogger pid=42624 comm=testserver
autoinstrumenter-1      | time=2025-07-07T20:55:47.191Z level=DEBUG msg="goroutine_addr c000334e00" component=BPFLogger pid=42624 comm=testserver
autoinstrumenter-1      | time=2025-07-07T20:55:47.191Z level=DEBUG msg="Decoded from existing traceparent" component=BPFLogger pid=42624 comm=testserver
autoinstrumenter-1      | time=2025-07-07T20:55:47.191Z level=DEBUG msg="=== uprobe/proc body read goroutine === " component=BPFLogger pid=42624 comm=testserver
autoinstrumenter-1      | time=2025-07-07T20:55:47.191Z level=DEBUG msg="=== uprobe/proc newproc1 === " component=BPFLogger pid=42624 comm=testserver
autoinstrumenter-1      | time=2025-07-07T20:55:47.191Z level=DEBUG msg="creator_goroutine_addr c00058a540" component=BPFLogger pid=42624 comm=testserver
autoinstrumenter-1      | time=2025-07-07T20:55:47.191Z level=DEBUG msg="=== uprobe/proc newproc1 returns === " component=BPFLogger pid=42624 comm=testserver
autoinstrumenter-1      | time=2025-07-07T20:55:47.191Z level=DEBUG msg="creator_goroutine_addr c00058a540" component=BPFLogger pid=42624 comm=testserver
autoinstrumenter-1      | time=2025-07-07T20:55:47.191Z level=DEBUG msg="parent goroutine_addr c000334e00" component=BPFLogger pid=42624 comm=testserver
autoinstrumenter-1      | time=2025-07-07T20:55:47.191Z level=DEBUG msg="goroutine_addr c0002ce700" component=BPFLogger pid=42624 comm=testserver
autoinstrumenter-1      | time=2025-07-07T20:55:47.191Z level=DEBUG msg="=== uprobe/proc body read returns goroutine === " component=BPFLogger pid=42624 comm=testserver
autoinstrumenter-1      | time=2025-07-07T20:55:47.191Z level=DEBUG msg="n is 74" component=BPFLogger pid=42624 comm=testserver
autoinstrumenter-1      | time=2025-07-07T20:55:47.192Z level=DEBUG msg="content type is not json, skipping" component=BPFLogger pid=42624 comm=testserver

I may simply update code from

        if (header_inv && header_inv->content_type[0]) {
            bpf_dbg_printk("Found content type in ongoing request: %s", header_inv->content_type);
            if (is_json_content_type((void *)header_inv->content_type,
                                     sizeof(header_inv->content_type))) {

                invocation.json_content_type = 1;
            }
        }

into

        if (header_inv && header_inv->content_type[0]) {
            bpf_dbg_printk("Found content type in ongoing request: %s", header_inv->content_type);
            if (header_inv->json_content_type ||
                is_json_content_type((void *)header_inv->content_type,
                                     sizeof(header_inv->content_type))) {

                invocation.json_content_type = 1;
            }

to fix this flaky test.

@titaneric titaneric changed the title feat(go jsonrpc): add bound check in extract_json_string feat(go jsonrpc): fix jsonrpc flaky test Jul 17, 2025
@titaneric
Copy link
Copy Markdown
Contributor Author

titaneric commented Jul 17, 2025

I ran the TestMultiProcessAppCPNoIP integration test locally for at least 10 time, and it all pass. However, I am afriad that this patch would bring some side effect.

In addition, I am not sure whether the different request to the same http server would share the same go routine address. For a complete and robust fix, we may find the parent go routine of the given go routine, to check whether they are born from the same http server.

P.S. I would remove unnessary commits if we all agree this is the root cause.

@rafaelroquetto
Copy link
Copy Markdown
Contributor

@titaneric so I have been looking into the failed tests as well - if you look at the logs (look for HTTP 200 Arith.T /jsonrpc) you will see that things actually work, but for some reasons the test cannot find the trace in the jaeger database.

What I will do is updating the test so to ensure it waits for all traces to be submitted from OBI to the Jaeger DB and then see if that works.

Would that make sense to you? And would we still need this PR?

@titaneric
Copy link
Copy Markdown
Contributor Author

titaneric commented Jul 18, 2025

@rafaelroquetto , thank you for helping check the log. I think 5eb6851 is still needed. If you search HTTP 200 Arith.T /jsonrpc in the this action log. You would find only 9 matches, but it should have 10 traces found.

As this #298 (comment) explains, I think it's nessary to add this patch to resolve the flaky test.

@rafaelroquetto
Copy link
Copy Markdown
Contributor

@titaneric so we have been doing some digging and have managed to reproduce the bug once. When it happens, it appears that we do see the incoming request, but we fail to recognise it as JSON-RPC - either because the uprobe doesn't fire at all times, or most likely, because sometimes (rarely) the body can come chunked and we fail to parse it. It's a bit tricky.

Whilst looking into it, we realised that we may be able to simplify the code, and most importantly, do without attaching a uprobe to net/http.(*body).Read" altogether (which gets called often and could be a bottleneck).

This is what the call stack of handling a JSON request looks like:


 0  0x0000000000726f0e in net/http.(*body).Read
    at /usr/local/go/src/net/http/transfer.go:831
 1  0x0000000000aedd29 in github.com/open-telemetry/opentelemetry-ebpf-instrumentation/test/integration/components/testserver/jsonrpc.(*ReadWriteCloserWrapper).Read
    at <autogenerated>:1
 2  0x0000000000538588 in encoding/json.(*Decoder).refill
    at /usr/local/go/src/encoding/json/stream.go:165
 3  0x0000000000538185 in encoding/json.(*Decoder).readValue
    at /usr/local/go/src/encoding/json/stream.go:140
 4  0x0000000000537e55 in encoding/json.(*Decoder).Decode
    at /usr/local/go/src/encoding/json/stream.go:63
 5  0x0000000000aecc2c in net/rpc/jsonrpc.(*serverCodec).ReadRequestHeader
    at /usr/local/go/src/net/rpc/jsonrpc/server.go:66
 6  0x0000000000839718 in net/rpc.(*Server).readRequestHeader
    at /usr/local/go/src/net/rpc/server.go:587
 7  0x00000000008392cf in net/rpc.(*Server).readRequest
    at /usr/local/go/src/net/rpc/server.go:547
 8  0x00000000008389d1 in net/rpc.(*Server).ServeCodec
    at /usr/local/go/src/net/rpc/server.go:462
 9  0x0000000000839b6d in net/rpc.ServeCodec
    at /usr/local/go/src/net/rpc/server.go:677
10  0x0000000000aedae5 in github.com/open-telemetry/opentelemetry-ebpf-instrumentation/test/integration/components/testserver/jsonrpc.Setup.HTTPHandler.func1
    at ./test/integration/components/testserver/jsonrpc/jsonrpc.go:80
11  0x000000000071a4c9 in net/http.HandlerFunc.ServeHTTP
    at /usr/local/go/src/net/http/server.go:2294
12  0x0000000000739f0e in net/http.serverHandler.ServeHTTP
    at /usr/local/go/src/net/http/server.go:3301
13  0x00000000007189c5 in net/http.(*conn).serve
    at /usr/local/go/src/net/http/server.go:2102
14  0x000000000071e2a8 in net/http.(*Server).Serve.gowrap3
    at /usr/local/go/src/net/http/server.go:3454
15  0x000000000047d521 in runtime.goexit
    at /usr/local/go/src/runtime/asm_amd64.s:1700

Which means the go runtime already does the job of detecting whether we have a jsonrpc request - I believe all we need to do is attaching a uretprobe to net/rpc/jsonrpc.(*serverCodec).ReadRequestHeader() and read the contents of the r parameter - at which stage it will have already decoded the request for us.

Check this out:

 net/rpc/jsonrpc.(*serverCodec).ReadRequestHeader() /usr/lib/go/src/net/rpc/jsonrpc/server.go:74 (PC: 0xaed7c6)
Warning: debugging optimized function
    69:		r.ServiceMethod = c.req.Method
    70:	
    71:		// JSON request id can be any JSON value;
    72:		// RPC package expects uint64.  Translate to
    73:		// internal uint64 and save JSON on the side.
=>  74:		c.mutex.Lock()
    75:		c.seq++
    76:		c.pending[c.seq] = c.req.Id
    77:		c.req.Id = nil
    78:		r.Seq = c.seq
    79:		c.mutex.Unlock()
(dlv) p r
("*net/rpc.Request")(0xc000128500)
*net/rpc.Request {
	ServiceMethod: "Arith.Traceme",
	Seq: 0,
	next: *net/rpc.Request nil,}

and we could remove the probe from read altogether (and all the associated parsing). What do you think? Is this something you'd like to take on, or would you prefer if we did it?

Thanks again!

@titaneric
Copy link
Copy Markdown
Contributor Author

titaneric commented Jul 19, 2025

OMG, it's a huge discovery for me. Could I rewrite the present implementation with your idea?

I think it's much much robust, and I don't have to parse json-rpc method by mysel.

Really thank you for this help, I would raise another PR to re-implement it.

@titaneric
Copy link
Copy Markdown
Contributor Author

Raise #335 here, and I would close this PR.

@titaneric titaneric closed this Jul 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix flaky JSON-RPC tests

3 participants