-
Notifications
You must be signed in to change notification settings - Fork 1.9k
pack: fix token count regression in JSON tokenizer and remove parser reset #10733
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…reset This patch fixes a regression in flb_json_tokenise() where failure to reinitialize the jsmn_parser after reallocating tokens led to invalid or duplicated token counts on repeated parses. Signed-off-by: Eduardo Silva <[email protected]>
Signed-off-by: Eduardo Silva <[email protected]>
WalkthroughUpdates JSON tokenisation in src/flb_pack.c to preserve jsmn parser state across reallocations and derive token count from parser internals. Adds two internal tests validating streaming parsing and token count stability during reallocations. Minor non-functional test cleanups and registry updates. Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant flb_json_tokenise
participant jsmn_parser
Caller->>flb_json_tokenise: tokenise(json, state)
flb_json_tokenise->>jsmn_parser: parse chunk
alt OOM (JSMN_ERROR_NOMEM)
flb_json_tokenise->>flb_json_tokenise: realloc token buffer
note right of flb_json_tokenise: Preserve parser state<br/>(do not reset)
flb_json_tokenise->>jsmn_parser: continue parse
end
flb_json_tokenise->>flb_json_tokenise: tokens_count = parser.toknext
flb_json_tokenise-->>Caller: result (tokens, count)
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Assessment against linked issues
Possibly related PRs
Suggested labels
Suggested reviewers
Poem
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
tests/internal/pack.c (2)
297-385: Good streaming regression test; consider dynamic buffer sizing and stricter error checks.The test robustly exercises stateful parsing and validates record counts. Minor robustness improvements:
- Allocate the read buffer to match the file size instead of a fixed 4 KiB.
- Check flb_pack_state_init() return values inside the loop.
- Free the dynamically allocated buffer.
Apply this diff within the function to avoid potential overflow and improve checks:
@@ - char *out; - char buf[1024*4]; + char *out; + char *buf = NULL; @@ - ret = stat(JSON_BUG342, &st); + ret = stat(JSON_BUG342, &st); @@ if (ret == -1) { perror("stat"); exit(EXIT_FAILURE); } @@ + buf = flb_malloc(st.st_size); + if (buf == NULL) { + flb_errno(); + close(fd); + exit(EXIT_FAILURE); + } @@ - flb_pack_state_init(&state); + if (flb_pack_state_init(&state) != 0) { + flb_free(buf); + close(fd); + exit(EXIT_FAILURE); + } @@ - flb_pack_state_reset(&state); - flb_pack_state_init(&state); + flb_pack_state_reset(&state); + if (flb_pack_state_init(&state) != 0) { + flb_free(buf); + close(fd); + exit(EXIT_FAILURE); + } @@ - flb_pack_state_reset(&state); - close(fd); + flb_pack_state_reset(&state); + flb_free(buf); + close(fd);
1141-1183: Excellent regression test for token count stability across reallocations.This precisely catches the duplication regression. One small nit: consider replacing printf with TEST_MSG to keep test output consistent.
- printf("\nFirst parse: tokens_count=%d\n", state.tokens_count); + TEST_MSG("First parse: tokens_count=%d", state.tokens_count); @@ - printf("Second parse: tokens_count=%d (BUG if > ~301)\n", state.tokens_count); + TEST_MSG("Second parse: tokens_count=%d (BUG if > ~301)", state.tokens_count);
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these settings in your CodeRabbit configuration.
📒 Files selected for processing (2)
src/flb_pack.c(3 hunks)tests/internal/pack.c(5 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
tests/internal/pack.c (2)
src/flb_pack.c (3)
flb_pack_state_init(354-383)flb_json_tokenise(56-101)flb_pack_state_reset(385-395)src/flb_sds.c (4)
flb_sds_create(78-90)flb_sds_cat_safe(204-214)flb_sds_printf(336-387)flb_sds_destroy(389-399)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (28)
- GitHub Check: pr-windows-build / call-build-windows-package (Windows 64bit, x64, x64-windows-static, 3.31.6)
- GitHub Check: pr-windows-build / call-build-windows-package (Windows 64bit (Arm64), amd64_arm64, -DCMAKE_SYSTEM_NAME=Windows -DCMA...
- GitHub Check: pr-windows-build / call-build-windows-package (Windows 32bit, x86, x86-windows-static, 3.31.6)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=Off, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_THREAD=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=Off, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_UNDEFINED=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_ADDRESS=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SMALL=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_THREAD=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=Off, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_UNDEFINED=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_COVERAGE=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=Off, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_ADDRESS=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SMALL=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=On, 3.31.6, clang, clang++)
- GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, clang, clang++, ubuntu-22.04, clang-12)
- GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, gcc, g++, ubuntu-22.04, clang-12)
- GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, clang, clang++, ubuntu-24.04, clang-14)
- GitHub Check: pr-compile-centos-7
- GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, gcc, g++, ubuntu-24.04, clang-14)
- GitHub Check: PR - fuzzing test
🔇 Additional comments (5)
src/flb_pack.c (3)
67-67: Whitespace-only change — no action needed.No functional impact.
87-87: Whitespace-only change — no action needed.No functional impact.
98-99: Verified — using jsmn_parser.toknext for tokens_count is correctShort: I searched the codebase for mutations to tokens_count. The change (assigning from parser.toknext) is present and fixes the duplication issue; there are no other unsafe updates that would reintroduce the bug.
Pointers:
- src/flb_pack.c — flb_json_tokenise: sets state->tokens_count = state->parser.toknext; (line ~99)
- src/flb_pack.c — flb_pack_json_state: intentionally uses state->tokens_count += delim for partial/multiple handling (line ~454)
- src/flb_pack.c — flb_pack_state_init / flb_pack_state_reset: tokens_count initialized/reset to 0
- plugins/in_* (in_udp, in_unix_socket, in_tcp): pack_state.tokens_count initialized to 0 on conn setup
- tests/internal/pack.c: contains the unit check that would have detected doubling of tokens_count
Relevant snippet (unchanged):
/* always use jsmn_parser.toknext to count tokens */
state->tokens_count = state->parser.toknext;Conclusion: change is correct; only the explicit += in flb_pack_json_state is an intended accumulation for multi/partial parsing. No further fixes required.
tests/internal/pack.c (2)
994-994: Variable initialization moved — OK.The explicit initialization of off is clear and harmless.
1209-1211: Test list updates look good.New tests are properly registered.
|
|
||
| for (i = 0; i < sizeof(test_cases) / sizeof(test_cases[0]); i++) { | ||
| p_in = test_cases[i].json_str; | ||
| p_in = (char *) test_cases[i].json_str; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Avoid casting away const; make p_in const char.*
Casting a const string to char* can hide bugs and violates const-correctness. flb_pack_json accepts const char*, so keep p_in const.
Apply this diff to the assignment:
- p_in = (char *) test_cases[i].json_str;
+ p_in = test_cases[i].json_str;Additionally, adjust the declaration of p_in (outside this hunk) from:
char *p_in;to:
const char *p_in;🤖 Prompt for AI Agents
In tests/internal/pack.c around line 1107, the code casts a const JSON string to
char* with "p_in = (char *) test_cases[i].json_str;"; change the assignment to
keep the const qualifier (p_in = test_cases[i].json_str;) and update p_in's
declaration (earlier in the file) from "char *p_in;" to "const char *p_in;" so
p_in matches flb_pack_json's const char* parameter and preserves
const-correctness.
Fixes #10729 (regression introduced in v4.0.7)
This patch fixes a regression in flb_json_tokenise() where failure to reinitialize the jsmn_parser after reallocating tokens led to invalid or duplicated token counts on repeated parses.
Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.
Summary by CodeRabbit