-
Notifications
You must be signed in to change notification settings - Fork 690
feat: perf.rs - token counting analysis #1985
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Resolved conflicts between performance recording module implementation and empty file, keeping the full implementation. Also resolved conflicts in HTTP service tests to use proper API base configuration and service readiness checking. -Agent Generated Commit Message
WalkthroughThe changes introduce new utilities for handling bzip2-compressed files, including extraction, validation, and automatic cleanup, along with a builder-based API. Token counting and analysis for streaming responses are implemented, supporting multiple data sources and validation. Dependency configurations are updated, and new modules for utilities and token operations are added. Tests and data files are also updated accordingly. Changes
Sequence Diagram(s)sequenceDiagram
participant Test as Test Code
participant Bzip2ExtractorBuilder
participant Bzip2Extractor
participant Bzip2Extraction
Test->>Bzip2ExtractorBuilder: new().source_path(...).target_filename(...).extract()
Bzip2ExtractorBuilder->>Bzip2Extractor: build extractor
Bzip2Extractor->>Bzip2Extractor: perform_extraction()
Bzip2Extractor->>Bzip2Extraction: return handle to extracted file
Test->>Bzip2Extraction: read_to_string()/validate_blake3_hash()
Note right of Bzip2Extraction: Temp dir auto-deleted on drop
sequenceDiagram
participant Perf as perf::record_data_stream
participant DataStream
participant RecordedStream
Perf->>DataStream: consume all items
DataStream-->>Perf: yield items
Perf->>Perf: record items with timestamp/sequence
Perf->>RecordedStream: return completed recording
sequenceDiagram
participant Analyzer as analyze_token_counting
participant RecordedStream
participant TokenExtractor
participant Tokenizer
Analyzer->>RecordedStream: iterate responses
loop For each response
Analyzer->>TokenExtractor: extract_tokens_by_choice(tokenizer)
TokenExtractor->>Tokenizer: (if needed) count tokens
Tokenizer-->>TokenExtractor: token count/IDs
TokenExtractor-->>Analyzer: token data per choice
end
Analyzer->>Analyzer: aggregate, validate, summarize
Analyzer-->>Test: TokenAnalysis result
Poem
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
lib/llm/src/tokenizers.rs (1)
150-161: Consider removing redundant trait implementation.This explicit implementation of
TokenCounterforTokenizerappears redundant since:
Tokenizerderefs toArc<dyn traits::Tokenizer>traits::TokenizerrequiresEncoder- The blanket implementation already provides
TokenCounterfor allEncodertypesUnless there's a specific reason for this explicit implementation (e.g., avoiding deref complications), consider removing it to reduce code duplication.
lib/llm/src/perf.rs (1)
392-404: Clean implementation with a minor consideration.The function correctly handles file reading, SSE parsing, and stream conversion. Note that
read_to_stringloads the entire file into memory, which could be an issue for very large SSE files. For typical SSE replay data this should be fine, but consider documenting this limitation or using streaming file reading for very large files.lib/llm/src/utils/bzip2.rs (1)
476-476: Consider more robust test file path construction.The hardcoded relative path assumes tests run from the repository root. Consider using
env!("CARGO_MANIFEST_DIR")or similar to construct a more robust path:-let tokenizer_file_path = "dynamo/lib/llm/tests/data/replays/deepseek-r1-distill-llama-8b/tokenizer-deepseek-r1-distill-llama-8b.json.bz2"; +let tokenizer_file_path = std::path::Path::new(env!("CARGO_MANIFEST_DIR")) + .join("tests/data/replays/deepseek-r1-distill-llama-8b/tokenizer-deepseek-r1-distill-llama-8b.json.bz2");
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (3)
Cargo.lockis excluded by!**/*.locklib/bindings/python/Cargo.lockis excluded by!**/*.locklib/llm/tests/data/replays/deepseek-r1-distill-llama-8b/tokenizer-deepseek-r1-distill-llama-8b.json.bz2is excluded by!**/*.bz2
📒 Files selected for processing (11)
.gitattributes(1 hunks)Cargo.toml(1 hunks)lib/llm/Cargo.toml(2 hunks)lib/llm/src/lib.rs(1 hunks)lib/llm/src/perf.rs(3 hunks)lib/llm/src/perf/logprobs.rs(2 hunks)lib/llm/src/perf/tokens.rs(1 hunks)lib/llm/src/tokenizers.rs(2 hunks)lib/llm/src/utils.rs(1 hunks)lib/llm/src/utils/bzip2.rs(1 hunks)lib/llm/tests/data/replays/deepseek-r1-distill-llama-8b/tokenizer.b3sum(1 hunks)
🧰 Additional context used
🧠 Learnings (6)
Cargo.toml (1)
Learnt from: kthui
PR: ai-dynamo/dynamo#1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.
lib/llm/Cargo.toml (1)
Learnt from: kthui
PR: ai-dynamo/dynamo#1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.
lib/llm/src/utils.rs (1)
Learnt from: alec-flowers
PR: ai-dynamo/dynamo#1181
File: lib/llm/src/kv_router/publisher.rs:379-425
Timestamp: 2025-05-29T00:02:35.018Z
Learning: In lib/llm/src/kv_router/publisher.rs, the functions `create_stored_blocks` and `create_stored_block_from_parts` are correctly implemented and not problematic duplications of existing functionality elsewhere in the codebase.
lib/llm/src/perf/logprobs.rs (2)
Learnt from: ryanolson
PR: ai-dynamo/dynamo#1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from `Send + Sync + Debug` to `Send + Debug` because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.
Learnt from: kthui
PR: ai-dynamo/dynamo#1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.
lib/llm/src/perf.rs (1)
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1236
File: lib/llm/src/mocker/engine.rs:140-161
Timestamp: 2025-06-17T00:50:44.845Z
Learning: In Rust async code, when an Arc<Mutex<_>> is used solely to transfer ownership of a resource (like a channel receiver) into a spawned task rather than for sharing between multiple tasks, holding the mutex lock across an await is not problematic since there's no actual contention.
lib/llm/src/perf/tokens.rs (1)
Learnt from: ishandhanani
PR: ai-dynamo/dynamo#1626
File: lib/llm/src/preprocessor.rs:238-239
Timestamp: 2025-06-24T20:59:35.725Z
Learning: In lib/llm/src/preprocessor.rs, the `sampling_options` call in the `preprocess_request` method is placed in the common section after the match statement on `request.prompt_input_type()`, meaning it applies to both `PromptInput::Tokens` and `PromptInput::Text` request types.
🧬 Code Graph Analysis (4)
lib/llm/src/perf/logprobs.rs (1)
lib/llm/src/perf.rs (2)
read_annotated_stream_from_file(392-404)record_stream_with_context(292-301)
lib/llm/src/tokenizers.rs (1)
lib/llm/src/perf/tokens.rs (2)
count_tokens(494-497)count_tokens_with_ids(499-507)
lib/llm/src/perf.rs (3)
lib/llm/src/protocols/codec.rs (1)
create_message_stream(296-302)lib/llm/src/protocols.rs (1)
convert_sse_stream(51-68)lib/llm/src/perf/logprobs.rs (3)
new(54-101)new(1575-1579)read_annotated_stream_from_file(1453-1453)
lib/llm/src/perf/tokens.rs (3)
lib/llm/src/tokenizers.rs (17)
tokenizer(398-400)new(245-254)new(326-333)new(553-561)token_ids(58-63)token_ids(402-404)len(339-341)len(531-533)count_tokens(92-92)count_tokens(100-102)count_tokens(151-153)text(406-411)count_tokens_with_ids(95-95)count_tokens_with_ids(104-109)count_tokens_with_ids(155-160)builder(472-474)from_file(126-128)lib/llm/src/perf.rs (9)
new(51-57)new(105-115)new(578-582)read_annotated_stream_from_file(392-404)record_data_stream(371-389)response_count(118-120)data(60-62)responses(128-130)start_time(133-135)lib/llm/src/utils/bzip2.rs (1)
builder(124-126)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Build and Test - vllm
- GitHub Check: pre-merge-rust (lib/runtime/examples)
- GitHub Check: pre-merge-rust (lib/bindings/python)
- GitHub Check: pre-merge-rust (.)
🔇 Additional comments (30)
lib/llm/tests/data/replays/deepseek-r1-distill-llama-8b/tokenizer.b3sum (1)
1-1: LGTM - Checksum file is correctly formattedThe BLAKE3 checksum follows the standard format and will enable integrity verification of the extracted tokenizer.json file from the bzip2 archive.
lib/llm/src/lib.rs (1)
36-36: LGTM - Utils module addition is cleanThe new utils module declaration follows the established pattern and appropriately exposes the bzip2 decompression functionality for the crate.
Cargo.toml (1)
53-53: Confirm bzip2 0.6.0 is up-to-date and secure
bzip2 = { version = "0.6" }points to 0.6.0 (released June 17, 2025), which is the latest stable version.- Transitions to a pure Rust implementation using
libbz2-rs-sysby default.- Addresses CVE-2023-22895 (DoS via integer overflow in earlier releases); no known security issues in 0.6.0.
No further changes required—this dependency can be approved as-is.
lib/llm/Cargo.toml (2)
54-54: LGTM - Proper workspace dependency usageThe bzip2 workspace dependency addition is correctly configured.
81-81: Good dependency standardizationConverting blake3 to use workspace dependency improves consistency with the overall dependency management approach.
.gitattributes (1)
13-13: LGTM - Proper Git LFS configurationThe Git LFS configuration for the bzip2 tokenizer file is correctly set up with appropriate attributes for handling large binary files.
lib/llm/src/utils.rs (1)
1-5: LGTM!Clean module structure for utilities. The dedicated
utilsmodule with thebzip2submodule provides good organization for utility functions.lib/llm/src/perf/logprobs.rs (2)
571-574: Good refactoring to use higher-level utilities.The import changes align well with the new
read_annotated_stream_from_fileutility, simplifying the test setup by removing the need for manual SSE stream creation and conversion.
453-459: Excellent simplification of the test setup.The refactoring effectively replaces the manual SSE parsing and conversion steps with the new
read_annotated_stream_from_fileutility, making the test more concise and maintainable while preserving the same functionality.lib/llm/src/tokenizers.rs (2)
89-96: Well-designed trait for token counting.The
TokenCountertrait provides a clean interface for token counting with appropriate thread safety bounds and error handling. The separation betweencount_tokens(just count) andcount_tokens_with_ids(IDs + count) offers good flexibility.
98-110: Good blanket implementation with a minor optimization opportunity.The blanket implementation correctly leverages the existing
Encodertrait. Consider this minor optimization forcount_tokens_with_ids:fn count_tokens_with_ids(&self, input: &str) -> Result<(Vec<TokenIdType>, usize)> { let encoding = self.encode(input)?; let token_ids = encoding.token_ids().to_vec(); - let count = token_ids.len(); - Ok((token_ids, count)) + Ok((token_ids.clone(), token_ids.len())) }Wait, that's incorrect. Let me reconsider:
fn count_tokens_with_ids(&self, input: &str) -> Result<(Vec<TokenIdType>, usize)> { let encoding = self.encode(input)?; let token_ids = encoding.token_ids().to_vec(); - let count = token_ids.len(); - Ok((token_ids, count)) + Ok((token_ids, encoding.token_ids().len())) }Actually, the current implementation is fine as is - it's clear and the performance difference is negligible.
lib/llm/src/perf.rs (3)
10-16: LGTM!The new imports appropriately support the token analysis module and annotated stream functionality.
29-30: Good API design with descriptive aliases.The re-exports provide cleaner, more descriptive names for the SSE parsing utilities, making the public API more intuitive.
351-389: Well-implemented stream recording function.The
record_data_streamfunction provides a clean, focused API for consuming and recording entire streams. Good documentation and proper async handling.lib/llm/src/utils/bzip2.rs (5)
1-36: Excellent module documentation and organization.The module documentation clearly explains the purpose, use cases, and provides helpful examples. Good choice of dependencies with
anyhowfor error handling andtempfilefor RAII cleanup.
73-120: Well-implemented builder pattern.The builder provides a clean, fluent API with proper validation. Good error handling for missing required fields and clear documentation of the
extract()method.
132-188: Robust extraction implementation.The extraction logic handles edge cases well:
- Pre-validation of source file existence
- Smart filename derivation with
.bz2stripping- Comprehensive error contexts for debugging
- Explicit
flush()to ensure data integrity
190-269: Excellent API design with RAII and validation.The
Bzip2Extractionstruct provides a clean API with:
- Automatic cleanup via RAII pattern
- Convenient read methods with proper error contexts
- BLAKE3 hash validation for integrity verification
- Good encapsulation with private
_temp_dir
271-514: Comprehensive test coverage.Excellent test suite covering:
- All major functionality paths
- Error conditions and edge cases
- RAII cleanup verification
- Complex filename patterns
- Integration test with graceful skip if test file is missing
The tests provide confidence in the implementation's correctness and robustness.
lib/llm/src/perf/tokens.rs (11)
1-46: Excellent module documentation!The documentation clearly explains the module's purpose, provides comprehensive usage examples, and covers both basic and advanced scenarios. The examples are well-structured and demonstrate different token counting methods.
47-59: Good use of type aliases for semantic clarity.The type aliases
TokenCountandForwardPassDurationimprove code readability and make the intent clear.
60-132: Well-designed enum for representing token data sources.The
TokenDataSourceenum effectively captures different token counting methods with varying accuracy levels. The implementation methods are clean and the documentation clearly explains the trade-offs of each approach.
133-151: Well-structured data type for choice-specific token information.The
ChoiceTokenDatastruct comprehensively captures all relevant token information for a single choice, including debugging aids like content storage.
152-220: Robust implementation of TokenExtractor with good error handling.The implementation correctly handles tokenizer failures by falling back to single-token approximation. The empty content handling is appropriate.
Minor observation: There's a slight inconsistency in
token_idswhen content is empty - it'sSome(Vec::new())with tokenizer butNonewithout. This might be intentional to distinguish between "no tokens found by tokenizer" vs "no tokenizer available", but consider documenting this distinction if it's important.
221-256: Well-designed structs for token timeline tracking and analysis.Both
TokenEventandChoiceTokenAnalysiseffectively capture the temporal aspects of token generation and provide comprehensive analysis data per choice.
257-365: Excellent implementation of comprehensive token analysis with robust validation.The
analyze_token_countingfunction implements thorough validation including:
- Detection of tokens generated after a choice has finished
- Tracking of incomplete streams
- Selection of the most accurate data source
- Comprehensive timeline tracking
The validation logic is particularly well-designed and will help catch streaming protocol violations.
366-478: Comprehensive and user-friendly analysis API.The
TokenAnalysisimplementation provides excellent utility methods:
print_summarywith clear visual indicatorstoken_rate_for_choicecorrectly excluding empty responses from rate calculationvalidate_data_source_consistencyfor detecting potential issuesThe API is well-designed for both programmatic access and human-readable output.
479-509: Well-structured test infrastructure with good helper functions.The test module provides:
- A simple but effective
MockTokenizerimplementation- Reusable helper functions that reduce test code duplication
- Clear test data creation patterns
Also applies to: 1114-1208
510-974: Excellent test coverage with comprehensive edge case handling.The test suite thoroughly covers:
- All data source types
- Multiple choice scenarios with varying timelines
- Validation error detection
- Edge cases like empty content and missing choices
- Complex scenarios like choices finishing at different times
The tests are well-structured and clearly document the expected behavior.
975-1112: Excellent integration test demonstrating real-world usage.This integration test effectively:
- Tests the full pipeline with real tokenizer and stream data
- Validates file integrity with BLAKE3 hash verification
- Demonstrates proper usage of the bzip2 extraction utilities
- Includes comprehensive assertions on the analysis results
The hard-coded expectation of 32 tokens is appropriate for this specific test case with known input data.
…ynamo into ryan/token-counting-analysis
|
This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
|
WIP |
Signed-off-by: Ryan Olson <[email protected]>
Resolved merge conflicts while preserving token counting improvements: - Updated dependencies to latest versions from main - Kept gzip compression utilities and token counting features - Resolved import conflicts in logprobs.rs and tokens.rs - Updated workflow files with latest CI improvements - Removed deprecated tools.rs file as deleted in main - Regenerated Cargo.lock files with updated dependencies - Fixed API changes in blake3 and response structures Token counting analysis features preserved: - Token counting infrastructure with multiple data sources - Performance recording framework for streaming responses - GzipExtractor utility for compressed file handling - Mock tokenizer tests replacing deepseek dependency All tests pass successfully after merge. Signed-off-by: Ryan Olson <[email protected]>
|
/ok to test |
@ryanolson, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/ |
Merged main branch resolving conflicts and preserving token counting features. **Conflict Resolutions:** - .gitattributes: Added *.fatbin binary handling - .gitignore: Preserved SDK binaries ignore entry - Cargo.toml: Integrated ModelExpress dependencies, used main's rand 0.9.0 - lib/llm/Cargo.toml: Kept flate2/ahash deps, accepted ModelExpress integration - Cargo.lock: Regenerated for dependency resolution **Code Fixes:** - Fixed blake3 API: Replaced deprecated mmap methods with std::fs::read - Fixed response struct: Removed .inner field access in TokenExtractor - Fixed utils module conflict: Merged gzip and prefix_matcher modules - Fixed test imports: Updated to dynamo_async_openai - Fixed clippy warnings: Collapsed nested if, removed unnecessary to_string() - Added missing reasoning_content field to ChatCompletionStreamResponseDelta All tests passing (323), clippy clean, formatted with Rust 1.90.
Summary
Adds comprehensive token counting analysis infrastructure to Dynamo for LLM streaming responses and logprob analysis.
🔄 Updated: Successfully merged main branch (32 commits) preserving all token counting features
Key Features
Token Counting Analysis
Logprob Analysis
GGUF Tokenizer Support
Merge Highlights
✅ Main Branch Integration (32 commits)
Usage
Files Changed
lib/llm/src/perf/tokens.rs- Core token counting analysislib/llm/src/perf/logprobs.rs- Logprob analysis utilitieslib/llm/src/gguf/- GGUF tokenizer integrationlib/llm/src/utils/gzip.rs- Compressed model utilitiesReady for production use with full main branch compatibility.