-
Notifications
You must be signed in to change notification settings - Fork 1
feat: mixed isl/osl distributions #322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughAdds sequence-length distribution support: new distribution models and parser, PromptConfig CLI/config field and validator, BaseDatasetComposer per-turn sampling with seeded RNG and caching, Synthetic composer using per-turn lengths, tests, docs, and public exports. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant CLI as CLI/Config
participant PC as PromptConfig
participant DP as DistributionParser
participant BDC as BaseDatasetComposer
participant SD as SequenceLengthDistribution
participant SYN as SyntheticComposer
participant T as Turn
CLI->>PC: provide `sequence_distribution` string
PC->>DP: model validator parses string
alt parse OK
DP-->>PC: SequenceLengthDistribution
else parse fails
DP-->>PC: raise ValueError
end
PC-->>BDC: get_sequence_distribution() -> SD or None
BDC->>BDC: init _seq_rng (seeded?) and _turn_sequence_cache
loop per turn
SYN->>BDC: request lengths for turn_id
BDC->>BDC: _get_turn_sequence_lengths(turn_id)
alt SD present
BDC->>SD: sample(random_state=_seq_rng)
SD-->>BDC: (isl, osl)
BDC->>BDC: cache (turn_id -> (isl,osl))
else no SD
BDC-->>BDC: compute/sample via legacy mean/stddev
end
BDC-->>SYN: return (isl,osl)
SYN->>T: _generate_text_payloads(turn, is_first) uses isl/osl
SYN->>BDC: finalize turn
BDC->>BDC: _clear_turn_cache(turn_id)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
aiperf/common/config/prompt_config.py
(2 hunks)aiperf/common/sequence_distribution.py
(1 hunks)aiperf/dataset/composer/base.py
(2 hunks)aiperf/dataset/composer/synthetic.py
(2 hunks)tests/test_sequence_distribution.py
(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (4)
tests/test_sequence_distribution.py (2)
aiperf/common/sequence_distribution.py (10)
DistributionParser
(219-348)SequenceLengthDistribution
(47-200)SequenceLengthPair
(23-44)create_balanced_distribution
(365-383)create_uniform_distribution
(351-362)sample
(93-119)sample_batch
(121-151)pairs
(154-156)get_statistics
(158-192)parse
(227-265)aiperf/common/config/prompt_config.py (2)
PromptConfig
(167-258)get_sequence_distribution
(234-258)
aiperf/dataset/composer/base.py (4)
aiperf/common/config/prompt_config.py (1)
get_sequence_distribution
(234-258)aiperf/common/sequence_distribution.py (1)
sample
(93-119)aiperf/common/models/dataset_models.py (1)
Turn
(43-70)aiperf/dataset/utils.py (1)
sample_positive_normal_integer
(113-124)
aiperf/common/config/prompt_config.py (4)
aiperf/common/config/cli_parameter.py (1)
CLIParameter
(10-19)aiperf/common/config/groups.py (1)
Groups
(6-28)aiperf/common/config/config_defaults.py (1)
InputTokensDefaults
(83-86)aiperf/common/sequence_distribution.py (3)
DistributionParser
(219-348)create_uniform_distribution
(351-362)parse
(227-265)
aiperf/dataset/composer/synthetic.py (2)
aiperf/dataset/composer/base.py (1)
_sample_sequence_lengths
(65-80)aiperf/dataset/generator/prompt.py (1)
generate
(96-118)
🪛 Ruff (0.13.2)
aiperf/common/sequence_distribution.py
33-35: Avoid specifying long messages outside the exception class
(TRY003)
37-39: Avoid specifying long messages outside the exception class
(TRY003)
41-41: Avoid specifying long messages outside the exception class
(TRY003)
66-68: Avoid specifying long messages outside the exception class
(TRY003)
82-85: Avoid specifying long messages outside the exception class
(TRY003)
135-135: Avoid specifying long messages outside the exception class
(TRY003)
246-246: Avoid specifying long messages outside the exception class
(TRY003)
263-265: Avoid specifying long messages outside the exception class
(TRY003)
273-273: Avoid specifying long messages outside the exception class
(TRY003)
277-277: Avoid specifying long messages outside the exception class
(TRY003)
285-285: Avoid specifying long messages outside the exception class
(TRY003)
298-298: Avoid specifying long messages outside the exception class
(TRY003)
316-316: Avoid specifying long messages outside the exception class
(TRY003)
332-334: Avoid specifying long messages outside the exception class
(TRY003)
346-346: Avoid specifying long messages outside the exception class
(TRY003)
378-378: Avoid specifying long messages outside the exception class
(TRY003)
aiperf/dataset/composer/synthetic.py
116-116: Unpacked variable osl
is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
aiperf/dataset/composer/base.py (2)
29-30
: Gate distribution initialization when prompts are disabled.If prompts are disabled (e.g.,
--prompt-input-tokens-mean 0
) and no explicit--seq-dist
is provided,get_sequence_distribution()
may fall back to a default with ISL=0 (via the fallback at lines 68-74), which violatesSequenceLengthPair
validation (ISL must be positive) and prevents dataset creation even when only image/audio payloads are enabled.Solution: Only initialize
_seq_distribution
when prompts are enabled or an explicit distribution is provided:- # Initialize sequence distribution - self._seq_distribution = config.input.prompt.get_sequence_distribution() + # Initialize sequence distribution only if prompts are enabled or explicitly specified + if config.input.prompt.input_tokens.mean > 0 or config.input.prompt.sequence_distribution: + self._seq_distribution = config.input.prompt.get_sequence_distribution() + else: + self._seq_distribution = NoneThis ensures that when prompts are disabled, the distribution is not initialized, and the fallback in
_sample_sequence_lengths
is never invoked with ISL=0.
76-78
: Seed RNG once during initialization, not on every sample.Passing
config.input.random_seed
(anint
) intosample(random_state=random_seed)
on every call recreates a freshnp.random.Generator
with the same seed each time. This causes every turn to receive identical (ISL, OSL) pairs when a seed is configured, completely defeating the distribution.Solution: Create a single
np.random.Generator
during__init__
and reuse it for all sampling:+import numpy as np ... class BaseDatasetComposer(AIPerfLoggerMixin, ABC): def __init__(self, config: UserConfig, tokenizer: Tokenizer, **kwargs): ... self._seq_distribution = config.input.prompt.get_sequence_distribution() + seed = getattr(self.config.input, "random_seed", None) + self._seq_rng = np.random.default_rng(seed) if seed is not None else None def _sample_sequence_lengths(self) -> tuple[int, int]: ... - # Use random seed from config if available for reproducible results - random_seed = getattr(self.config.input, "random_seed", None) - return self._seq_distribution.sample(random_state=random_seed) + # Use seeded generator for reproducible sampling + return self._seq_distribution.sample(random_state=self._seq_rng)This ensures sampling remains stochastic yet reproducible when a seed is provided.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
aiperf/common/config/prompt_config.py
(1 hunks)aiperf/dataset/composer/base.py
(2 hunks)aiperf/dataset/composer/synthetic.py
(1 hunks)tests/test_sequence_distribution.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- aiperf/common/config/prompt_config.py
🧰 Additional context used
🧬 Code graph analysis (3)
aiperf/dataset/composer/base.py (4)
aiperf/common/config/prompt_config.py (1)
get_sequence_distribution
(206-212)aiperf/common/sequence_distribution.py (1)
sample
(93-119)aiperf/common/models/dataset_models.py (1)
Turn
(43-70)aiperf/dataset/utils.py (1)
sample_positive_normal_integer
(113-124)
tests/test_sequence_distribution.py (2)
aiperf/common/sequence_distribution.py (10)
DistributionParser
(219-348)SequenceLengthDistribution
(47-200)SequenceLengthPair
(23-44)create_balanced_distribution
(365-383)create_uniform_distribution
(351-362)sample
(93-119)sample_batch
(121-151)pairs
(154-156)get_statistics
(158-192)parse
(227-265)aiperf/common/config/prompt_config.py (1)
get_sequence_distribution
(206-212)
aiperf/dataset/composer/synthetic.py (2)
aiperf/dataset/composer/base.py (1)
_sample_sequence_lengths
(62-78)aiperf/dataset/generator/prompt.py (1)
generate
(96-118)
🪛 Ruff (0.13.2)
aiperf/dataset/composer/synthetic.py
110-110: Unpacked variable osl
is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
🔇 Additional comments (2)
aiperf/dataset/composer/base.py (1)
80-97
: Logic is correct once turn-level caching is implemented.The implementation correctly uses the sampled OSL for
max_tokens
when a distribution is present, and falls back to legacy mean/stddev sampling otherwise.However, this method currently calls
_sample_sequence_lengths()
a second time per turn (the first call is insynthetic._generate_text_payloads
at line 110), which breaks ISL/OSL pairing. Once turn-level caching is implemented (as flagged in the review ofsynthetic.py
lines 108-116), this logic will correctly retrieve the cached OSL associated with the ISL used for prompt generation.Verify that the turn-level caching solution addresses the double-sampling issue across both
_generate_text_payloads
and_set_max_tokens
.tests/test_sequence_distribution.py (1)
1-387
: Excellent test coverage for the sequence distribution system.The test suite comprehensively validates all aspects of the new distribution functionality:
- SequenceLengthPair: validation (invalid ISL/OSL/probability), boundaries, immutability, string representation
- SequenceLengthDistribution: single/multi-pair construction, sampling (deterministic and probabilistic), batch sampling, reproducibility, probability validation (including floating-point tolerance), statistics, string representation
- DistributionParser: all supported formats (JSON, bracket, semicolon), backward compatibility (fractions), error handling (invalid formats, missing fields), decimal probabilities, whitespace handling
- Utility functions:
create_uniform_distribution
andcreate_balanced_distribution
, including empty input validation- Integration: end-to-end workflow from parsing to sampling with empirical distribution checks, statistical accuracy validation (expected values match empirical means within 1%)
- PromptConfig integration: retrieving distributions with explicit configuration and fallback behavior
The test design is robust, using appropriate sample sizes (10,000–50,000 samples) and statistical tolerances (±5% for distribution checks, ±1% for expected value accuracy) to minimize flakiness while catching real issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
aiperf/dataset/composer/base.py
(3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
aiperf/dataset/composer/base.py (4)
aiperf/common/config/prompt_config.py (1)
get_sequence_distribution
(206-212)aiperf/common/sequence_distribution.py (1)
sample
(93-119)aiperf/common/models/dataset_models.py (1)
Turn
(43-70)aiperf/dataset/utils.py (1)
sample_positive_normal_integer
(113-124)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build (ubuntu-latest, 3.10)
🔇 Additional comments (2)
aiperf/dataset/composer/base.py (2)
7-7
: LGTM!The numpy import is correctly added to support the new RNG initialization and distribution sampling.
31-36
: Handle seed=0 correctly when initializing RNG.The condition
if seed is not None
on line 36 will fail for a valid seed value of 0, preventing RNG initialization. Zero is a legitimate seed value and should create a seeded generator.Apply this diff to handle zero seeds correctly:
# Initialize RNG for sequence distribution sampling (avoid reseeding on each sample) seed = getattr(self.config.input, "random_seed", None) - self._seq_rng = np.random.default_rng(seed) if seed is not None else None + self._seq_rng = ( + np.random.default_rng(seed) if seed is not None else np.random.default_rng() + )Note: This change also eliminates the
None
case, ensuring a generator is always available. If you want to preserve the ability to haveself._seq_rng = None
(to signal "no seed configured"), use this alternative:seed = getattr(self.config.input, "random_seed", None) - self._seq_rng = np.random.default_rng(seed) if seed is not None else None + # Explicitly check for None to handle seed=0 correctly + if seed is None: + self._seq_rng = None + else: + self._seq_rng = np.random.default_rng(seed)Likely an incorrect or invalid review comment.
fix: import
98a11a7
to
0fe89fc
Compare
5a536ae
to
d0187a6
Compare
remove unused class remove excess docs
d0187a6
to
617092f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (1)
docs/tutorials/sequence-distributions.md (1)
44-78
: Add explicit languages to fenced code blocks.Markdownlint is flagging these fences (MD040). Annotating them with
text
,bash
, orjson
fixes the warning and keeps formatting consistent with the rest of the docs.As per static analysis hints.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
README.md
(1 hunks)aiperf/common/config/__init__.py
(1 hunks)aiperf/common/config/prompt_config.py
(3 hunks)aiperf/common/models/__init__.py
(4 hunks)aiperf/common/models/sequence_distribution.py
(1 hunks)aiperf/dataset/composer/base.py
(4 hunks)aiperf/dataset/composer/synthetic.py
(2 hunks)docs/tutorials/sequence-distributions.md
(1 hunks)mkdocs.yml
(1 hunks)tests/composers/test_base_composer.py
(1 hunks)tests/composers/test_synthetic_composer.py
(4 hunks)tests/config/test_prompt_config.py
(2 hunks)tests/test_sequence_distribution.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- tests/composers/test_synthetic_composer.py
- tests/composers/test_base_composer.py
- tests/config/test_prompt_config.py
🧰 Additional context used
🧬 Code graph analysis (5)
aiperf/dataset/composer/base.py (3)
aiperf/common/config/prompt_config.py (1)
get_sequence_distribution
(222-228)aiperf/common/models/sequence_distribution.py (1)
sample
(141-183)aiperf/common/models/dataset_models.py (1)
Turn
(43-70)
aiperf/common/models/__init__.py (2)
aiperf/common/models/sequence_distribution.py (5)
DistributionParser
(267-407)SequenceLengthDistribution
(95-264)SequenceLengthPair
(58-92)create_balanced_distribution
(424-442)create_uniform_distribution
(410-421)tests/logging/test_logging_mixins.py (1)
logger
(29-31)
aiperf/dataset/composer/synthetic.py (3)
aiperf/common/models/dataset_models.py (2)
Turn
(43-70)Text
(24-27)aiperf/dataset/composer/base.py (2)
_get_turn_sequence_lengths
(71-99)prefix_prompt_enabled
(148-149)aiperf/dataset/generator/prompt.py (2)
generate
(96-118)get_random_prefix_prompt
(219-234)
tests/test_sequence_distribution.py (4)
aiperf/common/models/sequence_distribution.py (11)
DistributionParser
(267-407)SequenceLengthDistribution
(95-264)SequenceLengthPair
(58-92)_sample_positive_normal_integer
(45-54)create_balanced_distribution
(424-442)create_uniform_distribution
(410-421)sample
(141-183)sample_batch
(185-215)pairs
(218-220)get_statistics
(222-256)parse
(279-318)aiperf/common/config/prompt_config.py (2)
PromptConfig
(167-228)get_sequence_distribution
(222-228)aiperf/common/models/dataset_models.py (1)
Turn
(43-70)aiperf/dataset/composer/base.py (3)
BaseDatasetComposer
(22-149)_get_turn_sequence_lengths
(71-99)_clear_turn_cache
(101-107)
aiperf/common/config/prompt_config.py (3)
aiperf/common/models/sequence_distribution.py (2)
DistributionParser
(267-407)parse
(279-318)aiperf/common/config/cli_parameter.py (1)
CLIParameter
(10-19)aiperf/common/config/groups.py (1)
Groups
(6-28)
🪛 markdownlint-cli2 (0.18.1)
docs/tutorials/sequence-distributions.md
44-44: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
49-49: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
56-56: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
61-61: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🪛 Ruff (0.13.3)
aiperf/dataset/composer/synthetic.py
109-109: Unpacked variable osl
is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
tests/test_sequence_distribution.py
205-205: Pattern passed to match=
contains metacharacters but is neither escaped nor raw
(RUF043)
312-312: Pattern passed to match=
contains metacharacters but is neither escaped nor raw
(RUF043)
aiperf/common/config/prompt_config.py
186-186: Avoid specifying long messages outside the exception class
(TRY003)
aiperf/common/models/sequence_distribution.py
48-48: Value being cast to int
is already an integer
Remove unnecessary int
call
(RUF046)
54-54: Value being cast to int
is already an integer
Remove unnecessary int
call
(RUF046)
70-72: Avoid specifying long messages outside the exception class
(TRY003)
74-76: Avoid specifying long messages outside the exception class
(TRY003)
78-78: Avoid specifying long messages outside the exception class
(TRY003)
80-82: Avoid specifying long messages outside the exception class
(TRY003)
84-86: Avoid specifying long messages outside the exception class
(TRY003)
114-116: Avoid specifying long messages outside the exception class
(TRY003)
130-133: Avoid specifying long messages outside the exception class
(TRY003)
199-199: Avoid specifying long messages outside the exception class
(TRY003)
299-299: Avoid specifying long messages outside the exception class
(TRY003)
316-318: Avoid specifying long messages outside the exception class
(TRY003)
326-326: Avoid specifying long messages outside the exception class
(TRY003)
330-330: Avoid specifying long messages outside the exception class
(TRY003)
338-338: Avoid specifying long messages outside the exception class
(TRY003)
353-353: Avoid specifying long messages outside the exception class
(TRY003)
373-373: Avoid specifying long messages outside the exception class
(TRY003)
389-391: Avoid specifying long messages outside the exception class
(TRY003)
405-405: Avoid specifying long messages outside the exception class
(TRY003)
437-437: Avoid specifying long messages outside the exception class
(TRY003)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
docs/tutorials/sequence-distributions.md (1)
46-64
: Consider adding language identifiers to format example code blocks.The code blocks showing format syntax (lines 46-53, 58-64) lack language identifiers, which triggers markdown linter warnings. While these are format examples rather than executable code, adding
text
as the language identifier would satisfy the linter and improve consistency.Example for line 46:
-``` +```text "ISL1,OSL1:PROB1;ISL2,OSL2:PROB2;..."Apply similar changes to the other format example blocks at lines 51, 58, and 63. </blockquote></details> </blockquote></details> <details> <summary>📜 Review details</summary> **Configuration used**: CodeRabbit UI **Review profile**: CHILL **Plan**: Pro <details> <summary>📥 Commits</summary> Reviewing files that changed from the base of the PR and between 617092f828257cca5e29b4a35f28eae8a5c3aaf9 and a0f4c24fa4bb7ae85f2ac6c5383b9adffad117fb. </details> <details> <summary>📒 Files selected for processing (4)</summary> * `aiperf/common/models/sequence_distribution.py` (1 hunks) * `aiperf/dataset/composer/synthetic.py` (2 hunks) * `docs/tutorials/sequence-distributions.md` (1 hunks) * `tests/test_sequence_distribution.py` (1 hunks) </details> <details> <summary>🚧 Files skipped from review as they are similar to previous changes (1)</summary> * aiperf/dataset/composer/synthetic.py </details> <details> <summary>🧰 Additional context used</summary> <details> <summary>🧬 Code graph analysis (1)</summary> <details> <summary>tests/test_sequence_distribution.py (3)</summary><blockquote> <details> <summary>aiperf/common/models/sequence_distribution.py (11)</summary> * `DistributionParser` (283-423) * `SequenceLengthDistribution` (95-280) * `SequenceLengthPair` (58-92) * `_sample_positive_normal_integer` (45-54) * `create_balanced_distribution` (440-458) * `create_uniform_distribution` (426-437) * `sample` (141-183) * `sample_batch` (185-231) * `pairs` (234-236) * `get_statistics` (238-272) * `parse` (295-334) </details> <details> <summary>aiperf/common/config/prompt_config.py (2)</summary> * `PromptConfig` (167-228) * `get_sequence_distribution` (222-228) </details> <details> <summary>aiperf/dataset/composer/base.py (3)</summary> * `BaseDatasetComposer` (22-145) * `_get_turn_sequence_lengths` (71-96) * `_clear_turn_cache` (98-104) </details> </blockquote></details> </details><details> <summary>🪛 markdownlint-cli2 (0.18.1)</summary> <details> <summary>docs/tutorials/sequence-distributions.md</summary> 46-46: Fenced code blocks should have a language specified (MD040, fenced-code-language) --- 51-51: Fenced code blocks should have a language specified (MD040, fenced-code-language) --- 58-58: Fenced code blocks should have a language specified (MD040, fenced-code-language) --- 63-63: Fenced code blocks should have a language specified (MD040, fenced-code-language) </details> </details> <details> <summary>🪛 Ruff (0.13.3)</summary> <details> <summary>tests/test_sequence_distribution.py</summary> 206-206: Pattern passed to `match=` contains metacharacters but is neither escaped nor raw (RUF043) --- 313-313: Pattern passed to `match=` contains metacharacters but is neither escaped nor raw (RUF043) </details> <details> <summary>aiperf/common/models/sequence_distribution.py</summary> 48-48: Value being cast to `int` is already an integer Remove unnecessary `int` call (RUF046) --- 54-54: Value being cast to `int` is already an integer Remove unnecessary `int` call (RUF046) --- 70-72: Avoid specifying long messages outside the exception class (TRY003) --- 74-76: Avoid specifying long messages outside the exception class (TRY003) --- 78-78: Avoid specifying long messages outside the exception class (TRY003) --- 80-82: Avoid specifying long messages outside the exception class (TRY003) --- 84-86: Avoid specifying long messages outside the exception class (TRY003) --- 114-116: Avoid specifying long messages outside the exception class (TRY003) --- 130-133: Avoid specifying long messages outside the exception class (TRY003) --- 199-199: Avoid specifying long messages outside the exception class (TRY003) --- 315-315: Avoid specifying long messages outside the exception class (TRY003) --- 332-334: Avoid specifying long messages outside the exception class (TRY003) --- 342-342: Avoid specifying long messages outside the exception class (TRY003) --- 346-346: Avoid specifying long messages outside the exception class (TRY003) --- 354-354: Avoid specifying long messages outside the exception class (TRY003) --- 369-369: Avoid specifying long messages outside the exception class (TRY003) --- 389-389: Avoid specifying long messages outside the exception class (TRY003) --- 405-407: Avoid specifying long messages outside the exception class (TRY003) --- 421-421: Avoid specifying long messages outside the exception class (TRY003) --- 453-453: Avoid specifying long messages outside the exception class (TRY003) </details> </details> </details> <details> <summary>⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)</summary> * GitHub Check: build (ubuntu-latest, 3.10) </details> <details> <summary>🔇 Additional comments (6)</summary><blockquote> <details> <summary>tests/test_sequence_distribution.py (2)</summary><blockquote> `154-167`: **LGTM - RNG reseeding issue resolved!** The test now correctly creates a single RNG instance once (line 159) and reuses it across all samples (line 160), allowing the RNG state to advance properly. This ensures the stddev test will observe variance as expected. --- `560-601`: **Excellent caching test coverage!** The turn-level sequence caching tests thoroughly verify that ISL/OSL pairs are cached per turn for consistency and that cache clearing works correctly. This ensures deterministic behavior within a turn while allowing variation across turns. </blockquote></details> <details> <summary>aiperf/common/models/sequence_distribution.py (4)</summary><blockquote> `212-231`: **LGTM - Batch sampling now correctly applies stddev!** The `sample_batch` method now properly samples with variance by calling `_sample_positive_normal_integer` for each pair that defines stddev, mirroring the single-sample logic. This ensures batch results respect configured standard deviations. --- `45-54`: **Solid implementation of positive normal sampling.** The clamping to a minimum of 1 prevents invalid sequence lengths while the stddev check optimizes the deterministic case. The implementation correctly handles the edge case where normal sampling might produce negative values. --- `124-133`: **Good probability validation with floating-point tolerance.** The validation correctly allows small floating-point errors (rtol=1e-6, atol=1e-6) while still enforcing the constraint that probabilities sum to 100%. The error message helpfully includes the actual sum and the pairs for debugging. --- `283-423`: **Robust multi-format parser with comprehensive error handling.** The parser supports three distinct formats (JSON, bracket, semicolon) with optional stddev syntax and provides clear error messages for invalid inputs. The regex patterns correctly handle whitespace and optional stddev components. </blockquote></details> </blockquote></details> </details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
docs/tutorials/sequence-distributions.md
(1 hunks)
🧰 Additional context used
🪛 markdownlint-cli2 (0.18.1)
docs/tutorials/sequence-distributions.md
46-46: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
51-51: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
58-58: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
63-63: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build (ubuntu-latest, 3.10)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Thanks for working on this.
Add support for allowing a combination of ISL/OSL pairings. This can be done now by passing --seq-dist ",,<% of requests>;<ISL_2>,<OSL_2>,<% of requests_2>;..." for as many pairing as desired, as long as the percentages sum to 100%.
Tested with:
Summary by CodeRabbit
New Features
Improvements
Documentation
Tests