Skip to content

Conversation

the-david-oy
Copy link
Contributor

@the-david-oy the-david-oy commented Oct 2, 2025

Add support for allowing a combination of ISL/OSL pairings. This can be done now by passing --seq-dist ",,<% of requests>;<ISL_2>,<OSL_2>,<% of requests_2>;..." for as many pairing as desired, as long as the percentages sum to 100%.

Tested with:

aiperf profile \                                 
    --model Qwen/Qwen3-0.6B \
    --endpoint-type chat \
    --endpoint /v1/chat/completions \
    --streaming \
    --url localhost:8000 \
    --seq-dist "128,64:60;256,128:40" \
    --output-tokens-mean 100 \
    --concurrency 15 \
    --request-count 100

Summary by CodeRabbit

  • New Features

    • Configurable sequence length distributions (semicolon/bracket/JSON) with CLI flags and public utilities for parsing and sampling.
  • Improvements

    • Per-turn sequence length sampling with optional seeding, per-turn caching, deterministic reproducibility, and refined first-turn prefix handling; falls back to legacy behavior if unset.
  • Documentation

    • New user guide and navigation entry describing sequence distributions, formats, examples, and usage.
  • Tests

    • Extensive unit and integration tests covering parsing, sampling, statistics, caching, config integration, and end-to-end workflows.

@the-david-oy the-david-oy self-assigned this Oct 2, 2025
Copy link

coderabbitai bot commented Oct 2, 2025

Walkthrough

Adds sequence-length distribution support: new distribution models and parser, PromptConfig CLI/config field and validator, BaseDatasetComposer per-turn sampling with seeded RNG and caching, Synthetic composer using per-turn lengths, tests, docs, and public exports.

Changes

Cohort / File(s) Summary
Config: Prompt sequence distribution
aiperf/common/config/prompt_config.py, aiperf/common/config/__init__.py, tests/config/test_prompt_config.py
Adds PromptConfig.sequence_distribution CLI/Field and get_sequence_distribution() accessor; model-level validator to parse/validate distribution strings; adjusts exports order; tests for defaults, valid/invalid parsing, stddevs, and probability-sum validation.
Sequence distribution models & exports
aiperf/common/models/sequence_distribution.py, aiperf/common/models/__init__.py, tests/test_sequence_distribution.py
New module implementing SequenceLengthPair, SequenceLengthDistribution, DistributionParser, sampling and batch APIs, stats, string reprs, and factory helpers; re-exports added to package __init__; comprehensive unit/integration tests.
Dataset composer core integration
aiperf/dataset/composer/base.py, tests/composers/test_base_composer.py
BaseDatasetComposer gains _seq_distribution, seeded _seq_rng, and _turn_sequence_cache; adds _get_turn_sequence_lengths(turn_id) and _clear_turn_cache(turn_id); uses cached/sampled ISL/OSL for max-tokens and clears cache on finalize; tests cover seeding, caching, strategies, and legacy fallback.
Synthetic composer updates
aiperf/dataset/composer/synthetic.py, tests/composers/test_synthetic_composer.py
_generate_text_payloads signature changed to (turn, is_first) and call sites updated; uses per-turn sampled ISL/OSL, inlines prefix handling, and appends generated content in-place; tests adjusted.
Docs and site navigation
docs/tutorials/sequence-distributions.md, README.md, mkdocs.yml
Adds tutorial doc and README entry for Sequence Distributions; adds nav entry in mkdocs.
Misc: exports and packaging
aiperf/common/config/__init__.py, aiperf/common/models/__init__.py
Reorders an exported name in config __all__; expands aiperf.common.models.__all__ to expose new distribution-related symbols.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant CLI as CLI/Config
  participant PC as PromptConfig
  participant DP as DistributionParser
  participant BDC as BaseDatasetComposer
  participant SD as SequenceLengthDistribution
  participant SYN as SyntheticComposer
  participant T as Turn

  CLI->>PC: provide `sequence_distribution` string
  PC->>DP: model validator parses string
  alt parse OK
    DP-->>PC: SequenceLengthDistribution
  else parse fails
    DP-->>PC: raise ValueError
  end

  PC-->>BDC: get_sequence_distribution() -> SD or None
  BDC->>BDC: init _seq_rng (seeded?) and _turn_sequence_cache

  loop per turn
    SYN->>BDC: request lengths for turn_id
    BDC->>BDC: _get_turn_sequence_lengths(turn_id)
    alt SD present
      BDC->>SD: sample(random_state=_seq_rng)
      SD-->>BDC: (isl, osl)
      BDC->>BDC: cache (turn_id -> (isl,osl))
    else no SD
      BDC-->>BDC: compute/sample via legacy mean/stddev
    end
    BDC-->>SYN: return (isl,osl)
    SYN->>T: _generate_text_payloads(turn, is_first) uses isl/osl
    SYN->>BDC: finalize turn
    BDC->>BDC: _clear_turn_cache(turn_id)
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

I hop through lengths, both short and long,
Dice-roll tokens, steady or strong.
Per-turn pairs tucked in a neat little stash,
Sampled, cached, then cleared in a flash.
Benchmarks bounce — the rabbit’s dash. 🐇✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title succinctly and accurately describes the primary change of introducing mixed input and output sequence length distributions, matching the core feature added by the PR.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch dyas-isl-dist

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions bot added the feat label Oct 2, 2025
@the-david-oy the-david-oy marked this pull request as ready for review October 2, 2025 19:19
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d056b9e and df6d7e3.

📒 Files selected for processing (5)
  • aiperf/common/config/prompt_config.py (2 hunks)
  • aiperf/common/sequence_distribution.py (1 hunks)
  • aiperf/dataset/composer/base.py (2 hunks)
  • aiperf/dataset/composer/synthetic.py (2 hunks)
  • tests/test_sequence_distribution.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (4)
tests/test_sequence_distribution.py (2)
aiperf/common/sequence_distribution.py (10)
  • DistributionParser (219-348)
  • SequenceLengthDistribution (47-200)
  • SequenceLengthPair (23-44)
  • create_balanced_distribution (365-383)
  • create_uniform_distribution (351-362)
  • sample (93-119)
  • sample_batch (121-151)
  • pairs (154-156)
  • get_statistics (158-192)
  • parse (227-265)
aiperf/common/config/prompt_config.py (2)
  • PromptConfig (167-258)
  • get_sequence_distribution (234-258)
aiperf/dataset/composer/base.py (4)
aiperf/common/config/prompt_config.py (1)
  • get_sequence_distribution (234-258)
aiperf/common/sequence_distribution.py (1)
  • sample (93-119)
aiperf/common/models/dataset_models.py (1)
  • Turn (43-70)
aiperf/dataset/utils.py (1)
  • sample_positive_normal_integer (113-124)
aiperf/common/config/prompt_config.py (4)
aiperf/common/config/cli_parameter.py (1)
  • CLIParameter (10-19)
aiperf/common/config/groups.py (1)
  • Groups (6-28)
aiperf/common/config/config_defaults.py (1)
  • InputTokensDefaults (83-86)
aiperf/common/sequence_distribution.py (3)
  • DistributionParser (219-348)
  • create_uniform_distribution (351-362)
  • parse (227-265)
aiperf/dataset/composer/synthetic.py (2)
aiperf/dataset/composer/base.py (1)
  • _sample_sequence_lengths (65-80)
aiperf/dataset/generator/prompt.py (1)
  • generate (96-118)
🪛 Ruff (0.13.2)
aiperf/common/sequence_distribution.py

33-35: Avoid specifying long messages outside the exception class

(TRY003)


37-39: Avoid specifying long messages outside the exception class

(TRY003)


41-41: Avoid specifying long messages outside the exception class

(TRY003)


66-68: Avoid specifying long messages outside the exception class

(TRY003)


82-85: Avoid specifying long messages outside the exception class

(TRY003)


135-135: Avoid specifying long messages outside the exception class

(TRY003)


246-246: Avoid specifying long messages outside the exception class

(TRY003)


263-265: Avoid specifying long messages outside the exception class

(TRY003)


273-273: Avoid specifying long messages outside the exception class

(TRY003)


277-277: Avoid specifying long messages outside the exception class

(TRY003)


285-285: Avoid specifying long messages outside the exception class

(TRY003)


298-298: Avoid specifying long messages outside the exception class

(TRY003)


316-316: Avoid specifying long messages outside the exception class

(TRY003)


332-334: Avoid specifying long messages outside the exception class

(TRY003)


346-346: Avoid specifying long messages outside the exception class

(TRY003)


378-378: Avoid specifying long messages outside the exception class

(TRY003)

aiperf/dataset/composer/synthetic.py

116-116: Unpacked variable osl is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

Copy link

codecov bot commented Oct 2, 2025

Codecov Report

❌ Patch coverage is 90.99099% with 20 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
aiperf/common/models/sequence_distribution.py 88.16% 14 Missing and 6 partials ⚠️

📢 Thoughts on this report? Let us know!

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
aiperf/dataset/composer/base.py (2)

29-30: Gate distribution initialization when prompts are disabled.

If prompts are disabled (e.g., --prompt-input-tokens-mean 0) and no explicit --seq-dist is provided, get_sequence_distribution() may fall back to a default with ISL=0 (via the fallback at lines 68-74), which violates SequenceLengthPair validation (ISL must be positive) and prevents dataset creation even when only image/audio payloads are enabled.

Solution: Only initialize _seq_distribution when prompts are enabled or an explicit distribution is provided:

-        # Initialize sequence distribution
-        self._seq_distribution = config.input.prompt.get_sequence_distribution()
+        # Initialize sequence distribution only if prompts are enabled or explicitly specified
+        if config.input.prompt.input_tokens.mean > 0 or config.input.prompt.sequence_distribution:
+            self._seq_distribution = config.input.prompt.get_sequence_distribution()
+        else:
+            self._seq_distribution = None

This ensures that when prompts are disabled, the distribution is not initialized, and the fallback in _sample_sequence_lengths is never invoked with ISL=0.


76-78: Seed RNG once during initialization, not on every sample.

Passing config.input.random_seed (an int) into sample(random_state=random_seed) on every call recreates a fresh np.random.Generator with the same seed each time. This causes every turn to receive identical (ISL, OSL) pairs when a seed is configured, completely defeating the distribution.

Solution: Create a single np.random.Generator during __init__ and reuse it for all sampling:

+import numpy as np
 ...
 class BaseDatasetComposer(AIPerfLoggerMixin, ABC):
     def __init__(self, config: UserConfig, tokenizer: Tokenizer, **kwargs):
         ...
         self._seq_distribution = config.input.prompt.get_sequence_distribution()
+        seed = getattr(self.config.input, "random_seed", None)
+        self._seq_rng = np.random.default_rng(seed) if seed is not None else None

     def _sample_sequence_lengths(self) -> tuple[int, int]:
         ...
-        # Use random seed from config if available for reproducible results
-        random_seed = getattr(self.config.input, "random_seed", None)
-        return self._seq_distribution.sample(random_state=random_seed)
+        # Use seeded generator for reproducible sampling
+        return self._seq_distribution.sample(random_state=self._seq_rng)

This ensures sampling remains stochastic yet reproducible when a seed is provided.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between df6d7e3 and e057365.

📒 Files selected for processing (4)
  • aiperf/common/config/prompt_config.py (1 hunks)
  • aiperf/dataset/composer/base.py (2 hunks)
  • aiperf/dataset/composer/synthetic.py (1 hunks)
  • tests/test_sequence_distribution.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • aiperf/common/config/prompt_config.py
🧰 Additional context used
🧬 Code graph analysis (3)
aiperf/dataset/composer/base.py (4)
aiperf/common/config/prompt_config.py (1)
  • get_sequence_distribution (206-212)
aiperf/common/sequence_distribution.py (1)
  • sample (93-119)
aiperf/common/models/dataset_models.py (1)
  • Turn (43-70)
aiperf/dataset/utils.py (1)
  • sample_positive_normal_integer (113-124)
tests/test_sequence_distribution.py (2)
aiperf/common/sequence_distribution.py (10)
  • DistributionParser (219-348)
  • SequenceLengthDistribution (47-200)
  • SequenceLengthPair (23-44)
  • create_balanced_distribution (365-383)
  • create_uniform_distribution (351-362)
  • sample (93-119)
  • sample_batch (121-151)
  • pairs (154-156)
  • get_statistics (158-192)
  • parse (227-265)
aiperf/common/config/prompt_config.py (1)
  • get_sequence_distribution (206-212)
aiperf/dataset/composer/synthetic.py (2)
aiperf/dataset/composer/base.py (1)
  • _sample_sequence_lengths (62-78)
aiperf/dataset/generator/prompt.py (1)
  • generate (96-118)
🪛 Ruff (0.13.2)
aiperf/dataset/composer/synthetic.py

110-110: Unpacked variable osl is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

🔇 Additional comments (2)
aiperf/dataset/composer/base.py (1)

80-97: Logic is correct once turn-level caching is implemented.

The implementation correctly uses the sampled OSL for max_tokens when a distribution is present, and falls back to legacy mean/stddev sampling otherwise.

However, this method currently calls _sample_sequence_lengths() a second time per turn (the first call is in synthetic._generate_text_payloads at line 110), which breaks ISL/OSL pairing. Once turn-level caching is implemented (as flagged in the review of synthetic.py lines 108-116), this logic will correctly retrieve the cached OSL associated with the ISL used for prompt generation.

Verify that the turn-level caching solution addresses the double-sampling issue across both _generate_text_payloads and _set_max_tokens.

tests/test_sequence_distribution.py (1)

1-387: Excellent test coverage for the sequence distribution system.

The test suite comprehensively validates all aspects of the new distribution functionality:

  • SequenceLengthPair: validation (invalid ISL/OSL/probability), boundaries, immutability, string representation
  • SequenceLengthDistribution: single/multi-pair construction, sampling (deterministic and probabilistic), batch sampling, reproducibility, probability validation (including floating-point tolerance), statistics, string representation
  • DistributionParser: all supported formats (JSON, bracket, semicolon), backward compatibility (fractions), error handling (invalid formats, missing fields), decimal probabilities, whitespace handling
  • Utility functions: create_uniform_distribution and create_balanced_distribution, including empty input validation
  • Integration: end-to-end workflow from parsing to sampling with empirical distribution checks, statistical accuracy validation (expected values match empirical means within 1%)
  • PromptConfig integration: retrieving distributions with explicit configuration and fallback behavior

The test design is robust, using appropriate sample sizes (10,000–50,000 samples) and statistical tolerances (±5% for distribution checks, ±1% for expected value accuracy) to minimize flakiness while catching real issues.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e057365 and 58ee44b.

📒 Files selected for processing (1)
  • aiperf/dataset/composer/base.py (3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
aiperf/dataset/composer/base.py (4)
aiperf/common/config/prompt_config.py (1)
  • get_sequence_distribution (206-212)
aiperf/common/sequence_distribution.py (1)
  • sample (93-119)
aiperf/common/models/dataset_models.py (1)
  • Turn (43-70)
aiperf/dataset/utils.py (1)
  • sample_positive_normal_integer (113-124)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build (ubuntu-latest, 3.10)
🔇 Additional comments (2)
aiperf/dataset/composer/base.py (2)

7-7: LGTM!

The numpy import is correctly added to support the new RNG initialization and distribution sampling.


31-36: Handle seed=0 correctly when initializing RNG.

The condition if seed is not None on line 36 will fail for a valid seed value of 0, preventing RNG initialization. Zero is a legitimate seed value and should create a seeded generator.

Apply this diff to handle zero seeds correctly:

         # Initialize RNG for sequence distribution sampling (avoid reseeding on each sample)
         seed = getattr(self.config.input, "random_seed", None)
-        self._seq_rng = np.random.default_rng(seed) if seed is not None else None
+        self._seq_rng = (
+            np.random.default_rng(seed) if seed is not None else np.random.default_rng()
+        )

Note: This change also eliminates the None case, ensuring a generator is always available. If you want to preserve the ability to have self._seq_rng = None (to signal "no seed configured"), use this alternative:

         seed = getattr(self.config.input, "random_seed", None)
-        self._seq_rng = np.random.default_rng(seed) if seed is not None else None
+        # Explicitly check for None to handle seed=0 correctly
+        if seed is None:
+            self._seq_rng = None
+        else:
+            self._seq_rng = np.random.default_rng(seed)

Likely an incorrect or invalid review comment.

@the-david-oy the-david-oy marked this pull request as draft October 2, 2025 21:11
@the-david-oy the-david-oy marked this pull request as ready for review October 6, 2025 19:46
remove unused class

remove excess docs
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
docs/tutorials/sequence-distributions.md (1)

44-78: Add explicit languages to fenced code blocks.

Markdownlint is flagging these fences (MD040). Annotating them with text, bash, or json fixes the warning and keeps formatting consistent with the rest of the docs.

As per static analysis hints.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 58ee44b and 617092f.

📒 Files selected for processing (13)
  • README.md (1 hunks)
  • aiperf/common/config/__init__.py (1 hunks)
  • aiperf/common/config/prompt_config.py (3 hunks)
  • aiperf/common/models/__init__.py (4 hunks)
  • aiperf/common/models/sequence_distribution.py (1 hunks)
  • aiperf/dataset/composer/base.py (4 hunks)
  • aiperf/dataset/composer/synthetic.py (2 hunks)
  • docs/tutorials/sequence-distributions.md (1 hunks)
  • mkdocs.yml (1 hunks)
  • tests/composers/test_base_composer.py (1 hunks)
  • tests/composers/test_synthetic_composer.py (4 hunks)
  • tests/config/test_prompt_config.py (2 hunks)
  • tests/test_sequence_distribution.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • tests/composers/test_synthetic_composer.py
  • tests/composers/test_base_composer.py
  • tests/config/test_prompt_config.py
🧰 Additional context used
🧬 Code graph analysis (5)
aiperf/dataset/composer/base.py (3)
aiperf/common/config/prompt_config.py (1)
  • get_sequence_distribution (222-228)
aiperf/common/models/sequence_distribution.py (1)
  • sample (141-183)
aiperf/common/models/dataset_models.py (1)
  • Turn (43-70)
aiperf/common/models/__init__.py (2)
aiperf/common/models/sequence_distribution.py (5)
  • DistributionParser (267-407)
  • SequenceLengthDistribution (95-264)
  • SequenceLengthPair (58-92)
  • create_balanced_distribution (424-442)
  • create_uniform_distribution (410-421)
tests/logging/test_logging_mixins.py (1)
  • logger (29-31)
aiperf/dataset/composer/synthetic.py (3)
aiperf/common/models/dataset_models.py (2)
  • Turn (43-70)
  • Text (24-27)
aiperf/dataset/composer/base.py (2)
  • _get_turn_sequence_lengths (71-99)
  • prefix_prompt_enabled (148-149)
aiperf/dataset/generator/prompt.py (2)
  • generate (96-118)
  • get_random_prefix_prompt (219-234)
tests/test_sequence_distribution.py (4)
aiperf/common/models/sequence_distribution.py (11)
  • DistributionParser (267-407)
  • SequenceLengthDistribution (95-264)
  • SequenceLengthPair (58-92)
  • _sample_positive_normal_integer (45-54)
  • create_balanced_distribution (424-442)
  • create_uniform_distribution (410-421)
  • sample (141-183)
  • sample_batch (185-215)
  • pairs (218-220)
  • get_statistics (222-256)
  • parse (279-318)
aiperf/common/config/prompt_config.py (2)
  • PromptConfig (167-228)
  • get_sequence_distribution (222-228)
aiperf/common/models/dataset_models.py (1)
  • Turn (43-70)
aiperf/dataset/composer/base.py (3)
  • BaseDatasetComposer (22-149)
  • _get_turn_sequence_lengths (71-99)
  • _clear_turn_cache (101-107)
aiperf/common/config/prompt_config.py (3)
aiperf/common/models/sequence_distribution.py (2)
  • DistributionParser (267-407)
  • parse (279-318)
aiperf/common/config/cli_parameter.py (1)
  • CLIParameter (10-19)
aiperf/common/config/groups.py (1)
  • Groups (6-28)
🪛 markdownlint-cli2 (0.18.1)
docs/tutorials/sequence-distributions.md

44-44: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


49-49: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


56-56: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


61-61: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Ruff (0.13.3)
aiperf/dataset/composer/synthetic.py

109-109: Unpacked variable osl is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

tests/test_sequence_distribution.py

205-205: Pattern passed to match= contains metacharacters but is neither escaped nor raw

(RUF043)


312-312: Pattern passed to match= contains metacharacters but is neither escaped nor raw

(RUF043)

aiperf/common/config/prompt_config.py

186-186: Avoid specifying long messages outside the exception class

(TRY003)

aiperf/common/models/sequence_distribution.py

48-48: Value being cast to int is already an integer

Remove unnecessary int call

(RUF046)


54-54: Value being cast to int is already an integer

Remove unnecessary int call

(RUF046)


70-72: Avoid specifying long messages outside the exception class

(TRY003)


74-76: Avoid specifying long messages outside the exception class

(TRY003)


78-78: Avoid specifying long messages outside the exception class

(TRY003)


80-82: Avoid specifying long messages outside the exception class

(TRY003)


84-86: Avoid specifying long messages outside the exception class

(TRY003)


114-116: Avoid specifying long messages outside the exception class

(TRY003)


130-133: Avoid specifying long messages outside the exception class

(TRY003)


199-199: Avoid specifying long messages outside the exception class

(TRY003)


299-299: Avoid specifying long messages outside the exception class

(TRY003)


316-318: Avoid specifying long messages outside the exception class

(TRY003)


326-326: Avoid specifying long messages outside the exception class

(TRY003)


330-330: Avoid specifying long messages outside the exception class

(TRY003)


338-338: Avoid specifying long messages outside the exception class

(TRY003)


353-353: Avoid specifying long messages outside the exception class

(TRY003)


373-373: Avoid specifying long messages outside the exception class

(TRY003)


389-391: Avoid specifying long messages outside the exception class

(TRY003)


405-405: Avoid specifying long messages outside the exception class

(TRY003)


437-437: Avoid specifying long messages outside the exception class

(TRY003)

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
docs/tutorials/sequence-distributions.md (1)

46-64: Consider adding language identifiers to format example code blocks.

The code blocks showing format syntax (lines 46-53, 58-64) lack language identifiers, which triggers markdown linter warnings. While these are format examples rather than executable code, adding text as the language identifier would satisfy the linter and improve consistency.

Example for line 46:

-```
+```text
 "ISL1,OSL1:PROB1;ISL2,OSL2:PROB2;..."

Apply similar changes to the other format example blocks at lines 51, 58, and 63.

</blockquote></details>

</blockquote></details>

<details>
<summary>📜 Review details</summary>

**Configuration used**: CodeRabbit UI

**Review profile**: CHILL

**Plan**: Pro

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between 617092f828257cca5e29b4a35f28eae8a5c3aaf9 and a0f4c24fa4bb7ae85f2ac6c5383b9adffad117fb.

</details>

<details>
<summary>📒 Files selected for processing (4)</summary>

* `aiperf/common/models/sequence_distribution.py` (1 hunks)
* `aiperf/dataset/composer/synthetic.py` (2 hunks)
* `docs/tutorials/sequence-distributions.md` (1 hunks)
* `tests/test_sequence_distribution.py` (1 hunks)

</details>

<details>
<summary>🚧 Files skipped from review as they are similar to previous changes (1)</summary>

* aiperf/dataset/composer/synthetic.py

</details>

<details>
<summary>🧰 Additional context used</summary>

<details>
<summary>🧬 Code graph analysis (1)</summary>

<details>
<summary>tests/test_sequence_distribution.py (3)</summary><blockquote>

<details>
<summary>aiperf/common/models/sequence_distribution.py (11)</summary>

* `DistributionParser` (283-423)
* `SequenceLengthDistribution` (95-280)
* `SequenceLengthPair` (58-92)
* `_sample_positive_normal_integer` (45-54)
* `create_balanced_distribution` (440-458)
* `create_uniform_distribution` (426-437)
* `sample` (141-183)
* `sample_batch` (185-231)
* `pairs` (234-236)
* `get_statistics` (238-272)
* `parse` (295-334)

</details>
<details>
<summary>aiperf/common/config/prompt_config.py (2)</summary>

* `PromptConfig` (167-228)
* `get_sequence_distribution` (222-228)

</details>
<details>
<summary>aiperf/dataset/composer/base.py (3)</summary>

* `BaseDatasetComposer` (22-145)
* `_get_turn_sequence_lengths` (71-96)
* `_clear_turn_cache` (98-104)

</details>

</blockquote></details>

</details><details>
<summary>🪛 markdownlint-cli2 (0.18.1)</summary>

<details>
<summary>docs/tutorials/sequence-distributions.md</summary>

46-46: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

---

51-51: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

---

58-58: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

---

63-63: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

</details>

</details>
<details>
<summary>🪛 Ruff (0.13.3)</summary>

<details>
<summary>tests/test_sequence_distribution.py</summary>

206-206: Pattern passed to `match=` contains metacharacters but is neither escaped nor raw

(RUF043)

---

313-313: Pattern passed to `match=` contains metacharacters but is neither escaped nor raw

(RUF043)

</details>
<details>
<summary>aiperf/common/models/sequence_distribution.py</summary>

48-48: Value being cast to `int` is already an integer

Remove unnecessary `int` call

(RUF046)

---

54-54: Value being cast to `int` is already an integer

Remove unnecessary `int` call

(RUF046)

---

70-72: Avoid specifying long messages outside the exception class

(TRY003)

---

74-76: Avoid specifying long messages outside the exception class

(TRY003)

---

78-78: Avoid specifying long messages outside the exception class

(TRY003)

---

80-82: Avoid specifying long messages outside the exception class

(TRY003)

---

84-86: Avoid specifying long messages outside the exception class

(TRY003)

---

114-116: Avoid specifying long messages outside the exception class

(TRY003)

---

130-133: Avoid specifying long messages outside the exception class

(TRY003)

---

199-199: Avoid specifying long messages outside the exception class

(TRY003)

---

315-315: Avoid specifying long messages outside the exception class

(TRY003)

---

332-334: Avoid specifying long messages outside the exception class

(TRY003)

---

342-342: Avoid specifying long messages outside the exception class

(TRY003)

---

346-346: Avoid specifying long messages outside the exception class

(TRY003)

---

354-354: Avoid specifying long messages outside the exception class

(TRY003)

---

369-369: Avoid specifying long messages outside the exception class

(TRY003)

---

389-389: Avoid specifying long messages outside the exception class

(TRY003)

---

405-407: Avoid specifying long messages outside the exception class

(TRY003)

---

421-421: Avoid specifying long messages outside the exception class

(TRY003)

---

453-453: Avoid specifying long messages outside the exception class

(TRY003)

</details>

</details>

</details>

<details>
<summary>⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)</summary>

* GitHub Check: build (ubuntu-latest, 3.10)

</details>

<details>
<summary>🔇 Additional comments (6)</summary><blockquote>

<details>
<summary>tests/test_sequence_distribution.py (2)</summary><blockquote>

`154-167`: **LGTM - RNG reseeding issue resolved!**

The test now correctly creates a single RNG instance once (line 159) and reuses it across all samples (line 160), allowing the RNG state to advance properly. This ensures the stddev test will observe variance as expected.

---

`560-601`: **Excellent caching test coverage!**

The turn-level sequence caching tests thoroughly verify that ISL/OSL pairs are cached per turn for consistency and that cache clearing works correctly. This ensures deterministic behavior within a turn while allowing variation across turns.

</blockquote></details>
<details>
<summary>aiperf/common/models/sequence_distribution.py (4)</summary><blockquote>

`212-231`: **LGTM - Batch sampling now correctly applies stddev!**

The `sample_batch` method now properly samples with variance by calling `_sample_positive_normal_integer` for each pair that defines stddev, mirroring the single-sample logic. This ensures batch results respect configured standard deviations.

---

`45-54`: **Solid implementation of positive normal sampling.**

The clamping to a minimum of 1 prevents invalid sequence lengths while the stddev check optimizes the deterministic case. The implementation correctly handles the edge case where normal sampling might produce negative values.

---

`124-133`: **Good probability validation with floating-point tolerance.**

The validation correctly allows small floating-point errors (rtol=1e-6, atol=1e-6) while still enforcing the constraint that probabilities sum to 100%. The error message helpfully includes the actual sum and the pairs for debugging.

---

`283-423`: **Robust multi-format parser with comprehensive error handling.**

The parser supports three distinct formats (JSON, bracket, semicolon) with optional stddev syntax and provides clear error messages for invalid inputs. The regex patterns correctly handle whitespace and optional stddev components.

</blockquote></details>

</blockquote></details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a0f4c24 and d288575.

📒 Files selected for processing (1)
  • docs/tutorials/sequence-distributions.md (1 hunks)
🧰 Additional context used
🪛 markdownlint-cli2 (0.18.1)
docs/tutorials/sequence-distributions.md

46-46: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


51-51: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


58-58: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


63-63: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build (ubuntu-latest, 3.10)

Copy link
Contributor

@lkomali lkomali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Thanks for working on this.

@the-david-oy the-david-oy merged commit 0a70989 into main Oct 7, 2025
6 checks passed
@the-david-oy the-david-oy deleted the dyas-isl-dist branch October 7, 2025 17:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants