feat: mixed isl/osl distributions #322

the-david-oy · 2025-10-02T19:16:42Z

Add support for allowing a combination of ISL/OSL pairings. This can be done now by passing --seq-dist ",,<% of requests>;<ISL_2>,<OSL_2>,<% of requests_2>;..." for as many pairing as desired, as long as the percentages sum to 100%.

Tested with:

aiperf profile \                                 
    --model Qwen/Qwen3-0.6B \
    --endpoint-type chat \
    --endpoint /v1/chat/completions \
    --streaming \
    --url localhost:8000 \
    --seq-dist "128,64:60;256,128:40" \
    --output-tokens-mean 100 \
    --concurrency 15 \
    --request-count 100

Summary by CodeRabbit

New Features
- Configurable sequence length distributions (semicolon/bracket/JSON) with CLI flags and public utilities for parsing and sampling.
Improvements
- Per-turn sequence length sampling with optional seeding, per-turn caching, deterministic reproducibility, and refined first-turn prefix handling; falls back to legacy behavior if unset.
Documentation
- New user guide and navigation entry describing sequence distributions, formats, examples, and usage.
Tests
- Extensive unit and integration tests covering parsing, sampling, statistics, caching, config integration, and end-to-end workflows.

coderabbitai · 2025-10-02T19:16:49Z

Walkthrough

Adds sequence-length distribution support: new distribution models and parser, PromptConfig CLI/config field and validator, BaseDatasetComposer per-turn sampling with seeded RNG and caching, Synthetic composer using per-turn lengths, tests, docs, and public exports.

Changes

Cohort / File(s)	Summary
Config: Prompt sequence distribution `aiperf/common/config/prompt_config.py`, `aiperf/common/config/__init__.py`, `tests/config/test_prompt_config.py`	Adds `PromptConfig.sequence_distribution` CLI/Field and `get_sequence_distribution()` accessor; model-level validator to parse/validate distribution strings; adjusts exports order; tests for defaults, valid/invalid parsing, stddevs, and probability-sum validation.
Sequence distribution models & exports `aiperf/common/models/sequence_distribution.py`, `aiperf/common/models/__init__.py`, `tests/test_sequence_distribution.py`	New module implementing `SequenceLengthPair`, `SequenceLengthDistribution`, `DistributionParser`, sampling and batch APIs, stats, string reprs, and factory helpers; re-exports added to package `__init__`; comprehensive unit/integration tests.
Dataset composer core integration `aiperf/dataset/composer/base.py`, `tests/composers/test_base_composer.py`	`BaseDatasetComposer` gains `_seq_distribution`, seeded `_seq_rng`, and `_turn_sequence_cache`; adds `_get_turn_sequence_lengths(turn_id)` and `_clear_turn_cache(turn_id)`; uses cached/sampled ISL/OSL for max-tokens and clears cache on finalize; tests cover seeding, caching, strategies, and legacy fallback.
Synthetic composer updates `aiperf/dataset/composer/synthetic.py`, `tests/composers/test_synthetic_composer.py`	`_generate_text_payloads` signature changed to `(turn, is_first)` and call sites updated; uses per-turn sampled ISL/OSL, inlines prefix handling, and appends generated content in-place; tests adjusted.
Docs and site navigation `docs/tutorials/sequence-distributions.md`, `README.md`, `mkdocs.yml`	Adds tutorial doc and README entry for Sequence Distributions; adds nav entry in mkdocs.
Misc: exports and packaging `aiperf/common/config/__init__.py`, `aiperf/common/models/__init__.py`	Reorders an exported name in config `__all__`; expands `aiperf.common.models.__all__` to expose new distribution-related symbols.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant CLI as CLI/Config
  participant PC as PromptConfig
  participant DP as DistributionParser
  participant BDC as BaseDatasetComposer
  participant SD as SequenceLengthDistribution
  participant SYN as SyntheticComposer
  participant T as Turn

  CLI->>PC: provide `sequence_distribution` string
  PC->>DP: model validator parses string
  alt parse OK
    DP-->>PC: SequenceLengthDistribution
  else parse fails
    DP-->>PC: raise ValueError
  end

  PC-->>BDC: get_sequence_distribution() -> SD or None
  BDC->>BDC: init _seq_rng (seeded?) and _turn_sequence_cache

  loop per turn
    SYN->>BDC: request lengths for turn_id
    BDC->>BDC: _get_turn_sequence_lengths(turn_id)
    alt SD present
      BDC->>SD: sample(random_state=_seq_rng)
      SD-->>BDC: (isl, osl)
      BDC->>BDC: cache (turn_id -> (isl,osl))
    else no SD
      BDC-->>BDC: compute/sample via legacy mean/stddev
    end
    BDC-->>SYN: return (isl,osl)
    SYN->>T: _generate_text_payloads(turn, is_first) uses isl/osl
    SYN->>BDC: finalize turn
    BDC->>BDC: _clear_turn_cache(turn_id)
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

I hop through lengths, both short and long,
Dice-roll tokens, steady or strong.
Per-turn pairs tucked in a neat little stash,
Sampled, cached, then cleared in a flash.
Benchmarks bounce — the rabbit’s dash. 🐇✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title succinctly and accurately describes the primary change of introducing mixed input and output sequence length distributions, matching the core feature added by the PR.
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch dyas-isl-dist

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d056b9e and df6d7e3.

📒 Files selected for processing (5)

aiperf/common/config/prompt_config.py (2 hunks)
aiperf/common/sequence_distribution.py (1 hunks)
aiperf/dataset/composer/base.py (2 hunks)
aiperf/dataset/composer/synthetic.py (2 hunks)
tests/test_sequence_distribution.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (4)

tests/test_sequence_distribution.py (2)

aiperf/common/sequence_distribution.py (10)

DistributionParser (219-348)

SequenceLengthDistribution (47-200)

SequenceLengthPair (23-44)

create_balanced_distribution (365-383)

create_uniform_distribution (351-362)

sample (93-119)

sample_batch (121-151)

pairs (154-156)

get_statistics (158-192)

parse (227-265)

aiperf/common/config/prompt_config.py (2)

PromptConfig (167-258)

get_sequence_distribution (234-258)

aiperf/dataset/composer/base.py (4)

aiperf/common/config/prompt_config.py (1)

get_sequence_distribution (234-258)

aiperf/common/sequence_distribution.py (1)

sample (93-119)

aiperf/common/models/dataset_models.py (1)

Turn (43-70)

aiperf/dataset/utils.py (1)

sample_positive_normal_integer (113-124)

aiperf/common/config/prompt_config.py (4)

aiperf/common/config/cli_parameter.py (1)

CLIParameter (10-19)

aiperf/common/config/groups.py (1)

Groups (6-28)

aiperf/common/config/config_defaults.py (1)

InputTokensDefaults (83-86)

aiperf/common/sequence_distribution.py (3)

DistributionParser (219-348)

create_uniform_distribution (351-362)

parse (227-265)

aiperf/dataset/composer/synthetic.py (2)

aiperf/dataset/composer/base.py (1)

_sample_sequence_lengths (65-80)

aiperf/dataset/generator/prompt.py (1)

generate (96-118)

🪛 Ruff (0.13.2)

aiperf/common/sequence_distribution.py

33-35: Avoid specifying long messages outside the exception class

(TRY003)

37-39: Avoid specifying long messages outside the exception class

(TRY003)

41-41: Avoid specifying long messages outside the exception class

(TRY003)

66-68: Avoid specifying long messages outside the exception class

(TRY003)

82-85: Avoid specifying long messages outside the exception class

(TRY003)

135-135: Avoid specifying long messages outside the exception class

(TRY003)

246-246: Avoid specifying long messages outside the exception class

(TRY003)

263-265: Avoid specifying long messages outside the exception class

(TRY003)

273-273: Avoid specifying long messages outside the exception class

(TRY003)

277-277: Avoid specifying long messages outside the exception class

(TRY003)

285-285: Avoid specifying long messages outside the exception class

(TRY003)

298-298: Avoid specifying long messages outside the exception class

(TRY003)

316-316: Avoid specifying long messages outside the exception class

(TRY003)

332-334: Avoid specifying long messages outside the exception class

(TRY003)

346-346: Avoid specifying long messages outside the exception class

(TRY003)

378-378: Avoid specifying long messages outside the exception class

(TRY003)

aiperf/dataset/composer/synthetic.py

116-116: Unpacked variable osl is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

aiperf/dataset/composer/base.py

codecov · 2025-10-02T20:06:55Z

Codecov Report

❌ Patch coverage is 90.99099% with 20 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
aiperf/common/models/sequence_distribution.py	88.16%	14 Missing and 6 partials ⚠️

📢 Thoughts on this report? Let us know!

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

aiperf/dataset/composer/base.py (2)
29-30: Gate distribution initialization when prompts are disabled.

If prompts are disabled (e.g., --prompt-input-tokens-mean 0) and no explicit --seq-dist is provided, get_sequence_distribution() may fall back to a default with ISL=0 (via the fallback at lines 68-74), which violates SequenceLengthPair validation (ISL must be positive) and prevents dataset creation even when only image/audio payloads are enabled.

Solution: Only initialize _seq_distribution when prompts are enabled or an explicit distribution is provided:
-        # Initialize sequence distribution
-        self._seq_distribution = config.input.prompt.get_sequence_distribution()
+        # Initialize sequence distribution only if prompts are enabled or explicitly specified
+        if config.input.prompt.input_tokens.mean > 0 or config.input.prompt.sequence_distribution:
+            self._seq_distribution = config.input.prompt.get_sequence_distribution()
+        else:
+            self._seq_distribution = None
This ensures that when prompts are disabled, the distribution is not initialized, and the fallback in _sample_sequence_lengths is never invoked with ISL=0.

76-78: Seed RNG once during initialization, not on every sample.

Passing config.input.random_seed (an int) into sample(random_state=random_seed) on every call recreates a fresh np.random.Generator with the same seed each time. This causes every turn to receive identical (ISL, OSL) pairs when a seed is configured, completely defeating the distribution.

Solution: Create a single np.random.Generator during __init__ and reuse it for all sampling:
+import numpy as np
 ...
 class BaseDatasetComposer(AIPerfLoggerMixin, ABC):
     def __init__(self, config: UserConfig, tokenizer: Tokenizer, **kwargs):
         ...
         self._seq_distribution = config.input.prompt.get_sequence_distribution()
+        seed = getattr(self.config.input, "random_seed", None)
+        self._seq_rng = np.random.default_rng(seed) if seed is not None else None

     def _sample_sequence_lengths(self) -> tuple[int, int]:
         ...
-        # Use random seed from config if available for reproducible results
-        random_seed = getattr(self.config.input, "random_seed", None)
-        return self._seq_distribution.sample(random_state=random_seed)
+        # Use seeded generator for reproducible sampling
+        return self._seq_distribution.sample(random_state=self._seq_rng)
This ensures sampling remains stochastic yet reproducible when a seed is provided.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between df6d7e3 and e057365.

📒 Files selected for processing (4)

aiperf/common/config/prompt_config.py (1 hunks)
aiperf/dataset/composer/base.py (2 hunks)
aiperf/dataset/composer/synthetic.py (1 hunks)
tests/test_sequence_distribution.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

aiperf/common/config/prompt_config.py

🧰 Additional context used

🧬 Code graph analysis (3)

aiperf/dataset/composer/base.py (4)

aiperf/common/config/prompt_config.py (1)

get_sequence_distribution (206-212)

aiperf/common/sequence_distribution.py (1)

sample (93-119)

aiperf/common/models/dataset_models.py (1)

Turn (43-70)

aiperf/dataset/utils.py (1)

sample_positive_normal_integer (113-124)

tests/test_sequence_distribution.py (2)

aiperf/common/sequence_distribution.py (10)

DistributionParser (219-348)

SequenceLengthDistribution (47-200)

SequenceLengthPair (23-44)

create_balanced_distribution (365-383)

create_uniform_distribution (351-362)

sample (93-119)

sample_batch (121-151)

pairs (154-156)

get_statistics (158-192)

parse (227-265)

aiperf/common/config/prompt_config.py (1)

get_sequence_distribution (206-212)

aiperf/dataset/composer/synthetic.py (2)

aiperf/dataset/composer/base.py (1)

_sample_sequence_lengths (62-78)

aiperf/dataset/generator/prompt.py (1)

generate (96-118)

🪛 Ruff (0.13.2)

aiperf/dataset/composer/synthetic.py

110-110: Unpacked variable osl is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

🔇 Additional comments (2)

aiperf/dataset/composer/base.py (1)

80-97: Logic is correct once turn-level caching is implemented.

The implementation correctly uses the sampled OSL for max_tokens when a distribution is present, and falls back to legacy mean/stddev sampling otherwise.

However, this method currently calls _sample_sequence_lengths() a second time per turn (the first call is in synthetic._generate_text_payloads at line 110), which breaks ISL/OSL pairing. Once turn-level caching is implemented (as flagged in the review of synthetic.py lines 108-116), this logic will correctly retrieve the cached OSL associated with the ISL used for prompt generation.

Verify that the turn-level caching solution addresses the double-sampling issue across both _generate_text_payloads and _set_max_tokens.

tests/test_sequence_distribution.py (1)

1-387: Excellent test coverage for the sequence distribution system.

The test suite comprehensively validates all aspects of the new distribution functionality:

SequenceLengthPair: validation (invalid ISL/OSL/probability), boundaries, immutability, string representation

SequenceLengthDistribution: single/multi-pair construction, sampling (deterministic and probabilistic), batch sampling, reproducibility, probability validation (including floating-point tolerance), statistics, string representation

DistributionParser: all supported formats (JSON, bracket, semicolon), backward compatibility (fractions), error handling (invalid formats, missing fields), decimal probabilities, whitespace handling

Utility functions: create_uniform_distribution and create_balanced_distribution, including empty input validation

Integration: end-to-end workflow from parsing to sampling with empirical distribution checks, statistical accuracy validation (expected values match empirical means within 1%)

PromptConfig integration: retrieving distributions with explicit configuration and fallback behavior

The test design is robust, using appropriate sample sizes (10,000–50,000 samples) and statistical tolerances (±5% for distribution checks, ±1% for expected value accuracy) to minimize flakiness while catching real issues.

aiperf/dataset/composer/synthetic.py

coderabbitai

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e057365 and 58ee44b.

📒 Files selected for processing (1)

aiperf/dataset/composer/base.py (3 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

aiperf/dataset/composer/base.py (4)

aiperf/common/config/prompt_config.py (1)

get_sequence_distribution (206-212)

aiperf/common/sequence_distribution.py (1)

sample (93-119)

aiperf/common/models/dataset_models.py (1)

Turn (43-70)

aiperf/dataset/utils.py (1)

sample_positive_normal_integer (113-124)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: build (ubuntu-latest, 3.10)

🔇 Additional comments (2)

aiperf/dataset/composer/base.py (2)
7-7: LGTM!

The numpy import is correctly added to support the new RNG initialization and distribution sampling.

31-36: Handle seed=0 correctly when initializing RNG.

The condition if seed is not None on line 36 will fail for a valid seed value of 0, preventing RNG initialization. Zero is a legitimate seed value and should create a seeded generator.

Apply this diff to handle zero seeds correctly:
         # Initialize RNG for sequence distribution sampling (avoid reseeding on each sample)
         seed = getattr(self.config.input, "random_seed", None)
-        self._seq_rng = np.random.default_rng(seed) if seed is not None else None
+        self._seq_rng = (
+            np.random.default_rng(seed) if seed is not None else np.random.default_rng()
+        )
Note: This change also eliminates the None case, ensuring a generator is always available. If you want to preserve the ability to have self._seq_rng = None (to signal "no seed configured"), use this alternative:
         seed = getattr(self.config.input, "random_seed", None)
-        self._seq_rng = np.random.default_rng(seed) if seed is not None else None
+        # Explicitly check for None to handle seed=0 correctly
+        if seed is None:
+            self._seq_rng = None
+        else:
+            self._seq_rng = np.random.default_rng(seed)
Likely an incorrect or invalid review comment.

aiperf/dataset/composer/base.py

fix: import

remove unused class remove excess docs

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (1)

docs/tutorials/sequence-distributions.md (1)

44-78: Add explicit languages to fenced code blocks.

Markdownlint is flagging these fences (MD040). Annotating them with text, bash, or json fixes the warning and keeps formatting consistent with the rest of the docs.

As per static analysis hints.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 58ee44b and 617092f.

📒 Files selected for processing (13)

README.md (1 hunks)
aiperf/common/config/__init__.py (1 hunks)
aiperf/common/config/prompt_config.py (3 hunks)
aiperf/common/models/__init__.py (4 hunks)
aiperf/common/models/sequence_distribution.py (1 hunks)
aiperf/dataset/composer/base.py (4 hunks)
aiperf/dataset/composer/synthetic.py (2 hunks)
docs/tutorials/sequence-distributions.md (1 hunks)
mkdocs.yml (1 hunks)
tests/composers/test_base_composer.py (1 hunks)
tests/composers/test_synthetic_composer.py (4 hunks)
tests/config/test_prompt_config.py (2 hunks)
tests/test_sequence_distribution.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (3)

tests/composers/test_synthetic_composer.py
tests/composers/test_base_composer.py
tests/config/test_prompt_config.py

🧰 Additional context used

🧬 Code graph analysis (5)

aiperf/dataset/composer/base.py (3)

aiperf/common/config/prompt_config.py (1)

get_sequence_distribution (222-228)

aiperf/common/models/sequence_distribution.py (1)

sample (141-183)

aiperf/common/models/dataset_models.py (1)

Turn (43-70)

aiperf/common/models/__init__.py (2)

aiperf/common/models/sequence_distribution.py (5)

DistributionParser (267-407)

SequenceLengthDistribution (95-264)

SequenceLengthPair (58-92)

create_balanced_distribution (424-442)

create_uniform_distribution (410-421)

tests/logging/test_logging_mixins.py (1)

logger (29-31)

aiperf/dataset/composer/synthetic.py (3)

aiperf/common/models/dataset_models.py (2)

Turn (43-70)

Text (24-27)

aiperf/dataset/composer/base.py (2)

_get_turn_sequence_lengths (71-99)

prefix_prompt_enabled (148-149)

aiperf/dataset/generator/prompt.py (2)

generate (96-118)

get_random_prefix_prompt (219-234)

tests/test_sequence_distribution.py (4)

aiperf/common/models/sequence_distribution.py (11)

DistributionParser (267-407)

SequenceLengthDistribution (95-264)

SequenceLengthPair (58-92)

_sample_positive_normal_integer (45-54)

create_balanced_distribution (424-442)

create_uniform_distribution (410-421)

sample (141-183)

sample_batch (185-215)

pairs (218-220)

get_statistics (222-256)

parse (279-318)

aiperf/common/config/prompt_config.py (2)

PromptConfig (167-228)

get_sequence_distribution (222-228)

aiperf/common/models/dataset_models.py (1)

Turn (43-70)

aiperf/dataset/composer/base.py (3)

BaseDatasetComposer (22-149)

_get_turn_sequence_lengths (71-99)

_clear_turn_cache (101-107)

aiperf/common/config/prompt_config.py (3)

aiperf/common/models/sequence_distribution.py (2)

DistributionParser (267-407)

parse (279-318)

aiperf/common/config/cli_parameter.py (1)

CLIParameter (10-19)

aiperf/common/config/groups.py (1)

Groups (6-28)

🪛 markdownlint-cli2 (0.18.1)

docs/tutorials/sequence-distributions.md

44-44: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

49-49: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

56-56: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

61-61: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Ruff (0.13.3)

aiperf/dataset/composer/synthetic.py

109-109: Unpacked variable osl is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

tests/test_sequence_distribution.py

205-205: Pattern passed to match= contains metacharacters but is neither escaped nor raw

(RUF043)

312-312: Pattern passed to match= contains metacharacters but is neither escaped nor raw

(RUF043)

aiperf/common/config/prompt_config.py

186-186: Avoid specifying long messages outside the exception class

(TRY003)

aiperf/common/models/sequence_distribution.py

48-48: Value being cast to int is already an integer

Remove unnecessary int call

(RUF046)

54-54: Value being cast to int is already an integer

Remove unnecessary int call

(RUF046)

70-72: Avoid specifying long messages outside the exception class

(TRY003)

74-76: Avoid specifying long messages outside the exception class

(TRY003)

78-78: Avoid specifying long messages outside the exception class

(TRY003)

80-82: Avoid specifying long messages outside the exception class

(TRY003)

84-86: Avoid specifying long messages outside the exception class

(TRY003)

114-116: Avoid specifying long messages outside the exception class

(TRY003)

130-133: Avoid specifying long messages outside the exception class

(TRY003)

199-199: Avoid specifying long messages outside the exception class

(TRY003)

299-299: Avoid specifying long messages outside the exception class

(TRY003)

316-318: Avoid specifying long messages outside the exception class

(TRY003)

326-326: Avoid specifying long messages outside the exception class

(TRY003)

330-330: Avoid specifying long messages outside the exception class

(TRY003)

338-338: Avoid specifying long messages outside the exception class

(TRY003)

353-353: Avoid specifying long messages outside the exception class

(TRY003)

373-373: Avoid specifying long messages outside the exception class

(TRY003)

389-391: Avoid specifying long messages outside the exception class

(TRY003)

405-405: Avoid specifying long messages outside the exception class

(TRY003)

437-437: Avoid specifying long messages outside the exception class

(TRY003)

aiperf/common/models/sequence_distribution.py

aiperf/dataset/composer/synthetic.py

docs/tutorials/sequence-distributions.md

tests/test_sequence_distribution.py

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

docs/tutorials/sequence-distributions.md (1)

46-64: Consider adding language identifiers to format example code blocks.

The code blocks showing format syntax (lines 46-53, 58-64) lack language identifiers, which triggers markdown linter warnings. While these are format examples rather than executable code, adding text as the language identifier would satisfy the linter and improve consistency.

Example for line 46:

-```
+```text
 "ISL1,OSL1:PROB1;ISL2,OSL2:PROB2;..."


Apply similar changes to the other format example blocks at lines 51, 58, and 63.

</blockquote></details>

</blockquote></details>

<details>
<summary>📜 Review details</summary>

**Configuration used**: CodeRabbit UI

**Review profile**: CHILL

**Plan**: Pro

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between 617092f828257cca5e29b4a35f28eae8a5c3aaf9 and a0f4c24fa4bb7ae85f2ac6c5383b9adffad117fb.

</details>

<details>
<summary>📒 Files selected for processing (4)</summary>

* `aiperf/common/models/sequence_distribution.py` (1 hunks)
* `aiperf/dataset/composer/synthetic.py` (2 hunks)
* `docs/tutorials/sequence-distributions.md` (1 hunks)
* `tests/test_sequence_distribution.py` (1 hunks)

</details>

<details>
<summary>🚧 Files skipped from review as they are similar to previous changes (1)</summary>

* aiperf/dataset/composer/synthetic.py

</details>

<details>
<summary>🧰 Additional context used</summary>

<details>
<summary>🧬 Code graph analysis (1)</summary>

<details>
<summary>tests/test_sequence_distribution.py (3)</summary><blockquote>

<details>
<summary>aiperf/common/models/sequence_distribution.py (11)</summary>

* `DistributionParser` (283-423)
* `SequenceLengthDistribution` (95-280)
* `SequenceLengthPair` (58-92)
* `_sample_positive_normal_integer` (45-54)
* `create_balanced_distribution` (440-458)
* `create_uniform_distribution` (426-437)
* `sample` (141-183)
* `sample_batch` (185-231)
* `pairs` (234-236)
* `get_statistics` (238-272)
* `parse` (295-334)

</details>
<details>
<summary>aiperf/common/config/prompt_config.py (2)</summary>

* `PromptConfig` (167-228)
* `get_sequence_distribution` (222-228)

</details>
<details>
<summary>aiperf/dataset/composer/base.py (3)</summary>

* `BaseDatasetComposer` (22-145)
* `_get_turn_sequence_lengths` (71-96)
* `_clear_turn_cache` (98-104)

</details>

</blockquote></details>

</details><details>
<summary>🪛 markdownlint-cli2 (0.18.1)</summary>

<details>
<summary>docs/tutorials/sequence-distributions.md</summary>

46-46: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

---

51-51: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

---

58-58: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

---

63-63: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

</details>

</details>
<details>
<summary>🪛 Ruff (0.13.3)</summary>

<details>
<summary>tests/test_sequence_distribution.py</summary>

206-206: Pattern passed to `match=` contains metacharacters but is neither escaped nor raw

(RUF043)

---

313-313: Pattern passed to `match=` contains metacharacters but is neither escaped nor raw

(RUF043)

</details>
<details>
<summary>aiperf/common/models/sequence_distribution.py</summary>

48-48: Value being cast to `int` is already an integer

Remove unnecessary `int` call

(RUF046)

---

54-54: Value being cast to `int` is already an integer

Remove unnecessary `int` call

(RUF046)

---

70-72: Avoid specifying long messages outside the exception class

(TRY003)

---

74-76: Avoid specifying long messages outside the exception class

(TRY003)

---

78-78: Avoid specifying long messages outside the exception class

(TRY003)

---

80-82: Avoid specifying long messages outside the exception class

(TRY003)

---

84-86: Avoid specifying long messages outside the exception class

(TRY003)

---

114-116: Avoid specifying long messages outside the exception class

(TRY003)

---

130-133: Avoid specifying long messages outside the exception class

(TRY003)

---

199-199: Avoid specifying long messages outside the exception class

(TRY003)

---

315-315: Avoid specifying long messages outside the exception class

(TRY003)

---

332-334: Avoid specifying long messages outside the exception class

(TRY003)

---

342-342: Avoid specifying long messages outside the exception class

(TRY003)

---

346-346: Avoid specifying long messages outside the exception class

(TRY003)

---

354-354: Avoid specifying long messages outside the exception class

(TRY003)

---

369-369: Avoid specifying long messages outside the exception class

(TRY003)

---

389-389: Avoid specifying long messages outside the exception class

(TRY003)

---

405-407: Avoid specifying long messages outside the exception class

(TRY003)

---

421-421: Avoid specifying long messages outside the exception class

(TRY003)

---

453-453: Avoid specifying long messages outside the exception class

(TRY003)

</details>

</details>

</details>

<details>
<summary>⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)</summary>

* GitHub Check: build (ubuntu-latest, 3.10)

</details>

<details>
<summary>🔇 Additional comments (6)</summary><blockquote>

<details>
<summary>tests/test_sequence_distribution.py (2)</summary><blockquote>

`154-167`: **LGTM - RNG reseeding issue resolved!**

The test now correctly creates a single RNG instance once (line 159) and reuses it across all samples (line 160), allowing the RNG state to advance properly. This ensures the stddev test will observe variance as expected.

---

`560-601`: **Excellent caching test coverage!**

The turn-level sequence caching tests thoroughly verify that ISL/OSL pairs are cached per turn for consistency and that cache clearing works correctly. This ensures deterministic behavior within a turn while allowing variation across turns.

</blockquote></details>
<details>
<summary>aiperf/common/models/sequence_distribution.py (4)</summary><blockquote>

`212-231`: **LGTM - Batch sampling now correctly applies stddev!**

The `sample_batch` method now properly samples with variance by calling `_sample_positive_normal_integer` for each pair that defines stddev, mirroring the single-sample logic. This ensures batch results respect configured standard deviations.

---

`45-54`: **Solid implementation of positive normal sampling.**

The clamping to a minimum of 1 prevents invalid sequence lengths while the stddev check optimizes the deterministic case. The implementation correctly handles the edge case where normal sampling might produce negative values.

---

`124-133`: **Good probability validation with floating-point tolerance.**

The validation correctly allows small floating-point errors (rtol=1e-6, atol=1e-6) while still enforcing the constraint that probabilities sum to 100%. The error message helpfully includes the actual sum and the pairs for debugging.

---

`283-423`: **Robust multi-format parser with comprehensive error handling.**

The parser supports three distinct formats (JSON, bracket, semicolon) with optional stddev syntax and provides clear error messages for invalid inputs. The regex patterns correctly handle whitespace and optional stddev components.

</blockquote></details>

</blockquote></details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

coderabbitai

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a0f4c24 and d288575.

📒 Files selected for processing (1)

docs/tutorials/sequence-distributions.md (1 hunks)

🧰 Additional context used

🪛 markdownlint-cli2 (0.18.1)

docs/tutorials/sequence-distributions.md

46-46: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

51-51: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

58-58: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

63-63: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: build (ubuntu-latest, 3.10)

docs/tutorials/sequence-distributions.md

lkomali

LGTM!

Thanks for working on this.

feat: mixed isl/osl distributions

df6d7e3

the-david-oy self-assigned this Oct 2, 2025

github-actions bot added the feat label Oct 2, 2025

the-david-oy marked this pull request as ready for review October 2, 2025 19:19

coderabbitai bot reviewed Oct 2, 2025

View reviewed changes

aiperf/dataset/composer/base.py Show resolved Hide resolved

aiperf/dataset/composer/base.py Outdated Show resolved Hide resolved

fix tests

e057365

coderabbitai bot reviewed Oct 2, 2025

View reviewed changes

aiperf/dataset/composer/synthetic.py Outdated Show resolved Hide resolved

address feedback about rng seed

58ee44b

coderabbitai bot reviewed Oct 2, 2025

View reviewed changes

aiperf/dataset/composer/base.py Outdated Show resolved Hide resolved

aiperf/dataset/composer/base.py Show resolved Hide resolved

the-david-oy marked this pull request as draft October 2, 2025 21:11

Support stddev in sequence distributions

0fe89fc

fix: import

the-david-oy force-pushed the dyas-isl-dist branch from 98a11a7 to 0fe89fc Compare October 6, 2025 17:37

the-david-oy and others added 3 commits October 6, 2025 11:50

update docs

e383012

add test coverage

09ce328

Merge branch 'main' into dyas-isl-dist

a2c6979

the-david-oy force-pushed the dyas-isl-dist branch from 5a536ae to d0187a6 Compare October 6, 2025 19:46

the-david-oy marked this pull request as ready for review October 6, 2025 19:46

remove unused class

617092f

remove unused class remove excess docs

the-david-oy force-pushed the dyas-isl-dist branch from d0187a6 to 617092f Compare October 6, 2025 19:54

the-david-oy requested review from ajcasagrande and lkomali October 6, 2025 19:56

coderabbitai bot reviewed Oct 6, 2025

View reviewed changes

aiperf/common/models/sequence_distribution.py Outdated Show resolved Hide resolved

aiperf/dataset/composer/synthetic.py Show resolved Hide resolved

docs/tutorials/sequence-distributions.md Outdated Show resolved Hide resolved

tests/test_sequence_distribution.py Show resolved Hide resolved

address coderabbit comments

a0f4c24

coderabbitai bot reviewed Oct 6, 2025

View reviewed changes

Remove newline

d288575

coderabbitai bot reviewed Oct 6, 2025

View reviewed changes

docs/tutorials/sequence-distributions.md Show resolved Hide resolved

docs/tutorials/sequence-distributions.md Show resolved Hide resolved

lkomali approved these changes Oct 7, 2025

View reviewed changes

the-david-oy merged commit 0a70989 into main Oct 7, 2025
6 checks passed

the-david-oy deleted the dyas-isl-dist branch October 7, 2025 17:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: mixed isl/osl distributions #322

feat: mixed isl/osl distributions #322

Uh oh!

the-david-oy commented Oct 2, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 2, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Oct 2, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

lkomali left a comment

Uh oh!

Uh oh!

Uh oh!

feat: mixed isl/osl distributions #322

feat: mixed isl/osl distributions #322

Uh oh!

Conversation

the-david-oy commented Oct 2, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lkomali left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

the-david-oy commented Oct 2, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 2, 2025 •

edited

Loading

codecov bot commented Oct 2, 2025 •

edited

Loading