Skip to content

Conversation

@attafosu
Copy link

@attafosu attafosu commented Feb 20, 2025

  • Previously LLM.generate() could not be called multiple times with delayed sampling enabled.
  • This also was the case with step() calls
  • Issue occurs when after the last (batch) request is finished, and we're starting a new request, but cached_step_inputs and cached_step_outputs still contain elements saved from the last served (batch) request. This shouldn't be the case.
  • The cleanest solution would be to skip appending to cached_step_inputs/outputs if the recently generated output is the final token generated for the current batch request. But couldn't find a cleaner way to check for this in the model runner.
  • So we instead check (in _patch_prev_output) for when the scheduler context has empty output_queue, which means no pending outputs to patch.

Tests here: https://github.com/habana-internal/mlperf_inference/pull/158

@attafosu attafosu requested a review from mswiniarsk February 20, 2025 16:31
@tianmu-li tianmu-li merged commit 6eeefdd into HabanaAI:mlperf_features Feb 20, 2025
3 of 22 checks passed
kamil-kaczor added a commit that referenced this pull request Mar 5, 2025
Cherry-pick of: #845 fixing
issue in fe. static benchmarks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants