Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

speculative decoding #630

Merged
merged 4 commits into from
Sep 6, 2023
Merged

speculative decoding #630

merged 4 commits into from
Sep 6, 2023

Conversation

wheresmyhair
Copy link
Collaborator

inferencer now supports speculative decoding via SpeculativeInferencer. Tested with gpt2 (draft model) and gpt2-large (target model), see /tests/pipeline/test_spec_inf.py. Only finished functionality testing, the performance testing is needed.
Not sure if my implementation of STEP 2 in speculative sampling (running target model in parallel) is correct, please review & revise. Thanks a lot!

Copy link
Contributor

@research4pan research4pan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, the implementation looks good to me and is well-documented 👍 The quality can be further improved with following minor problems fixed. I think there is only one question away from approving this PR.

src/lmflow/pipeline/inferencer.py

  • [Feature] line 331: better treat temperature=0.0 as use_argmax=True.
  • ⚠️ [Bug or Question] line 344: I think the denominator is the maximum non-zero cumulative probability, not the sum of those cumulative probabiliies?
  • [Bug] line 359: no bug when num_sample=1, but should notice that torch.multinomial is without replacement (see this link), thus replacement=True should be specified.
  • [Style] line 455: comment typo "x1,...,γ" -> "x1,...,xγ"
  • [Style] line 458-459, 484: better use logger.debug instead of print.
  • [Feature] line 465: assume ThreadPoolExecutor(max_worker=num_model_for_verify, better export the argument of num_model_for_verify (default=1) to users, since for very large models, the GPU memory can become the bottleneck when multiple large models are running in parallel for verification. A better implementation could be verifying batch by batch and let user specify the batch size.
  • [Style] line 499, 502: typo: "flnal" -> "final"
  • [Style] line 507-508, 512-513: use logger.debug instead of print.

tests/pipeline/test_spec_inf.py

  • The tests folder is used for unittests, better modify this part according to standard format of unittest, or move this part to examples/*.py later.

Copy link
Contributor

@research4pan research4pan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • [Bug] line 465-466: we should use 1 forward with the whole sequence (utilize gpu parallelism) instead of thread-parallelism for large model M_p, otherwise there will be no acceleration.

@wheresmyhair
Copy link
Collaborator Author

src/lmflow/pipeline/inferencer.py

  • [Feature] line 331: better treat temperature=0.0 as use_argmax=True.
  • ⚠️ [Bug or Question] line 344: I think the denominator is the maximum non-zero cumulative probability, not the sum of those cumulative probabiliies?
  • [Bug] line 359: no bug when num_sample=1, but should notice that torch.multinomial is without replacement (see this link), thus replacement=True should be specified.
  • [Style] line 455: comment typo "x1,...,γ" -> "x1,...,xγ"
  • [Style] line 458-459, 484: better use logger.debug instead of print.
  • [Style] line 499, 502: typo: "flnal" -> "final"
  • [Style] line 507-508, 512-513: use logger.debug instead of print.

Fixed.

  • [Feature] line 465: assume ThreadPoolExecutor(max_worker=num_model_for_verify, better export the argument of num_model_for_verify (default=1) to users, since for very large models, the GPU memory can become the bottleneck when multiple large models are running in parallel for verification. A better implementation could be verifying batch by batch and let user specify the batch size.
  • [Bug] line 465-466: we should use 1 forward with the whole sequence (utilize gpu parallelism) instead of thread-parallelism for large model M_p, otherwise there will be no acceleration.

Instead of using threading + pred_next_token(), SpeculativeInferencer now uses get_backend_model() method from the HFDecoderModel model object to get logits and do the calculations.

Copy link
Contributor

@research4pan research4pan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks 👍

@research4pan research4pan merged commit 7f2711a into OptimalScale:main Sep 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants