-
Notifications
You must be signed in to change notification settings - Fork 833
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
speculative decoding #630
speculative decoding #630
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, the implementation looks good to me and is well-documented 👍 The quality can be further improved with following minor problems fixed. I think there is only one question away from approving this PR.
src/lmflow/pipeline/inferencer.py
- [Feature] line 331: better treat
temperature=0.0
asuse_argmax=True
. ⚠️ [Bug or Question] line 344: I think the denominator is the maximum non-zero cumulative probability, not the sum of those cumulative probabiliies?- [Bug] line 359: no bug when
num_sample=1
, but should notice thattorch.multinomial
is without replacement (see this link), thusreplacement=True
should be specified. - [Style] line 455: comment typo "x1,...,γ" -> "x1,...,xγ"
- [Style] line 458-459, 484: better use
logger.debug
instead of print. - [Feature] line 465: assume
ThreadPoolExecutor(max_worker=num_model_for_verify
, better export the argument ofnum_model_for_verify
(default=1) to users, since for very large models, the GPU memory can become the bottleneck when multiple large models are running in parallel for verification. A better implementation could be verifying batch by batch and let user specify the batch size. - [Style] line 499, 502: typo: "flnal" -> "final"
- [Style] line 507-508, 512-513: use
logger.debug
instead of print.
tests/pipeline/test_spec_inf.py
- The
tests
folder is used for unittests, better modify this part according to standard format of unittest, or move this part toexamples/*.py
later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- [Bug] line 465-466: we should use 1 forward with the whole sequence (utilize gpu parallelism) instead of thread-parallelism for large model M_p, otherwise there will be no acceleration.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks 👍
inferencer now supports speculative decoding via
SpeculativeInferencer
. Tested with gpt2 (draft model) and gpt2-large (target model), see /tests/pipeline/test_spec_inf.py. Only finished functionality testing, the performance testing is needed.Not sure if my implementation of STEP 2 in speculative sampling (running target model in parallel) is correct, please review & revise. Thanks a lot!