Skip to content

Releases: InternLM/lmdeploy

LMDeploy Release v0.6.2

29 Oct 06:42
522108c
Compare
Choose a tag to compare

Highlights

  • PyTorch engine supports graph mode on ascend platform, doubling the inference speed
  • Support llama3.2-vision models in PyTorch engine
  • Support Mixtral in TurboMind engine, achieving 20+ RPS using SharedGPT dataset with 2 A100-80G GPUs

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

New Contributors

Full Changelog: v0.6.1...v0.6.2

LMDeploy Release V0.6.1

28 Sep 11:34
2e49fc3
Compare
Choose a tag to compare

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

🌐 Other

New Contributors

Full Changelog: v0.6.0...v0.6.1

LMDeploy Release v0.6.0

13 Sep 03:12
e2aa4bd
Compare
Choose a tag to compare

Highlight

  • Optimize W4A16 quantized model inference by implementing GEMM in TurboMind Engine
    • Add GPTQ-INT4 inference
    • Support CUDA architecture from SM70 and above, equivalent to the V100 and above.
  • Refactor PytorchEngine
    • Employ CUDA graph to boost the inference performance (30%)
    • Support more models in Huawei Ascend platform
  • Upgrade GenerationConfig
    • Support min_p sampling
    • Add do_sample=False as the default option
    • Remove EngineGenerationConfig and merge it to GenertionConfig
  • Support guided decoding
  • Distinguish between the concepts of the name of the deployed model and the name of the model's chat tempate
    Before:
lmdeploy serve api_server /the/path/of/your/awesome/model \
    --model-name customized_chat_template.json

After

lmdeploy serve api_server  /the/path/of/your/awesome/model \
    --model-name "the served model name"
    --chat-template customized_chat_template.json

Break Changes

  • TurboMind model converter. Please re-convert the models if you uses this feature
  • EngineGenerationConfig is removed. Please use GenerationConfig instead
  • Chat template. Please use --chat-template to specify it

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

  • enable run vlm with pytorch engine in gradio by @RunningLeon in #2256
  • fix side-effect: failed to update tm model config with tm engine config by @lvhan028 in #2275
  • Fix internvl2 template and update docs by @irexyc in #2292
  • fix the issue missing dependencies in the Dockerfile and pip by @ColorfulDick in #2240
  • Fix the way to get "quantization_config" from model's coniguration by @lvhan028 in #2325
  • fix(ascend): fix import error of pt engine in cli by @CyCle1024 in #2328
  • Default rope_scaling_factor of TurbomindEngineConfig to None by @lvhan028 in #2358
  • Fix the logic of update engine_config to TurbomindModelConfig for both tm model and hf model by @lvhan028 in #2362
  • fix cache position for pytorch engine by @RunningLeon in #2388
  • Fix /v1/completions batch order wrong by @AllentDan in #2395
  • Fix some issues encountered by modelscope and community by @irexyc in #2428
  • fix llama3 rotary in pytorch engine by @grimoire in #2444
  • fix tensors on different devices when deploying MiniCPM-V-2_6 with tensor parallelism by @irexyc in #2454
  • fix MultinomialSampling operator builder by @grimoire in #2460
  • Fix initialization of runtime_min_p by @irexyc in #2461
  • fix Windows compile error by @zhyncs in #2303
  • fix: follow up #2303 by @zhyncs in #2307

📚 Documentations

🌐 Other

New Contributors

Full Changelog: v0.5.3...v0.6.0

LMDeploy Release V0.6.0a0

26 Aug 09:12
97b880b
Compare
Choose a tag to compare

Highlight

  • Optimize W4A16 quantized model inference by implementing GEMM in TurboMind Engine
    • Add GPTQ-INT4 inference
    • Support CUDA architecture from SM70 and above, equivalent to the V100 and above.
  • Optimize the prefilling inference stage of PyTorchEngine
  • Distinguish between the concepts of the name of the deployed model and the name of the model's chat tempate

Before:

lmdeploy serve api_server /the/path/of/your/awesome/model \
    --model-name customized_chat_template.json 

After

lmdeploy serve api_server  /the/path/of/your/awesome/model \
    --model-name "the served model name"
    --chat-template customized_chat_template.json

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

  • enable run vlm with pytorch engine in gradio by @RunningLeon in #2256
  • fix side-effect: failed to update tm model config with tm engine config by @lvhan028 in #2275
  • Fix internvl2 template and update docs by @irexyc in #2292
  • fix the issue missing dependencies in the Dockerfile and pip by @ColorfulDick in #2240
  • Fix the way to get "quantization_config" from model's coniguration by @lvhan028 in #2325
  • fix(ascend): fix import error of pt engine in cli by @CyCle1024 in #2328
  • Default rope_scaling_factor of TurbomindEngineConfig to None by @lvhan028 in #2358
  • Fix the logic of update engine_config to TurbomindModelConfig for both tm model and hf model by @lvhan028 in #2362

📚 Documentations

🌐 Other

New Contributors

Full Changelog: v0.5.3...v0.6.0a0

LMDeploy Release V0.5.3

07 Aug 03:38
a129a14
Compare
Choose a tag to compare

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

New Contributors

Full Changelog: v0.5.2...v0.5.3

LMDeploy Release V0.5.2.post1

26 Jul 12:22
fb6f8ea
Compare
Choose a tag to compare

What's Changed

🐞 Bug fixes

  • [Hotfix] miss parentheses when calcuating the coef of llama3 rope which causes needle-in-hays experiment failed by @lvhan028 in #2157

🌐 Other

Full Changelog: v0.5.2...v0.5.2.post1

LMDeploy Release V0.5.2

26 Jul 08:07
7199b4e
Compare
Choose a tag to compare

Highlight

  • LMDeploy support Llama3.1 and its Tool Calling. An example of calling "Wolfram Alpha" to perform complex mathematical calculations can be found from here

What's Changed

🚀 Features

💥 Improvements

  • Remove the triton inference server backend "turbomind_backend" by @lvhan028 in #1986
  • Remove kv cache offline quantization by @AllentDan in #2097
  • Remove session_len and deprecated short names of the chat templates by @lvhan028 in #2105
  • clarify "n>1" in GenerationConfig hasn't been supported yet by @lvhan028 in #2108

🐞 Bug fixes

🌐 Other

Full Changelog: v0.5.1...v0.5.2

LMDeploy Release V0.5.1

16 Jul 10:05
9cdce39
Compare
Choose a tag to compare

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

New Contributors

Full Changelog: v0.5.0...v0.5.1

LMDeploy Release V0.5.0

01 Jul 07:22
4cb3854
Compare
Choose a tag to compare

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

New Contributors

Full Changelog: v0.4.2...v0.5.0

LMDeploy Release V0.4.2

27 May 08:56
54b7230
Compare
Choose a tag to compare

Highlight

  • Support 4-bit weight-only quantization and inference on VMLs, such as InternVL v1.5, LLaVa, InternLMXComposer2

Quantization

lmdeploy lite auto_awq OpenGVLab/InternVL-Chat-V1-5 --work-dir ./InternVL-Chat-V1-5-AWQ

Inference with quantized model

from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image

pipe = pipeline('./InternVL-Chat-V1-5-AWQ', backend_config=TurbomindEngineConfig(tp=1, model_format='awq'))

img = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
out = pipe(('describe this image', img))
print(out)
  • Balance vision model when deploying VLMs with multiple GPUs
from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image

pipe = pipeline('OpenGVLab/InternVL-Chat-V1-5', backend_config=TurbomindEngineConfig(tp=2))

img = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
out = pipe(('describe this image', img))
print(out)

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

New Contributors

Full Changelog: v0.4.1...v0.4.2