Skip to content

feat: add llm_vllm_deepseek_ocr and llm_office_vllm_deepseek_ocr#205

Merged
weedge merged 8 commits intomainfrom
feat/vision-ocr
Oct 24, 2025
Merged

feat: add llm_vllm_deepseek_ocr and llm_office_vllm_deepseek_ocr#205
weedge merged 8 commits intomainfrom
feat/vision-ocr

Conversation

@weedge
Copy link
Collaborator

@weedge weedge commented Oct 24, 2025

Note

  • vllm deepseek_ocr only support Gundam: base_size = 1024, image_size = 640, crop_mode = True
  • although vllm deepseek_ocr implementation fast generate, it is not as stable as deepseek ocr with transformers and has bug: recognition repeated output

feat:

  • add llm_vllm_deepseek_ocr and officially vLLM supported deepseek OCR
  • add llm_vllm_deepseek_ocr/llm_office_vllm_deepseek_ocr test on modal
IMAGE_GPU=L40s modal run src/llm/vllm/vlm/ocr_deepseek.py --task stream_infer
IMAGE_GPU=L40s OCR_TAG=llm_office_vllm_deepseek_ocr modal run src/llm/vllm/vlm/ocr_deepseek.py --task offline_infer
APP_NAME=achatbot IMAGE_GPU=L40s OCR_TAG=llm_vllm_deepseek_ocr modal run src/llm/vllm/vlm/ocr_deepseek.py --task achatbot_stream_infer
APP_NAME=achatbot IMAGE_GPU=L40s OCR_TAG=llm_office_vllm_deepseek_ocr modal run src/llm/vllm/vlm/ocr_deepseek.py --task achatbot_stream_infer
  • add deepseek ocr vllm (register DeepseekOCRForCausalLM(deepencoder+decoder) and DeepseekVLV2Processor)
  • deploy deepseek_ocr vision ocr bot with fastapi_webrtc_vision_bot_serve
# deepseek single room bot
modal run src/download_models.py --repo-ids "FunAudioLLM/SenseVoiceSmall"
modal run src/download_models.py --repo-ids "deepseek-ai/DeepSeek-OCR" --revision "refs/pr/23"
modal volume put config ./config/bots/daily_ocr_vllm_vision_bot.json /bots/ -f
EXTRA_INDEX_URL=https://pypi.org/simple/ \
    SERVE_TYPE=room_bot \
    CONFIG_FILE=/root/.achatbot/config/bots/daily_ocr_vllm_vision_bot.json \
    ACHATBOT_VERSION=0.0.28.post1 \
    IMAGE_NAME=deepseek_ocr_vllm IMAGE_CONCURRENT_CN=1 IMAGE_GPU=L40s \
    modal serve src/fastapi_webrtc_vision_bot_serve.py
EXTRA_INDEX_URL=https://pypi.org/simple/ \
    SERVE_TYPE=room_bot \
    CONFIG_FILE=/root/.achatbot/config/bots/daily_ocr_vllm_vision_bot.json \
    ACHATBOT_VERSION=0.0.28.post2 \
    IMAGE_NAME=deepseek_ocr_office_vllm IMAGE_CONCURRENT_CN=1 IMAGE_GPU=L40s \
    modal serve src/fastapi_webrtc_vision_bot_serve.py

# run DailyOCRVisionBot with config
curl -XPOST "https://weedge--fastapi-webrtc-vision-deepseek-ocr-vllm-bot-srv-app-dev.modal.run/bot_join/chat-room/DailyOCRVisionBot"
curl -XPOST "https://weedge--fastapi-webrtc-vision-deepseek-ocr-office-vllm-bot-srv-app-dev.modal.run/bot_join/chat-room/DailyOCRVisionBot"

daily_ocr_vllm_vision_bot.json

{
  "chat_bot_name": "DailyOCRVisionBot",
  "handle_sigint": true,
  "is_background": false,
  "room_name": "chat-room",
  "room_url": "",
  "token": "",
  "services": {},
  "config": {
    "vad": {
      "tag": "silero_vad_analyzer",
      "args": {
        "start_secs": 0.0,
        "stop_secs": 0.0,
        "confidence": 0.7,
        "min_volume": 0.6,
        "onnx": true
      }
    },
    "asr": {
      "tag": "sense_voice_asr",
      "args": {
        "language": "zn",
        "model_name_or_path": "/root/.achatbot/models/FunAudioLLM/SenseVoiceSmall"
      }
    },
    "vision_ocr": {
      "tag": "llm_vllm_deepseek_ocr",
      "args": {
        "serv_args": {
          "model": "/root/.achatbot/models/deepseek-ai/DeepSeek-OCR",
          "enforce_eager":false,
          "trust_remote_code":true,
          "max_model_len":8192,
          "tensor_parallel_size":1,
          "enable_prefix_caching":false,
          "gpu_memory_utilization": 0.75
        }
      }
    },
    "tts": {
      "tag": "tts_edge",
      "args": {
        "voice_name": "zh-CN-YunxiNeural",
        "language": "zh",
        "gender": "Male"
      }
    }
  },
  "config_list": []
}

reference

Signed-off-by: weedge <weege007@gmail.com>
Signed-off-by: weedge <weege007@gmail.com>
Signed-off-by: weedge <weege007@gmail.com>
…coder) and DeepseekVLV2Processor)

Signed-off-by: weedge <weege007@gmail.com>
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @weedge, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the system's optical character recognition (OCR) capabilities by integrating Deepseek OCR models, leveraging vLLM for efficient inference. It introduces both a custom and an officially supported vLLM variant of Deepseek OCR, complete with robust vision encoding and image preprocessing pipelines. The changes also include a new testing task for the vision bot and updates to core interfaces to support asynchronous operations, ensuring a more performant and feature-rich multimodal experience.

Highlights

  • Deepseek OCR Integration: Introduced two new vLLM-based Deepseek OCR models: llm_vllm_deepseek_ocr for a custom implementation and llm_office_vllm_deepseek_ocr for the officially vLLM-supported version. This significantly expands the OCR capabilities within the system.
  • New Vision Bot Test Task: Added a new achatbot_stream_infer task to ocr_deepseek.py for comprehensive testing of the Deepseek OCR vision bot within the achatbot framework, including session management and image processing.
  • Vision Encoder and Image Processing Modules: Incorporated a suite of new modules (deepencoder, model, process within src/thirdparty/deepseek_ocr_vllm) that implement the Deepseek OCR's vision encoder (utilizing SAM and CLIP-based components), an MLP projector, and advanced image processing logic, including dynamic cropping and tokenization for multimodal input.
  • Asynchronous Interface Update: Refactored the IVisionOCR interface to use async_generate instead of generate, ensuring better compatibility with asynchronous operations and improving overall system responsiveness for vision-related tasks.
  • Dependency and Version Management: Updated the achatbot project version to 0.0.28.post1 and introduced conditional transformers versioning (4.57.1 for llm_office_vllm_deepseek_ocr and 4.47.1 otherwise) to align with specific vLLM requirements.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for DeepSeek OCR models using vLLM, including both a custom implementation and the officially supported version. The changes are well-structured, primarily introducing new files for the model implementation and testing scripts. My review focuses on improving type safety and reducing code duplication for better maintainability. Overall, the changes look good and the new feature is a valuable addition.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@weedge weedge added vllm VLM vision language model ViT vision transformer CLIP Contrastive Language-Image Pre-training labels Oct 24, 2025
…pi_webrtc_vision_bot_serve on modal

Signed-off-by: weedge <weege007@gmail.com>
Signed-off-by: weedge <weege007@gmail.com>
@weedge weedge merged commit 1a96549 into main Oct 24, 2025
@weedge weedge changed the title feat: add add llm_vllm_deepseek_ocr and llm_office_vllm_deepseek_ocr feat: add llm_vllm_deepseek_ocr and llm_office_vllm_deepseek_ocr Oct 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLIP Contrastive Language-Image Pre-training ViT vision transformer vllm VLM vision language model

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant