feat: add llm_transformers_manual_vision_deepseek_ocr and deploy deepseek_ocr vision bot with fastapi_webrtc_vision_bot_serve by weedge · Pull Request #203 · ai-bot-pro/achatbot

weedge · 2025-10-22T09:35:38Z

DeepEncoder ( ImageEncoderViT (sam) + a 16× token compressor + VitModel(clip) ) (有个想法是否可以使用这部分用于RAG的embedding模型，将文本记忆体的信息以图片encoder进行压缩(损失在可接受范围，本身长记忆体也存在信息损失的情况))
DeepSeek-3B-MoE decoder (DeepseekV2Decoder MOE-A570M)

AI Podcast:

DeepSeek-OCR：开启长上下文光学压缩新纪元: https://podcast-997.pages.dev/podcast/a39bf3b1332e490ab4e619ea6424eb2f

feat:

add add deepseek-OCR transformers/vllm cases on modal

modal run src/download_models.py --repo-ids "deepseek-ai/DeepSeek-OCR" --revision "refs/pr/23"

# HF transformers
IMAGE_GPU=L4 modal run src/llm/transformers/vlm/ocr_deepseek.py --task dump_model
IMAGE_GPU=L4 modal run src/llm/transformers/vlm/ocr_deepseek.py --task infer
IMAGE_GPU=L4 modal run src/llm/transformers/vlm/ocr_deepseek.py --task infer_filter
BACKEND=achatbot IMAGE_GPU=L4 modal run src/llm/transformers/vlm/ocr_deepseek.py --task achatbot_infer

# vllm
IMAGE_GPU=L40s modal run src/llm/vllm/vlm/ocr_deepseek.py --task stream_infer

add llm_transformers_manual_vision_deepseek_ocr
deploy deepseek_ocr vision ocr bot with fastapi_webrtc_vision_bot_serve

# deepseek single room bot
modal run src/download_models.py --repo-ids "FunAudioLLM/SenseVoiceSmall"
modal run src/download_models.py --repo-ids "deepseek-ai/DeepSeek-OCR" --revision "refs/pr/23"
modal volume put config ./config/bots/daily_ocr_vision_bot.json /bots/ -f
EXTRA_INDEX_URL=https://pypi.org/simple/ \
    SERVE_TYPE=room_bot \
    CONFIG_FILE=/root/.achatbot/config/bots/daily_ocr_vision_bot.json \
    ACHATBOT_VERSION=0.0.28 \
    IMAGE_NAME=deepseek_ocr IMAGE_CONCURRENT_CN=1 IMAGE_GPU=L4 \
    modal serve src/fastapi_webrtc_vision_bot_serve.py

# run DailyOCRVisionBot with config
curl -XPOST "https://weedge--fastapi-webrtc-vision-deepseek-ocr-bot-srv-app-dev.modal.run/bot_join/chat-room/DailyOCRVisionBot"

daily_ocr_vision_bot.json

{
  "chat_bot_name": "DailyOCRVisionBot",
  "room_name": "chat-room",
  "room_url": "",
  "token": "",
  "services": {},
  "config": {
    "vad": {
      "tag": "silero_vad_analyzer",
      "args": {
        "start_secs": 0.0,
        "stop_secs": 0.0,
        "confidence": 0.7,
        "min_volume": 0.6,
        "onnx": true
      }
    },
    "asr": {
      "tag": "sense_voice_asr",
      "args": {
        "language": "zn",
        "model_name_or_path": "/root/.achatbot/models/FunAudioLLM/SenseVoiceSmall"
      }
    },
    "vision_ocr": {
      "tag": "llm_transformers_manual_vision_deepseek_ocr",
      "args": {
        "lm_model_name_or_path": "/root/.achatbot/models/deepseek-ai/DeepSeek-OCR",
        "lm_device": "cuda",
        "ocr_base_size":1024,
        "ocr_image_size":640,
        "ocr_crop_mode":true
      }
    },
    "tts": {
      "tag": "tts_edge",
      "args": {
        "voice_name": "zh-CN-YunjianNeural",
        "language": "zh",
        "gender": "Male"
      }
    }
  },
  "config_list": []
}

reference

⭐️⭐️ 2021.03 Learning Transferable Visual Models From Natural Language Supervision | paper code (Embedding)
VITDET: Exploring Plain Vision Transformer Backbones for Object Detection
⭐️⭐️ 2023.04 Segment Anything | paper code (Tokenizer)
https://www.youtube.com/watch?v=AhkMKIQsBrw (非常好高效讲解)
2024.09 General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model | paper code
2024.12 DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding | paper code
2025.10 DeepSeek-OCR: Contexts Optical Compression | paper code
2024.12 OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations | paper code

Signed-off-by: weedge <weege007@gmail.com>

…cr vision ocr bot with fastapi_webrtc_vision_bot_serve Signed-off-by: weedge <weege007@gmail.com>

gemini-code-assist · 2025-10-22T09:36:05Z

Summary of Changes

Hello @weedge, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the platform's optical character recognition capabilities by integrating the DeepSeek-OCR model. It provides both Hugging Face Transformers and vLLM backends for the model and deploys a new vision bot that utilizes this advanced OCR technology. The changes also include necessary dependency updates and a refactoring of the core OCR interface to streamline functionality.

Highlights

DeepSeek-OCR Integration: Added comprehensive support for the DeepSeek-OCR model, including both Hugging Face Transformers and vLLM implementations, enabling advanced Optical Character Recognition capabilities.
New Vision Bot Deployment: Introduced a new vision bot that leverages DeepSeek-OCR, deployed via fastapi_webrtc_vision_bot_serve, expanding the platform's multimodal bot offerings.
Dependency and Version Updates: Updated the achatbot project version to 0.0.28 and refined various dependencies, including accelerate and onnxruntime, to support the new OCR functionalities.
Refactored OCR Interface: The IVisionOCR interface was updated to consolidate streaming and generation logic into a single generate method, improving consistency and simplifying future implementations.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for the DeepSeek-OCR model, adding implementations for both standard transformers and vLLM, along with Modal deployment configurations. The changes are substantial, including new Modal applications, a new OCR processor, and updates to dependencies and configurations. My review has identified a critical security vulnerability related to the use of eval() on model outputs, which must be addressed. I've also highlighted several areas for improvement in correctness, robustness, and adherence to best practices, such as unsafe path manipulations, potential runtime errors, and insecure use of subprocess. Addressing this feedback will significantly enhance the security and maintainability of the new functionality.

deploy/modal/src/llm/vllm/vlm/ocr_deepseek.py

deploy/modal/src/llm/transformers/vlm/ocr_deepseek.py

deploy/modal/src/llm/vllm/vlm/ocr_deepseek.py

pyproject.toml

src/core/llm/transformers/manual_vision_ocr_deepseek.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

weedge · 2025-10-22T12:42:40Z

DeepSeek-OCR 3336.10624 M parameters

DeepseekOCRForCausalLM(
  (model): DeepseekOCRModel(
    (embed_tokens): Embedding(129280, 1280)
    (layers): ModuleList(
      (0): DeepseekV2DecoderLayer(
        (self_attn): LlamaFlashAttention2(
          (q_proj): Linear(in_features=1280, out_features=1280, bias=False)
          (k_proj): Linear(in_features=1280, out_features=1280, bias=False)
          (v_proj): Linear(in_features=1280, out_features=1280, bias=False)
          (o_proj): Linear(in_features=1280, out_features=1280, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): DeepseekV2MLP(
          (gate_proj): Linear(in_features=1280, out_features=6848, bias=False)
          (up_proj): Linear(in_features=1280, out_features=6848, bias=False)
          (down_proj): Linear(in_features=6848, out_features=1280, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): DeepseekV2RMSNorm()
        (post_attention_layernorm): DeepseekV2RMSNorm()
      )
      (1-11): 11 x DeepseekV2DecoderLayer(
        (self_attn): LlamaFlashAttention2(
          (q_proj): Linear(in_features=1280, out_features=1280, bias=False)
          (k_proj): Linear(in_features=1280, out_features=1280, bias=False)
          (v_proj): Linear(in_features=1280, out_features=1280, bias=False)
          (o_proj): Linear(in_features=1280, out_features=1280, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): DeepseekV2MoE(
          (experts): ModuleList(
            (0-63): 64 x DeepseekV2MLP(
              (gate_proj): Linear(in_features=1280, out_features=896, bias=False)
              (up_proj): Linear(in_features=1280, out_features=896, bias=False)
              (down_proj): Linear(in_features=896, out_features=1280, bias=False)
              (act_fn): SiLU()
            )
          )
          (gate): MoEGate()
          (shared_experts): DeepseekV2MLP(
            (gate_proj): Linear(in_features=1280, out_features=1792, bias=False)
            (up_proj): Linear(in_features=1280, out_features=1792, bias=False)
            (down_proj): Linear(in_features=1792, out_features=1280, bias=False)
            (act_fn): SiLU()
          )
        )
        (input_layernorm): DeepseekV2RMSNorm()
        (post_attention_layernorm): DeepseekV2RMSNorm()
      )
    )
    (norm): DeepseekV2RMSNorm()
    (sam_model): ImageEncoderViT(
      (patch_embed): PatchEmbed(
        (proj): Conv2d(3, 768, kernel_size=(16, 16), stride=(16, 16))
      )
      (blocks): ModuleList(
        (0-11): 12 x Block(
          (norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
          (attn): Attention(
            (qkv): Linear(in_features=768, out_features=2304, bias=True)
            (proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
          (mlp): MLPBlock(
            (lin1): Linear(in_features=768, out_features=3072, bias=True)
            (lin2): Linear(in_features=3072, out_features=768, bias=True)
            (act): GELU(approximate='none')
          )
        )
      )
      (neck): Sequential(
        (0): Conv2d(768, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (1): LayerNorm2d()
        (2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (3): LayerNorm2d()
      )
      (net_2): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (net_3): Conv2d(512, 1024, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
    )
    (vision_model): VitModel(
      (embeddings): CLIPVisionEmbeddings(
        (patch_embedding): Conv2d(3, 1024, kernel_size=(14, 14), stride=(14, 14), bias=False)
        (position_embedding): Embedding(257, 1024)
      )
      (transformer): NoTPTransformer(
        (layers): ModuleList(
          (0-23): 24 x NoTPTransformerBlock(
            (self_attn): NoTPAttention(
              (qkv_proj): Linear(in_features=1024, out_features=3072, bias=True)
              (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
            )
            (mlp): NoTPFeedForward(
              (fc1): Linear(in_features=1024, out_features=4096, bias=True)
              (fc2): Linear(in_features=4096, out_features=1024, bias=True)
            )
            (layer_norm1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
            (layer_norm2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          )
        )
      )
      (pre_layrnorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
    )
    (projector): MlpProjector(
      (layers): Linear(in_features=2048, out_features=1280, bias=True)
    )
  )
  (lm_head): Linear(in_features=1280, out_features=129280, bias=False)
)

Signed-off-by: weedge <weege007@gmail.com>

weedge added 2 commits October 22, 2025 00:17

feat: add deepseek-OCR transformers/vllm

e5a3327

Signed-off-by: weedge <weege007@gmail.com>

add llm_transformers_manual_vision_deepseek_ocr and deploy deepseek_o…

25c57b9

…cr vision ocr bot with fastapi_webrtc_vision_bot_serve Signed-off-by: weedge <weege007@gmail.com>

gemini-code-assist bot reviewed Oct 22, 2025

View reviewed changes

weedge and others added 6 commits October 22, 2025 18:02

Update deploy/modal/src/llm/vllm/vlm/ocr_deepseek.py

57f7e0e

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update deploy/modal/src/llm/vllm/vlm/ocr_deepseek.py

391ffc2

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update deploy/modal/src/llm/vllm/vlm/ocr_deepseek.py

402b67b

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update deploy/modal/src/llm/transformers/vlm/ocr_deepseek.py

946467d

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update src/core/llm/transformers/manual_vision_ocr_deepseek.py

682b05d

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update src/core/llm/transformers/manual_vision_ocr_deepseek.py

43e4c64

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

weedge added the vision label Oct 22, 2025

weedge added OCR Optical Character Recognition CLIP Contrastive Language-Image Pre-training labels Oct 22, 2025

fix test

50555da

Signed-off-by: weedge <weege007@gmail.com>

weedge merged commit 0d5360e into main Oct 22, 2025

weedge added ViT vision transformer transformer transformers and removed transformer labels Oct 22, 2025

This was referenced Oct 24, 2025

feat: add llm_vllm_deepseek_ocr and llm_office_vllm_deepseek_ocr #205

Merged

[achatbot] add DeepSeek-OCR (transformers/vllm) deepseek-ai/DeepSeek-OCR#127

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add llm_transformers_manual_vision_deepseek_ocr and deploy deepseek_ocr vision bot with fastapi_webrtc_vision_bot_serve#203

feat: add llm_transformers_manual_vision_deepseek_ocr and deploy deepseek_ocr vision bot with fastapi_webrtc_vision_bot_serve#203
weedge merged 9 commits intomainfrom
feat/vision-ocr

weedge commented Oct 22, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Oct 22, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

weedge commented Oct 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

weedge commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

reference

Uh oh!

gemini-code-assist bot commented Oct 22, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

weedge commented Oct 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

weedge commented Oct 22, 2025 •

edited

Loading