[CI] XPU Deepseek CI Test by Godmook · Pull Request #21874 · sgl-project/sglang

Godmook · 2026-04-01T19:59:00Z

Motivation

The test_deepseek_ocr.py test in the XPU CI suite (per-commit-xpu) is consistently failing with a Non-base64 digit found error. The server's image loader cannot resolve the relative file path ../../examples/assets/example_image.png passed as image_data, causing it to fall through to base64 decoding and fail.
Related Link: https://github.com/sgl-project/sglang/actions/runs/23862150941/job/69572094739?pr=20501

[2026-04-01 17:57:05] Prefill batch, #new-seq: 1, #new-token: 384, #cached-token: 0, token usage: 0.02, #running-req: 0, #queue-req: 0, cuda graph: False, input throughput (token/s): 0.00
[2026-04-01 17:57:05] INFO:     127.0.0.1:60624 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2026-04-01 17:57:05] The server is fired up and ready to roll!
[2026-04-01 17:57:09] Prefill batch, #new-seq: 1, #new-token: 128, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, cuda graph: False, input throughput (token/s): 109.89
[2026-04-01 17:57:10] INFO:     127.0.0.1:47006 - "GET /health_generate HTTP/1.1" 200 OK
command=sglang serve --model-path deepseek-ai/DeepSeek-OCR --device xpu --attention-backend intel_xpu --device xpu --host 127.0.0.1 --port 21000
CI_OFFLINE: Launching server HF_HUB_OFFLINE=0 model=deepseek-ai/DeepSeek-OCR
[CI Test Method] TestDeepSeekOCR.test_moe
[2026-04-01 17:57:10] [load_mm_data(simple)] error loading IMAGE data at index=0
Traceback (most recent call last):
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/sglang/srt/multimodal/processors/base_processor.py", line 502, in _load_single_item
    img, _ = load_image(data, cls.gpu_image_decode)
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/sglang/srt/utils/common.py", line 837, in load_image
    image = _load_image(image_file=image_file, gpu_image_decode=gpu_image_decode)
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/sglang/srt/utils/common.py", line 796, in _load_image
    image_bytes = get_image_bytes(image_file)
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/sglang/srt/utils/common.py", line 869, in get_image_bytes
    return pybase64.b64decode(image_file, validate=True)
binascii.Error: Non-base64 digit found

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/sglang/srt/multimodal/processors/base_processor.py", line 821, in fast_load_mm_data
    result = future.result()
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/sglang/srt/multimodal/processors/base_processor.py", line 517, in _load_single_item
    raise RuntimeError(f"Error while loading data {data}: {e}")
RuntimeError: Error while loading data ../../examples/assets/example_image.png: Non-base64 digit found
[2026-04-01 17:57:10] INFO:     127.0.0.1:47018 - "POST /generate HTTP/1.1" 500 Internal Server Error
[2026-04-01 17:57:10] ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/sglang/srt/multimodal/processors/base_processor.py", line 502, in _load_single_item
    img, _ = load_image(data, cls.gpu_image_decode)
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/sglang/srt/utils/common.py", line 837, in load_image
    image = _load_image(image_file=image_file, gpu_image_decode=gpu_image_decode)
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/sglang/srt/utils/common.py", line 796, in _load_image
    image_bytes = get_image_bytes(image_file)
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/sglang/srt/utils/common.py", line 869, in get_image_bytes
    return pybase64.b64decode(image_file, validate=True)
binascii.Error: Non-base64 digit found

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/sglang/srt/multimodal/processors/base_processor.py", line 821, in fast_load_mm_data
    result = future.result()
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/sglang/srt/multimodal/processors/base_processor.py", line 517, in _load_single_item
    raise RuntimeError(f"Error while loading data {data}: {e}")
RuntimeError: Error while loading data ../../examples/assets/example_image.png: Non-base64 digit found

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/uvicorn/protocols/http/h11_impl.py", line 410, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
    return await self.app(scope, receive, send)
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/fastapi/applications.py", line 1163, in __call__
    await super().__call__(scope, receive, send)
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/starlette/applications.py", line 90, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/starlette/middleware/cors.py", line 88, in __call__
    await self.app(scope, receive, send)
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 63, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
    await self.app(scope, receive, send)
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/starlette/routing.py", line 660, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/starlette/routing.py", line 680, in app
    await route.handle(scope, receive, send)
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/starlette/routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/fastapi/routing.py", line 134, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/fastapi/routing.py", line 120, in app
    response = await f(request)
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/fastapi/routing.py", line 674, in app
    raw_response = await run_endpoint_function(
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/fastapi/routing.py", line 328, in run_endpoint_function
    return await dependant.call(**values)
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/sglang/srt/entrypoints/http_server.py", line 687, in generate_request
    ret = await _global_state.tokenizer_manager.generate_request(
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/sglang/srt/managers/tokenizer_manager.py", line 520, in generate_request
    tokenized_obj = await self._tokenize_one_request(obj)
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/sglang/srt/managers/tokenizer_manager.py", line 732, in _tokenize_one_request
    mm_inputs: Dict = await self.mm_data_processor.process(
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/sglang/srt/managers/async_mm_data_processor.py", line 99, in process
    return await asyncio.wait_for(_invoke(), timeout=self.timeout_s)
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
    return fut.result()
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/sglang/srt/managers/async_mm_data_processor.py", line 70, in _invoke
    return await self._proc_async(
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/sglang/srt/multimodal/processors/deepseek_ocr.py", line 31, in process_mm_data_async
    base_output = self.load_mm_data(
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/sglang/srt/multimodal/processors/base_processor.py", line 758, in load_mm_data
    return self.fast_load_mm_data(
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/sglang/srt/multimodal/processors/base_processor.py", line 828, in fast_load_mm_data
    raise RuntimeError(
RuntimeError: An exception occurred while loading IMAGE data at index 0: Error while loading data ../../examples/assets/example_image.png: Non-base64 digit found
Traceback (most recent call last):
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/requests/models.py", line 978, in json
    return complexjson.loads(self.text, **kwargs)
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/sglang/srt/utils/common.py", line 2571, in retry
    return fn()
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/sglang/test/test_utils.py", line 2095, in <lambda>
    lambda: super(CustomTestCase, self)._callTestMethod(method),
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/unittest/case.py", line 549, in _callTestMethod
    method()
  File "/home/sdp/sglang/test/srt/xpu/test_deepseek_ocr.py", line 94, in test_moe
    self.run_decode()
  File "/home/sdp/sglang/test/srt/xpu/test_deepseek_ocr.py", line 68, in run_decode
    ret = self.get_request_json(max_new_tokens=max_new_tokens, n=n)
  File "/home/sdp/sglang/test/srt/xpu/test_deepseek_ocr.py", line 60, in get_request_json
    return response.json()
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/requests/models.py", line 982, in json
    raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
E
======================================================================
ERROR: test_moe (__main__.TestDeepSeekOCR)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/requests/models.py", line 978, in json
    return complexjson.loads(self.text, **kwargs)
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/sglang/srt/utils/common.py", line 2571, in retry
    return fn()
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/sglang/test/test_utils.py", line 2095, in <lambda>
    lambda: super(CustomTestCase, self)._callTestMethod(method),
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/unittest/case.py", line 549, in _callTestMethod
    method()
  File "/home/sdp/sglang/test/srt/xpu/test_deepseek_ocr.py", line 94, in test_moe
    self.run_decode()
  File "/home/sdp/sglang/test/srt/xpu/test_deepseek_ocr.py", line 68, in run_decode
    ret = self.get_request_json(max_new_tokens=max_new_tokens, n=n)
  File "/home/sdp/sglang/test/srt/xpu/test_deepseek_ocr.py", line 60, in get_request_json
    return response.json()
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/requests/models.py", line 982, in json
    raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/sglang/test/test_utils.py", line 2094, in _callTestMethod
    retry(
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/sglang/srt/utils/common.py", line 2579, in retry
    raise Exception(f"retry() exceed maximum number of retries.")
Exception: retry() exceed maximum number of retries.

----------------------------------------------------------------------
Ran 1 test in 102.839s

FAILED (errors=1)
.
.
End (0/1):
filename='xpu/test_deepseek_ocr.py', elapsed=108, estimated_time=60
.
.


✗ FAILED: xpu/test_deepseek_ocr.py returned exit code 1

Fail. Time elapsed: 108.25s

============================================================
Test Summary: 0/2 passed
============================================================

✗ FAILED:
  xpu/test_deepseek_ocr.py (exit code 1)
============================================================


+----------------+-------------+
| Suite          | Partition   |
|----------------+-------------|
| per-commit-xpu | full        |
+----------------+-------------+
✅ Executed 2 test(s) (est total 120.0s):
  - xpu/test_deepseek_ocr.py (est_time=60)
  - xpu/test_intel_xpu_backend.py (est_time=60)

Error: Process completed with exit code 255.

Modifications

Replaced the relative file path ../../examples/assets/example_image.png with DEFAULT_IMAGE_URL (a GitHub raw URL) in test/srt/xpu/test_deepseek_ocr.py, consistent with how other VLM tests handle image input.
Added DEFAULT_IMAGE_URL to the imports from sglang.test.test_utils.

Accuracy Tests

N/A — This change only affects the test file, not model outputs.

Speed Tests and Profiling

N/A — No impact on inference speed.

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

gemini-code-assist · 2026-04-01T19:59:04Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Kangyan-Zhou · 2026-04-01T20:04:03Z

/tag-and-rerun-ci

Godmook · 2026-04-01T20:35:17Z

@Kangyan-Zhou

test_intel_xpu_backend.py failure (separate from this PR)

The OCR test now passes. The XPU backend benchmark fails with UR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY while loading meta-llama/Llama-3.2-1B after test_deepseek_ocr.py has already used ~6GB on the GPU. That looks like XPU memory not being fully reclaimed between test files, not a bug in the image URL change. Fixing it probably needs CI or harness changes (ordering, delay, or explicit cache cleanup), not this diff.

So I think after this PR merged,I'll try to solve backend.py issues. I think it is not quite easy solve like OCR test. Do you think it is good approach?

✗ FAILED: xpu/test_intel_xpu_backend.py returned exit code 1

Fail. Time elapsed: 256.75s

============================================================
Test Summary: 1/2 passed
============================================================
✓ PASSED:
  xpu/test_deepseek_ocr.py

✗ FAILED:
  xpu/test_intel_xpu_backend.py (exit code 1)
============================================================


+----------------+-------------+
| Suite          | Partition   |
|----------------+-------------|
| per-commit-xpu | full        |
+----------------+-------------+
✅ Executed 2 test(s) (est total 120.0s):
  - xpu/test_deepseek_ocr.py (est_time=60)
  - xpu/test_intel_xpu_backend.py (est_time=60)

Error: Process completed with exit code 255.

Godmook · 2026-04-01T21:49:57Z

@airMeng @mingfeima @Kangyan-Zhou

I reordered the XPU suite so lighter tests run first. If that fixes CI, it’s the simplest fix. Long term, if we still want strict alphabetical order (as the comment suggests), we’ll need another approach—e.g. isolating runs so each test starts from a clean device memory state.

 # Add Intel XPU tests
-# NOTE: please sort the test cases alphabetically by the test file name
+# NOTE: Intentionally NOT alphabetical. Lighter benchmarks run first because
+# heavy models (e.g. DeepSeek-OCR ~6GB) can leave XPU device memory unreclaimed,
+# causing OOM for subsequent tests on memory-constrained devices.

See: run_suite.py in PR

Godmook · 2026-04-01T22:02:48Z

/rerun-failed-ci

airMeng · 2026-04-02T13:28:03Z

@Godmook thank you for your help! Unfortunately we are get some stability issues in current CI as you can see the failures, my colleague is working on #21735 to solve it.

Godmook · 2026-04-02T15:32:49Z

@Godmook thank you for your help! Unfortunately we are get some stability issues in current CI as you can see the failures, my colleague is working on #21735 to solve it.

@airMeng
No worries.😁 Could you also look at and use #21916 to fix the CI? I think this modification is important because #21916 makes all non-CUDA CI Stopped.

FIX Deepseek-OCR-Test

9dcce42

github-actions bot added the deepseek label Apr 1, 2026

github-actions bot added the run-ci label Apr 1, 2026

Merge origin/main

c29f89f

Kangyan-Zhou requested review from airMeng and mingfeima April 1, 2026 21:41

FIX OOM

7de8847

Merge branch 'main' into fix/deepseek-ocr-test

c1eadb3

Godmook closed this Apr 3, 2026

Godmook deleted the fix/deepseek-ocr-test branch April 3, 2026 03:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] XPU Deepseek CI Test #21874

[CI] XPU Deepseek CI Test #21874
Godmook wants to merge 4 commits intosgl-project:mainfrom
Godmook:fix/deepseek-ocr-test

Godmook commented Apr 1, 2026

Uh oh!

gemini-code-assist bot commented Apr 1, 2026

Uh oh!

Kangyan-Zhou commented Apr 1, 2026

Uh oh!

Godmook commented Apr 1, 2026 •

edited

Loading

Uh oh!

Godmook commented Apr 1, 2026 •

edited

Loading

Uh oh!

Godmook commented Apr 1, 2026

Uh oh!

airMeng commented Apr 2, 2026

Uh oh!

Godmook commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Godmook commented Apr 1, 2026

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Apr 1, 2026

Uh oh!

Kangyan-Zhou commented Apr 1, 2026

Uh oh!

Godmook commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

test_intel_xpu_backend.py failure (separate from this PR)

Uh oh!

Godmook commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Godmook commented Apr 1, 2026

Uh oh!

airMeng commented Apr 2, 2026

Uh oh!

Godmook commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Godmook commented Apr 1, 2026 •

edited

Loading

Godmook commented Apr 1, 2026 •

edited

Loading