Skip to content

[CI] XPU Deepseek CI Test #21874

Closed
Godmook wants to merge 4 commits intosgl-project:mainfrom
Godmook:fix/deepseek-ocr-test
Closed

[CI] XPU Deepseek CI Test #21874
Godmook wants to merge 4 commits intosgl-project:mainfrom
Godmook:fix/deepseek-ocr-test

Conversation

@Godmook
Copy link
Copy Markdown
Contributor

@Godmook Godmook commented Apr 1, 2026

Motivation

The test_deepseek_ocr.py test in the XPU CI suite (per-commit-xpu) is consistently failing with a Non-base64 digit found error. The server's image loader cannot resolve the relative file path ../../examples/assets/example_image.png passed as image_data, causing it to fall through to base64 decoding and fail.
Related Link: https://github.com/sgl-project/sglang/actions/runs/23862150941/job/69572094739?pr=20501

[2026-04-01 17:57:05] Prefill batch, #new-seq: 1, #new-token: 384, #cached-token: 0, token usage: 0.02, #running-req: 0, #queue-req: 0, cuda graph: False, input throughput (token/s): 0.00
[2026-04-01 17:57:05] INFO:     127.0.0.1:60624 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2026-04-01 17:57:05] The server is fired up and ready to roll!
[2026-04-01 17:57:09] Prefill batch, #new-seq: 1, #new-token: 128, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, cuda graph: False, input throughput (token/s): 109.89
[2026-04-01 17:57:10] INFO:     127.0.0.1:47006 - "GET /health_generate HTTP/1.1" 200 OK
command=sglang serve --model-path deepseek-ai/DeepSeek-OCR --device xpu --attention-backend intel_xpu --device xpu --host 127.0.0.1 --port 21000
CI_OFFLINE: Launching server HF_HUB_OFFLINE=0 model=deepseek-ai/DeepSeek-OCR
[CI Test Method] TestDeepSeekOCR.test_moe
[2026-04-01 17:57:10] [load_mm_data(simple)] error loading IMAGE data at index=0
Traceback (most recent call last):
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/sglang/srt/multimodal/processors/base_processor.py", line 502, in _load_single_item
    img, _ = load_image(data, cls.gpu_image_decode)
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/sglang/srt/utils/common.py", line 837, in load_image
    image = _load_image(image_file=image_file, gpu_image_decode=gpu_image_decode)
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/sglang/srt/utils/common.py", line 796, in _load_image
    image_bytes = get_image_bytes(image_file)
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/sglang/srt/utils/common.py", line 869, in get_image_bytes
    return pybase64.b64decode(image_file, validate=True)
binascii.Error: Non-base64 digit found

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/sglang/srt/multimodal/processors/base_processor.py", line 821, in fast_load_mm_data
    result = future.result()
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/sglang/srt/multimodal/processors/base_processor.py", line 517, in _load_single_item
    raise RuntimeError(f"Error while loading data {data}: {e}")
RuntimeError: Error while loading data ../../examples/assets/example_image.png: Non-base64 digit found
[2026-04-01 17:57:10] INFO:     127.0.0.1:47018 - "POST /generate HTTP/1.1" 500 Internal Server Error
[2026-04-01 17:57:10] ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/sglang/srt/multimodal/processors/base_processor.py", line 502, in _load_single_item
    img, _ = load_image(data, cls.gpu_image_decode)
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/sglang/srt/utils/common.py", line 837, in load_image
    image = _load_image(image_file=image_file, gpu_image_decode=gpu_image_decode)
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/sglang/srt/utils/common.py", line 796, in _load_image
    image_bytes = get_image_bytes(image_file)
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/sglang/srt/utils/common.py", line 869, in get_image_bytes
    return pybase64.b64decode(image_file, validate=True)
binascii.Error: Non-base64 digit found

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/sglang/srt/multimodal/processors/base_processor.py", line 821, in fast_load_mm_data
    result = future.result()
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/sglang/srt/multimodal/processors/base_processor.py", line 517, in _load_single_item
    raise RuntimeError(f"Error while loading data {data}: {e}")
RuntimeError: Error while loading data ../../examples/assets/example_image.png: Non-base64 digit found

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/uvicorn/protocols/http/h11_impl.py", line 410, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
    return await self.app(scope, receive, send)
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/fastapi/applications.py", line 1163, in __call__
    await super().__call__(scope, receive, send)
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/starlette/applications.py", line 90, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/starlette/middleware/cors.py", line 88, in __call__
    await self.app(scope, receive, send)
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 63, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
    await self.app(scope, receive, send)
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/starlette/routing.py", line 660, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/starlette/routing.py", line 680, in app
    await route.handle(scope, receive, send)
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/starlette/routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/fastapi/routing.py", line 134, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/fastapi/routing.py", line 120, in app
    response = await f(request)
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/fastapi/routing.py", line 674, in app
    raw_response = await run_endpoint_function(
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/fastapi/routing.py", line 328, in run_endpoint_function
    return await dependant.call(**values)
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/sglang/srt/entrypoints/http_server.py", line 687, in generate_request
    ret = await _global_state.tokenizer_manager.generate_request(
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/sglang/srt/managers/tokenizer_manager.py", line 520, in generate_request
    tokenized_obj = await self._tokenize_one_request(obj)
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/sglang/srt/managers/tokenizer_manager.py", line 732, in _tokenize_one_request
    mm_inputs: Dict = await self.mm_data_processor.process(
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/sglang/srt/managers/async_mm_data_processor.py", line 99, in process
    return await asyncio.wait_for(_invoke(), timeout=self.timeout_s)
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
    return fut.result()
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/sglang/srt/managers/async_mm_data_processor.py", line 70, in _invoke
    return await self._proc_async(
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/sglang/srt/multimodal/processors/deepseek_ocr.py", line 31, in process_mm_data_async
    base_output = self.load_mm_data(
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/sglang/srt/multimodal/processors/base_processor.py", line 758, in load_mm_data
    return self.fast_load_mm_data(
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/sglang/srt/multimodal/processors/base_processor.py", line 828, in fast_load_mm_data
    raise RuntimeError(
RuntimeError: An exception occurred while loading IMAGE data at index 0: Error while loading data ../../examples/assets/example_image.png: Non-base64 digit found
Traceback (most recent call last):
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/requests/models.py", line 978, in json
    return complexjson.loads(self.text, **kwargs)
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/sglang/srt/utils/common.py", line 2571, in retry
    return fn()
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/sglang/test/test_utils.py", line 2095, in <lambda>
    lambda: super(CustomTestCase, self)._callTestMethod(method),
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/unittest/case.py", line 549, in _callTestMethod
    method()
  File "/home/sdp/sglang/test/srt/xpu/test_deepseek_ocr.py", line 94, in test_moe
    self.run_decode()
  File "/home/sdp/sglang/test/srt/xpu/test_deepseek_ocr.py", line 68, in run_decode
    ret = self.get_request_json(max_new_tokens=max_new_tokens, n=n)
  File "/home/sdp/sglang/test/srt/xpu/test_deepseek_ocr.py", line 60, in get_request_json
    return response.json()
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/requests/models.py", line 982, in json
    raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
E
======================================================================
ERROR: test_moe (__main__.TestDeepSeekOCR)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/requests/models.py", line 978, in json
    return complexjson.loads(self.text, **kwargs)
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/sglang/srt/utils/common.py", line 2571, in retry
    return fn()
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/sglang/test/test_utils.py", line 2095, in <lambda>
    lambda: super(CustomTestCase, self)._callTestMethod(method),
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/unittest/case.py", line 549, in _callTestMethod
    method()
  File "/home/sdp/sglang/test/srt/xpu/test_deepseek_ocr.py", line 94, in test_moe
    self.run_decode()
  File "/home/sdp/sglang/test/srt/xpu/test_deepseek_ocr.py", line 68, in run_decode
    ret = self.get_request_json(max_new_tokens=max_new_tokens, n=n)
  File "/home/sdp/sglang/test/srt/xpu/test_deepseek_ocr.py", line 60, in get_request_json
    return response.json()
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/requests/models.py", line 982, in json
    raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/sglang/test/test_utils.py", line 2094, in _callTestMethod
    retry(
  File "/home/sdp/miniforge3/envs/py3.10/lib/python3.10/site-packages/sglang/srt/utils/common.py", line 2579, in retry
    raise Exception(f"retry() exceed maximum number of retries.")
Exception: retry() exceed maximum number of retries.

----------------------------------------------------------------------
Ran 1 test in 102.839s

FAILED (errors=1)
.
.
End (0/1):
filename='xpu/test_deepseek_ocr.py', elapsed=108, estimated_time=60
.
.


✗ FAILED: xpu/test_deepseek_ocr.py returned exit code 1

Fail. Time elapsed: 108.25s

============================================================
Test Summary: 0/2 passed
============================================================

✗ FAILED:
  xpu/test_deepseek_ocr.py (exit code 1)
============================================================


+----------------+-------------+
| Suite          | Partition   |
|----------------+-------------|
| per-commit-xpu | full        |
+----------------+-------------+
✅ Executed 2 test(s) (est total 120.0s):
  - xpu/test_deepseek_ocr.py (est_time=60)
  - xpu/test_intel_xpu_backend.py (est_time=60)

Error: Process completed with exit code 255.

Modifications

  • Replaced the relative file path ../../examples/assets/example_image.png with DEFAULT_IMAGE_URL (a GitHub raw URL) in test/srt/xpu/test_deepseek_ocr.py, consistent with how other VLM tests handle image input.
  • Added DEFAULT_IMAGE_URL to the imports from sglang.test.test_utils.

Accuracy Tests

N/A — This change only affects the test file, not model outputs.

Speed Tests and Profiling

N/A — No impact on inference speed.

Checklist

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@Kangyan-Zhou
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@github-actions github-actions bot added the run-ci label Apr 1, 2026
@Godmook
Copy link
Copy Markdown
Contributor Author

Godmook commented Apr 1, 2026

@Kangyan-Zhou

test_intel_xpu_backend.py failure (separate from this PR)

The OCR test now passes. The XPU backend benchmark fails with UR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY while loading meta-llama/Llama-3.2-1B after test_deepseek_ocr.py has already used ~6GB on the GPU. That looks like XPU memory not being fully reclaimed between test files, not a bug in the image URL change. Fixing it probably needs CI or harness changes (ordering, delay, or explicit cache cleanup), not this diff.

So I think after this PR merged,I'll try to solve backend.py issues. I think it is not quite easy solve like OCR test. Do you think it is good approach?

✗ FAILED: xpu/test_intel_xpu_backend.py returned exit code 1

Fail. Time elapsed: 256.75s

============================================================
Test Summary: 1/2 passed
============================================================
✓ PASSED:
  xpu/test_deepseek_ocr.py

✗ FAILED:
  xpu/test_intel_xpu_backend.py (exit code 1)
============================================================


+----------------+-------------+
| Suite          | Partition   |
|----------------+-------------|
| per-commit-xpu | full        |
+----------------+-------------+
✅ Executed 2 test(s) (est total 120.0s):
  - xpu/test_deepseek_ocr.py (est_time=60)
  - xpu/test_intel_xpu_backend.py (est_time=60)

Error: Process completed with exit code 255.

@Godmook
Copy link
Copy Markdown
Contributor Author

Godmook commented Apr 1, 2026

@airMeng @mingfeima @Kangyan-Zhou

I reordered the XPU suite so lighter tests run first. If that fixes CI, it’s the simplest fix. Long term, if we still want strict alphabetical order (as the comment suggests), we’ll need another approach—e.g. isolating runs so each test starts from a clean device memory state.

 # Add Intel XPU tests
-# NOTE: please sort the test cases alphabetically by the test file name
+# NOTE: Intentionally NOT alphabetical. Lighter benchmarks run first because
+# heavy models (e.g. DeepSeek-OCR ~6GB) can leave XPU device memory unreclaimed,
+# causing OOM for subsequent tests on memory-constrained devices.

See: run_suite.py in PR

@Godmook
Copy link
Copy Markdown
Contributor Author

Godmook commented Apr 1, 2026

/rerun-failed-ci

@airMeng
Copy link
Copy Markdown
Collaborator

airMeng commented Apr 2, 2026

@Godmook thank you for your help! Unfortunately we are get some stability issues in current CI as you can see the failures, my colleague is working on #21735 to solve it.

@Godmook
Copy link
Copy Markdown
Contributor Author

Godmook commented Apr 2, 2026

@Godmook thank you for your help! Unfortunately we are get some stability issues in current CI as you can see the failures, my colleague is working on #21735 to solve it.

@airMeng
No worries.😁 Could you also look at and use #21916 to fix the CI? I think this modification is important because #21916 makes all non-CUDA CI Stopped.

@Godmook Godmook closed this Apr 3, 2026
@Godmook Godmook deleted the fix/deepseek-ocr-test branch April 3, 2026 03:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants