rapidocr_api队列实现代码 #338

nzm001 · 2025-02-12T10:09:27Z

问题描述 / Problem Description

之前有说当前rapidocr_api的worker参数并没有实际作用。

运行环境 / Runtime Environment

Ubuntu 24.04
Python 3.12.3

实现代码 / Code

服务端

from flask import Flask, request, jsonify
import base64
from queue import Queue
from rapidocr_onnxruntime import RapidOCR
import sys
import logging

# 配置日志格式
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)

app = Flask(__name__)

# 初始化OCR引擎池
def init_engine_pool(pool_size):
    engine_queue = Queue(maxsize=pool_size)
    for _ in range(pool_size):
        engine = RapidOCR()
        engine_queue.put(engine)
    return engine_queue

# 命令行参数处理
if len(sys.argv) != 2:
    print("Usage: python ocr_server.py N")
    sys.exit(1)

try:
    POOL_SIZE = int(sys.argv[1])
except ValueError:
    print("Error: N must be an integer")
    sys.exit(1)

engine_pool = init_engine_pool(POOL_SIZE)

@app.route('/ocr', methods=['POST'])
def ocr_service():
    # 参数校验
    if 'image' not in request.json:
        return jsonify({"error": "Missing image parameter"}), 400
    
    # Base64解码
    try:
        img_b64 = request.json['image'].split(',')[-1]
        img_bytes = base64.b64decode(img_b64)
    except Exception as e:
        logging.error(f"Base64解码失败: {str(e)}")
        return jsonify({"error": f"无效的图片数据: {str(e)}"}), 400

    # 获取OCR引擎
    engine = engine_pool.get()
    try:
        # 直接处理二进制数据
        result, elapse = engine(img_bytes)
    except Exception as e:
        logging.error(f"OCR处理失败: {str(e)}")
        return jsonify({"error": f"OCR处理失败: {str(e)}"}), 500
    finally:
        engine_pool.put(engine)

    # 结果格式化（根据实际数据结构调整）
    formatted = []
    for item in result:
        formatted.append({
            "coordinates": item[0],  # 保持原始坐标结构
            "text": item[1],
            "confidence": float(item[2])
        })

    return jsonify({
        "result": formatted,
        "processing_time": elapse,
        "engine_count": POOL_SIZE
    })

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=9003, threaded=True)

服务端运行（启动5个实例）

python ocr_server.py 5

客户端

import requests
import base64
import time

def test_ocr(image_path, url='http://localhost:9003/ocr'):
    with open(image_path, 'rb') as f:
        img_b64 = base64.b64encode(f.read()).decode()
    
    start = time.time()
    response = requests.post(url, json={'image': img_b64})
    latency = time.time() - start
    
    if response.status_code == 200:
        print(f"识别成功 | 耗时: {latency:.2f}s")
        return response.json()
    else:
        print(f"识别失败 | 状态码: {response.status_code}")
        return response.text

# 并发测试示例
from concurrent.futures import ThreadPoolExecutor

with ThreadPoolExecutor(max_workers=5) as executor:
    futures = [executor.submit(test_ocr, 'test.jpg') for _ in range(10)]
    results = [f.result() for f in futures]

问题说明

api返回结果和runtime返回结果统一，现在api也可以返回处理时间、置信度。可以查看服务器的三个processing_time用来优化运行了。

{'engine_count': 4,
 'processing_time': [3.3301830849995895,
  0.4641904830932617,
  22.541740894317627],
 'result': [{'confidence': 0.9690895353044783,
   'coordinates': [[95.0, 16.0], [502.0, 16.0], [502.0, 35.0], [95.0, 35.0]],
   'text': '广州软件院开箱测评12款开源OCR，Rapid0CR表现优异→link'},
  {'confidence': 0.5223508477210999,
   'coordinates': [[1264.0, 17.0],
    [1283.0, 17.0],
    [1283.0, 35.0],
    [1264.0, 35.0]],
   'text': '×'},
...

运行openvino时，警告，应该是要修改utils/infer_engine.py里的openvino.runtime。

<frozen importlib.util>:208: DeprecationWarning: The `openvino.runtime` module is deprecated and will be removed in the 2026.0 release. Please replace `openvino.runtime` with `openvino`.

这不是一个bug但须要申明：在一个例如6核12线程的CPU上跑，开6个实例，同时识别6张图片，RapidOCR()要比RapidOCR(intra_op_num_threads=1)慢一倍，完美的互相拖后腿。（inter_op_num_threads不知道啥用，我测试对CPU的影响在误差范围内，难道是双CPU才有影响）
猜测是openvino的问题：在测试openvino时，不论用这里的代码，还是同时运行多个rapidocr_api（openvino版），开5个实例，都只用了一核或者说一半的核心。测试CPU为2核心4线程，在htop中查看，0123核心只有23使用率100％，01都只用了10-15％。

The text was updated successfully, but these errors were encountered:

SWHL · 2025-02-12T10:35:25Z

很好的问题和观察，待我有空测测看。

nzm001 · 2025-02-12T17:35:09Z

如使用gunicorn，则要修改上面的命令行参数处理sys.argv，改成环境变量。使用gunicorn，好处没看到，运行反而更慢，额外增加一个维护烦恼，个人使用不推荐。

服务端

from flask import Flask, request, jsonify
import base64
from queue import Queue
from rapidocr_onnxruntime import RapidOCR
import logging
import os

# 配置日志格式
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)

app = Flask(__name__)

# 初始化OCR引擎池
def init_engine_pool(pool_size):
    engine_queue = Queue(maxsize=pool_size)
    for _ in range(pool_size):
        engine = RapidOCR()
        engine_queue.put(engine)
    logging.info(f"Initialized OCR engine pool with size: {pool_size}")
    return engine_queue

# 从环境变量获取配置
POOL_SIZE = int(os.environ.get('OCR_ENGINE_POOL_SIZE', 1))  # 默认1个实例
engine_pool = init_engine_pool(POOL_SIZE)

@app.route('/ocr', methods=['POST'])
def ocr_service():
    # 参数校验
    if 'image' not in request.json:
        return jsonify({"error": "Missing image parameter"}), 400
    
    # Base64解码
    try:
        img_b64 = request.json['image'].split(',')[-1]
        img_bytes = base64.b64decode(img_b64)
    except Exception as e:
        logging.error(f"Base64解码失败: {str(e)}")
        return jsonify({"error": f"无效的图片数据: {str(e)}"}), 400

    # 获取OCR引擎
    engine = engine_pool.get()
    try:
        # 直接处理二进制数据
        result, elapse = engine(img_bytes)
    except Exception as e:
        logging.error(f"OCR处理失败: {str(e)}")
        return jsonify({"error": f"OCR处理失败: {str(e)}"}), 500
    finally:
        engine_pool.put(engine)

    # 结果格式化（根据实际数据结构调整）
    formatted = []
    for item in result:
        formatted.append({
            "coordinates": item[0],  # 保持原始坐标结构
            "text": item[1],
            "confidence": float(item[2])
        })

    return jsonify({
        "result": formatted,
        "processing_time": elapse,
        "engine_count": POOL_SIZE
    })

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=9003, threaded=True)

服务端运行（启动4个实例）

OCR_ENGINE_POOL_SIZE=4 gunicorn -w 4 -b 0.0.0.0:9003 ocr_server:app

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rapidocr_api队列实现代码 #338

rapidocr_api队列实现代码 #338

nzm001 commented Feb 12, 2025 •

edited

Loading

SWHL commented Feb 12, 2025

nzm001 commented Feb 12, 2025 •

edited

Loading

rapidocr_api队列实现代码 #338

rapidocr_api队列实现代码 #338

Comments

nzm001 commented Feb 12, 2025 • edited Loading

问题描述 / Problem Description

运行环境 / Runtime Environment

实现代码 / Code

问题说明

SWHL commented Feb 12, 2025

nzm001 commented Feb 12, 2025 • edited Loading

nzm001 commented Feb 12, 2025 •

edited

Loading

nzm001 commented Feb 12, 2025 •

edited

Loading