Add IOBinding support to ONNX Runtime module by JingyaHuang · Pull Request #421 · huggingface/optimum

JingyaHuang · 2022-10-13T23:03:42Z

Context

As reported by users, when using devices for acceleration, there is sometimes a significant performance drop. And the slow-down is especially significant for decoders(so also for seq2seq models). This is due to a large overhead while copying data across the host and device.

This PR will introduce IOBinding of ONNX Runtime to arrange inputs and pre-allocate outputs on device.

What does this PR do?

Create TypeHelper to prepare IO binding
Integrate IO binding into ORTModels

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Associated issues:
#362 #365 #404 #414

HuggingFaceDocBuilderDev · 2022-10-13T23:26:03Z

The documentation is not available anymore as the PR was closed or merged.

optimum/onnxruntime/io_binding/io_binding_helper.py

optimum/onnxruntime/modeling_ort.py

lewtun

Thanks for working on this killer feature @JingyaHuang 🔥 !!

I've left a few nits, but the API design looks great to me. Would you mind sharing a small code example in the PR description once you have everything ready for a final review?

I'm especially interested to know if quantization / optmization play nice with the current implementation :)

optimum/onnxruntime/io_binding/io_binding_helper.py

optimum/onnxruntime/modeling_ort.py

optimum/onnxruntime/io_binding/io_binding_helper.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

JingyaHuang

Hi @lewtun, thanks for the review!!!

I will apply the helper on other ort models and update the PR description with a snippet once it is finished.

optimum/onnxruntime/io_binding/io_binding_helper.py

philschmid

Awesome work @JingyaHuang! I added some first comments to the io_binding_helper.py and first model.
Most of my comments are about performance. I am not sure if we need to keep things typer dynamic since we have dedicated classes for each "task" meaning, which allows us to have more "static/defined" code in the forward method to improve latency.
It would be interested if you can take a look at my comments and try to evaluate the performance of replacing those "dynamic loops" with a more "static approach"

optimum/onnxruntime/io_binding/io_binding_helper.py

optimum/onnxruntime/modeling_ort.py

optimum/onnxruntime/io_binding/io_binding_helper.py

optimum/onnxruntime/modeling_ort.py

tests/onnxruntime/test_modeling.py

JingyaHuang

@philschmid Thanks for reviewing! I have added some modifications according to the comments. And the tests are also added.

optimum/onnxruntime/modeling_ort.py

JingyaHuang · 2022-10-26T11:04:59Z

Hi folks,

Thanks for helping out and reviewing. I think the PR is ready for final review. Now the IO binding direct is applied to all ORTModels except for ORTModelForCustomTasks, for which I will open another PR.

Also since our last discussion, now the buffers' size for ORTModelForSeq2SeqLM has been reduced.

@lewtun Here is a (rough) snippet that I use for benchmarking.

from pathlib import Path
import numpy as np
import pandas as pd
from time import perf_counter
import torch
from transformers import AutoConfig, AutoModelForSeq2SeqLM, AutoTokenizer
from optimum.onnxruntime.modeling_seq2seq import ORTModelForSeq2SeqLM
from optimum.onnxruntime import ORTOptimizer, ORTQuantizer
from optimum.onnxruntime.configuration import AutoQuantizationConfig, OptimizationConfig

model_id = "facebook/bart-base"
tokenizer = AutoTokenizer.from_pretrained(model_id)
onnx_path = Path("results_seq2seq/")
seq_lengths = [8, 16, 32, 64, 128, 256, 512]

# Load vanilla onnx model
model = ORTModelForSeq2SeqLM.from_pretrained(model_id, from_transformers=True)
model.save_pretrained(onnx_path)
tokenizer.save_pretrained(onnx_path)

# Graph optimization
optimizer = ORTOptimizer.from_pretrained(model)
optimization_config = OptimizationConfig(optimization_level=2)  # enable all optimizations
optimizer.optimize(save_dir=onnx_path, optimization_config=optimization_config)

def benchmark(seq_len, model, tokenizer, device, iterations=200):
    # prepare date
    seq_len = "l " * (seq_len - 2)
    payload = tokenizer(seq_len, return_tensors="pt")
    payload = {key: val.to(device) for key, val in payload.items()}
    latencies = []
    # warm up
    for _ in range(10):
        _ = model.generate(**payload)
    # Timed run
    for _ in range(iterations):
        start_time = perf_counter()
        _ = model.generate(**payload)
        latency = perf_counter() - start_time
        latencies.append(latency)
    # Compute run statistics
    time_avg_ms = 1000 * np.mean(latencies)
    time_p95_ms = 1000 * np.percentile(latencies, 95)
    return {"seq_len": payload["input_ids"].shape[1], "time_avg_ms": time_avg_ms, "time_p95_ms": time_p95_ms}

device = torch.device("cuda:0")
# Baseline: PyTorch
config = AutoConfig.from_pretrained(model_id, use_cache=True)
pt_model = AutoModelForSeq2SeqLM.from_config(config)
pt_model.to(device)
# Case 1: Vanilla onnx with IO binding
v_onnx_model = ORTModelForSeq2SeqLM.from_pretrained(
    model_id, from_transformers=True, use_cache=True, use_io_binding=True
)
v_onnx_model.to(device)
# Case 2: graph optimized onnx with IOBinding
optim_onnx_model = ORTModelForSeq2SeqLM.from_pretrained(
    model_id="results_seq2seq",
    encoder_file_name="encoder_model_optimized.onnx",
    decoder_file_name="decoder_model_optimized.onnx",
    decoder_with_past_file_name="decoder_with_past_model_optimized.onnx",
)
optim_onnx_model.to(device)

# Benchmark
res = []
for seq_len in seq_lengths:
    print("seq_len: ", seq_len)
    pt = benchmark(seq_len, pt_model, tokenizer, device, iterations=500)
    res.append({**pt, "model": "pt"})

    v_onnx = benchmark(seq_len, v_onnx_model, tokenizer, device, iterations=500)
    res.append({**v_onnx, "model": "v_onnx"})

    optim_onnx = benchmark(seq_len, optim_onnx_model, tokenizer, device, iterations=500)
    res.append({**optim_onnx, "model": "optim_onnx"})

df = pd.DataFrame(res)
print(df)

chart_df = pd.merge(
    df[df.model == "pt"][["seq_len", "time_p95_ms"]],
    df[df.model == "v_onnx"][["seq_len", "time_p95_ms"]],
    on="seq_len",
)
chart_df = chart_df.rename(
    columns={
        "time_p95_ms_x": "pt_p95",
        "time_p95_ms_y": "v_onnx_p95",
    }
)
chart_df = pd.merge(
    chart_df,
    df[df.model == "optim_onnx"][["seq_len", "time_p95_ms"]],
    on="seq_len",
)
chart_df = chart_df.rename(
    columns={
        "time_p95_ms": "optim_onnx_p95",
    }
)

chart_df["io_improvement/pt"] = f"{round((chart_df['pt_p95'] - chart_df['v_onnx_p95']) / chart_df['pt_p95'] * 100,2)}%"
chart_df["io+optim/pt"] = f"{round((chart_df['pt_p95'] - chart_df['optim_onnx_p95']) / chart_df['pt_p95'] * 100,2)}%"

plt = chart_df.plot(x="seq_len", y=["pt_p95", "v_onnx_p95", "optim_onnx_p95"], kind="line")
plt.figure.savefig("gpu_res_iobinding_seq2seq.png", dpi=900)

print(chart_df.head(10))
chart_df.to_csv("gpu_res_iobinding_seq2seq.csv")

@echarlaix I add some seq2seq models to ORTConfigManager in order to pass the modeling tests. For they are tested with graph optimization. And whether a model can be well optimized is decided by check_optimization_supported_model_or_raise now(also updated in ORTOptimizer).

Also gently pinging @philschmid @michaelbenayoun and @fxmarty.

regisss

Huge PR @JingyaHuang 🔥 🚀
I just left a couple of minor comments.

optimum/onnxruntime/modeling_ort.py

optimum/onnxruntime/io_binding/io_binding_helper.py

Co-authored-by: regisss <15324346+regisss@users.noreply.github.com>

…timum into add-ort-iobinding

optimum/onnxruntime/modeling_ort.py

philschmid

Awesome work @JingyaHuang 🚀✅ Looks good to me. Everything else can be a follow up PR.
I left some two minor comments.

tests/onnxruntime/test_modeling.py

JingyaHuang added 2 commits October 13, 2022 23:00

Create IOBinding helper and Type helper

7b058e5

fix style

a82f2ab

JingyaHuang added 2 commits October 14, 2022 12:53

Apply IOBinding helper to causallm model

9ff48be

remove useless

a0ae729

JingyaHuang requested review from lewtun and philschmid October 14, 2022 12:57

michaelbenayoun reviewed Oct 14, 2022

View reviewed changes

JingyaHuang added 2 commits October 17, 2022 08:40

Update docstring

ee7afd5

Add use_io_binding args

c280fa2

lewtun reviewed Oct 17, 2022

View reviewed changes

JingyaHuang and others added 13 commits October 17, 2022 14:49

Fix case

adfbc6a

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

Fix doctring

bef7f4d

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

Fix docstring

dd7db8f

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

Fix case

cd06b9a

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

Improve class description

92ff5de

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

Improve class description

a4f70c5

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

Complete comment

ec60023

Fix use_io_binding value define bug

da6fe5a

Fix typos

0729d0b

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

np to numpy

9446274

Replace asserts with error raise

80e9c1a

Add io binding return type

4ddfcdb

Add is_onnxruntime_training_available in utils

0c8813c

JingyaHuang commented Oct 17, 2022

View reviewed changes

JingyaHuang added 3 commits October 17, 2022 21:36

Apply io binding to encoders

a07b6a8

Also for custom tasks

f08479f

Improve use_io_binding args

8c7d82e

philschmid reviewed Oct 18, 2022

View reviewed changes

Apply io binding for seq2seq

e97769b

JingyaHuang added 6 commits October 25, 2022 10:20

modify for comments

9f5a2a3

Revert dependencies for anothet PR

fb1680d

Add IO binding tests for Encoder/Decoder

07f3248

Add config for passing seq2seq modeling tests

b90ae76

Improve seq2seq buffer size

aadd037

Add test for seq2seq

e98ee95

philschmid reviewed Oct 26, 2022

View reviewed changes

tests/onnxruntime/test_modeling.py Show resolved Hide resolved

Add tests for generation

855bad5

JingyaHuang commented Oct 26, 2022

View reviewed changes

optimum/onnxruntime/modeling_ort.py Outdated Show resolved Hide resolved

optimum/onnxruntime/modeling_ort.py Outdated Show resolved Hide resolved

optimum/onnxruntime/modeling_ort.py Show resolved Hide resolved

Merge branch 'main' into add-ort-iobinding

56bf6c6

JingyaHuang requested a review from philschmid October 26, 2022 12:40

regisss reviewed Oct 26, 2022

View reviewed changes

optimum/onnxruntime/modeling_ort.py Outdated Show resolved Hide resolved

optimum/onnxruntime/io_binding/io_binding_helper.py Show resolved Hide resolved

optimum/onnxruntime/io_binding/io_binding_helper.py Show resolved Hide resolved

JingyaHuang and others added 5 commits October 27, 2022 10:47

Fix doc

cc51dc6

Co-authored-by: regisss <15324346+regisss@users.noreply.github.com>

Add comments

095268b

Merge branch 'add-ort-iobinding' of https://github.com/huggingface/op…

479be02

…timum into add-ort-iobinding

Add warning for TensorRT

c0eea41

Merge main branch

a6021a6

fxmarty reviewed Oct 28, 2022

View reviewed changes

optimum/onnxruntime/modeling_ort.py Show resolved Hide resolved

philschmid approved these changes Oct 31, 2022

View reviewed changes

tests/onnxruntime/test_modeling.py Outdated Show resolved Hide resolved

tests/onnxruntime/test_modeling.py Outdated Show resolved Hide resolved

Test: allclose -> equal

505ed8c

JingyaHuang merged commit 5dfb498 into main Nov 2, 2022

JingyaHuang deleted the add-ort-iobinding branch November 2, 2022 21:14

JingyaHuang mentioned this pull request Nov 14, 2022

onnx speed is even slower #414

Closed

4 tasks

bugzyz mentioned this pull request Nov 25, 2022

[Feature request] optimum upgrade and enable GPU device selection for optimum in huggingface MLServer SeldonIO/MLServer#870

Closed

PoodleWang mentioned this pull request Dec 18, 2022

[BUG] Using the tensorrt slow down the inference #605

Closed

4 tasks

Conversation

JingyaHuang commented Oct 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

What does this PR do?

Before submitting

Uh oh!

HuggingFaceDocBuilderDev commented Oct 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lewtun left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JingyaHuang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

philschmid left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JingyaHuang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JingyaHuang commented Oct 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

regisss left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

philschmid left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

JingyaHuang commented Oct 13, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Oct 13, 2022 •

edited

Loading

JingyaHuang commented Oct 26, 2022 •

edited

Loading