Add IOBinding support to ONNX Runtime module #421
Conversation
|
The documentation is not available anymore as the PR was closed or merged. |
lewtun
left a comment
There was a problem hiding this comment.
Thanks for working on this killer feature @JingyaHuang 🔥 !!
I've left a few nits, but the API design looks great to me. Would you mind sharing a small code example in the PR description once you have everything ready for a final review?
I'm especially interested to know if quantization / optmization play nice with the current implementation :)
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
JingyaHuang
left a comment
There was a problem hiding this comment.
Hi @lewtun, thanks for the review!!!
I will apply the helper on other ort models and update the PR description with a snippet once it is finished.
philschmid
left a comment
There was a problem hiding this comment.
Awesome work @JingyaHuang! I added some first comments to the io_binding_helper.py and first model.
Most of my comments are about performance. I am not sure if we need to keep things typer dynamic since we have dedicated classes for each "task" meaning, which allows us to have more "static/defined" code in the forward method to improve latency.
It would be interested if you can take a look at my comments and try to evaluate the performance of replacing those "dynamic loops" with a more "static approach"
JingyaHuang
left a comment
There was a problem hiding this comment.
@philschmid Thanks for reviewing! I have added some modifications according to the comments. And the tests are also added.
|
Hi folks, Thanks for helping out and reviewing. I think the PR is ready for final review. Now the IO binding direct is applied to all ORTModels except for Also since our last discussion, now the buffers' size for @lewtun Here is a (rough) snippet that I use for benchmarking. from pathlib import Path
import numpy as np
import pandas as pd
from time import perf_counter
import torch
from transformers import AutoConfig, AutoModelForSeq2SeqLM, AutoTokenizer
from optimum.onnxruntime.modeling_seq2seq import ORTModelForSeq2SeqLM
from optimum.onnxruntime import ORTOptimizer, ORTQuantizer
from optimum.onnxruntime.configuration import AutoQuantizationConfig, OptimizationConfig
model_id = "facebook/bart-base"
tokenizer = AutoTokenizer.from_pretrained(model_id)
onnx_path = Path("results_seq2seq/")
seq_lengths = [8, 16, 32, 64, 128, 256, 512]
# Load vanilla onnx model
model = ORTModelForSeq2SeqLM.from_pretrained(model_id, from_transformers=True)
model.save_pretrained(onnx_path)
tokenizer.save_pretrained(onnx_path)
# Graph optimization
optimizer = ORTOptimizer.from_pretrained(model)
optimization_config = OptimizationConfig(optimization_level=2) # enable all optimizations
optimizer.optimize(save_dir=onnx_path, optimization_config=optimization_config)
def benchmark(seq_len, model, tokenizer, device, iterations=200):
# prepare date
seq_len = "l " * (seq_len - 2)
payload = tokenizer(seq_len, return_tensors="pt")
payload = {key: val.to(device) for key, val in payload.items()}
latencies = []
# warm up
for _ in range(10):
_ = model.generate(**payload)
# Timed run
for _ in range(iterations):
start_time = perf_counter()
_ = model.generate(**payload)
latency = perf_counter() - start_time
latencies.append(latency)
# Compute run statistics
time_avg_ms = 1000 * np.mean(latencies)
time_p95_ms = 1000 * np.percentile(latencies, 95)
return {"seq_len": payload["input_ids"].shape[1], "time_avg_ms": time_avg_ms, "time_p95_ms": time_p95_ms}
device = torch.device("cuda:0")
# Baseline: PyTorch
config = AutoConfig.from_pretrained(model_id, use_cache=True)
pt_model = AutoModelForSeq2SeqLM.from_config(config)
pt_model.to(device)
# Case 1: Vanilla onnx with IO binding
v_onnx_model = ORTModelForSeq2SeqLM.from_pretrained(
model_id, from_transformers=True, use_cache=True, use_io_binding=True
)
v_onnx_model.to(device)
# Case 2: graph optimized onnx with IOBinding
optim_onnx_model = ORTModelForSeq2SeqLM.from_pretrained(
model_id="results_seq2seq",
encoder_file_name="encoder_model_optimized.onnx",
decoder_file_name="decoder_model_optimized.onnx",
decoder_with_past_file_name="decoder_with_past_model_optimized.onnx",
)
optim_onnx_model.to(device)
# Benchmark
res = []
for seq_len in seq_lengths:
print("seq_len: ", seq_len)
pt = benchmark(seq_len, pt_model, tokenizer, device, iterations=500)
res.append({**pt, "model": "pt"})
v_onnx = benchmark(seq_len, v_onnx_model, tokenizer, device, iterations=500)
res.append({**v_onnx, "model": "v_onnx"})
optim_onnx = benchmark(seq_len, optim_onnx_model, tokenizer, device, iterations=500)
res.append({**optim_onnx, "model": "optim_onnx"})
df = pd.DataFrame(res)
print(df)
chart_df = pd.merge(
df[df.model == "pt"][["seq_len", "time_p95_ms"]],
df[df.model == "v_onnx"][["seq_len", "time_p95_ms"]],
on="seq_len",
)
chart_df = chart_df.rename(
columns={
"time_p95_ms_x": "pt_p95",
"time_p95_ms_y": "v_onnx_p95",
}
)
chart_df = pd.merge(
chart_df,
df[df.model == "optim_onnx"][["seq_len", "time_p95_ms"]],
on="seq_len",
)
chart_df = chart_df.rename(
columns={
"time_p95_ms": "optim_onnx_p95",
}
)
chart_df["io_improvement/pt"] = f"{round((chart_df['pt_p95'] - chart_df['v_onnx_p95']) / chart_df['pt_p95'] * 100,2)}%"
chart_df["io+optim/pt"] = f"{round((chart_df['pt_p95'] - chart_df['optim_onnx_p95']) / chart_df['pt_p95'] * 100,2)}%"
plt = chart_df.plot(x="seq_len", y=["pt_p95", "v_onnx_p95", "optim_onnx_p95"], kind="line")
plt.figure.savefig("gpu_res_iobinding_seq2seq.png", dpi=900)
print(chart_df.head(10))
chart_df.to_csv("gpu_res_iobinding_seq2seq.csv")@echarlaix I add some seq2seq models to Also gently pinging @philschmid @michaelbenayoun and @fxmarty. |
regisss
left a comment
There was a problem hiding this comment.
Huge PR @JingyaHuang 🔥 🚀
I just left a couple of minor comments.
…timum into add-ort-iobinding
philschmid
left a comment
There was a problem hiding this comment.
Awesome work @JingyaHuang 🚀✅ Looks good to me. Everything else can be a follow up PR.
I left some two minor comments.
Context
As reported by users, when using devices for acceleration, there is sometimes a significant performance drop. And the slow-down is especially significant for decoders(so also for seq2seq models). This is due to a large overhead while copying data across the host and device.
This PR will introduce IOBinding of ONNX Runtime to arrange inputs and pre-allocate outputs on device.
What does this PR do?
Before submitting
Associated issues:
#362 #365 #404 #414