Engine can NOT be shared if model uses dynamic shape. #4127

poor1017 · 2024-09-14T08:37:58Z

Hi guys

In the repo tensorrt_backend's code:
https://github.com/triton-inference-server/tensorrt_backend/blob/main/src/instance_state.h#L377

It means: If the model uses dynamic shape, then the CUDA engine can NOT be shared across contexts.

Can anyone tell me why this is?

Thank you!

lix19937 · 2024-09-16T09:35:11Z

By default, TensorRT optimizes the model based on the input shapes (batch size, image size, and so on) at which it was defined. However, the builder can be configured to adjust the input dimensions at runtime.

For fixed shape model, multiple contexts associated with a single engine and run them in parallel/multi-threads.

For dynamic shape model, engine attributes(like io-tensor) will be modified by special input shape.

// at init phase 
context = engine.create_execution_context()

// at runtime if model only one input  
// dynamic shape model need set dynamic input , but fixed shape model not need follow.
context.set_input_shape(lTensorName[0], shape) 

for i in range(engine.num_io_tensors):
   print("[%2d]%s->" % (i, "Input " if i < nInput else "Output"),  
      engine.get_tensor_dtype(lTensorName[i]), 
      engine.get_tensor_shape(lTensorName[i]), 
      context.get_tensor_shape(lTensorName[i]), lTensorName[i])

so, the CUDA engine can NOT be shared across contexts. @poor1017

poor1017 · 2024-09-18T01:33:23Z

Hi @lix19937

Thank you for your reply.

From your code I can infer the following:

Dynamic input shape is set to context;
I can get running shape from engine.get_tensor_shape(lTensorName[i]);
So, engin can NOT be shared;

But, in my C++ test, I found the following:

Call nvinfer1::Dims dims = engine->getTensorShape("x");, I get d = {-1, -1, 80, 0, 0, 0, 0, 0}};
Call context->setInputShape("x", {1, 200, 80, 0, 0, 0, 0, 0});, then call nvinfer1::Dims dims = engine->getTensorShape("x");, I get d = {-1, -1, 80, 0, 0, 0, 0, 0}}, not changed;
Call context->setInputShape("x", {1, 200, 80, 0, 0, 0, 0, 0});, then call context->getTensorShape("x");, I get d = {1, 200, 80, 0, 0, 0, 0, 0}};
So, I think engine can be shared, while context can NOT be shared;

Did I not handle something correctly?

Thank you!

lix19937 · 2024-09-18T05:56:33Z

The Engine represents an optimized model. You can query an engine for information about the input and output tensors of the network - the expected dimensions, data type, data format, and so on.

The ExecutionContext, created from the engine, is the main interface for invoking inference. The execution context contains all of the states associated with a particular invocation - thus, you can have multiple contexts associated with a single engine and run them in parallel.

You must set up the input and output buffers in the appropriate locations when invoking inference. Depending on the nature of the data, this may be in either CPU or GPU memory. If not obvious, based on your model, you can query the engine to determine which memory space to provide the buffer.

poor1017 · 2024-09-18T06:49:19Z

I understand the main principles.

Now there is such a scenario:

The model uses dynamic shape;
The program has two worker threads;
Two context are created frome one engine;
Each thread is bound to a context;

When the program is running, is it safe to let the two threads infer two batches at the same time?

Thank you!

lix19937 · 2024-09-18T06:56:35Z

Sorry, my python example not very appropriate. At runtime, asking the engine for binding dimensions returns the same dimensions used to build the network, meaning you get a -1 for each runtime dimension.

From https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/classnvinfer1_1_1_i_execution_context.html#details
Multiple execution contexts may exist for one instance, allowing the same engine to be used for the execution of multiple batches simultaneously. If the engine supports dynamic shapes, each execution context in concurrent use must use a separate optimization profile.
Like follow snipnet: @poor1017

from cuda import cudart
import numpy as np
import tensorrt as trt

nB, nC, nH, nW = 2, 3, 4, 5
nContext = 2  # num of Context 
np.random.seed(19937)
np.set_printoptions(precision=8, linewidth=200, suppress=True)
cudart.cudaDeviceSynchronize()

logger = trt.Logger(trt.Logger.ERROR)
builder = trt.Builder(logger)
network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
profileList = [builder.create_optimization_profile() for index in range(nContext)]
config = builder.create_builder_config()

inputT0 = network.add_input("inputT0", trt.float32, [-1, -1, -1, -1])
inputT1 = network.add_input("inputT1", trt.float32, [-1, -1, -1, -1])
layer = network.add_elementwise(inputT0, inputT1, trt.ElementWiseOperation.SUM)
network.mark_output(layer.get_output(0))

for profile in profileList:
    profile.set_shape(inputT0.name, (nB, nC, nH, nW), (nB, nC, nH, nW), (nB * nContext, nC * nContext, nH * nContext, nW * nContext))  # depend on your model shape
    profile.set_shape(inputT1.name, (nB, nC, nH, nW), (nB, nC, nH, nW), (nB * nContext, nC * nContext, nH * nContext, nW * nContext))
    config.add_optimization_profile(profile)

engineString = builder.build_serialized_network(network, config)
engine = trt.Runtime(logger).deserialize_cuda_engine(engineString)

streamList = [cudart.cudaStreamCreate()[1] for index in range(nContext)]
contextList = [engine.create_execution_context() for index in range(nContext)]
nInput = np.sum([engine.binding_is_input(i) for i in range(engine.num_bindings)])
nOutput = engine.num_bindings - nInput
nInput = nInput // nContext
nOutput = nOutput // nContext

bufferH = []
for index in range(nContext):
    stream = streamList[index]
    context = contextList[index]
    context.set_optimization_profile_async(index, stream)
    bindingPad = (nInput + nOutput) * index  #  
    bindingShape = (np.array([nB, nC, nH, nW]) * (index + 1)).tolist()  # Binding0 use [nB, nC, nH, nW]，Binding1 use [nB*2, nC*2, nH*2, nW*2] 
    context.set_binding_shape(bindingPad + 0, bindingShape)
    context.set_binding_shape(bindingPad + 1, bindingShape)
    print("Context%d binding all? %s" % (index, "Yes" if context.all_binding_shapes_specified else "No"))
    for i in range(engine.num_bindings):
        print(i, "Input " if engine.binding_is_input(i) else "Output", engine.get_binding_shape(i), context.get_binding_shape(i))

    for i in range(nInput):
        bufferH.append(np.arange(np.prod(bindingShape)).astype(np.float32).reshape(bindingShape))
    for i in range(nOutput):
        bufferH.append(np.empty(context.get_binding_shape(bindingPad + nInput + i), dtype=trt.nptype(engine.get_binding_dtype(bindingPad + nInput + i))))

bufferD = []
for i in range(engine.num_bindings):
    bufferD.append(cudart.cudaMalloc(bufferH[i].nbytes)[1])

for index in range(nContext):
    bindingPad = (nInput + nOutput) * index
    for i in range(nInput):
        cudart.cudaMemcpyAsync(bufferD[bindingPad + i], bufferH[bindingPad + i].ctypes.data, bufferH[bindingPad + i].nbytes, cudart.cudaMemcpyKind.cudaMemcpyHostToDevice, streamList[index])

for index in range(nContext):
    bindingPad = (nInput + nOutput) * index
    bufferList = [int(0) for b in bufferD[:bindingPad]] + [int(b) for b in bufferD[bindingPad:(bindingPad + nInput + nOutput)]] + [int(0) for b in bufferD[(bindingPad + nInput + nOutput):]]
    contextList[index].execute_async_v2(bufferList, streamList[index])

for index in range(nContext):
    bindingPad = (nInput + nOutput) * index
    for i in range(nOutput):
        cudart.cudaMemcpyAsync(bufferH[bindingPad + nInput + i].ctypes.data, bufferD[bindingPad + nInput + i], bufferH[bindingPad + nInput + i].nbytes, cudart.cudaMemcpyKind.cudaMemcpyDeviceToHost, streamList[index])

for index in range(nContext):
    cudart.cudaStreamSynchronize(streamList[index])

for index in range(nContext):
    bindingPad = (nInput + nOutput) * index
    print("check result of context %d: %s" % (index, np.all(bufferH[bindingPad + 2] == bufferH[bindingPad + 0] + bufferH[bindingPad + 1])))

for b in bufferD:
    cudart.cudaFree(b)

print("done !");

poor1017 · 2024-09-18T08:03:18Z

From https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/classnvinfer1_1_1_i_execution_context.html#details
One op profile does match only one context.

I noticed that there is this statement in your code:

context.set_optimization_profile_async(index, stream)

In this case, if an engine has only one default op profile, then this engine can indeed only have one context.

There is one thing that makes me curious: why we should call this context.set_optimization_profile_async(index, stream)?
If we do NOT call this method, can one engine have multiple contexts?

Thank you!

lix19937 · 2024-09-18T08:16:05Z

set_optimization_profile_async/set_optimization_profile just process dynamic shape case to get better performance in some/nearly shape.

If we do NOT call this method, can one engine have multiple contexts?

Yes, from previous python snipnet, use 2 context, and two profile.set_shape use the same.

poor1017 · 2024-09-19T00:54:47Z

Hi @lix19937

Here is what I discovered.

From https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#opt_profiles

At runtime, you must set an optimization profile before setting input dimensions. Profiles are numbered in the order they were added, starting at 0. Note that each execution context must use a separate optimization profile.

If the associated CUDA engine has dynamic inputs, the optimization profile must be set at least once with a unique profile index that is not used by other execution contexts, and that is not destroyed. For the first execution context created for an engine, profile 0 is implicitly chosen.

We can get:

An engine has a default optimization profile which has index 0;
If dynamic shape, one profile (as default is 0) is bound to one context;

So, an engine can NOT be shared.

lix19937 · 2024-09-19T01:52:37Z

Yes, you can see the code

for profile in profileList:
    profile.set_shape(inputT0.name, (nB, nC, nH, nW), (nB, nC, nH, nW), (nB * nContext, nC * nContext, nH * nContext, nW * nContext))  # depend on your model shape
    profile.set_shape(inputT1.name, (nB, nC, nH, nW), (nB, nC, nH, nW), (nB * nContext, nC * nContext, nH * nContext, nW * nContext))
    config.add_optimization_profile(profile)

and

for index in range(nContext):
    stream = streamList[index]
    context = contextList[index]
    context.set_optimization_profile_async(index, stream)
    bindingPad = (nInput + nOutput) * index  #  
    bindingShape = (np.array([nB, nC, nH, nW]) * (index + 1)).tolist()  # Binding0 use [nB, nC, nH, nW]，Binding1 use [nB*2, nC*2, nH*2, nW*2] 
    context.set_binding_shape(bindingPad + 0, bindingShape)
    context.set_binding_shape(bindingPad + 1, bindingShape)

moraxu added triaged Issue has been triaged by maintainers Topic: Dynamic Shape labels Sep 16, 2024

poor1017 closed this as completed Sep 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Engine can NOT be shared if model uses dynamic shape. #4127

Engine can NOT be shared if model uses dynamic shape. #4127

poor1017 commented Sep 14, 2024

lix19937 commented Sep 16, 2024

poor1017 commented Sep 18, 2024

lix19937 commented Sep 18, 2024

poor1017 commented Sep 18, 2024

lix19937 commented Sep 18, 2024 •

edited

Loading

poor1017 commented Sep 18, 2024

lix19937 commented Sep 18, 2024 •

edited

Loading

poor1017 commented Sep 19, 2024

lix19937 commented Sep 19, 2024 •

edited

Loading

Engine can NOT be shared if model uses dynamic shape. #4127

Engine can NOT be shared if model uses dynamic shape. #4127

Comments

poor1017 commented Sep 14, 2024

lix19937 commented Sep 16, 2024

poor1017 commented Sep 18, 2024

lix19937 commented Sep 18, 2024

poor1017 commented Sep 18, 2024

lix19937 commented Sep 18, 2024 • edited Loading

poor1017 commented Sep 18, 2024

lix19937 commented Sep 18, 2024 • edited Loading

poor1017 commented Sep 19, 2024

lix19937 commented Sep 19, 2024 • edited Loading

lix19937 commented Sep 18, 2024 •

edited

Loading

lix19937 commented Sep 18, 2024 •

edited

Loading

lix19937 commented Sep 19, 2024 •

edited

Loading