Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Engine can NOT be shared if model uses dynamic shape. #4127

Closed
poor1017 opened this issue Sep 14, 2024 · 9 comments
Closed

Engine can NOT be shared if model uses dynamic shape. #4127

poor1017 opened this issue Sep 14, 2024 · 9 comments
Labels
Topic: Dynamic Shape triaged Issue has been triaged by maintainers

Comments

@poor1017
Copy link

Hi guys

In the repo tensorrt_backend's code:
https://github.com/triton-inference-server/tensorrt_backend/blob/main/src/instance_state.h#L377

It means: If the model uses dynamic shape, then the CUDA engine can NOT be shared across contexts.

Can anyone tell me why this is?

Thank you!

@lix19937
Copy link

By default, TensorRT optimizes the model based on the input shapes (batch size, image size, and so on) at which it was defined. However, the builder can be configured to adjust the input dimensions at runtime.

For fixed shape model, multiple contexts associated with a single engine and run them in parallel/multi-threads.

For dynamic shape model, engine attributes(like io-tensor) will be modified by special input shape.

// at init phase 
context = engine.create_execution_context()

// at runtime if model only one input  
// dynamic shape model need set dynamic input , but fixed shape model not need follow.
context.set_input_shape(lTensorName[0], shape) 

for i in range(engine.num_io_tensors):
   print("[%2d]%s->" % (i, "Input " if i < nInput else "Output"),  
      engine.get_tensor_dtype(lTensorName[i]), 
      engine.get_tensor_shape(lTensorName[i]), 
      context.get_tensor_shape(lTensorName[i]), lTensorName[i])

so, the CUDA engine can NOT be shared across contexts. @poor1017

@moraxu moraxu added triaged Issue has been triaged by maintainers Topic: Dynamic Shape labels Sep 16, 2024
@poor1017
Copy link
Author

Hi @lix19937

Thank you for your reply.

From your code I can infer the following:

  1. Dynamic input shape is set to context;
  2. I can get running shape from engine.get_tensor_shape(lTensorName[i]);
  3. So, engin can NOT be shared;

But, in my C++ test, I found the following:

  1. Call nvinfer1::Dims dims = engine->getTensorShape("x");, I get d = {-1, -1, 80, 0, 0, 0, 0, 0}};
  2. Call context->setInputShape("x", {1, 200, 80, 0, 0, 0, 0, 0});, then call nvinfer1::Dims dims = engine->getTensorShape("x");, I get d = {-1, -1, 80, 0, 0, 0, 0, 0}}, not changed;
  3. Call context->setInputShape("x", {1, 200, 80, 0, 0, 0, 0, 0});, then call context->getTensorShape("x");, I get d = {1, 200, 80, 0, 0, 0, 0, 0}};
  4. So, I think engine can be shared, while context can NOT be shared;

Did I not handle something correctly?

Thank you!

@lix19937
Copy link

The Engine represents an optimized model. You can query an engine for information about the input and output tensors of the network - the expected dimensions, data type, data format, and so on.

The ExecutionContext, created from the engine, is the main interface for invoking inference. The execution context contains all of the states associated with a particular invocation - thus, you can have multiple contexts associated with a single engine and run them in parallel.

You must set up the input and output buffers in the appropriate locations when invoking inference. Depending on the nature of the data, this may be in either CPU or GPU memory. If not obvious, based on your model, you can query the engine to determine which memory space to provide the buffer.

@poor1017
Copy link
Author

I understand the main principles.

Now there is such a scenario:

  1. The model uses dynamic shape;
  2. The program has two worker threads;
  3. Two context are created frome one engine;
  4. Each thread is bound to a context;

When the program is running, is it safe to let the two threads infer two batches at the same time?

Thank you!

@lix19937
Copy link

lix19937 commented Sep 18, 2024

Sorry, my python example not very appropriate. At runtime, asking the engine for binding dimensions returns the same dimensions used to build the network, meaning you get a -1 for each runtime dimension.

From https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/classnvinfer1_1_1_i_execution_context.html#details
Multiple execution contexts may exist for one instance, allowing the same engine to be used for the execution of multiple batches simultaneously. If the engine supports dynamic shapes, each execution context in concurrent use must use a separate optimization profile.
Like follow snipnet: @poor1017

from cuda import cudart
import numpy as np
import tensorrt as trt

nB, nC, nH, nW = 2, 3, 4, 5
nContext = 2  # num of Context 
np.random.seed(19937)
np.set_printoptions(precision=8, linewidth=200, suppress=True)
cudart.cudaDeviceSynchronize()

logger = trt.Logger(trt.Logger.ERROR)
builder = trt.Builder(logger)
network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
profileList = [builder.create_optimization_profile() for index in range(nContext)]
config = builder.create_builder_config()

inputT0 = network.add_input("inputT0", trt.float32, [-1, -1, -1, -1])
inputT1 = network.add_input("inputT1", trt.float32, [-1, -1, -1, -1])
layer = network.add_elementwise(inputT0, inputT1, trt.ElementWiseOperation.SUM)
network.mark_output(layer.get_output(0))

for profile in profileList:
    profile.set_shape(inputT0.name, (nB, nC, nH, nW), (nB, nC, nH, nW), (nB * nContext, nC * nContext, nH * nContext, nW * nContext))  # depend on your model shape
    profile.set_shape(inputT1.name, (nB, nC, nH, nW), (nB, nC, nH, nW), (nB * nContext, nC * nContext, nH * nContext, nW * nContext))
    config.add_optimization_profile(profile)

engineString = builder.build_serialized_network(network, config)
engine = trt.Runtime(logger).deserialize_cuda_engine(engineString)

streamList = [cudart.cudaStreamCreate()[1] for index in range(nContext)]
contextList = [engine.create_execution_context() for index in range(nContext)]
nInput = np.sum([engine.binding_is_input(i) for i in range(engine.num_bindings)])
nOutput = engine.num_bindings - nInput
nInput = nInput // nContext
nOutput = nOutput // nContext

bufferH = []
for index in range(nContext):
    stream = streamList[index]
    context = contextList[index]
    context.set_optimization_profile_async(index, stream)
    bindingPad = (nInput + nOutput) * index  #  
    bindingShape = (np.array([nB, nC, nH, nW]) * (index + 1)).tolist()  # Binding0 use [nB, nC, nH, nW],Binding1 use [nB*2, nC*2, nH*2, nW*2] 
    context.set_binding_shape(bindingPad + 0, bindingShape)
    context.set_binding_shape(bindingPad + 1, bindingShape)
    print("Context%d binding all? %s" % (index, "Yes" if context.all_binding_shapes_specified else "No"))
    for i in range(engine.num_bindings):
        print(i, "Input " if engine.binding_is_input(i) else "Output", engine.get_binding_shape(i), context.get_binding_shape(i))

    for i in range(nInput):
        bufferH.append(np.arange(np.prod(bindingShape)).astype(np.float32).reshape(bindingShape))
    for i in range(nOutput):
        bufferH.append(np.empty(context.get_binding_shape(bindingPad + nInput + i), dtype=trt.nptype(engine.get_binding_dtype(bindingPad + nInput + i))))

bufferD = []
for i in range(engine.num_bindings):
    bufferD.append(cudart.cudaMalloc(bufferH[i].nbytes)[1])

for index in range(nContext):
    bindingPad = (nInput + nOutput) * index
    for i in range(nInput):
        cudart.cudaMemcpyAsync(bufferD[bindingPad + i], bufferH[bindingPad + i].ctypes.data, bufferH[bindingPad + i].nbytes, cudart.cudaMemcpyKind.cudaMemcpyHostToDevice, streamList[index])

for index in range(nContext):
    bindingPad = (nInput + nOutput) * index
    bufferList = [int(0) for b in bufferD[:bindingPad]] + [int(b) for b in bufferD[bindingPad:(bindingPad + nInput + nOutput)]] + [int(0) for b in bufferD[(bindingPad + nInput + nOutput):]]
    contextList[index].execute_async_v2(bufferList, streamList[index])

for index in range(nContext):
    bindingPad = (nInput + nOutput) * index
    for i in range(nOutput):
        cudart.cudaMemcpyAsync(bufferH[bindingPad + nInput + i].ctypes.data, bufferD[bindingPad + nInput + i], bufferH[bindingPad + nInput + i].nbytes, cudart.cudaMemcpyKind.cudaMemcpyDeviceToHost, streamList[index])

for index in range(nContext):
    cudart.cudaStreamSynchronize(streamList[index])

for index in range(nContext):
    bindingPad = (nInput + nOutput) * index
    print("check result of context %d: %s" % (index, np.all(bufferH[bindingPad + 2] == bufferH[bindingPad + 0] + bufferH[bindingPad + 1])))

for b in bufferD:
    cudart.cudaFree(b)

print("done !");

@poor1017
Copy link
Author

From https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/classnvinfer1_1_1_i_execution_context.html#details
One op profile does match only one context.

I noticed that there is this statement in your code:

context.set_optimization_profile_async(index, stream)

In this case, if an engine has only one default op profile, then this engine can indeed only have one context.

There is one thing that makes me curious: why we should call this context.set_optimization_profile_async(index, stream)?
If we do NOT call this method, can one engine have multiple contexts?

Thank you!

@lix19937
Copy link

lix19937 commented Sep 18, 2024

set_optimization_profile_async/set_optimization_profile just process dynamic shape case to get better performance in some/nearly shape.

If we do NOT call this method, can one engine have multiple contexts?

Yes, from previous python snipnet, use 2 context, and two profile.set_shape use the same.

@poor1017
Copy link
Author

Hi @lix19937

Here is what I discovered.

From https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#opt_profiles

At runtime, you must set an optimization profile before setting input dimensions. Profiles are numbered in the order they were added, starting at 0. Note that each execution context must use a separate optimization profile.

If the associated CUDA engine has dynamic inputs, the optimization profile must be set at least once with a unique profile index that is not used by other execution contexts, and that is not destroyed. For the first execution context created for an engine, profile 0 is implicitly chosen.

We can get:

  1. An engine has a default optimization profile which has index 0;
  2. If dynamic shape, one profile (as default is 0) is bound to one context;

So, an engine can NOT be shared.

@lix19937
Copy link

lix19937 commented Sep 19, 2024

Yes, you can see the code

for profile in profileList:
    profile.set_shape(inputT0.name, (nB, nC, nH, nW), (nB, nC, nH, nW), (nB * nContext, nC * nContext, nH * nContext, nW * nContext))  # depend on your model shape
    profile.set_shape(inputT1.name, (nB, nC, nH, nW), (nB, nC, nH, nW), (nB * nContext, nC * nContext, nH * nContext, nW * nContext))
    config.add_optimization_profile(profile)

and

for index in range(nContext):
    stream = streamList[index]
    context = contextList[index]
    context.set_optimization_profile_async(index, stream)
    bindingPad = (nInput + nOutput) * index  #  
    bindingShape = (np.array([nB, nC, nH, nW]) * (index + 1)).tolist()  # Binding0 use [nB, nC, nH, nW],Binding1 use [nB*2, nC*2, nH*2, nW*2] 
    context.set_binding_shape(bindingPad + 0, bindingShape)
    context.set_binding_shape(bindingPad + 1, bindingShape)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Topic: Dynamic Shape triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

3 participants