-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Engine can NOT be shared if model uses dynamic shape. #4127
Comments
By default, TensorRT optimizes the model based on the input shapes (batch size, image size, and so on) at which it was defined. However, the builder can be configured to adjust the input dimensions at runtime. For fixed shape model, multiple contexts associated with a single engine and run them in parallel/multi-threads. For dynamic shape model, engine attributes(like io-tensor) will be modified by special input shape. // at init phase
context = engine.create_execution_context()
// at runtime if model only one input
// dynamic shape model need set dynamic input , but fixed shape model not need follow.
context.set_input_shape(lTensorName[0], shape)
for i in range(engine.num_io_tensors):
print("[%2d]%s->" % (i, "Input " if i < nInput else "Output"),
engine.get_tensor_dtype(lTensorName[i]),
engine.get_tensor_shape(lTensorName[i]),
context.get_tensor_shape(lTensorName[i]), lTensorName[i]) so, the CUDA engine can NOT be shared across contexts. @poor1017 |
Hi @lix19937 Thank you for your reply. From your code I can infer the following:
But, in my C++ test, I found the following:
Did I not handle something correctly? Thank you! |
The Engine represents an optimized model. You can query an engine for information about the input and output tensors of the network - the expected dimensions, data type, data format, and so on. The ExecutionContext, created from the engine, is the main interface for invoking inference. The execution context contains all of the states associated with a particular invocation - thus, you can have multiple contexts associated with a single engine and run them in parallel. You must set up the input and output buffers in the appropriate locations when invoking inference. Depending on the nature of the data, this may be in either CPU or GPU memory. If not obvious, based on your model, you can query the engine to determine which memory space to provide the buffer. |
I understand the main principles. Now there is such a scenario:
When the program is running, is it safe to let the two threads infer two batches at the same time? Thank you! |
Sorry, my python example not very appropriate. At runtime, asking the engine for binding dimensions returns the same dimensions used to build the network, meaning you get a From https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/classnvinfer1_1_1_i_execution_context.html#details from cuda import cudart
import numpy as np
import tensorrt as trt
nB, nC, nH, nW = 2, 3, 4, 5
nContext = 2 # num of Context
np.random.seed(19937)
np.set_printoptions(precision=8, linewidth=200, suppress=True)
cudart.cudaDeviceSynchronize()
logger = trt.Logger(trt.Logger.ERROR)
builder = trt.Builder(logger)
network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
profileList = [builder.create_optimization_profile() for index in range(nContext)]
config = builder.create_builder_config()
inputT0 = network.add_input("inputT0", trt.float32, [-1, -1, -1, -1])
inputT1 = network.add_input("inputT1", trt.float32, [-1, -1, -1, -1])
layer = network.add_elementwise(inputT0, inputT1, trt.ElementWiseOperation.SUM)
network.mark_output(layer.get_output(0))
for profile in profileList:
profile.set_shape(inputT0.name, (nB, nC, nH, nW), (nB, nC, nH, nW), (nB * nContext, nC * nContext, nH * nContext, nW * nContext)) # depend on your model shape
profile.set_shape(inputT1.name, (nB, nC, nH, nW), (nB, nC, nH, nW), (nB * nContext, nC * nContext, nH * nContext, nW * nContext))
config.add_optimization_profile(profile)
engineString = builder.build_serialized_network(network, config)
engine = trt.Runtime(logger).deserialize_cuda_engine(engineString)
streamList = [cudart.cudaStreamCreate()[1] for index in range(nContext)]
contextList = [engine.create_execution_context() for index in range(nContext)]
nInput = np.sum([engine.binding_is_input(i) for i in range(engine.num_bindings)])
nOutput = engine.num_bindings - nInput
nInput = nInput // nContext
nOutput = nOutput // nContext
bufferH = []
for index in range(nContext):
stream = streamList[index]
context = contextList[index]
context.set_optimization_profile_async(index, stream)
bindingPad = (nInput + nOutput) * index #
bindingShape = (np.array([nB, nC, nH, nW]) * (index + 1)).tolist() # Binding0 use [nB, nC, nH, nW],Binding1 use [nB*2, nC*2, nH*2, nW*2]
context.set_binding_shape(bindingPad + 0, bindingShape)
context.set_binding_shape(bindingPad + 1, bindingShape)
print("Context%d binding all? %s" % (index, "Yes" if context.all_binding_shapes_specified else "No"))
for i in range(engine.num_bindings):
print(i, "Input " if engine.binding_is_input(i) else "Output", engine.get_binding_shape(i), context.get_binding_shape(i))
for i in range(nInput):
bufferH.append(np.arange(np.prod(bindingShape)).astype(np.float32).reshape(bindingShape))
for i in range(nOutput):
bufferH.append(np.empty(context.get_binding_shape(bindingPad + nInput + i), dtype=trt.nptype(engine.get_binding_dtype(bindingPad + nInput + i))))
bufferD = []
for i in range(engine.num_bindings):
bufferD.append(cudart.cudaMalloc(bufferH[i].nbytes)[1])
for index in range(nContext):
bindingPad = (nInput + nOutput) * index
for i in range(nInput):
cudart.cudaMemcpyAsync(bufferD[bindingPad + i], bufferH[bindingPad + i].ctypes.data, bufferH[bindingPad + i].nbytes, cudart.cudaMemcpyKind.cudaMemcpyHostToDevice, streamList[index])
for index in range(nContext):
bindingPad = (nInput + nOutput) * index
bufferList = [int(0) for b in bufferD[:bindingPad]] + [int(b) for b in bufferD[bindingPad:(bindingPad + nInput + nOutput)]] + [int(0) for b in bufferD[(bindingPad + nInput + nOutput):]]
contextList[index].execute_async_v2(bufferList, streamList[index])
for index in range(nContext):
bindingPad = (nInput + nOutput) * index
for i in range(nOutput):
cudart.cudaMemcpyAsync(bufferH[bindingPad + nInput + i].ctypes.data, bufferD[bindingPad + nInput + i], bufferH[bindingPad + nInput + i].nbytes, cudart.cudaMemcpyKind.cudaMemcpyDeviceToHost, streamList[index])
for index in range(nContext):
cudart.cudaStreamSynchronize(streamList[index])
for index in range(nContext):
bindingPad = (nInput + nOutput) * index
print("check result of context %d: %s" % (index, np.all(bufferH[bindingPad + 2] == bufferH[bindingPad + 0] + bufferH[bindingPad + 1])))
for b in bufferD:
cudart.cudaFree(b)
print("done !"); |
From https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/classnvinfer1_1_1_i_execution_context.html#details I noticed that there is this statement in your code: context.set_optimization_profile_async(index, stream) In this case, if an engine has only one default op profile, then this engine can indeed only have one context. There is one thing that makes me curious: why we should call this Thank you! |
Yes, from previous python snipnet, use 2 context, and two profile.set_shape use the same. |
Hi @lix19937 Here is what I discovered. From https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#opt_profiles
We can get:
So, an engine can NOT be shared. |
Yes, you can see the code
and
|
Hi guys
In the repo tensorrt_backend's code:
https://github.com/triton-inference-server/tensorrt_backend/blob/main/src/instance_state.h#L377
It means: If the model uses dynamic shape, then the CUDA engine can NOT be shared across contexts.
Can anyone tell me why this is?
Thank you!
The text was updated successfully, but these errors were encountered: