代码收藏家技术教程 2024-10-04

TensorRT模型部署入门指南（Python篇）

文章目录

1. TensorRT安装

1.1 cuda/cudnn以及虚拟环境的创建

1.2 根据cuda版本安装相对应版本的tensorRT

2. 模型转换

2.1 pth转onnx

2.2 onnx转engine

3. TensorRT部署

TensorRT推理（python API）

TensorRT推理（C++ API）

可能遇到的问题

参考文献

1. TensorRT安装

1.1 cuda/cudnn以及虚拟环境的创建

CUDA下载链接：https://developer.nvidia.com/cuda-toolkit-archive
cuDnn下载链接：https://docs.nvidia.com/deeplearning/cudnn/latest/installation/windows.html

1.2 根据cuda版本安装相对应版本的tensorRT

TensorRT下载链接：https://developer.nvidia.com/tensorrt/download

TensorRT安装指南
下载后解压缩，并为文件夹下的lib配置环境变量

2. 模型转换

2.1 pth转onnx

python安装onnx模块，pip install onnx

input_name = 'input'
output_name = 'output'
torch.onnx.export(model,               # model being run
                  x,                   # model input 
                  "model.onnx",        # where to save the model (can be a file or file-like object)                  
                  opset_version=11,          # the ONNX version to export the model to                  
                  input_names = [input_name],   # the model's input names
                  output_names = [output_name],  # the model's output names
                  dynamic_axes= {
                        input_name: {0: 'batch_size', 2 : 'in_width', 3: 'int_height'},
                        output_name: {0: 'batch_size', 2: 'out_width', 3:'out_height'}
                        }

2.2 onnx转engine

注意：TensorRT的ONNX解释器是针对Pytorch版本编译的，如果版本不对应可能导致转模型时出现错误
1. 使用命令行工具

主要是调用bin文件夹下的trtexec执行程序

trtexec.exe --onnx=model.onnx --saveEngine=model.engine --workspace=6000

#生成静态batchsize的engine
./trtexec 	--onnx=<onnx_file> \ 						#指定onnx模型文件
        	--explicitBatch \ 							#在构建引擎时使用显式批大小(默认=隐式)显示批处理
        	--saveEngine=<tensorRT_engine_file> \ 		#输出engine
        	--workspace=<size_in_megabytes> \ 			#设置工作空间大小单位是MB(默认为16MB)
        	--fp16 										#除了fp32之外，还启用fp16精度(默认=禁用)
        
#生成动态batchsize的engine
./trtexec 	--onnx=<onnx_file> \						#指定onnx模型文件
        	--minShapes=input:<shape_of_min_batch> \ 	#最小的NCHW
        	--optShapes=input:<shape_of_opt_batch> \  	#最佳输入维度，跟maxShapes一样就好
        	--maxShapes=input:<shape_of_max_batch> \ 	#最大输入维度
        	--workspace=<size_in_megabytes> \ 			#设置工作空间大小单位是MB(默认为16MB)
        	--saveEngine=<engine_file> \   				#输出engine
        	--fp16   									#除了fp32之外，还启用fp16精度(默认=禁用)


#小尺寸的图片可以多batchsize即8x3x416x416
/home/zxl/TensorRT-7.2.3.4/bin/trtexec  --onnx=yolov4_-1_3_416_416_dynamic.onnx \
                                        --minShapes=input:1x3x416x416 \
                                        --optShapes=input:8x3x416x416 \
                                        --maxShapes=input:8x3x416x416 \
                                        --workspace=4096 \
                                        --saveEngine=yolov4_-1_3_416_416_dynamic_b8_fp16.engine \
                                        --fp16

#由于内存不够了所以改成4x3x608x608
/home/zxl/TensorRT-7.2.3.4/bin/trtexec  --onnx=yolov4_-1_3_608_608_dynamic.onnx \
                                        --minShapes=input:1x3x608x608 \
                                        --optShapes=input:4x3x608x608 \
                                        --maxShapes=input:4x3x608x608 \
                                        --workspace=4096 \
                                        --saveEngine=yolov4_-1_3_608_608_dynamic_b4_fp16.engine \
                                        --fp16

另外，可以使用trtexec.exe –help命令查看trtexec的命令参数含义

D:\Work\cuda_gpu\sdk\TensorRT-8.5.1.7\bin>trtexec.exe --help
&&&& RUNNING TensorRT.trtexec [TensorRT v8501] # trtexec.exe --help
=== Model Options ===
  --uff=<file>                UFF model
  --onnx=<file>               ONNX model
  --model=<file>              Caffe model (default = no model, random weights used)
  --deploy=<file>             Caffe prototxt file
  --output=<name>[,<name>]*   Output names (it can be specified multiple times); at least one output is required for UFF and Caffe
  --uffInput=<name>,X,Y,Z     Input blob name and its dimensions (X,Y,Z=C,H,W), it can be specified multiple times; at least one is required for UFF models
  --uffNHWC                   Set if inputs are in the NHWC layout instead of NCHW (use X,Y,Z=H,W,C order in --uffInput)

=== Build Options ===
  --maxBatch                  Set max batch size and build an implicit batch engine (default = same size as --batch)
                              This option should not be used when the input model is ONNX or when dynamic shapes are provided.
  --minShapes=spec            Build with dynamic shapes using a profile with the min shapes provided
  --optShapes=spec            Build with dynamic shapes using a profile with the opt shapes provided
  --maxShapes=spec            Build with dynamic shapes using a profile with the max shapes provided
  --minShapesCalib=spec       Calibrate with dynamic shapes using a profile with the min shapes provided
  --optShapesCalib=spec       Calibrate with dynamic shapes using a profile with the opt shapes provided
  --maxShapesCalib=spec       Calibrate with dynamic shapes using a profile with the max shapes provided
                              Note: All three of min, opt and max shapes must be supplied.
                                    However, if only opt shapes is supplied then it will be expanded so
                                    that min shapes and max shapes are set to the same values as opt shapes.
                                    Input names can be wrapped with escaped single quotes (ex: \'Input:0\').
                              Example input shapes spec: input0:1x3x256x256,input1:1x3x128x128
                              Each input shape is supplied as a key-value pair where key is the input name and
                              value is the dimensions (including the batch dimension) to be used for that input.
                              Each key-value pair has the key and value separated using a colon (:).
                              Multiple input shapes can be provided via comma-separated key-value pairs.
  --inputIOFormats=spec       Type and format of each of the input tensors (default = all inputs in fp32:chw)
                              See --outputIOFormats help for the grammar of type and format list.
                              Note: If this option is specified, please set comma-separated types and formats for all
                                    inputs following the same order as network inputs ID (even if only one input
                                    needs specifying IO format) or set the type and format once for broadcasting.
  --outputIOFormats=spec      Type and format of each of the output tensors (default = all outputs in fp32:chw)
                              Note: If this option is specified, please set comma-separated types and formats for all
                                    outputs following the same order as network outputs ID (even if only one output
                                    needs specifying IO format) or set the type and format once for broadcasting.
                              IO Formats: spec  ::= IOfmt[","spec]
                                          IOfmt ::= type:fmt
                                          type  ::= "fp32"|"fp16"|"int32"|"int8"
                                          fmt   ::= ("chw"|"chw2"|"chw4"|"hwc8"|"chw16"|"chw32"|"dhwc8"|
                                                     "cdhw32"|"hwc"|"dla_linear"|"dla_hwc4")["+"fmt]
  --workspace=N               Set workspace size in MiB.
  --memPoolSize=poolspec      Specify the size constraints of the designated memory pool(s) in MiB.
                              Note: Also accepts decimal sizes, e.g. 0.25MiB. Will be rounded down to the nearest integer bytes.
                              Pool constraint: poolspec ::= poolfmt[","poolspec]
                                               poolfmt ::= pool:sizeInMiB
                                               pool ::= "workspace"|"dlaSRAM"|"dlaLocalDRAM"|"dlaGlobalDRAM"
  --profilingVerbosity=mode   Specify profiling verbosity. mode ::= layer_names_only|detailed|none (default = layer_names_only)
  --minTiming=M               Set the minimum number of iterations used in kernel selection (default = 1)
  --avgTiming=M               Set the number of times averaged in each iteration for kernel selection (default = 8)
  --refit                     Mark the engine as refittable. This will allow the inspection of refittable layers
                              and weights within the engine.
  --sparsity=spec             Control sparsity (default = disabled).
                              Sparsity: spec ::= "disable", "enable", "force"
                              Note: Description about each of these options is as below
                                    disable = do not enable sparse tactics in the builder (this is the default)
                                    enable  = enable sparse tactics in the builder (but these tactics will only be
                                              considered if the weights have the right sparsity pattern)
                                    force   = enable sparse tactics in the builder and force-overwrite the weights to have
                                              a sparsity pattern (even if you loaded a model yourself)
  --noTF32                    Disable tf32 precision (default is to enable tf32, in addition to fp32)
  --fp16                      Enable fp16 precision, in addition to fp32 (default = disabled)
  --int8                      Enable int8 precision, in addition to fp32 (default = disabled)
  --best                      Enable all precisions to achieve the best performance (default = disabled)
  --directIO                  Avoid reformatting at network boundaries. (default = disabled)
  --precisionConstraints=spec Control precision constraint setting. (default = none)
                                  Precision Constaints: spec ::= "none" | "obey" | "prefer"
                                  none = no constraints
                                  prefer = meet precision constraints set by --layerPrecisions/--layerOutputTypes if possible
                                  obey = meet precision constraints set by --layerPrecisions/--layerOutputTypes or fail
                                         otherwise
  --layerPrecisions=spec      Control per-layer precision constraints. Effective only when precisionConstraints is set to
                              "obey" or "prefer". (default = none)
                              The specs are read left-to-right, and later ones override earlier ones. "*" can be used as a
                              layerName to specify the default precision for all the unspecified layers.
                              Per-layer precision spec ::= layerPrecision[","spec]
                                                  layerPrecision ::= layerName":"precision
                                                  precision ::= "fp32"|"fp16"|"int32"|"int8"
  --layerOutputTypes=spec     Control per-layer output type constraints. Effective only when precisionConstraints is set to
                              "obey" or "prefer". (default = none)
                              The specs are read left-to-right, and later ones override earlier ones. "*" can be used as a
                              layerName to specify the default precision for all the unspecified layers. If a layer has more than
                              one output, then multiple types separated by "+" can be provided for this layer.
                              Per-layer output type spec ::= layerOutputTypes[","spec]
                                                    layerOutputTypes ::= layerName":"type
                                                    type ::= "fp32"|"fp16"|"int32"|"int8"["+"type]
  --calib=<file>              Read INT8 calibration cache file
  --safe                      Enable build safety certified engine
  --consistency               Perform consistency checking on safety certified engine
  --restricted                Enable safety scope checking with kSAFETY_SCOPE build flag
  --saveEngine=<file>         Save the serialized engine
  --loadEngine=<file>         Load a serialized engine
  --tacticSources=tactics     Specify the tactics to be used by adding (+) or removing (-) tactics from the default
                              tactic sources (default = all available tactics).
                              Note: Currently only cuDNN, cuBLAS, cuBLAS-LT, and edge mask convolutions are listed as optional
                                    tactics.
                              Tactic Sources: tactics ::= [","tactic]
                                              tactic  ::= (+|-)lib
                                              lib     ::= "CUBLAS"|"CUBLAS_LT"|"CUDNN"|"EDGE_MASK_CONVOLUTIONS"
                                                          |"JIT_CONVOLUTIONS"
                              For example, to disable cudnn and enable cublas: --tacticSources=-CUDNN,+CUBLAS
  --noBuilderCache            Disable timing cache in builder (default is to enable timing cache)
  --heuristic                 Enable tactic selection heuristic in builder (default is to disable the heuristic)
  --timingCacheFile=<file>    Save/load the serialized global timing cache
  --preview=features          Specify preview feature to be used by adding (+) or removing (-) preview features from the default
                              Preview Features: features ::= [","feature]
                                                feature  ::= (+|-)flag
                                                flag     ::= "fasterDynamicShapes0805"
                                                             |"disableExternalTacticSourcesForCore0805"

=== Inference Options ===
  --batch=N                   Set batch size for implicit batch engines (default = 1)
                              This option should not be used when the engine is built from an ONNX model or when dynamic
                              shapes are provided when the engine is built.
  --shapes=spec               Set input shapes for dynamic shapes inference inputs.
                              Note: Input names can be wrapped with escaped single quotes (ex: \'Input:0\').
                              Example input shapes spec: input0:1x3x256x256, input1:1x3x128x128
                              Each input shape is supplied as a key-value pair where key is the input name and
                              value is the dimensions (including the batch dimension) to be used for that input.
                              Each key-value pair has the key and value separated using a colon (:).
                              Multiple input shapes can be provided via comma-separated key-value pairs.
  --loadInputs=spec           Load input values from files (default = generate random inputs). Input names can be wrapped with single quotes (ex: 'Input:0')
                              Input values spec ::= Ival[","spec]
                                           Ival ::= name":"file
  --iterations=N              Run at least N inference iterations (default = 10)
  --warmUp=N                  Run for N milliseconds to warmup before measuring performance (default = 200)
  --duration=N                Run performance measurements for at least N seconds wallclock time (default = 3)
  --sleepTime=N               Delay inference start with a gap of N milliseconds between launch and compute (default = 0)
  --idleTime=N                Sleep N milliseconds between two continuous iterations(default = 0)
  --streams=N                 Instantiate N engines to use concurrently (default = 1)
  --exposeDMA                 Serialize DMA transfers to and from device (default = disabled).
  --noDataTransfers           Disable DMA transfers to and from device (default = enabled).
  --useManagedMemory          Use managed memory instead of separate host and device allocations (default = disabled).
  --useSpinWait               Actively synchronize on GPU events. This option may decrease synchronization time but increase CPU usage and power (default = disabled)
  --threads                   Enable multithreading to drive engines with independent threads or speed up refitting (default = disabled)
  --useCudaGraph              Use CUDA graph to capture engine execution and then launch inference (default = disabled).
                              This flag may be ignored if the graph capture fails.
  --timeDeserialize           Time the amount of time it takes to deserialize the network and exit.
  --timeRefit                 Time the amount of time it takes to refit the engine before inference.
  --separateProfileRun        Do not attach the profiler in the benchmark run; if profiling is enabled, a second profile run will be executed (default = disabled)
  --buildOnly                 Exit after the engine has been built and skip inference perf measurement (default = disabled)
  --persistentCacheRatio      Set the persistentCacheLimit in ratio, 0.5 represent half of max persistent L2 size (default = 0)

=== Build and Inference Batch Options ===
                              When using implicit batch, the max batch size of the engine, if not given,
                              is set to the inference batch size;
                              when using explicit batch, if shapes are specified only for inference, they
                              will be used also as min/opt/max in the build profile; if shapes are
                              specified only for the build, the opt shapes will be used also for inference;
                              if both are specified, they must be compatible; and if explicit batch is
                              enabled but neither is specified, the model must provide complete static
                              dimensions, including batch size, for all inputs
                              Using ONNX models automatically forces explicit batch.

=== Reporting Options ===
  --verbose                   Use verbose logging (default = false)
  --avgRuns=N                 Report performance measurements averaged over N consecutive iterations (default = 10)
  --percentile=P1,P2,P3,...   Report performance for the P1,P2,P3,... percentages (0<=P_i<=100, 0 representing max perf, and 100 representing min perf; (default = 90,95,99%)
  --dumpRefit                 Print the refittable layers and weights from a refittable engine
  --dumpOutput                Print the output tensor(s) of the last inference iteration (default = disabled)
  --dumpProfile               Print profile information per layer (default = disabled)
  --dumpLayerInfo             Print layer information of the engine to console (default = disabled)
  --exportTimes=<file>        Write the timing results in a json file (default = disabled)
  --exportOutput=<file>       Write the output tensors to a json file (default = disabled)
  --exportProfile=<file>      Write the profile information per layer in a json file (default = disabled)
  --exportLayerInfo=<file>    Write the layer information of the engine in a json file (default = disabled)

=== System Options ===
  --device=N                  Select cuda device N (default = 0)
  --useDLACore=N              Select DLA core N for layers that support DLA (default = none)
  --allowGPUFallback          When DLA is enabled, allow GPU fallback for unsupported layers (default = disabled)
  --plugins                   Plugin library (.so) to load (can be specified multiple times)

=== Help ===
  --help, -h                  Print this message

2. 使用TensorRT API 转换成TensorRT engine

def generate_engine(onnx_path, engine_path):
    # 1.构建trt日志记录器
    logger = trt.Logger(trt.Logger.WARNING)
    # 初始化
    trt.init_libnvinfer_plugins(logger, namespace="")

    # 2.create a builder，logger放入进去
    builder = trt.Builder(logger)

    # 3.创建配置文件，用于trt如何优化模型
    config = builder.create_builder_config()
    # 设置工作空间内存大小
    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 20)  # 1 MiB
    # 设置精度
    config.set_flag(trt.BuilderFlag.FP16)
    # INT8需要进行校准

    # 4.创建一个network。EXPLICIT_BATCH：batch是动态的
    network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
    # 创建ONNX模型解析器
    parser = trt.OnnxParser(network, logger)
    # 解析ONNX模型，并填充到网络
    success = parser.parse_from_file(onnx_path)
    # 处理错误
    for idx in range(parser.num_errors):
        print(parser.get_error(idx))
    if not success:
        pass  # Error handling code here

    # 5.engine模型序列化，即生成了trt.engine model
    serialized_engine = builder.build_serialized_network(network, config)
    # 保存序列化的engine，如果以后要用到的话. 模型不能跨平台，即和trt版本 gpu类型有关
    with open(engine_path, "wb") as f:
        f.write(serialized_engine)

    # 6.反序列化engine。使用runtime接口。即加载engine模型进行推理。
    # runtime = trt.Runtime(logger)
    # engine = runtime.deserialize_cuda_engine(serialized_engine)
    # with open("sample.engine", "rb") as f:
    #     serialized_engine = f.read()

完成上述步骤后，将获得一个转换为TensorRT格式的模型文件（model.trt）。可以将该文件用于TensorRT的推理和部署。

3. TensorRT部署

TensorRT部署包括Python和C++ 两种API

可以使用TensorRT的Python API或C++ API

TensorRT推理（python API）

在安装好tensorrt环境后，可以尝试使用预训练权重进行转化封装部署，运行以下代码！！

import torch
import tensorrt as trt
from collections import OrderedDict, namedtuple
def infer(img_data, engine_path):
    # 1.日志器
    logger = trt.Logger(trt.Logger.INFO)
    # 2.runtime加载trt engine model
    runtime = trt.Runtime(logger)
    trt.init_libnvinfer_plugins(logger, '')  # initialize TensorRT plugins
    with open(engine_path, "rb") as f:
        serialized_engine = f.read()
    engine = runtime.deserialize_cuda_engine(serialized_engine)

    # 3.绑定输入输出
    bindings = OrderedDict()
    Binding = namedtuple('Binding', ('name', 'dtype', 'shape', 'data', 'ptr'))
    fp16 = False
    for index in range(engine.num_bindings):
        name = engine.get_binding_name(index)
        dtype = trt.nptype(engine.get_binding_dtype(index))
        shape = tuple(engine.get_binding_shape(index))
        data = torch.from_numpy(np.empty(shape, dtype=np.dtype(dtype))).to(self.device)
        # Tensor.data_ptr 该tensor首个元素的地址即指针，为int类型
        bindings[name] = Binding(name, dtype, shape, data, int(data.data_ptr()))
        if engine.binding_is_input(index) and dtype == np.float16:
            fp16 = True
    # 记录输入输出的指针地址
    binding_addrs = OrderedDict((n, d.ptr) for n, d in bindings.items())

    # 4.加载数据，绑定数据，并推理，将推理的结果放入到
    context = engine.create_execution_context()
    binding_addrs['images'] = int(img_data.data_ptr())
    context.execute_v2(list(binding_addrs.values()))

    # 5.获取结果((根据导出onnx模型时设置的输入输出名字获取)
    nums = bindings['num'].data[0]
    boxes = bindings['boxes'].data[0]
    scores = bindings['scores'].data[0]
    classes = bindings['classes'].data[0]

TensorRT推理（C++ API）

项目工程设置属性 — 链接器
找到链接器下的输入，在附加依赖项中加入

cudnn.lib
cublas.lib
cudart.lib
nvinfer.lib
nvparsers.lib
nvonnxparser.lib
nvinfer_plugin.lib
opencv_world460d.lib

完整推理代码：

#include <cassert>
#include <cfloat>
#include <fstream>
#include <iostream>
#include <memory>
#include <sstream>

#include <cuda_runtime_api.h>
#include "NvInfer.h"
#include "NvOnnxParser.h"
#include "logger.h"


using sample::gLogError;
using sample::gLogInfo;

using namespace nvinfer1;

// logger用来管控打印日志级别
// TRTLogger继承自nvinfer1::ILogger
class TRTLogger : public nvinfer1::ILogger
{
	void log(Severity severity, const char *msg) noexcept override
	{
		// 屏蔽INFO级别的日志
		if (severity != Severity::kINFO)
			std::cout << msg << std::endl;
	}
} gLogger;
int ReadEngineData(char* enginePath, char *&engineData)
{
	// 读取引擎文件
	std::ifstream engineFile(enginePath, std::ios::binary);
	if (engineFile.fail())
	{
		std::cerr << "Failed to open file!" << std::endl;
		return -1;
	}

	engineFile.seekg(0, std::ifstream::end);
	auto fsize = engineFile.tellg();
	engineFile.seekg(0, std::ifstream::beg);

	if (nullptr == engineData)
	{
		engineData = new char[fsize];
	}

	engineFile.read(engineData, fsize);
	engineFile.close();
	return fsize;
}
size_t getMemorySize(nvinfer1::Dims32 input_dims, int typeSize)
{
	size_t psize = input_dims.d[0] * input_dims.d[1] * input_dims.d[2] * input_dims.d[3] * typeSize;
	return psize;
}
bool inferDemo(float* input_buffer, int* tensorSize)
{
	int batchsize = tensorSize[0];
	int channel = tensorSize[1];
	int width = tensorSize[2];
	int height = tensorSize[3];

	size_t dataSize = width * height*channel*batchsize;

	// 读取引擎文件
	char* enginePath = "net_model.engine";
	char* engineData = nullptr;
	int fsize = ReadEngineData(enginePath, engineData);
	printf("fsize=%d\n", fsize);

	// 创建运行时 & 加载引擎
	// TRTLogger glogger; // 可以使用这个代替sample::gLogger.getTRTLogger()
	std::unique_ptr<nvinfer1::IRuntime> runtime{ nvinfer1::createInferRuntime(sample::gLogger.getTRTLogger()) };
	std::unique_ptr<nvinfer1::ICudaEngine> mEngine(runtime->deserializeCudaEngine(engineData, fsize, nullptr));
	assert(mEngine.get() != nullptr);

	// 创建执行上下文
	std::unique_ptr<nvinfer1::IExecutionContext> context(mEngine->createExecutionContext());
	const char* name0 = mEngine->getBindingName(0);
	const char* name1 = mEngine->getBindingName(1);
	const char* name2 = mEngine->getBindingName(2);
	const char* name3 = mEngine->getBindingName(3);

	printf("name0=%s\nname1=%s\nname2=%s\nname3=%s\n", name0, name1, name2, name3);
	// 获取输入大小
	auto input_idx = mEngine->getBindingIndex("input");
 	if (input_idx == -1)
	{
		return false;
	}
	assert(mEngine->getBindingDataType(input_idx) == nvinfer1::DataType::kFLOAT);
	auto input_dims = context->getBindingDimensions(input_idx);
	context->setBindingDimensions(input_idx, input_dims);
	auto input_size = getMemorySize(input_dims, sizeof(float_t));

	// 获取输出大小 所有输出的空间都要分配 
	auto output1_idx = mEngine->getBindingIndex("output1");
	if (output1_idx == -1)
	{
		return false;
	}
	assert(mEngine->getBindingDataType(output1_idx) == nvinfer1::DataType::kFLOAT);
	auto output1_dims = context->getBindingDimensions(output1_idx);
	auto output1_size = getMemorySize(output1_dims, sizeof(float_t));

	auto output2_idx = mEngine->getBindingIndex("output2");
	if (output2_idx == -1)
	{
		return false;
	}
	assert(mEngine->getBindingDataType(output2_idx) == nvinfer1::DataType::kFLOAT);
	auto output2_dims = context->getBindingDimensions(output2_idx);
	auto output2_size = getMemorySize(output2_dims, sizeof(float_t));

	auto output3_idx = mEngine->getBindingIndex("output3");
	if (output3_idx == -1)
	{
		return false;
	}
	assert(mEngine->getBindingDataType(output3_idx) == nvinfer1::DataType::kFLOAT);
	auto output3_dims = context->getBindingDimensions(output3_idx);
	auto output3_size = getMemorySize(output3_dims, sizeof(float_t));

	// 准备推理
	// Allocate CUDA memory
	void* input_mem{ nullptr };
	if (cudaMalloc(&input_mem, input_size) != cudaSuccess)
	{
		gLogError << "ERROR: input cuda memory allocation failed, size = " << input_size << " bytes" << std::endl;
		return false;
	}

	void* output1_mem{ nullptr };
	if (cudaMalloc(&output1_mem, output1_size) != cudaSuccess)
	{
		gLogError << "ERROR: output cuda memory allocation failed, size = " << output1_size << " bytes" << std::endl;
		return false;
	}
	void* output2_mem{ nullptr };
	if (cudaMalloc(&output2_mem, output2_size) != cudaSuccess)
	{
		gLogError << "ERROR: output cuda memory allocation failed, size = " << output2_size << " bytes" << std::endl;
		return false;
	}
	void* output3_mem{ nullptr };
	if (cudaMalloc(&output3_mem, output3_size) != cudaSuccess)
	{
		gLogError << "ERROR: output cuda memory allocation failed, size = " << output3_size << " bytes" << std::endl;
		return false;
	}

	// 复制数据到设备
	cudaMemcpy(input_mem, input_buffer, input_size, cudaMemcpyHostToDevice); // cudaMemcpyHostToDevice 从主机到设备 即 内存到显存

	// 绑定输入输出内存 一起送入推理
	void* bindings[4];
	bindings[input_idx] = input_mem;
	bindings[output1_idx] = output1_mem;
	bindings[output2_idx] = output2_mem;
	bindings[output3_idx] = output3_mem;

	// 推理
	bool status = context->executeV2(bindings);

	if (!status)
	{
		gLogError << "ERROR: inference failed" << std::endl;
		cudaFree(input_mem);
		cudaFree(output1_mem);
		cudaFree(output2_mem);
		cudaFree(output3_mem);
		return 0;
	}

	// 获得结果
	float* output3_buffer = new float[dataSize];
	cudaMemcpy(output3_buffer, output3_mem, output3_size, cudaMemcpyDeviceToHost);

	// 释放CUDA内存
	cudaFree(input_mem);
	cudaFree(output1_mem);
	cudaFree(output2_mem);
	cudaFree(output3_mem);

	cudaError_t err = cudaGetLastError();
	if (err != cudaSuccess) {
		gLogError << "ERROR: failed to free CUDA memory: " << cudaGetErrorString(err) << std::endl;
		return false;
	}

	// save the results



	delete[] output3_buffer;
	output3_buffer = nullptr;

	return true;
}
int main()
{
	int batchsize = 1;
	int channel = 3;
	int width = 256;
	int height = 256;
	size_t dataSize = width * height*channel*batchsize;
	int tensorSize[4] = { batchsize, channel, width, height };
	float* input_buffer = new float[dataSize];
	for (int i = 0; i < dataSize; i++)
		input_buffer[i] = 0.1;

	inferDemo(input_buffer, tensorSize);

	delete[] input_buffer;
	input_buffer = nullptr;

	system("pause");
	return 0;
}

可能遇到的问题

trt转换时，cudnn库报错：

安装正确版本的cudnn，另外也需要将tensorRT中lib的dll拷贝到cuda的bin文件下。

trt转换时，缺少 zlibwapi.dll：
Could not locate zlibwapi.dll. Please make sure it is in your library path!

网上的解决方案有：在NVIDIA官网下载（不过好像自2023年后下架了）；另外就是下载源码进行编译
zlib源码：https://github.com/madler/zlib
而我自己则通过搜索电脑上的zlibwapi.dll（安装的pytorch路径下、钉钉、Origin），将其拷贝到cuda的bin文件下解决了。

TensorRT推理时，size报错：
如果模型是动态大小导出的，则需要自己设置size。

参考文献

tensorRT基础(1)-实现模型的推理过程
ONNX基本操作
重要，python部署成功 模型部署】TensorRT的安装与使用
有点参考价值 TensorRT部署模型基本步骤（C++）
TensorRT优化部署（一）–TensorRT和ONNX基础
值得参考C++部署 TensorRT基础知识及应用【C++深度学习部署（十）】
windows平台使用tensorRT部署yolov5详细介绍，整个流程思路以及细节。
C ++部署成功 TensorRT Windows C++ 部署 ONNX 模型简易教程
python版tensorrt推理
python TensorRT API转换模型 TensorRT实战：构建与部署Python推理模型（二）
onnx转engine命令 【TensorRT】trtexec工具转engine
trt 使用trtexec工具ONNX转engine