TensorRT模型部署入门指南(Python篇)

文章目录

  • 1. TensorRT安装
  • 1.1 cuda/cudnn以及虚拟环境的创建
  • 1.2 根据cuda版本安装相对应版本的tensorRT
  • 2. 模型转换
  • 2.1 pth转onnx
  • 2.2 onnx转engine
  • 3. TensorRT部署
  • TensorRT推理(python API)
  • TensorRT推理(C++ API)
  • 可能遇到的问题
  • 参考文献
  • 1. TensorRT安装

    1.1 cuda/cudnn以及虚拟环境的创建

    CUDA下载链接:https://developer.nvidia.com/cuda-toolkit-archive
    cuDnn下载链接:https://docs.nvidia.com/deeplearning/cudnn/latest/installation/windows.html

    1.2 根据cuda版本安装相对应版本的tensorRT

    TensorRT下载链接:https://developer.nvidia.com/tensorrt/download

    TensorRT安装指南
    下载后解压缩,并为文件夹下的lib配置环境变量

    2. 模型转换

    2.1 pth转onnx

    python安装onnx模块,pip install onnx

    input_name = 'input'
    output_name = 'output'
    torch.onnx.export(model,               # model being run
                      x,                   # model input 
                      "model.onnx",        # where to save the model (can be a file or file-like object)                  
                      opset_version=11,          # the ONNX version to export the model to                  
                      input_names = [input_name],   # the model's input names
                      output_names = [output_name],  # the model's output names
                      dynamic_axes= {
                            input_name: {0: 'batch_size', 2 : 'in_width', 3: 'int_height'},
                            output_name: {0: 'batch_size', 2: 'out_width', 3:'out_height'}
                            }
    

    2.2 onnx转engine

    注意:TensorRT的ONNX解释器是针对Pytorch版本编译的,如果版本不对应可能导致转模型时出现错误
    1. 使用命令行工具

    主要是调用bin文件夹下的trtexec执行程序

    trtexec.exe --onnx=model.onnx --saveEngine=model.engine --workspace=6000
    
    #生成静态batchsize的engine
    ./trtexec 	--onnx=<onnx_file> \ 						#指定onnx模型文件
            	--explicitBatch \ 							#在构建引擎时使用显式批大小(默认=隐式)显示批处理
            	--saveEngine=<tensorRT_engine_file> \ 		#输出engine
            	--workspace=<size_in_megabytes> \ 			#设置工作空间大小单位是MB(默认为16MB)
            	--fp16 										#除了fp32之外,还启用fp16精度(默认=禁用)
            
    #生成动态batchsize的engine
    ./trtexec 	--onnx=<onnx_file> \						#指定onnx模型文件
            	--minShapes=input:<shape_of_min_batch> \ 	#最小的NCHW
            	--optShapes=input:<shape_of_opt_batch> \  	#最佳输入维度,跟maxShapes一样就好
            	--maxShapes=input:<shape_of_max_batch> \ 	#最大输入维度
            	--workspace=<size_in_megabytes> \ 			#设置工作空间大小单位是MB(默认为16MB)
            	--saveEngine=<engine_file> \   				#输出engine
            	--fp16   									#除了fp32之外,还启用fp16精度(默认=禁用)
    
    
    #小尺寸的图片可以多batchsize即8x3x416x416
    /home/zxl/TensorRT-7.2.3.4/bin/trtexec  --onnx=yolov4_-1_3_416_416_dynamic.onnx \
                                            --minShapes=input:1x3x416x416 \
                                            --optShapes=input:8x3x416x416 \
                                            --maxShapes=input:8x3x416x416 \
                                            --workspace=4096 \
                                            --saveEngine=yolov4_-1_3_416_416_dynamic_b8_fp16.engine \
                                            --fp16
    
    #由于内存不够了所以改成4x3x608x608
    /home/zxl/TensorRT-7.2.3.4/bin/trtexec  --onnx=yolov4_-1_3_608_608_dynamic.onnx \
                                            --minShapes=input:1x3x608x608 \
                                            --optShapes=input:4x3x608x608 \
                                            --maxShapes=input:4x3x608x608 \
                                            --workspace=4096 \
                                            --saveEngine=yolov4_-1_3_608_608_dynamic_b4_fp16.engine \
                                            --fp16           
                                            
    

    另外,可以使用trtexec.exe –help命令查看trtexec的命令参数含义

    D:\Work\cuda_gpu\sdk\TensorRT-8.5.1.7\bin>trtexec.exe --help
    &&&& RUNNING TensorRT.trtexec [TensorRT v8501] # trtexec.exe --help
    === Model Options ===
      --uff=<file>                UFF model
      --onnx=<file>               ONNX model
      --model=<file>              Caffe model (default = no model, random weights used)
      --deploy=<file>             Caffe prototxt file
      --output=<name>[,<name>]*   Output names (it can be specified multiple times); at least one output is required for UFF and Caffe
      --uffInput=<name>,X,Y,Z     Input blob name and its dimensions (X,Y,Z=C,H,W), it can be specified multiple times; at least one is required for UFF models
      --uffNHWC                   Set if inputs are in the NHWC layout instead of NCHW (use X,Y,Z=H,W,C order in --uffInput)
    
    === Build Options ===
      --maxBatch                  Set max batch size and build an implicit batch engine (default = same size as --batch)
                                  This option should not be used when the input model is ONNX or when dynamic shapes are provided.
      --minShapes=spec            Build with dynamic shapes using a profile with the min shapes provided
      --optShapes=spec            Build with dynamic shapes using a profile with the opt shapes provided
      --maxShapes=spec            Build with dynamic shapes using a profile with the max shapes provided
      --minShapesCalib=spec       Calibrate with dynamic shapes using a profile with the min shapes provided
      --optShapesCalib=spec       Calibrate with dynamic shapes using a profile with the opt shapes provided
      --maxShapesCalib=spec       Calibrate with dynamic shapes using a profile with the max shapes provided
                                  Note: All three of min, opt and max shapes must be supplied.
                                        However, if only opt shapes is supplied then it will be expanded so
                                        that min shapes and max shapes are set to the same values as opt shapes.
                                        Input names can be wrapped with escaped single quotes (ex: \'Input:0\').
                                  Example input shapes spec: input0:1x3x256x256,input1:1x3x128x128
                                  Each input shape is supplied as a key-value pair where key is the input name and
                                  value is the dimensions (including the batch dimension) to be used for that input.
                                  Each key-value pair has the key and value separated using a colon (:).
                                  Multiple input shapes can be provided via comma-separated key-value pairs.
      --inputIOFormats=spec       Type and format of each of the input tensors (default = all inputs in fp32:chw)
                                  See --outputIOFormats help for the grammar of type and format list.
                                  Note: If this option is specified, please set comma-separated types and formats for all
                                        inputs following the same order as network inputs ID (even if only one input
                                        needs specifying IO format) or set the type and format once for broadcasting.
      --outputIOFormats=spec      Type and format of each of the output tensors (default = all outputs in fp32:chw)
                                  Note: If this option is specified, please set comma-separated types and formats for all
                                        outputs following the same order as network outputs ID (even if only one output
                                        needs specifying IO format) or set the type and format once for broadcasting.
                                  IO Formats: spec  ::= IOfmt[","spec]
                                              IOfmt ::= type:fmt
                                              type  ::= "fp32"|"fp16"|"int32"|"int8"
                                              fmt   ::= ("chw"|"chw2"|"chw4"|"hwc8"|"chw16"|"chw32"|"dhwc8"|
                                                         "cdhw32"|"hwc"|"dla_linear"|"dla_hwc4")["+"fmt]
      --workspace=N               Set workspace size in MiB.
      --memPoolSize=poolspec      Specify the size constraints of the designated memory pool(s) in MiB.
                                  Note: Also accepts decimal sizes, e.g. 0.25MiB. Will be rounded down to the nearest integer bytes.
                                  Pool constraint: poolspec ::= poolfmt[","poolspec]
                                                   poolfmt ::= pool:sizeInMiB
                                                   pool ::= "workspace"|"dlaSRAM"|"dlaLocalDRAM"|"dlaGlobalDRAM"
      --profilingVerbosity=mode   Specify profiling verbosity. mode ::= layer_names_only|detailed|none (default = layer_names_only)
      --minTiming=M               Set the minimum number of iterations used in kernel selection (default = 1)
      --avgTiming=M               Set the number of times averaged in each iteration for kernel selection (default = 8)
      --refit                     Mark the engine as refittable. This will allow the inspection of refittable layers
                                  and weights within the engine.
      --sparsity=spec             Control sparsity (default = disabled).
                                  Sparsity: spec ::= "disable", "enable", "force"
                                  Note: Description about each of these options is as below
                                        disable = do not enable sparse tactics in the builder (this is the default)
                                        enable  = enable sparse tactics in the builder (but these tactics will only be
                                                  considered if the weights have the right sparsity pattern)
                                        force   = enable sparse tactics in the builder and force-overwrite the weights to have
                                                  a sparsity pattern (even if you loaded a model yourself)
      --noTF32                    Disable tf32 precision (default is to enable tf32, in addition to fp32)
      --fp16                      Enable fp16 precision, in addition to fp32 (default = disabled)
      --int8                      Enable int8 precision, in addition to fp32 (default = disabled)
      --best                      Enable all precisions to achieve the best performance (default = disabled)
      --directIO                  Avoid reformatting at network boundaries. (default = disabled)
      --precisionConstraints=spec Control precision constraint setting. (default = none)
                                      Precision Constaints: spec ::= "none" | "obey" | "prefer"
                                      none = no constraints
                                      prefer = meet precision constraints set by --layerPrecisions/--layerOutputTypes if possible
                                      obey = meet precision constraints set by --layerPrecisions/--layerOutputTypes or fail
                                             otherwise
      --layerPrecisions=spec      Control per-layer precision constraints. Effective only when precisionConstraints is set to
                                  "obey" or "prefer". (default = none)
                                  The specs are read left-to-right, and later ones override earlier ones. "*" can be used as a
                                  layerName to specify the default precision for all the unspecified layers.
                                  Per-layer precision spec ::= layerPrecision[","spec]
                                                      layerPrecision ::= layerName":"precision
                                                      precision ::= "fp32"|"fp16"|"int32"|"int8"
      --layerOutputTypes=spec     Control per-layer output type constraints. Effective only when precisionConstraints is set to
                                  "obey" or "prefer". (default = none)
                                  The specs are read left-to-right, and later ones override earlier ones. "*" can be used as a
                                  layerName to specify the default precision for all the unspecified layers. If a layer has more than
                                  one output, then multiple types separated by "+" can be provided for this layer.
                                  Per-layer output type spec ::= layerOutputTypes[","spec]
                                                        layerOutputTypes ::= layerName":"type
                                                        type ::= "fp32"|"fp16"|"int32"|"int8"["+"type]
      --calib=<file>              Read INT8 calibration cache file
      --safe                      Enable build safety certified engine
      --consistency               Perform consistency checking on safety certified engine
      --restricted                Enable safety scope checking with kSAFETY_SCOPE build flag
      --saveEngine=<file>         Save the serialized engine
      --loadEngine=<file>         Load a serialized engine
      --tacticSources=tactics     Specify the tactics to be used by adding (+) or removing (-) tactics from the default
                                  tactic sources (default = all available tactics).
                                  Note: Currently only cuDNN, cuBLAS, cuBLAS-LT, and edge mask convolutions are listed as optional
                                        tactics.
                                  Tactic Sources: tactics ::= [","tactic]
                                                  tactic  ::= (+|-)lib
                                                  lib     ::= "CUBLAS"|"CUBLAS_LT"|"CUDNN"|"EDGE_MASK_CONVOLUTIONS"
                                                              |"JIT_CONVOLUTIONS"
                                  For example, to disable cudnn and enable cublas: --tacticSources=-CUDNN,+CUBLAS
      --noBuilderCache            Disable timing cache in builder (default is to enable timing cache)
      --heuristic                 Enable tactic selection heuristic in builder (default is to disable the heuristic)
      --timingCacheFile=<file>    Save/load the serialized global timing cache
      --preview=features          Specify preview feature to be used by adding (+) or removing (-) preview features from the default
                                  Preview Features: features ::= [","feature]
                                                    feature  ::= (+|-)flag
                                                    flag     ::= "fasterDynamicShapes0805"
                                                                 |"disableExternalTacticSourcesForCore0805"
    
    === Inference Options ===
      --batch=N                   Set batch size for implicit batch engines (default = 1)
                                  This option should not be used when the engine is built from an ONNX model or when dynamic
                                  shapes are provided when the engine is built.
      --shapes=spec               Set input shapes for dynamic shapes inference inputs.
                                  Note: Input names can be wrapped with escaped single quotes (ex: \'Input:0\').
                                  Example input shapes spec: input0:1x3x256x256, input1:1x3x128x128
                                  Each input shape is supplied as a key-value pair where key is the input name and
                                  value is the dimensions (including the batch dimension) to be used for that input.
                                  Each key-value pair has the key and value separated using a colon (:).
                                  Multiple input shapes can be provided via comma-separated key-value pairs.
      --loadInputs=spec           Load input values from files (default = generate random inputs). Input names can be wrapped with single quotes (ex: 'Input:0')
                                  Input values spec ::= Ival[","spec]
                                               Ival ::= name":"file
      --iterations=N              Run at least N inference iterations (default = 10)
      --warmUp=N                  Run for N milliseconds to warmup before measuring performance (default = 200)
      --duration=N                Run performance measurements for at least N seconds wallclock time (default = 3)
      --sleepTime=N               Delay inference start with a gap of N milliseconds between launch and compute (default = 0)
      --idleTime=N                Sleep N milliseconds between two continuous iterations(default = 0)
      --streams=N                 Instantiate N engines to use concurrently (default = 1)
      --exposeDMA                 Serialize DMA transfers to and from device (default = disabled).
      --noDataTransfers           Disable DMA transfers to and from device (default = enabled).
      --useManagedMemory          Use managed memory instead of separate host and device allocations (default = disabled).
      --useSpinWait               Actively synchronize on GPU events. This option may decrease synchronization time but increase CPU usage and power (default = disabled)
      --threads                   Enable multithreading to drive engines with independent threads or speed up refitting (default = disabled)
      --useCudaGraph              Use CUDA graph to capture engine execution and then launch inference (default = disabled).
                                  This flag may be ignored if the graph capture fails.
      --timeDeserialize           Time the amount of time it takes to deserialize the network and exit.
      --timeRefit                 Time the amount of time it takes to refit the engine before inference.
      --separateProfileRun        Do not attach the profiler in the benchmark run; if profiling is enabled, a second profile run will be executed (default = disabled)
      --buildOnly                 Exit after the engine has been built and skip inference perf measurement (default = disabled)
      --persistentCacheRatio      Set the persistentCacheLimit in ratio, 0.5 represent half of max persistent L2 size (default = 0)
    
    === Build and Inference Batch Options ===
                                  When using implicit batch, the max batch size of the engine, if not given,
                                  is set to the inference batch size;
                                  when using explicit batch, if shapes are specified only for inference, they
                                  will be used also as min/opt/max in the build profile; if shapes are
                                  specified only for the build, the opt shapes will be used also for inference;
                                  if both are specified, they must be compatible; and if explicit batch is
                                  enabled but neither is specified, the model must provide complete static
                                  dimensions, including batch size, for all inputs
                                  Using ONNX models automatically forces explicit batch.
    
    === Reporting Options ===
      --verbose                   Use verbose logging (default = false)
      --avgRuns=N                 Report performance measurements averaged over N consecutive iterations (default = 10)
      --percentile=P1,P2,P3,...   Report performance for the P1,P2,P3,... percentages (0<=P_i<=100, 0 representing max perf, and 100 representing min perf; (default = 90,95,99%)
      --dumpRefit                 Print the refittable layers and weights from a refittable engine
      --dumpOutput                Print the output tensor(s) of the last inference iteration (default = disabled)
      --dumpProfile               Print profile information per layer (default = disabled)
      --dumpLayerInfo             Print layer information of the engine to console (default = disabled)
      --exportTimes=<file>        Write the timing results in a json file (default = disabled)
      --exportOutput=<file>       Write the output tensors to a json file (default = disabled)
      --exportProfile=<file>      Write the profile information per layer in a json file (default = disabled)
      --exportLayerInfo=<file>    Write the layer information of the engine in a json file (default = disabled)
    
    === System Options ===
      --device=N                  Select cuda device N (default = 0)
      --useDLACore=N              Select DLA core N for layers that support DLA (default = none)
      --allowGPUFallback          When DLA is enabled, allow GPU fallback for unsupported layers (default = disabled)
      --plugins                   Plugin library (.so) to load (can be specified multiple times)
    
    === Help ===
      --help, -h                  Print this message
    

    2. 使用TensorRT API 转换成TensorRT engine

    def generate_engine(onnx_path, engine_path):
        # 1.构建trt日志记录器
        logger = trt.Logger(trt.Logger.WARNING)
        # 初始化
        trt.init_libnvinfer_plugins(logger, namespace="")
    
        # 2.create a builder,logger放入进去
        builder = trt.Builder(logger)
    
        # 3.创建配置文件,用于trt如何优化模型
        config = builder.create_builder_config()
        # 设置工作空间内存大小
        config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 20)  # 1 MiB
        # 设置精度
        config.set_flag(trt.BuilderFlag.FP16)
        # INT8需要进行校准
    
        # 4.创建一个network。EXPLICIT_BATCH:batch是动态的
        network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
        # 创建ONNX模型解析器
        parser = trt.OnnxParser(network, logger)
        # 解析ONNX模型,并填充到网络
        success = parser.parse_from_file(onnx_path)
        # 处理错误
        for idx in range(parser.num_errors):
            print(parser.get_error(idx))
        if not success:
            pass  # Error handling code here
    
        # 5.engine模型序列化,即生成了trt.engine model
        serialized_engine = builder.build_serialized_network(network, config)
        # 保存序列化的engine,如果以后要用到的话. 模型不能跨平台,即和trt版本 gpu类型有关
        with open(engine_path, "wb") as f:
            f.write(serialized_engine)
    
        # 6.反序列化engine。使用runtime接口。即加载engine模型进行推理。
        # runtime = trt.Runtime(logger)
        # engine = runtime.deserialize_cuda_engine(serialized_engine)
        # with open("sample.engine", "rb") as f:
        #     serialized_engine = f.read()
    

    完成上述步骤后,将获得一个转换为TensorRT格式的模型文件(model.trt)。可以将该文件用于TensorRT的推理和部署。

    3. TensorRT部署

    TensorRT部署包括Python和C++ 两种API

    可以使用TensorRT的Python API或C++ API

    TensorRT推理(python API)

    在安装好tensorrt环境后,可以尝试使用预训练权重进行转化封装部署,运行以下代码!!

    import torch
    import tensorrt as trt
    from collections import OrderedDict, namedtuple
    def infer(img_data, engine_path):
        # 1.日志器
        logger = trt.Logger(trt.Logger.INFO)
        # 2.runtime加载trt engine model
        runtime = trt.Runtime(logger)
        trt.init_libnvinfer_plugins(logger, '')  # initialize TensorRT plugins
        with open(engine_path, "rb") as f:
            serialized_engine = f.read()
        engine = runtime.deserialize_cuda_engine(serialized_engine)
    
        # 3.绑定输入输出
        bindings = OrderedDict()
        Binding = namedtuple('Binding', ('name', 'dtype', 'shape', 'data', 'ptr'))
        fp16 = False
        for index in range(engine.num_bindings):
            name = engine.get_binding_name(index)
            dtype = trt.nptype(engine.get_binding_dtype(index))
            shape = tuple(engine.get_binding_shape(index))
            data = torch.from_numpy(np.empty(shape, dtype=np.dtype(dtype))).to(self.device)
            # Tensor.data_ptr 该tensor首个元素的地址即指针,为int类型
            bindings[name] = Binding(name, dtype, shape, data, int(data.data_ptr()))
            if engine.binding_is_input(index) and dtype == np.float16:
                fp16 = True
        # 记录输入输出的指针地址
        binding_addrs = OrderedDict((n, d.ptr) for n, d in bindings.items())
    
        # 4.加载数据,绑定数据,并推理,将推理的结果放入到
        context = engine.create_execution_context()
        binding_addrs['images'] = int(img_data.data_ptr())
        context.execute_v2(list(binding_addrs.values()))
    
        # 5.获取结果((根据导出onnx模型时设置的输入输出名字获取)
        nums = bindings['num'].data[0]
        boxes = bindings['boxes'].data[0]
        scores = bindings['scores'].data[0]
        classes = bindings['classes'].data[0]
    

    TensorRT推理(C++ API)

    项目工程设置属性 — 链接器
    找到链接器下的输入,在附加依赖项中加入

    cudnn.lib
    cublas.lib
    cudart.lib
    nvinfer.lib
    nvparsers.lib
    nvonnxparser.lib
    nvinfer_plugin.lib
    opencv_world460d.lib
    

    完整推理代码:

    #include <cassert>
    #include <cfloat>
    #include <fstream>
    #include <iostream>
    #include <memory>
    #include <sstream>
    
    #include <cuda_runtime_api.h>
    #include "NvInfer.h"
    #include "NvOnnxParser.h"
    #include "logger.h"
    
    
    using sample::gLogError;
    using sample::gLogInfo;
    
    using namespace nvinfer1;
    
    // logger用来管控打印日志级别
    // TRTLogger继承自nvinfer1::ILogger
    class TRTLogger : public nvinfer1::ILogger
    {
    	void log(Severity severity, const char *msg) noexcept override
    	{
    		// 屏蔽INFO级别的日志
    		if (severity != Severity::kINFO)
    			std::cout << msg << std::endl;
    	}
    } gLogger;
    int ReadEngineData(char* enginePath, char *&engineData)
    {
    	// 读取引擎文件
    	std::ifstream engineFile(enginePath, std::ios::binary);
    	if (engineFile.fail())
    	{
    		std::cerr << "Failed to open file!" << std::endl;
    		return -1;
    	}
    
    	engineFile.seekg(0, std::ifstream::end);
    	auto fsize = engineFile.tellg();
    	engineFile.seekg(0, std::ifstream::beg);
    
    	if (nullptr == engineData)
    	{
    		engineData = new char[fsize];
    	}
    
    	engineFile.read(engineData, fsize);
    	engineFile.close();
    	return fsize;
    }
    size_t getMemorySize(nvinfer1::Dims32 input_dims, int typeSize)
    {
    	size_t psize = input_dims.d[0] * input_dims.d[1] * input_dims.d[2] * input_dims.d[3] * typeSize;
    	return psize;
    }
    bool inferDemo(float* input_buffer, int* tensorSize)
    {
    	int batchsize = tensorSize[0];
    	int channel = tensorSize[1];
    	int width = tensorSize[2];
    	int height = tensorSize[3];
    
    	size_t dataSize = width * height*channel*batchsize;
    
    	// 读取引擎文件
    	char* enginePath = "net_model.engine";
    	char* engineData = nullptr;
    	int fsize = ReadEngineData(enginePath, engineData);
    	printf("fsize=%d\n", fsize);
    
    	// 创建运行时 & 加载引擎
    	// TRTLogger glogger; // 可以使用这个代替sample::gLogger.getTRTLogger()
    	std::unique_ptr<nvinfer1::IRuntime> runtime{ nvinfer1::createInferRuntime(sample::gLogger.getTRTLogger()) };
    	std::unique_ptr<nvinfer1::ICudaEngine> mEngine(runtime->deserializeCudaEngine(engineData, fsize, nullptr));
    	assert(mEngine.get() != nullptr);
    
    	// 创建执行上下文
    	std::unique_ptr<nvinfer1::IExecutionContext> context(mEngine->createExecutionContext());
    	const char* name0 = mEngine->getBindingName(0);
    	const char* name1 = mEngine->getBindingName(1);
    	const char* name2 = mEngine->getBindingName(2);
    	const char* name3 = mEngine->getBindingName(3);
    
    	printf("name0=%s\nname1=%s\nname2=%s\nname3=%s\n", name0, name1, name2, name3);
    	// 获取输入大小
    	auto input_idx = mEngine->getBindingIndex("input");
     	if (input_idx == -1)
    	{
    		return false;
    	}
    	assert(mEngine->getBindingDataType(input_idx) == nvinfer1::DataType::kFLOAT);
    	auto input_dims = context->getBindingDimensions(input_idx);
    	context->setBindingDimensions(input_idx, input_dims);
    	auto input_size = getMemorySize(input_dims, sizeof(float_t));
    
    	// 获取输出大小 所有输出的空间都要分配 
    	auto output1_idx = mEngine->getBindingIndex("output1");
    	if (output1_idx == -1)
    	{
    		return false;
    	}
    	assert(mEngine->getBindingDataType(output1_idx) == nvinfer1::DataType::kFLOAT);
    	auto output1_dims = context->getBindingDimensions(output1_idx);
    	auto output1_size = getMemorySize(output1_dims, sizeof(float_t));
    
    	auto output2_idx = mEngine->getBindingIndex("output2");
    	if (output2_idx == -1)
    	{
    		return false;
    	}
    	assert(mEngine->getBindingDataType(output2_idx) == nvinfer1::DataType::kFLOAT);
    	auto output2_dims = context->getBindingDimensions(output2_idx);
    	auto output2_size = getMemorySize(output2_dims, sizeof(float_t));
    
    	auto output3_idx = mEngine->getBindingIndex("output3");
    	if (output3_idx == -1)
    	{
    		return false;
    	}
    	assert(mEngine->getBindingDataType(output3_idx) == nvinfer1::DataType::kFLOAT);
    	auto output3_dims = context->getBindingDimensions(output3_idx);
    	auto output3_size = getMemorySize(output3_dims, sizeof(float_t));
    
    	// 准备推理
    	// Allocate CUDA memory
    	void* input_mem{ nullptr };
    	if (cudaMalloc(&input_mem, input_size) != cudaSuccess)
    	{
    		gLogError << "ERROR: input cuda memory allocation failed, size = " << input_size << " bytes" << std::endl;
    		return false;
    	}
    
    	void* output1_mem{ nullptr };
    	if (cudaMalloc(&output1_mem, output1_size) != cudaSuccess)
    	{
    		gLogError << "ERROR: output cuda memory allocation failed, size = " << output1_size << " bytes" << std::endl;
    		return false;
    	}
    	void* output2_mem{ nullptr };
    	if (cudaMalloc(&output2_mem, output2_size) != cudaSuccess)
    	{
    		gLogError << "ERROR: output cuda memory allocation failed, size = " << output2_size << " bytes" << std::endl;
    		return false;
    	}
    	void* output3_mem{ nullptr };
    	if (cudaMalloc(&output3_mem, output3_size) != cudaSuccess)
    	{
    		gLogError << "ERROR: output cuda memory allocation failed, size = " << output3_size << " bytes" << std::endl;
    		return false;
    	}
    
    	// 复制数据到设备
    	cudaMemcpy(input_mem, input_buffer, input_size, cudaMemcpyHostToDevice); // cudaMemcpyHostToDevice 从主机到设备 即 内存到显存
    
    	// 绑定输入输出内存 一起送入推理
    	void* bindings[4];
    	bindings[input_idx] = input_mem;
    	bindings[output1_idx] = output1_mem;
    	bindings[output2_idx] = output2_mem;
    	bindings[output3_idx] = output3_mem;
    
    	// 推理
    	bool status = context->executeV2(bindings);
    
    	if (!status)
    	{
    		gLogError << "ERROR: inference failed" << std::endl;
    		cudaFree(input_mem);
    		cudaFree(output1_mem);
    		cudaFree(output2_mem);
    		cudaFree(output3_mem);
    		return 0;
    	}
    
    	// 获得结果
    	float* output3_buffer = new float[dataSize];
    	cudaMemcpy(output3_buffer, output3_mem, output3_size, cudaMemcpyDeviceToHost);
    
    	// 释放CUDA内存
    	cudaFree(input_mem);
    	cudaFree(output1_mem);
    	cudaFree(output2_mem);
    	cudaFree(output3_mem);
    
    	cudaError_t err = cudaGetLastError();
    	if (err != cudaSuccess) {
    		gLogError << "ERROR: failed to free CUDA memory: " << cudaGetErrorString(err) << std::endl;
    		return false;
    	}
    
    	// save the results
    
    
    
    	delete[] output3_buffer;
    	output3_buffer = nullptr;
    
    	return true;
    }
    int main()
    {
    	int batchsize = 1;
    	int channel = 3;
    	int width = 256;
    	int height = 256;
    	size_t dataSize = width * height*channel*batchsize;
    	int tensorSize[4] = { batchsize, channel, width, height };
    	float* input_buffer = new float[dataSize];
    	for (int i = 0; i < dataSize; i++)
    		input_buffer[i] = 0.1;
    
    	inferDemo(input_buffer, tensorSize);
    
    	delete[] input_buffer;
    	input_buffer = nullptr;
    
    	system("pause");
    	return 0;
    }
    

    可能遇到的问题

    trt转换时,cudnn库报错:

    安装正确版本的cudnn,另外也需要将tensorRT中lib的dll拷贝到cuda的bin文件下。

    trt转换时,缺少 zlibwapi.dll:
    Could not locate zlibwapi.dll. Please make sure it is in your library path!

    网上的解决方案有: 在NVIDIA官网下载(不过好像自2023年后下架了);另外就是下载源码进行编译
    zlib源码:https://github.com/madler/zlib
    而我自己则通过搜索电脑上的zlibwapi.dll(安装的pytorch路径下、钉钉、Origin),将其拷贝到cuda的bin文件下解决了。

    TensorRT推理时,size报错:
    如果模型是动态大小导出的,则需要自己设置size。

    参考文献

    tensorRT基础(1)-实现模型的推理过程
    ONNX基本操作
    重要,python部署成功 模型部署】TensorRT的安装与使用
    有点参考价值 TensorRT部署模型基本步骤(C++)
    TensorRT优化部署(一)–TensorRT和ONNX基础
    值得参考C++部署 TensorRT基础知识及应用【C++深度学习部署(十)】
    windows平台使用tensorRT部署yolov5详细介绍,整个流程思路以及细节。
    C ++部署成功 TensorRT Windows C++ 部署 ONNX 模型简易教程
    python版tensorrt推理
    python TensorRT API转换模型 TensorRT实战:构建与部署Python推理模型(二)
    onnx转engine命令 【TensorRT】trtexec工具转engine
    trt 使用trtexec工具ONNX转engine

    作者:痛&快乐着

    物联沃分享整理
    物联沃-IOTWORD物联网 » TensorRT模型部署入门指南(Python篇)

    发表回复