Jetson系列之TensorRT与Python环境下YOLOv5推理初探(一)

目录

一.onnx模型导出

二.TensorRT模型本地序列化

三.算法整体Pipline架构

四.算法整体Pipline实现


一.onnx模型导出

        在使用tensorrt进行加速之前需要将自己的torch模型转为onnx格式的,这个操作很基础就不赘述了,自己根据自己的任务、部署设备选择合适的batch/infersize/opset

yolov5官方导出onnx脚本

    Example:
        ```python
        from pathlib import Path
        import torch
        from models.experimental import attempt_load
        from utils.torch_utils import select_device

        # Load model
        weights = 'yolov5s.pt'
        device = select_device('')
        model = attempt_load(weights, map_location=device)

        # Example input tensor
        im = torch.zeros(1, 3, 640, 640).to(device)

        # Export model
        file_path = Path('yolov5s.onnx')
        export_onnx(model, im, file_path, opset=12, dynamic=True, simplify=True)

二.TensorRT模型本地序列化

        1.如果安装了TensorRT但是不知道trtexec程序在哪儿,用下面的命令可以找到

root@ubuntu:/datas/xk/02code/yolov5-tensorrt-python/weights# find / -name trtexec
find: ‘/proc/15048’: No such file or directory
/usr/src/tensorrt/bin/trtexec
/usr/src/tensorrt/samples/trtexec

        2.使用trtexec模型工具将模型序列化成tensorrt框架能够加载的二进制文件,后缀名随意常见的是*.trt或者*.engine,下面是fp32和fp16两种格式的engine文件本地序列化。

root@ubuntu:/datas/xk/02code/yolov5-tensorrt-python/weights# /usr/src/tensorrt/bin/trtexec --onnx=yolov5s.onnx  --saveEngine=yolov5s_fp32.engine
&&&& RUNNING TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --onnx=yolov5s.onnx --saveEngine=yolov5s_fp32.engine
[03/06/2025-19:08:36] [I] === Model Options ===
[03/06/2025-19:08:36] [I] Format: ONNX
[03/06/2025-19:08:36] [I] Model: yolov5s.onnx
[03/06/2025-19:08:36] [I] Output:
[03/06/2025-19:08:36] [I] === Build Options ===
[03/06/2025-19:08:36] [I] Max batch: explicit batch
[03/06/2025-19:08:36] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[03/06/2025-19:08:36] [I] minTiming: 1
[03/06/2025-19:08:36] [I] avgTiming: 8
[03/06/2025-19:08:36] [I] Precision: FP32
&&&& FAILED TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --onnx=yolov5s.onnx --saveEngine=yolov5s_fp16.engine --f16
root@ubuntu:/datas/xk/02code/yolov5-tensorrt-python/weights# /usr/src/tensorrt/bin/trtexec --onnx=yolov5s.onnx  --saveEngine=yolov5s_fp16.engine --fp16
&&&& RUNNING TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --onnx=yolov5s.onnx --saveEngine=yolov5s_fp16.engine --fp16
[03/06/2025-19:10:51] [I] === Model Options ===
[03/06/2025-19:10:51] [I] Format: ONNX
[03/06/2025-19:10:51] [I] Model: yolov5s.onnx
[03/06/2025-19:10:51] [I] Output:
[03/06/2025-19:10:51] [I] === Build Options ===
[03/06/2025-19:10:51] [I] Max batch: explicit batch
[03/06/2025-19:10:51] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[03/06/2025-19:10:51] [I] minTiming: 1
[03/06/2025-19:10:51] [I] avgTiming: 8
[03/06/2025-19:10:51] [I] Precision: FP32+FP16

         3.engine二进制文件序列化成功标识

[03/06/2025-19:28:20] [I] === Performance summary ===
[03/06/2025-19:28:20] [I] Throughput: 264.685 qps
[03/06/2025-19:28:20] [I] Latency: min = 4.6189 ms, max = 5.85681 ms, mean = 5.09798 ms, median = 4.97119 ms, percentile(90%) = 5.7226 ms, percentile(95%) = 5.73151 ms, percentile(99%) = 5.75586 ms
[03/06/2025-19:28:20] [I] Enqueue Time: min = 0.671509 ms, max = 1.82977 ms, mean = 1.17656 ms, median = 1.27228 ms, percentile(90%) = 1.70499 ms, percentile(95%) = 1.74487 ms, percentile(99%) = 1.80884 ms
[03/06/2025-19:28:20] [I] H2D Latency: min = 0.203186 ms, max = 0.252548 ms, mean = 0.221109 ms, median = 0.21875 ms, percentile(90%) = 0.237503 ms, percentile(95%) = 0.239532 ms, percentile(99%) = 0.244751 ms
[03/06/2025-19:28:20] [I] GPU Compute Time: min = 3.6452 ms, max = 4.22778 ms, mean = 3.77244 ms, median = 3.70071 ms, percentile(90%) = 4.13855 ms, percentile(95%) = 4.14554 ms, percentile(99%) = 4.17288 ms
[03/06/2025-19:28:20] [I] D2H Latency: min = 0.664551 ms, max = 1.42108 ms, mean = 1.10443 ms, median = 1.05444 ms, percentile(90%) = 1.34631 ms, percentile(95%) = 1.35052 ms, percentile(99%) = 1.35684 ms
[03/06/2025-19:28:20] [I] Total Host Walltime: 3.00735 s
[03/06/2025-19:28:20] [I] Total GPU Compute Time: 3.00287 s
[03/06/2025-19:28:20] [W] * GPU compute time is unstable, with coefficient of variance = 4.54448%.
[03/06/2025-19:28:20] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[03/06/2025-19:28:20] [I] Explanations of the performance metrics are printed in the verbose logs.
[03/06/2025-19:28:20] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --onnx=yolov5s.onnx --saveEngine=yolov5s_fp16.engine --fp16

三.算法整体Pipline架构

  1. 模型前处理
    1. 等比例缩放+Padding(letterbox)
    2. HWC2CHW
    3. 扩充batch维度到NCHW
    4. normlize
  2. 模型推理
    1. tensorrt二进制文件加载
    2. tensorrt logger工具初始
    3. 读取二进制文件,使用trt.logger初始化tensorrt runtime;并使用runtime反序列化二进制文件得到tensorrt模型
    4. 基于加载的model信息初始化tensorrt 的bindings绑定
    5. 使用加载的model创建tensorrt模型推理上下文context
    6. 将输入输出数据的指针加载到bindings并使用context执行execute_v2
    7. 从bindings里面得到模型推理结果
  3. 模型后处理
    1. memcpy from device to host

    2. NMS非极大值抑制

    3. scale_coords坐标转换(warpaffine)

四.算法整体Pipline实现

  1. 前处理
     def prepocess(self,image):
            h,w,c = image.shape
            scale = min(float(self.infer_size)/w,float(self.infer_size)/h)
            w1 = int(w * scale)
            h1 = int(h * scale)
            image_resize = cv2.resize(image,(w1,h1),cv2.INTER_LINEAR)
            # bgr2rgb
            image_bgr = image_resize[:,:,::-1]
            # copy and paste
            image_pad = np.ones((self.infer_size,self.infer_size,3),dtype=np.uint8)*127
            image_pad[(self.infer_size-h1)//2:(self.infer_size-h1)//2 + h1,(self.infer_size-w1)//2:(self.infer_size-w1)//2 + w1,:] = image_bgr
            # transpose  HWC2CHW
            image_trans = np.transpose(image_pad,(2,0,1))
            # CHW-->[1,C,H,W]
            image_batch = np.expand_dims(image_trans,axis=0)
            # normlize
            image_norm = image_batch / 255.0
            # 创建连续内存的numpy数组,否则会出现异常
            image_norm = np.ascontiguousarray(image_norm)
            # 
            input_tensor = torch.from_numpy(image_norm).to(self.device).to(torch.float32)
    
            return input_tensor

  2. 模型推理
    def load_trtmodel(self,w):
            print(f'Loading {w} for TensorRT inference...')
            Binding = namedtuple('Binding', ('name', 'dtype', 'shape', 'data', 'ptr'))
            logger = trt.Logger(trt.Logger.INFO)
            with open(w, 'rb') as f, trt.Runtime(logger) as runtime:
                model = runtime.deserialize_cuda_engine(f.read())
            bindings = OrderedDict()
            fp16 = False  # default updated below
            for index in range(model.num_bindings):
                name = model.get_binding_name(index)
                dtype = trt.nptype(model.get_binding_dtype(index))
                shape = tuple(model.get_binding_shape(index))
                data = torch.from_numpy(np.empty(shape, dtype=np.dtype(dtype))).to(self.device)
                bindings[name] = Binding(name, dtype, shape, data, int(data.data_ptr()))
                if model.binding_is_input(index) and dtype == np.float16:
                    fp16 = True
            binding_addrs = OrderedDict((n, d.ptr) for n, d in bindings.items())
            context = model.create_execution_context()
            batch_size = bindings['images'].shape[0]
            self.__dict__.update(locals())  # assign all variables to self
        
        def trt_infer(self,im):
            assert im.shape == self.bindings['images'].shape, (im.shape, self.bindings['images'].shape)
            self.binding_addrs['images'] = int(im.data_ptr())
            self.context.execute_v2(list(self.binding_addrs.values()))
            y = self.bindings['output0'].data
            return y

  3. 后处理
    def postprepocess(self,out,infer_size,orin_size):
            # memcpy from device to host
            out.clone().detach()
            # nms
            pred = non_max_suppression(out, self.conf_thresh, self.iou_thresh, self.classes, self.agnostic_nms, self.max_det)
            # coord_trans
            res = []
            for i,det in enumerate(pred):
                if len(det):
                    det[:,:4] = scale_coords(infer_size,det[:,:4],orin_size).round()
                    res.append(det.detach().cpu().numpy())
            
            return res

  4. OSD
    def plot_image(out,image,color_cover):
        for idx in range(out.shape[0]):
            x,y,x1,y1,conf,cls = out[idx]
            cv2.rectangle(image,(int(x),int(y)),(int(x1),int(y1)),color_cover(int(cls),True),2)
        return image

 

作者:weixin_55083979

物联沃分享整理
物联沃-IOTWORD物联网 » Jetson系列之TensorRT与Python环境下YOLOv5推理初探(一)

发表回复