代码收藏家技术教程 22天前

Jetson系列之TensorRT与Python环境下YOLOv5推理初探（一）

一.onnx模型导出

二.TensorRT模型本地序列化

三.算法整体Pipline架构

四.算法整体Pipline实现

一.onnx模型导出

在使用tensorrt进行加速之前需要将自己的torch模型转为onnx格式的，这个操作很基础就不赘述了，自己根据自己的任务、部署设备选择合适的batch/infersize/opset

yolov5官方导出onnx脚本

    Example:
        ```python
        from pathlib import Path
        import torch
        from models.experimental import attempt_load
        from utils.torch_utils import select_device

        # Load model
        weights = 'yolov5s.pt'
        device = select_device('')
        model = attempt_load(weights, map_location=device)

        # Example input tensor
        im = torch.zeros(1, 3, 640, 640).to(device)

        # Export model
        file_path = Path('yolov5s.onnx')
        export_onnx(model, im, file_path, opset=12, dynamic=True, simplify=True)

二.TensorRT模型本地序列化

1.如果安装了TensorRT但是不知道trtexec程序在哪儿，用下面的命令可以找到

root@ubuntu:/datas/xk/02code/yolov5-tensorrt-python/weights# find / -name trtexec
find: ‘/proc/15048’: No such file or directory
/usr/src/tensorrt/bin/trtexec
/usr/src/tensorrt/samples/trtexec

2.使用trtexec模型工具将模型序列化成tensorrt框架能够加载的二进制文件，后缀名随意常见的是*.trt或者*.engine，下面是fp32和fp16两种格式的engine文件本地序列化。

root@ubuntu:/datas/xk/02code/yolov5-tensorrt-python/weights# /usr/src/tensorrt/bin/trtexec --onnx=yolov5s.onnx  --saveEngine=yolov5s_fp32.engine
&&&& RUNNING TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --onnx=yolov5s.onnx --saveEngine=yolov5s_fp32.engine
[03/06/2025-19:08:36] [I] === Model Options ===
[03/06/2025-19:08:36] [I] Format: ONNX
[03/06/2025-19:08:36] [I] Model: yolov5s.onnx
[03/06/2025-19:08:36] [I] Output:
[03/06/2025-19:08:36] [I] === Build Options ===
[03/06/2025-19:08:36] [I] Max batch: explicit batch
[03/06/2025-19:08:36] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[03/06/2025-19:08:36] [I] minTiming: 1
[03/06/2025-19:08:36] [I] avgTiming: 8
[03/06/2025-19:08:36] [I] Precision: FP32

&&&& FAILED TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --onnx=yolov5s.onnx --saveEngine=yolov5s_fp16.engine --f16
root@ubuntu:/datas/xk/02code/yolov5-tensorrt-python/weights# /usr/src/tensorrt/bin/trtexec --onnx=yolov5s.onnx  --saveEngine=yolov5s_fp16.engine --fp16
&&&& RUNNING TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --onnx=yolov5s.onnx --saveEngine=yolov5s_fp16.engine --fp16
[03/06/2025-19:10:51] [I] === Model Options ===
[03/06/2025-19:10:51] [I] Format: ONNX
[03/06/2025-19:10:51] [I] Model: yolov5s.onnx
[03/06/2025-19:10:51] [I] Output:
[03/06/2025-19:10:51] [I] === Build Options ===
[03/06/2025-19:10:51] [I] Max batch: explicit batch
[03/06/2025-19:10:51] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[03/06/2025-19:10:51] [I] minTiming: 1
[03/06/2025-19:10:51] [I] avgTiming: 8
[03/06/2025-19:10:51] [I] Precision: FP32+FP16

3.engine二进制文件序列化成功标识

[03/06/2025-19:28:20] [I] === Performance summary ===
[03/06/2025-19:28:20] [I] Throughput: 264.685 qps
[03/06/2025-19:28:20] [I] Latency: min = 4.6189 ms, max = 5.85681 ms, mean = 5.09798 ms, median = 4.97119 ms, percentile(90%) = 5.7226 ms, percentile(95%) = 5.73151 ms, percentile(99%) = 5.75586 ms
[03/06/2025-19:28:20] [I] Enqueue Time: min = 0.671509 ms, max = 1.82977 ms, mean = 1.17656 ms, median = 1.27228 ms, percentile(90%) = 1.70499 ms, percentile(95%) = 1.74487 ms, percentile(99%) = 1.80884 ms
[03/06/2025-19:28:20] [I] H2D Latency: min = 0.203186 ms, max = 0.252548 ms, mean = 0.221109 ms, median = 0.21875 ms, percentile(90%) = 0.237503 ms, percentile(95%) = 0.239532 ms, percentile(99%) = 0.244751 ms
[03/06/2025-19:28:20] [I] GPU Compute Time: min = 3.6452 ms, max = 4.22778 ms, mean = 3.77244 ms, median = 3.70071 ms, percentile(90%) = 4.13855 ms, percentile(95%) = 4.14554 ms, percentile(99%) = 4.17288 ms
[03/06/2025-19:28:20] [I] D2H Latency: min = 0.664551 ms, max = 1.42108 ms, mean = 1.10443 ms, median = 1.05444 ms, percentile(90%) = 1.34631 ms, percentile(95%) = 1.35052 ms, percentile(99%) = 1.35684 ms
[03/06/2025-19:28:20] [I] Total Host Walltime: 3.00735 s
[03/06/2025-19:28:20] [I] Total GPU Compute Time: 3.00287 s
[03/06/2025-19:28:20] [W] * GPU compute time is unstable, with coefficient of variance = 4.54448%.
[03/06/2025-19:28:20] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[03/06/2025-19:28:20] [I] Explanations of the performance metrics are printed in the verbose logs.
[03/06/2025-19:28:20] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --onnx=yolov5s.onnx --saveEngine=yolov5s_fp16.engine --fp16

三.算法整体Pipline架构

模型前处理
1. 等比例缩放+Padding(letterbox)
2. HWC2CHW
3. 扩充batch维度到NCHW
4. normlize
模型推理
1. tensorrt二进制文件加载
2. tensorrt logger工具初始
3. 读取二进制文件，使用trt.logger初始化tensorrt runtime;并使用runtime反序列化二进制文件得到tensorrt模型
4. 基于加载的model信息初始化tensorrt 的bindings绑定
5. 使用加载的model创建tensorrt模型推理上下文context
6. 将输入输出数据的指针加载到bindings并使用context执行execute_v2
7. 从bindings里面得到模型推理结果
模型后处理
1. memcpy from device to host
2. NMS非极大值抑制
3. scale_coords坐标转换(warpaffine)

四.算法整体Pipline实现

前处理

 def prepocess(self,image):
        h,w,c = image.shape
        scale = min(float(self.infer_size)/w,float(self.infer_size)/h)
        w1 = int(w * scale)
        h1 = int(h * scale)
        image_resize = cv2.resize(image,(w1,h1),cv2.INTER_LINEAR)
        # bgr2rgb
        image_bgr = image_resize[:,:,::-1]
        # copy and paste
        image_pad = np.ones((self.infer_size,self.infer_size,3),dtype=np.uint8)*127
        image_pad[(self.infer_size-h1)//2:(self.infer_size-h1)//2 + h1,(self.infer_size-w1)//2:(self.infer_size-w1)//2 + w1,:] = image_bgr
        # transpose  HWC2CHW
        image_trans = np.transpose(image_pad,(2,0,1))
        # CHW-->[1,C,H,W]
        image_batch = np.expand_dims(image_trans,axis=0)
        # normlize
        image_norm = image_batch / 255.0
        # 创建连续内存的numpy数组，否则会出现异常
        image_norm = np.ascontiguousarray(image_norm)
        # 
        input_tensor = torch.from_numpy(image_norm).to(self.device).to(torch.float32)

        return input_tensor

模型推理

def load_trtmodel(self,w):
        print(f'Loading {w} for TensorRT inference...')
        Binding = namedtuple('Binding', ('name', 'dtype', 'shape', 'data', 'ptr'))
        logger = trt.Logger(trt.Logger.INFO)
        with open(w, 'rb') as f, trt.Runtime(logger) as runtime:
            model = runtime.deserialize_cuda_engine(f.read())
        bindings = OrderedDict()
        fp16 = False  # default updated below
        for index in range(model.num_bindings):
            name = model.get_binding_name(index)
            dtype = trt.nptype(model.get_binding_dtype(index))
            shape = tuple(model.get_binding_shape(index))
            data = torch.from_numpy(np.empty(shape, dtype=np.dtype(dtype))).to(self.device)
            bindings[name] = Binding(name, dtype, shape, data, int(data.data_ptr()))
            if model.binding_is_input(index) and dtype == np.float16:
                fp16 = True
        binding_addrs = OrderedDict((n, d.ptr) for n, d in bindings.items())
        context = model.create_execution_context()
        batch_size = bindings['images'].shape[0]
        self.__dict__.update(locals())  # assign all variables to self
    
    def trt_infer(self,im):
        assert im.shape == self.bindings['images'].shape, (im.shape, self.bindings['images'].shape)
        self.binding_addrs['images'] = int(im.data_ptr())
        self.context.execute_v2(list(self.binding_addrs.values()))
        y = self.bindings['output0'].data
        return y

后处理

def postprepocess(self,out,infer_size,orin_size):
        # memcpy from device to host
        out.clone().detach()
        # nms
        pred = non_max_suppression(out, self.conf_thresh, self.iou_thresh, self.classes, self.agnostic_nms, self.max_det)
        # coord_trans
        res = []
        for i,det in enumerate(pred):
            if len(det):
                det[:,:4] = scale_coords(infer_size,det[:,:4],orin_size).round()
                res.append(det.detach().cpu().numpy())
        
        return res

OSD

def plot_image(out,image,color_cover):
    for idx in range(out.shape[0]):
        x,y,x1,y1,conf,cls = out[idx]
        cv2.rectangle(image,(int(x),int(y)),(int(x1),int(y1)),color_cover(int(cls),True),2)
    return image