Jetson系列之TensorRT与Python环境下YOLOv5推理初探(一)
目录
一.onnx模型导出
二.TensorRT模型本地序列化
三.算法整体Pipline架构
四.算法整体Pipline实现
一.onnx模型导出
在使用tensorrt进行加速之前需要将自己的torch模型转为onnx格式的,这个操作很基础就不赘述了,自己根据自己的任务、部署设备选择合适的batch/infersize/opset
yolov5官方导出onnx脚本
Example:
```python
from pathlib import Path
import torch
from models.experimental import attempt_load
from utils.torch_utils import select_device
# Load model
weights = 'yolov5s.pt'
device = select_device('')
model = attempt_load(weights, map_location=device)
# Example input tensor
im = torch.zeros(1, 3, 640, 640).to(device)
# Export model
file_path = Path('yolov5s.onnx')
export_onnx(model, im, file_path, opset=12, dynamic=True, simplify=True)
二.TensorRT模型本地序列化
1.如果安装了TensorRT但是不知道trtexec程序在哪儿,用下面的命令可以找到
root@ubuntu:/datas/xk/02code/yolov5-tensorrt-python/weights# find / -name trtexec
find: ‘/proc/15048’: No such file or directory
/usr/src/tensorrt/bin/trtexec
/usr/src/tensorrt/samples/trtexec
2.使用trtexec模型工具将模型序列化成tensorrt框架能够加载的二进制文件,后缀名随意常见的是*.trt或者*.engine,下面是fp32和fp16两种格式的engine文件本地序列化。
root@ubuntu:/datas/xk/02code/yolov5-tensorrt-python/weights# /usr/src/tensorrt/bin/trtexec --onnx=yolov5s.onnx --saveEngine=yolov5s_fp32.engine
&&&& RUNNING TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --onnx=yolov5s.onnx --saveEngine=yolov5s_fp32.engine
[03/06/2025-19:08:36] [I] === Model Options ===
[03/06/2025-19:08:36] [I] Format: ONNX
[03/06/2025-19:08:36] [I] Model: yolov5s.onnx
[03/06/2025-19:08:36] [I] Output:
[03/06/2025-19:08:36] [I] === Build Options ===
[03/06/2025-19:08:36] [I] Max batch: explicit batch
[03/06/2025-19:08:36] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[03/06/2025-19:08:36] [I] minTiming: 1
[03/06/2025-19:08:36] [I] avgTiming: 8
[03/06/2025-19:08:36] [I] Precision: FP32
&&&& FAILED TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --onnx=yolov5s.onnx --saveEngine=yolov5s_fp16.engine --f16
root@ubuntu:/datas/xk/02code/yolov5-tensorrt-python/weights# /usr/src/tensorrt/bin/trtexec --onnx=yolov5s.onnx --saveEngine=yolov5s_fp16.engine --fp16
&&&& RUNNING TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --onnx=yolov5s.onnx --saveEngine=yolov5s_fp16.engine --fp16
[03/06/2025-19:10:51] [I] === Model Options ===
[03/06/2025-19:10:51] [I] Format: ONNX
[03/06/2025-19:10:51] [I] Model: yolov5s.onnx
[03/06/2025-19:10:51] [I] Output:
[03/06/2025-19:10:51] [I] === Build Options ===
[03/06/2025-19:10:51] [I] Max batch: explicit batch
[03/06/2025-19:10:51] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[03/06/2025-19:10:51] [I] minTiming: 1
[03/06/2025-19:10:51] [I] avgTiming: 8
[03/06/2025-19:10:51] [I] Precision: FP32+FP16
3.engine二进制文件序列化成功标识
[03/06/2025-19:28:20] [I] === Performance summary ===
[03/06/2025-19:28:20] [I] Throughput: 264.685 qps
[03/06/2025-19:28:20] [I] Latency: min = 4.6189 ms, max = 5.85681 ms, mean = 5.09798 ms, median = 4.97119 ms, percentile(90%) = 5.7226 ms, percentile(95%) = 5.73151 ms, percentile(99%) = 5.75586 ms
[03/06/2025-19:28:20] [I] Enqueue Time: min = 0.671509 ms, max = 1.82977 ms, mean = 1.17656 ms, median = 1.27228 ms, percentile(90%) = 1.70499 ms, percentile(95%) = 1.74487 ms, percentile(99%) = 1.80884 ms
[03/06/2025-19:28:20] [I] H2D Latency: min = 0.203186 ms, max = 0.252548 ms, mean = 0.221109 ms, median = 0.21875 ms, percentile(90%) = 0.237503 ms, percentile(95%) = 0.239532 ms, percentile(99%) = 0.244751 ms
[03/06/2025-19:28:20] [I] GPU Compute Time: min = 3.6452 ms, max = 4.22778 ms, mean = 3.77244 ms, median = 3.70071 ms, percentile(90%) = 4.13855 ms, percentile(95%) = 4.14554 ms, percentile(99%) = 4.17288 ms
[03/06/2025-19:28:20] [I] D2H Latency: min = 0.664551 ms, max = 1.42108 ms, mean = 1.10443 ms, median = 1.05444 ms, percentile(90%) = 1.34631 ms, percentile(95%) = 1.35052 ms, percentile(99%) = 1.35684 ms
[03/06/2025-19:28:20] [I] Total Host Walltime: 3.00735 s
[03/06/2025-19:28:20] [I] Total GPU Compute Time: 3.00287 s
[03/06/2025-19:28:20] [W] * GPU compute time is unstable, with coefficient of variance = 4.54448%.
[03/06/2025-19:28:20] [W] If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[03/06/2025-19:28:20] [I] Explanations of the performance metrics are printed in the verbose logs.
[03/06/2025-19:28:20] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --onnx=yolov5s.onnx --saveEngine=yolov5s_fp16.engine --fp16
三.算法整体Pipline架构
- 模型前处理
- 等比例缩放+Padding(letterbox)
- HWC2CHW
- 扩充batch维度到NCHW
- normlize
- 模型推理
- tensorrt二进制文件加载
- tensorrt logger工具初始
- 读取二进制文件,使用trt.logger初始化tensorrt runtime;并使用runtime反序列化二进制文件得到tensorrt模型
- 基于加载的model信息初始化tensorrt 的bindings绑定
- 使用加载的model创建tensorrt模型推理上下文context
- 将输入输出数据的指针加载到bindings并使用context执行execute_v2
- 从bindings里面得到模型推理结果
- 模型后处理
-
memcpy from device to host
-
NMS非极大值抑制
-
scale_coords坐标转换(warpaffine)
-
四.算法整体Pipline实现
- 前处理
def prepocess(self,image): h,w,c = image.shape scale = min(float(self.infer_size)/w,float(self.infer_size)/h) w1 = int(w * scale) h1 = int(h * scale) image_resize = cv2.resize(image,(w1,h1),cv2.INTER_LINEAR) # bgr2rgb image_bgr = image_resize[:,:,::-1] # copy and paste image_pad = np.ones((self.infer_size,self.infer_size,3),dtype=np.uint8)*127 image_pad[(self.infer_size-h1)//2:(self.infer_size-h1)//2 + h1,(self.infer_size-w1)//2:(self.infer_size-w1)//2 + w1,:] = image_bgr # transpose HWC2CHW image_trans = np.transpose(image_pad,(2,0,1)) # CHW-->[1,C,H,W] image_batch = np.expand_dims(image_trans,axis=0) # normlize image_norm = image_batch / 255.0 # 创建连续内存的numpy数组,否则会出现异常 image_norm = np.ascontiguousarray(image_norm) # input_tensor = torch.from_numpy(image_norm).to(self.device).to(torch.float32) return input_tensor
- 模型推理
def load_trtmodel(self,w): print(f'Loading {w} for TensorRT inference...') Binding = namedtuple('Binding', ('name', 'dtype', 'shape', 'data', 'ptr')) logger = trt.Logger(trt.Logger.INFO) with open(w, 'rb') as f, trt.Runtime(logger) as runtime: model = runtime.deserialize_cuda_engine(f.read()) bindings = OrderedDict() fp16 = False # default updated below for index in range(model.num_bindings): name = model.get_binding_name(index) dtype = trt.nptype(model.get_binding_dtype(index)) shape = tuple(model.get_binding_shape(index)) data = torch.from_numpy(np.empty(shape, dtype=np.dtype(dtype))).to(self.device) bindings[name] = Binding(name, dtype, shape, data, int(data.data_ptr())) if model.binding_is_input(index) and dtype == np.float16: fp16 = True binding_addrs = OrderedDict((n, d.ptr) for n, d in bindings.items()) context = model.create_execution_context() batch_size = bindings['images'].shape[0] self.__dict__.update(locals()) # assign all variables to self def trt_infer(self,im): assert im.shape == self.bindings['images'].shape, (im.shape, self.bindings['images'].shape) self.binding_addrs['images'] = int(im.data_ptr()) self.context.execute_v2(list(self.binding_addrs.values())) y = self.bindings['output0'].data return y
- 后处理
def postprepocess(self,out,infer_size,orin_size): # memcpy from device to host out.clone().detach() # nms pred = non_max_suppression(out, self.conf_thresh, self.iou_thresh, self.classes, self.agnostic_nms, self.max_det) # coord_trans res = [] for i,det in enumerate(pred): if len(det): det[:,:4] = scale_coords(infer_size,det[:,:4],orin_size).round() res.append(det.detach().cpu().numpy()) return res
- OSD
def plot_image(out,image,color_cover): for idx in range(out.shape[0]): x,y,x1,y1,conf,cls = out[idx] cv2.rectangle(image,(int(x),int(y)),(int(x1),int(y1)),color_cover(int(cls),True),2) return image
作者:weixin_55083979