RK3588部署YOLOv8深度指南(一):YOLOv8及YOLOv8-pose转ONNX与Python后处理代码实践

目录

前言

1. 使用的相关官方代码[1][2][3]

2. 使用[2]代码将模型导出为ONNX

3. [1]和[2]导出的ONNX模型结构对比

4. 后处理代码参考及Python实现

4.1 参考的后处理代码

4.2 后处理Python实现

4.2.1 目录结构

4.2.2 处理脚本文件ops_rk3588.py

4.2.3 YOLOv8后处理文件yolov8_det_rk3588.py

4.2.4 YOLOv8-pose后处理文件yolov8_pose_rk3588.py

5. Python推理结果显示

5.1 YOLOv8结果显示

5.2 YOLOv8-pose结果显示

6. 总结


前言

        由于种种原因,原始的YOLOv8系列的模型,在RK3588上难以部署,在 .pt .onnx 的时候需要去掉后处理层(主要是DFL层)。因此,模型的后处理需要自己来实现。

        本文基于Rockship 官方给的源码(导出模型的Python代码部署模型的C++代码)实现对YOLOv8 的.onnx模型格式的导出,并实现Python版本的后处理代码。

        本文实现了YOLOv8(目标检测)和YOLOv8-pose(关键点检测)的模型导出和后处理代码。


1. 使用的相关官方代码[1][2][3]

[1] 训练代码,YOLOv8/YOLOv11官方代码:https://github.com/ultralytics/ultralytics

使用的版本为:8.3.67的main分支

[2] 导出代码,airockship官方给的导出代码:https://github.com/airockchip/ultralytics_yolo11

使用的版本为:2025.01.20时最新的main分支

[3] 部署代码,airockship官方给的C++推理代码:https://github.com/airockchip/rknn_model_zoo

使用的版本为:V2.1版本的main分支


2. 使用[2]代码将模型导出为ONNX

        [2]的导出目录结构和相关代码如下图(模型输入的图片尺寸根据自己实际需要的尺寸来设置,需要为32的倍数,因为模型中有8/16/32倍下采样):

from ultralytics import YOLO


model = YOLO(r'det_hand_s_250110.pt')

model.export(format="rknn",     # (str) format to export to, choices at https://docs.ultralytics.com/modes/export/#export-formats
             imgsz=[448, 800],  # (int | list) input images size as int for train and val modes, or list[h,w] for predict and export modes
             opset=11,          # (int, optional) ONNX: opset version
             dynamic=False,     # (bool) ONNX/TF/TensorRT: dynamic axes\
             simplify=False,    # (bool) ONNX: simplify model using `onnxslim`
             keras=False,       # (bool) use Kera=s
             optimize=False,    # (bool) TorchScript: optimize for mobile
             int8=False,        # (bool) CoreML/TF INT8 quantization
             nms=False,         # (bool) CoreML: add NMS
             workspace=4,       # (int) TensorRT: workspace size (GB)
             )

# PyTorch: starting from 'best.pt' with input shape (1, 3, 640, 640) BCHW and output shape(s)
# ((1, 64, 80, 80), (1, 1, 80, 80), (1, 1, 80, 80),
# (1, 64, 40, 40), (1, 1, 40, 40), (1, 1, 40, 40),
# (1, 64, 20, 20), (1, 1, 20, 20), (1, 1, 20, 20)) (21.5 MB)

        检测、关键点、旋转框、分割均可这样导出(模型内部存了.yaml文件,已经含有模型的信息)。由于我只使用到检测和关键点,因此,仅给出这二者的后处理推理代码。


3. [1]和[2]导出的ONNX模型结构对比

[1]如下图,将不同分辨率的检测头输出结果连接在一起,仅输出一个结果:

输出维度为:[1, 4+class_num, 21×input_imgsz_h×input_imgsz_w÷1024]。

其中,21/1024 = (1/8×1/8+1/16×1/16+1/32×1/32)

[2]如下图(只截图下采样为8的3个头),总共有9个输出头:

318: [1, 64, input_imgsz_h // 8,  input_imgsz_w // 8]        (下图的输入图片为(384, 640))

326: [1, class_num, input_imgsz_h // 8,  input_imgsz_w // 8]

331: [1, 1, input_imgsz_h // 8,  input_imgsz_w // 8]

338: [1, 64, input_imgsz_h // 16,  input_imgsz_w // 16]

346: [1, class_num, input_imgsz_h // 16,  input_imgsz_w // 16]

350: [1, 1, input_imgsz_h // 16,  input_imgsz_w // 16]

357: [1, 64, input_imgsz_h // 32,  input_imgsz_w // 32]

365: [1, class_num, input_imgsz_h // 32,  input_imgsz_w // 32]

369: [1, 1, input_imgsz_h // 32,  input_imgsz_w // 32]


4. 后处理代码参考及Python实现

4.1 参考的后处理代码

        [3]中,RK官方写的C++后处理代码写得跟💩一样,个人严重怀疑是让实习生写的,烂的有点离谱,不过这里仅实现推理,不做性能优化。

参考的代码路径:

rknn_model_zoo/examples/yolov8/cpp/postprocess.cc

rknn_model_zoo/examples/yolov8_pose/cpp/postprocess.cc

参考的函数:

inline static int clamp(…);

static float CalculateOverlap(…);

static int nms(…);  // 优化:YOLO后处理trick – 减少nms的计算次数、比较次数和空间消耗

static int nms_pose(…);

void softmax(…);

static float sigmoid(…);

static float unsigmoid(…);

static void compute_dfl(…);

static int process_fp32(…);

static int process_fp32_v8(…);

static int process_fp32_pose(…);

int post_process_det(…);

int post_process_pose(…);

4.2 后处理Python实现

4.2.1 目录结构

4.2.2 处理脚本文件ops_rk3588.py

        对C++中的结构体、变量进行了拆分和优化。

import cv2
import numpy as np


def letterbox(im, new_shape=(640, 640), color=(114, 114, 114), auto=True, scalefill=False, scaleup=True, stride=32):
    """
    调整图片大小并填充以适应目标尺寸。
    :param im:输入图片。
    :param new_shape:目标形状,默认 (640, 640)。
    :param color:填充颜色,默认 (114, 114, 114)。
    :param auto:自动调整填充,保持最小矩形。True会让图片宽高是stride的最小整数倍,比如32,可以方便卷积。
    :param scalefill:是否拉伸填充。在auto是False时,True会让图片拉伸变形。
    :param scaleup:是否允许放大。False让图片只能缩小。
    :param stride:步幅大小,默认 32。
    :return:返回调整后的图片,缩放比例(宽,高)和填充值。
    """
    shape = im.shape[:2]  # 获取当前图片的形状 [高度, 宽度]
    if isinstance(new_shape, int):
        new_shape = (new_shape, new_shape)

    r = min(new_shape[0] / shape[0], new_shape[1] / shape[1])  # 缩放比例 (新尺寸 / 旧尺寸)
    if not scaleup:  # 如果不允许放大,只进行缩小 (提高验证的 mAP)
        r = min(r, 1.0)

    ratio = r, r  # 计算填充宽度和高度的缩放比例
    new_unpad = int(round(shape[1] * r)), int(round(shape[0] * r))  # 新的未填充尺寸 (宽度, 高度)
    dw, dh = new_shape[1] - new_unpad[0], new_shape[0] - new_unpad[1]  # 计算宽高方向的填充值
    if auto:  # 如果设置为自动,保持最小矩形
        dw, dh = np.mod(dw, stride), np.mod(dh, stride)  # 使填充值是步幅的倍数
    elif scalefill:  # 如果拉伸填充,完全填充
        dw, dh = 0.0, 0.0  # 不进行填充
        new_unpad = (new_shape[1], new_shape[0])  # 未填充的尺寸就是目标尺寸
        ratio = new_shape[1] / shape[1], new_shape[0] / shape[0]  # 计算宽高的缩放比例

    dw /= 2  # 将填充值均分到两侧
    dh /= 2  # 将填充值均分到上下

    if shape[::-1] != new_unpad:  # 如果当前形状和新的未填充形状不同,则调整大小
        im = cv2.resize(im, new_unpad, interpolation=cv2.INTER_LINEAR)
    top, bottom = int(round(dh - 0.1)), int(round(dh + 0.1))  # 计算上下填充的像素数
    left, right = int(round(dw - 0.1)), int(round(dw + 0.1))  # 计算左右填充的像素数
    im = cv2.copyMakeBorder(im, top, bottom, left, right, cv2.BORDER_CONSTANT, value=color)  # 添加填充边框,填充值为指定颜色

    return im, ratio, (dw, dh)  # 返回调整后的图片,缩放比例和填充值


def clamp(val, min_val, max_val):
    return max(min_val, min(val, max_val))


def calculate_overlap(xmin0, ymin0, xmax0, ymax0, xmin1, ymin1, xmax1, ymax1):
    w = max(0.0, min(xmax0, xmax1) - max(xmin0, xmin1) + 1.0)
    h = max(0.0, min(ymax0, ymax1) - max(ymin0, ymin1) + 1.0)
    i = w * h  # 交集面积
    u = ((xmax0 - xmin0 + 1.0) * (ymax0 - ymin0 + 1.0) +
         (xmax1 - xmin1 + 1.0) * (ymax1 - ymin1 + 1.0) - i)  # 并集面积

    return 0.0 if u <= 0.0 else (i / u)  # 计算 IoU


def nms(validCount, outputLocations, objProbs, classIds, order, filterId, threshold):
    """
    采用双指针和滑动窗口方法对检测框进行非极大值抑制(NMS)

    Parameters:
        validCount (int): 有效检测框的数量
        outputLocations (list of float): 每个框的位置信息 (x, y, w, h)
        objProbs (list of float): 每个检测框的置信度
        classIds (list of int): 每个检测框的类别ID
        order (list): 表示有效的检测框索引
        filterId (int): 需要处理的类别ID
        threshold (float): IoU 阈值

    Returns:
        None (order 数组在原地修改)
    """
    i, j = 0, 1

    while i < validCount and j < validCount:
        while i < validCount and (order[i] == -1 or classIds[order[i]] != filterId):
            i += 1  # 找到下一个有效框
        while j < validCount and (order[j] == -1 or classIds[order[j]] != filterId):
            j += 1  # 找到下一个有效框
        if j >= validCount:
            break

        n = order[i]

        while j < validCount and order[j] != -1 and classIds[order[j]] == filterId:
            m = order[j]

            # 计算两个框的坐标差值,若过小则移除置信度较低的框
            if (
                    abs(outputLocations[n][0] - outputLocations[m][0]) < 1.5 and  # x
                    abs(outputLocations[n][1] - outputLocations[m][1]) < 1.5 and  # y
                    abs(outputLocations[n][2] - outputLocations[m][2]) < 2.0 and  # w
                    abs(outputLocations[n][3] - outputLocations[m][3]) < 2.0  # h
            ):
                if objProbs[i] >= objProbs[j]:
                    order[j] = -1  # 置信度较低的框置为无效
                    j += 1
                else:
                    order[i] = -1
                    i = j  # 当前最小置信度低导致无效,修改为移动索引
                    j = i + 1
                    break
            else:
                i = j  # 当前最小和下一个不同导致无效,修改为移动索引
                j = i + 1
                break

    # 进行标准 NMS 处理
    for i in range(validCount):
        n = order[i]
        if n == -1 or classIds[n] != filterId:
            continue
        for j in range(i + 1, validCount):
            m = order[j]
            if m == -1 or classIds[m] != filterId:
                continue

            xmin0, ymin0, xmax0, ymax0 = (
                outputLocations[n][0],
                outputLocations[n][1],
                outputLocations[n][0] + outputLocations[n][2],
                outputLocations[n][1] + outputLocations[n][3]
            )
            xmin1, ymin1, xmax1, ymax1 = (
                outputLocations[m][0],
                outputLocations[m][1],
                outputLocations[m][0] + outputLocations[m][2],
                outputLocations[m][1] + outputLocations[m][3]
            )

            # 如果两个框没有交集,跳过
            if xmin0 > xmax1 or xmax0 < xmin1 or ymin0 > ymax1 or ymax0 < ymin1:
                continue

            iou = calculate_overlap(xmin0, ymin0, xmax0, ymax0, xmin1, ymin1, xmax1, ymax1)

            # 通过 IoU 进行 NMS 过滤
            if iou > threshold:
                if objProbs[i] >= objProbs[j]:
                    order[j] = -1
                else:
                    order[i] = -1
                    break

    return 0  # 与 C++ 代码一致,返回 0


def nms_pose(validCount, outputLocations, objProbs, classIds, order, filterId, threshold):
    """
    进行姿态检测的非极大值抑制(NMS)

    Parameters:
        validCount (int): 有效检测框的数量
        outputLocations (list): 每个框的位置信息 (x, y, w, h, keypoints_index)
        objProbs (list): 每个检测框的置信度
        classIds (list): 每个检测框的类别 ID
        order (list): 按置信度排序的检测框索引
        filterId (int): 需要处理的类别 ID
        threshold (float): IoU 阈值

    Returns:
        None(order 数组在原地修改)
    """

    i, j = 0, 1

    while i < validCount and j < validCount:
        while i < validCount and (order[i] == -1 or classIds[order[i]] != filterId):
            i += 1  # 找到下一个有效框
        while j < validCount and (order[j] == -1 or classIds[order[j]] != filterId):
            j += 1  # 找到下一个有效框
        if j >= validCount:
            break

        n = order[i]

        while j < validCount and order[j] != -1 and classIds[order[j]] == filterId:
            m = order[j]

            # 计算两个框的坐标差值,若过小则移除置信度较低的框
            if (
                    abs(outputLocations[n][0] - outputLocations[m][0]) < 1.5 and  # x
                    abs(outputLocations[n][1] - outputLocations[m][1]) < 1.5 and  # y
                    abs(outputLocations[n][2] - outputLocations[m][2]) < 2.0 and  # w
                    abs(outputLocations[n][3] - outputLocations[m][3]) < 2.0  # h
            ):
                if objProbs[i] >= objProbs[j]:
                    order[j] = -1  # 置信度较低的框置为无效
                    j += 1
                else:
                    order[i] = -1
                    i = j  # 当前最小置信度低导致无效,修改为移动索引
                    j = i + 1
                    break
            else:
                i = j  # 当前最小和下一个不同导致无效,修改为移动索引
                j = i + 1
                break

    # 进行标准 NMS 处理
    for i in range(validCount):
        n = order[i]
        if n == -1 or classIds[n] != filterId:
            continue
        for j in range(i + 1, validCount):
            m = order[j]
            if m == -1 or classIds[m] != filterId:
                continue

            xmin0, ymin0, xmax0, ymax0 = (
                outputLocations[n][0],
                outputLocations[n][1],
                outputLocations[n][0] + outputLocations[n][2],
                outputLocations[n][1] + outputLocations[n][3]
            )
            xmin1, ymin1, xmax1, ymax1 = (
                outputLocations[m][0],
                outputLocations[m][1],
                outputLocations[m][0] + outputLocations[m][2],
                outputLocations[m][1] + outputLocations[m][3]
            )

            # 如果两个框没有交集,跳过
            if xmin0 > xmax1 or xmax0 < xmin1 or ymin0 > ymax1 or ymax0 < ymin1:
                continue

            iou = calculate_overlap(xmin0, ymin0, xmax0, ymax0, xmin1, ymin1, xmax1, ymax1)

            # 通过 IoU 进行 NMS 过滤
            if iou > threshold:
                if objProbs[i] >= objProbs[j]:
                    order[j] = -1
                else:
                    order[i] = -1
                    break

    return 0  # 与 C++ 代码一致,返回 0


def sigmoid(x):
    """ 计算 Sigmoid 函数 """
    return 1.0 / (1.0 + np.exp(-x))


def unsigmoid(y):
    """ 计算 Sigmoid 的反函数(Logit 函数) """
    return -1.0 * np.log((1.0 / y) - 1.0)


def softmax(input_array):
    """
    计算 Softmax 函数(数值稳定版本)

    Parameters:
        input_array (numpy.ndarray): 1D NumPy 数组

    Returns:
        numpy.ndarray: 归一化后的 Softmax 结果
    """
    max_val = np.max(input_array)  # 计算最大值以提高数值稳定性
    exp_values = np.exp(input_array - max_val)  # 减去最大值后再求指数
    return exp_values / np.sum(exp_values)  # 归一化


def compute_dfl(tensor, dfl_len):
    box = np.zeros(4, dtype=np.float32)
    for b in range(4):
        exp_t = np.exp(tensor[b * dfl_len: (b + 1) * dfl_len])
        exp_sum = np.sum(exp_t)
        acc_sum = np.sum(exp_t / exp_sum * np.arange(dfl_len))
        box[b] = acc_sum
    return box


def process_fp32_v8(box_tensor, score_tensor, score_sum_tensor, stride, dfl_len, threshold=0.25):
    print(f"box_tensor.shape: {box_tensor.shape}")
    print(f"score_tensor.shape: {score_tensor.shape}")
    print(f"score_sum_tensor.shape: {score_sum_tensor.shape}")

    assert (box_tensor.shape[2] == score_tensor.shape[2] and
            score_tensor.shape[2] == score_sum_tensor.shape[2] and
            box_tensor.shape[3] == score_tensor.shape[3] and
            score_tensor.shape[3] == score_sum_tensor.shape[3]), "输入的box, score, score_sum尺寸不匹配"

    grid_h = box_tensor.shape[2]  # 输入尺寸高度除以 8/16/32
    grid_w = box_tensor.shape[3]  # 输入尺寸宽度除以 8/16/32
    obj_class_num = score_tensor.shape[1]  # 类别张量第1维度大小为类别数

    boxes = []
    objProbs = []
    classId = []
    validCount = 0

    for i in range(grid_h):
        for j in range(grid_w):
            max_class_id = -1

            # 通过 score sum 进行快速过滤
            if score_sum_tensor is not None and score_sum_tensor[0][0][i][j] < threshold:
                continue

            max_score = 0
            for c in range(obj_class_num):
                if score_tensor[0][c][i][j] > threshold and score_tensor[0][c][i][j] > max_score:
                    max_score = score_tensor[0][c][i][j]
                    max_class_id = c

            # 计算 box
            if max_score > threshold:
                before_dfl = np.zeros(dfl_len * 4, dtype=np.float32)
                for k in range(dfl_len * 4):
                    before_dfl[k] = box_tensor[0][k][i][j]  # 跳到下一个相同 grid 位置的通道

                box = compute_dfl(before_dfl, dfl_len)  # 调用之前转换的 compute_dfl_hw 函数

                x1 = (-box[0] + j + 0.5) * stride
                y1 = (-box[1] + i + 0.5) * stride
                x2 = (box[2] + j + 0.5) * stride
                y2 = (box[3] + i + 0.5) * stride
                w = x2 - x1
                h = y2 - y1

                boxes.append([x1, y1, w, h])
                objProbs.append(max_score)
                classId.append(max_class_id)
                validCount += 1

    return validCount, boxes, objProbs, classId


def process_fp32_pose(input_tensor, stride, index, threshold=0.25):
    """
    处理 FP32 版的姿态检测数据,解析 YOLO 关键点格式并进行 softmax 处理。
    """
    print(f"input_tensor.shape: {input_tensor.shape}")

    input_loc_len = 64
    filterBoxes = []
    boxScores = []
    classId = []
    validCount = 0
    thres_fp = unsigmoid(threshold)

    grid_h = input_tensor.shape[2]
    grid_w = input_tensor.shape[3]
    obj_class_num = input_tensor.shape[1] - input_loc_len

    for h in range(grid_h):
        for w in range(grid_w):
            for a in range(obj_class_num):
                if input_tensor[0][input_loc_len + a][h][w] >= thres_fp:  # 置信度筛选
                    box_conf_f32 = sigmoid(input_tensor[0][input_loc_len + a][h][w])

                    # 提取 loc 数组
                    loc = np.zeros(input_loc_len, dtype=np.float32)
                    for i in range(input_loc_len):
                        loc[i] = input_tensor[0][i][h][w]

                    # 进行 softmax 处理
                    for i in range(input_loc_len // 16):
                        loc[i * 16: (i + 1) * 16] = softmax(loc[i * 16: (i + 1) * 16])

                    # 计算 DFL 关键点坐标
                    xywh_ = np.zeros(4, dtype=np.float32)
                    xywh = np.zeros(4, dtype=np.float32)
                    for dfl in range(16):
                        xywh_[0] += loc[dfl + 0 * 16] * dfl
                        xywh_[1] += loc[dfl + 1 * 16] * dfl
                        xywh_[2] += loc[dfl + 2 * 16] * dfl
                        xywh_[3] += loc[dfl + 3 * 16] * dfl

                    xywh_[0] = (w + 0.5) - xywh_[0]
                    xywh_[1] = (h + 0.5) - xywh_[1]
                    xywh_[2] = (w + 0.5) + xywh_[2]
                    xywh_[3] = (h + 0.5) + xywh_[3]

                    # 转换成最终的 bbox 坐标
                    xywh[0] = ((xywh_[0] + xywh_[2]) / 2) * stride
                    xywh[1] = ((xywh_[1] + xywh_[3]) / 2) * stride
                    xywh[2] = (xywh_[2] - xywh_[0]) * stride
                    xywh[3] = (xywh_[3] - xywh_[1]) * stride
                    xywh[0] = xywh[0] - xywh[2] / 2
                    xywh[1] = xywh[1] - xywh[3] / 2

                    # 存储检测结果
                    filterBoxes.append([xywh[0], xywh[1], xywh[2], xywh[3], index + h * grid_w + w])  # x, y, w, h, 关键点索引
                    boxScores.append(box_conf_f32)
                    classId.append(a)
                    validCount += 1

    return validCount, filterBoxes, boxScores, classId


def post_process_pose(model_in_h, model_in_w, outputs, letter_box, conf_threshold=0.25, nms_threshold=0.6):
    """
    处理姿态检测的后处理逻辑
    """
    validCount = 0
    filterBoxes = []
    objProbs = []
    classId = []

    index = 0
    for i in range(3):
        grid_h = outputs[i].shape[2]
        grid_w = outputs[i].shape[3]
        stride = model_in_h // grid_h

        # Validcount, filterBoxes, Objprobs, Classid
        vboc = process_fp32_pose(outputs[i], stride, index, conf_threshold)

        validCount += vboc[0]
        filterBoxes += vboc[1]
        objProbs += vboc[2]
        classId += vboc[3]

        index += grid_h * grid_w    # 下一个分辨率检测头开始的索引

    print(f"input_tensor.shape(kpt_tensor.shape): {outputs[-1].shape}")
    # 没有检测到目标
    if validCount <= 0:
        return [[], [], [], []]

    indexArray = list(range(validCount))

    # 对不同类别进行 NMS 处理
    unique_classes = set(classId)
    for c in unique_classes:
        nms_pose(validCount, filterBoxes, objProbs, classId, indexArray, c, nms_threshold)

    print(f"validCount: {validCount}")
    print(f"filterBoxes: {filterBoxes}")
    print(f"boxScores: {objProbs}")
    print(f"classId: {classId}")
    print(f"indexArray: {indexArray}")

    # 处理最终检测框
    ret_bbox = []  # 返回的检测框
    ret_cls = []  # 返回的类别
    ret_conf = []  # 返回的置信度
    ret_kpt = []    # 返回的关键点
    last_count = 0

    kpt_num = outputs[-1].shape[1]
    kpt_dim = outputs[-1].shape[2]
    print(f"kpt_num: {kpt_num}, kpt_dim: {kpt_dim}")

    for i in range(validCount):
        if indexArray[i] == -1 or last_count >= 300:
            continue
        n = indexArray[i]

        x1 = filterBoxes[n][0] - letter_box[2]
        y1 = filterBoxes[n][1] - letter_box[3]
        x2 = x1 + filterBoxes[n][2]
        y2 = y1 + filterBoxes[n][3]

        w = filterBoxes[n][2]
        h = filterBoxes[n][3]
        keypoints_index = int(filterBoxes[n][4])

        keypoints = np.zeros((kpt_num, kpt_dim), dtype=np.float32)

        for j in range(kpt_num):
            keypoints[j][0] = (((outputs[3][0][j][0][keypoints_index]) - letter_box[2]) / letter_box[0])
            keypoints[j][1] = (((outputs[3][0][j][1][keypoints_index]) - letter_box[3]) / letter_box[1])
            if kpt_dim == 3:
                keypoints[j][2] = outputs[3][0][j][2][keypoints_index]

        obj_cls = classId[n]
        obj_conf = objProbs[i]

        ret_bbox.append([int(clamp(x1, 0, model_in_w) / letter_box[0]),
                         int(clamp(y1, 0, model_in_h) / letter_box[1]),
                         int(clamp(x2, 0, model_in_w) / letter_box[0]),
                         int(clamp(y2, 0, model_in_h) / letter_box[1])])
        ret_conf.append(obj_conf)
        ret_cls.append(obj_cls)
        ret_kpt.append(keypoints)

        last_count += 1

    return ret_bbox, ret_conf, ret_cls, ret_kpt


def post_process_det(model_in_h, model_in_w, outputs, letter_box, conf_threshold=0.25, nms_threshold=0.6):
    filterBoxes = []
    objProbs = []
    classId = []
    validCount = 0

    # 默认3个分支
    dfl_len = outputs[0].shape[1] // 4
    output_per_branch = len(outputs) // 3

    for i in range(3):
        score_sum = None

        if output_per_branch == 3:
            score_sum = outputs[i * output_per_branch + 2]

        box_idx = i * output_per_branch
        score_idx = i * output_per_branch + 1

        grid_h = outputs[box_idx].shape[2]
        grid_w = outputs[box_idx].shape[3]
        stride = model_in_h // grid_h

        # Validcount, filterBoxes, Objprobs, Classid
        vboc = process_fp32_v8(outputs[box_idx], outputs[score_idx], score_sum, stride, dfl_len, conf_threshold)

        validCount += vboc[0]
        filterBoxes += vboc[1]
        objProbs += vboc[2]
        classId += vboc[3]

    # 没有检测到物体
    if validCount <= 0:
        return [[], [], []]

    # 生成索引数组
    indexArray = list(range(validCount))

    # 进行NMS
    unique_classes = set(classId)
    for c in unique_classes:
        nms(validCount, filterBoxes, objProbs, classId, indexArray, c, nms_threshold)

    print(f"validCount: {validCount}")
    print(f"filterBoxes: {filterBoxes}")
    print(f"objProbs: {objProbs}")
    print(f"classId: {classId}")
    print(f"indexArray: {indexArray}")

    # 处理最终检测框
    ret_bbox = []  # 返回的检测框
    ret_cls = []  # 返回的类别
    ret_conf = []  # 返回的置信度
    last_count = 0

    for i in range(validCount):
        # 最大检测数设置为300
        if indexArray[i] == -1 or last_count >= 300:
            continue
        n = indexArray[i]

        x1 = filterBoxes[n][0] - letter_box[2]
        y1 = filterBoxes[n][1] - letter_box[3]
        x2 = x1 + filterBoxes[n][2]
        y2 = y1 + filterBoxes[n][3]
        obj_cls = classId[n]
        obj_conf = objProbs[i]

        ret_bbox.append([int(clamp(x1, 0, model_in_w) / letter_box[0]),
                         int(clamp(y1, 0, model_in_h) / letter_box[1]),
                         int(clamp(x2, 0, model_in_w) / letter_box[0]),
                         int(clamp(y2, 0, model_in_h) / letter_box[1])])
        ret_conf.append(obj_conf)
        ret_cls.append(obj_cls)

        last_count += 1

    return ret_bbox, ret_conf, ret_cls

4.2.3 YOLOv8后处理文件yolov8_det_rk3588.py

封装成类:YOLOv8

实例化:yolov8_detector = YOLOv8(model_path, img_size=(384, 640), conf_thres=0.25, iou_thres=0.5)

推理:bbox, conf, cls, cost = yolov8_detector(img_rgb)

import cv2
import onnxruntime
import time
import numpy as np
from ops_rk3588 import letterbox, post_process_det


class YOLOv8:
    def __init__(self, model_path, img_size=(640, 640), conf_thres=0.25, iou_thres=0.7):
        self.input_height = img_size[0]
        self.input_width = img_size[1]
        self.conf_threshold = conf_thres
        self.iou_threshold = iou_thres
        self.initialize_model(model_path)  # Initialize model

    def __call__(self, image):
        return self.pipeline(image)

    def pipeline(self, image):
        t0 = time.perf_counter()  # start time
        input_tensor = self.prepare_input(image)

        t1 = time.perf_counter()  # preprocess time
        outputs = self.inference(input_tensor)

        t2 = time.perf_counter()  # model infer time
        outputs = post_process_det(self.input_height, self.input_width, outputs, letter_box=[*self.ratio, *self.dwdh],
                                   conf_threshold=self.conf_threshold, nms_threshold=self.iou_threshold)
        print(outputs)
        self.boxes, self.scores, self.class_ids = outputs

        t3 = time.perf_counter()  # total time cost, and postprocess time

        return self.boxes, self.scores, self.class_ids, (t3 - t0, t1 - t0, t2 - t1, t3 - t2)

    def initialize_model(self, model_path):
        # TODO: 排除TensorRT
        providers = []
        if "CUDAExecutionProvider" in onnxruntime.get_available_providers():
            providers.append("CUDAExecutionProvider")

        providers.append("CPUExecutionProvider")
        self.session = onnxruntime.InferenceSession(model_path, providers=providers)
        self.get_model_details()

    def prepare_input(self, image):
        # Step 1: Get image dimensions
        self.img_height, self.img_width = image.shape[:2]

        # Step 2: Resize to input size, convert to float32 and scale pixel values to 0 to 1
        im, self.ratio, self.dwdh = letterbox(image, new_shape=(self.input_height, self.input_width),
                                              color=(114, 114, 114), auto=False)
        im = np.ascontiguousarray(im)  # contiguous
        im = (im.astype(np.float32) / 255.0)

        # Step 3: Transpose
        input_tensor = im.transpose((2, 0, 1))[np.newaxis, :, :, :]

        return input_tensor

    def inference(self, input_tensor):
        return self.session.run(self.output_names, {self.input_names[0]: input_tensor})

    def get_model_details(self):
        model_inputs = self.session.get_inputs()
        model_outputs = self.session.get_outputs()

        self.input_names = [model_inputs[i].name for i in range(len(model_inputs))]
        self.output_names = [model_outputs[i].name for i in range(len(model_outputs))]


if __name__ == '__main__':
    from ops_image import img_to_base64, get_image, resize_image
    from ops_draw import draw_detections_pipeline

    # 加载模型
    # model_path = "../det_hand/det_hand_s_384x640_250110.onnx"
    model_path = "../det_kx/det_kx_yolov8sd_x0.25_384x640_250213.onnx"
    yolov8_detector = YOLOv8(model_path, img_size=(384, 640), conf_thres=0.25, iou_thres=0.5)

    # 加载图片
    # img_1 = "https://pic4.zhimg.com/80/v2-81b33cc28e4ba869b7c2790366708e97_1440w.webp"  # URL读取
    img_1 = "./test_data_2.jpg"

    # 推理并绘制图片
    for img in [img_1, img_1]:
        img_rgb = get_image(img)
        img_bgr = cv2.cvtColor(img_rgb, cv2.COLOR_RGB2BGR)

        # cv2.namedWindow("Output", cv2.WINDOW_NORMAL)
        cv2.imshow("Output", resize_image(img_bgr, 480, 480)[0])
        cv2.waitKey(0)

        bbox, conf, cls, cost = yolov8_detector(img_rgb)

        cost_time = [round(x * 1000, 2) for x in cost]
        info = (f"RUN SUCCESS: Total time: {cost_time[0]} ms, Preprocess time: {cost_time[1]} ms, "
                f"Inference time: {cost_time[2]} ms, Postprocess time: {cost_time[3]} ms. ")
        print(info)
        img_plot = draw_detections_pipeline(img_bgr, bbox, conf, cls)

        # cv2.namedWindow("Output", cv2.WINDOW_NORMAL)
        cv2.imshow("Output", resize_image(img_plot, 480, 480)[0])
        cv2.waitKey(0)

4.2.4 YOLOv8-pose后处理文件yolov8_pose_rk3588.py

封装成类(继承自YOLOv8类):YOLOv8KPT

实例化:pose_predictor = YOLOv8KPT(model_path, kpt_shape=(17, 3), img_size=(640, 640), conf_thres=0.3, iou_thres=0.5)

推理(默认一个类,没返回类别):bbox, conf, kpts, cost = pose_predictor(img_rgb)

import cv2
import time
from yolov8_det_rk3588 import YOLOv8
from ops_rk3588 import post_process_pose


class YOLOv8KPT(YOLOv8):
    def __init__(self, model_path, kpt_shape=(17, 3), img_size=(640, 640), conf_thres=0.6, iou_thres=0.7):
        super().__init__(model_path, img_size, conf_thres, iou_thres)

        self.kpt_shape = kpt_shape  # 关键点形状

    def pipeline(self, image):
        t0 = time.perf_counter()  # start time
        input_tensor = self.prepare_input(image)

        t1 = time.perf_counter()  # preprocess time
        outputs = self.inference(input_tensor)

        t2 = time.perf_counter()  # model infer time
        outputs = post_process_pose(self.input_height, self.input_width, outputs, letter_box=[*self.ratio, *self.dwdh],
                                    conf_threshold=self.conf_threshold, nms_threshold=self.iou_threshold)
        print(outputs)
        t3 = time.perf_counter()  # total time cost, and postprocess time
        self.boxes, self.scores, self.cls, self.kpt = outputs

        return self.boxes, self.scores, self.kpt, (t3 - t0, t1 - t0, t2 - t1, t3 - t2)


if __name__ == '__main__':
    from ops_image import get_image
    from ops_draw import draw_bboxes_and_keypoints
    from visual_config import pose_hand_cfg, pose_person_cfg

    # 加载模型
    model_path = r'../pose_ren/yolov8s-pose.onnx'
    # model_path = r'../pose_hand/yolov8s-pose-hand.onnx'
    pose_predictor = YOLOv8KPT(model_path, kpt_shape=(17, 3), img_size=(640, 640), conf_thres=0.3, iou_thres=0.5)
    # pose_predictor = YOLOv8KPT(model_path, kpt_shape=(21, 2), img_size=(480, 480), conf_thres=0.3, iou_thres=0.5)

    # 加载图片
    img_1 = "./img/bus.jpg"
    img_2 = "./img/zidane.jpg"
    # img_1 = "./img/pose_hand_1.jpg"
    # img_2 = "./img/pose_hand_2.jpg"

    # 推理并绘制图片
    for img in [img_1, img_2]:
        img_rgb = get_image(img)
        img_bgr = cv2.cvtColor(img_rgb, cv2.COLOR_RGB2BGR)

        # cv2.namedWindow("Output", cv2.WINDOW_NORMAL)
        cv2.imshow("Output", img_bgr)
        cv2.waitKey(0)

        bbox, conf, kpts, cost = pose_predictor(img_rgb)

        # 将每个 17x3 的np数组展平,并存入新列表
        kpts = [kpt.flatten().tolist() for kpt in kpts]
        print(len(kpts))

        cost_time = [round(x * 1000, 2) for x in cost]
        info = (f"RUN SUCCESS: Total time: {cost_time[0]} ms, Preprocess time: {cost_time[1]} ms, "
                f"Inference time: {cost_time[2]} ms, Postprocess time: {cost_time[3]} ms. ")
        print(info)
        img_plot = draw_bboxes_and_keypoints(img_bgr, bbox, conf, kpts=kpts, **pose_person_cfg)
        # img_plot = draw_bboxes_and_keypoints(img_bgr, bbox, conf, kpts=kpts, **pose_hand_cfg)

        # cv2.namedWindow("Output", cv2.WINDOW_NORMAL)
        cv2.imshow("Output", img_plot)
        cv2.waitKey(0)

5. Python推理结果显示

5.1 YOLOv8结果显示

以手部检测为例:

中间结果分析(下面结果为上图最左侧女生图片的检测结果):

模型输入尺寸为384×640,所以第一个box_tensor形状为:[1, 64, 384 // 8, 640 // 8] == [1, 64, 48, 80],其余8个张量同理可得。

根据reduce_sum分支的张量,结合置信度阈值,过滤后,总共有20个有效待过nms的检测框。

使用nms后,有效索引对应的值为非-1.

最后经过pad_resize的逆操作,将检测框映射回原图。

原C++推理代码直接翻译过来,性能逆天,花费了6ms才处理完,转成np数组处理会极大加速(通常处理时间小于0.5ms)。

5.2 YOLOv8-pose结果显示

人体关键点检测结果(输入尺寸640×640,有可见不可见点之分):

打印中间处理结果:

手部关键点检测结果(输入尺寸480×480,没有可见不可见点之分):

打印中间处理结果:

由于手部关键点采用自动标注,没有可见不可见之分,最后的输出结果,没有置信度。


6. 总结

        使用RKNN官方代码转ONNX,主要用于在RK3588上的模型部署,方便后续ONNX转RKNN,可以使用RK官方代码一键转换+部署。转成的RKNN模型结构如下:

作者:paradoxjun

物联沃分享整理
物联沃-IOTWORD物联网 » RK3588部署YOLOv8深度指南(一):YOLOv8及YOLOv8-pose转ONNX与Python后处理代码实践

发表回复