解读YOLO v7的代码(三)损失函数

在前两篇博客中我分析了YOLO v7的模型结构以及训练数据的准备。这里将对损失函数的代码进行分析。

在train.py中，我们可以看到以下的代码是进行损失值计算的

                if 'loss_ota' not in hyp or hyp['loss_ota'] == 1:
                    loss, loss_items = compute_loss_ota(pred, targets.to(device), imgs)  # loss scaled by batch_size
                    break
                else:
                    loss, loss_items = compute_loss(pred, targets.to(device))  # loss scaled by batch_size

因为在超参数中定义了loss_ota，因此是用compute_loss_ota来进行计算。这个对应的是loss.py里面定义的ComputeLossOTA的类。

查看这个类里面的__call__函数，其输入变量是模型输出的预测值p, 训练图像包含的targets, 以及训练图像imgs。

例如以下是输入的一幅训练图片：

图片包含的targets如下：

tensor([[0.0000e+00, 4.5000e+01, 5.1015e-01, 8.6860e-02, 9.6169e-01, 1.7279e-01],
        [0.0000e+00, 5.0000e+01, 8.6156e-01, 8.8347e-02, 2.5677e-01, 1.7359e-01],
        [0.0000e+00, 1.5000e+01, 9.0563e-01, 8.2660e-01, 1.8675e-01, 3.4358e-01],
        [0.0000e+00, 3.9000e+01, 7.5925e-02, 1.2316e-01, 1.5037e-01, 2.3427e-01],
        [0.0000e+00, 8.0000e+00, 5.1819e-01, 1.0259e-01, 4.6240e-01, 2.0505e-01],
        [0.0000e+00, 7.0000e+00, 7.8777e-01, 4.4500e-02, 2.7215e-02, 2.3584e-02],
        [0.0000e+00, 6.0000e+00, 5.3230e-01, 4.1727e-01, 5.3604e-01, 3.4526e-01],
        [0.0000e+00, 1.6000e+01, 5.0707e-01, 3.8664e-01, 3.7291e-02, 3.2220e-02]],
       device='cuda:0')

这个targets数组共包括了8个target，每个target对应一个6维的变量，第一维代表Batch的第几个图片，第2维代表物体的类别，后面4维对应中心点的xy以及wh，这里xywh是表示target的大小与图片的大小比例。

模型输出的预测值P是一个列表，包含大中小三个维度的预测结果，其维度分别为(batch_size, 3, 80, 80, 85), (batch_size, 3, 40, 40, 85), (batch_size, 3, 20, 20, 85)。这里面3表示有3个anchor，80*80代表最细粒度的检测，把图片分为80*80个格子，85表示每个格子的一个预测变量，其中包括了预测物体存在概率，物体的xywh以及80个分类的概率，总共是85维。

我们把targets数组的物体在图像中标识出来看一下：

在计算loss之前，首先对输入的targets进行处理。在ComputeLossOTA里面有一个build_targets,对原始的targets进行处理，生成更多的正样本。

首先是对于每个target，根据不同的anchor和邻近的网格，初步筛选生成候选正样本，如以下的代码：

indices, anch = self.find_3_positive(p, targets)

这个函数的定义如下，我添加了注释：

def find_3_positive(self, p, targets):
    # Build targets for compute_loss(), input targets(image,class,x,y,w,h)
    na, nt = self.na, targets.shape[0]  # number of anchors, targets
    indices, anch = [], []
    gain = torch.ones(7, device=targets.device).long()  # normalized to gridspace gain
    ai = torch.arange(na, device=targets.device).float().view(na, 1).repeat(1, nt)  # same as .repeat_interleave(nt)
    #把target的第一个维度重复na次，并与anchor的编号拼接
    #例如原来target的维度为[8,6]，处理后的维度为[3,8,7]
    targets = torch.cat((targets.repeat(na, 1, 1), ai[:, :, None]), 2)  # append anchor indices

    g = 0.5  # bias
    #off表示包括了不同的偏移量，分别对应target本身所处的网格以及上下左右4个邻近的网格
    off = torch.tensor([[0, 0],
                        [1, 0], [0, 1], [-1, 0], [0, -1],  # j,k,l,m
                        # [1, 1], [1, -1], [-1, 1], [-1, -1],  # jk,jm,lk,lm
                        ], device=targets.device).float() * g  # offsets

    #分别基于三个检测粒度对target进行处理
    for i in range(self.nl):
        anchors = self.anchors[i]
        #以最小的检测粒度80*80网格为例，gain是[1, 1, 80, 80, 80, 80,  1]
        gain[2:6] = torch.tensor(p[i].shape)[[3, 2, 3, 2]]  # xyxy gain

        # Match targets to anchors，即把xywh和网格对应起来
        t = targets * gain
        if nt:
            # Matches
            r = t[:, :, 4:6] / anchors[:, None]  # wh ratio，计算每个target的宽高与anchor的比值
            j = torch.max(r, 1. / r).max(2)[0] < self.hyp['anchor_t']  # compare，如果比值的最大值小于4.0，则对应的anchor负责这个target的预测
            # j = wh_iou(anchors, t[:, 4:6]) > model.hyp['iou_t']  # iou(3,n)=wh_iou(anchors(3,2), gwh(n,2))
            t = t[j]  # filter

            # Offsets
            gxy = t[:, 2:4]  # grid xy，获取target对应的xy的网格坐标
            # 计算target的附近网格是否也作为正样本。例如某个target的xy坐标为(63.23, 21.75)，那么(62，22)这个坐标的网格也是这个target的一个正样本
            gxi = gain[[2, 3]] - gxy  # inverse，计算网格数与target对应的xy的网格坐标的差值
            j, k = ((gxy % 1. < g) & (gxy > 1.)).T  #判断xy网格坐标如果是大于1并且其余数是小于0.5，那么代表其左边或上边的网格也负责这个target的预测
            l, m = ((gxi % 1. < g) & (gxi > 1.)).T  #同上，代表右边或下边的网格也负责这个target的预测
            j = torch.stack((torch.ones_like(j), j, k, l, m))   #拼接之后的维度为[5, x]，其中5表示分别对应自身网格，以及上下左右四个网格的判断结果。x表各个anchor对应的总样本数。列如初始的target总数为8，其中有2个target与80*80这个粒度的3个anchor的宽高比值的最大值都小于4，因此x=2*3=6
            t = t.repeat((5, 1, 1))[j]
            offsets = (torch.zeros_like(gxy)[None] + off[:, None])[j]    #offset为偏移量
        else:
            t = targets[0]
            offsets = 0

        # Define
        b, c = t[:, :2].long().T  # image, class
        gxy = t[:, 2:4]  # grid xy
        gwh = t[:, 4:6]  # grid wh
        gij = (gxy - offsets).long()      #根据偏移量计算正样本对应的网格
        gi, gj = gij.T  # grid xy indices

        # Append
        a = t[:, 6].long()  # anchor indices
        indices.append((b, a, gj.clamp_(0, gain[3] - 1), gi.clamp_(0, gain[2] - 1)))  # image, anchor, grid indices
        anch.append(anchors[a])  # anchors

    return indices, anch

还是以之前举例的targets为例，当检测粒度是80*80网格的时候，这8个target当中只有两个，其wh与80*80的三个anchor的比值的最大值都小于4，即以下两个target，分别对应上图的truck和dog:

tensor([[0.0000e+00, 7.0000e+00, 7.8777e-01, 4.4500e-02, 2.7215e-02, 2.3584e-02],
        [0.0000e+00, 1.6000e+01, 5.0707e-01, 3.8664e-01, 3.7291e-02, 3.2220e-02]],
       device='cuda:0')

这两个target的xy网格坐标分别为(63.0125, 3.56)，(40.5656, 30.9312)。因此其周围的网格(62, 3), (63, 4), (41, 30), (40, 31)也是正样本，加上自身所处的网格，总共是6个网格。总的样本数为3*6=18

得到这些候选的正样本之后，下一步是对这些正样本进行进一步筛选。

把这些正样本对应的网格的预测值取出，计算其与每个target之间的IOU

for i, pi in enumerate(p):     #p是三个维度的预测值的列表，p[0]的维度是[3,80,80,85]
    #获取之前候选正样本的值，b表示batch里面的某一个，a表示anchor，gj,gi是格子的坐标
    b, a, gj, gi = indices[i]    
    idx = (b == batch_idx)
    b, a, gj, gi = b[idx], a[idx], gj[idx], gi[idx]                
    all_b.append(b)
    all_a.append(a)
    all_gj.append(gj)
    all_gi.append(gi)
    all_anch.append(anch[i][idx])
    from_which_layer.append(torch.ones(size=(len(b),)) * i)
    
    #获取正样本对应的预测值
    fg_pred = pi[b, a, gj, gi]                
    p_obj.append(fg_pred[:, 4:5])
    p_cls.append(fg_pred[:, 5:])
    
    grid = torch.stack([gi, gj], dim=1)
    #计算xy,wh的预测值，注意这里要对sigmoid的值*2-0.5，因为在前面初筛正样本里面，我们除了target所处的格子，其四周的格子也会负责预测，因此这个xy的值对sigmoid的值*2-0.5后，其范围是[-0.5, 1.5]
    pxy = (fg_pred[:, :2].sigmoid() * 2. - 0.5 + grid) * self.stride[i] #/ 8.
    #pxy = (fg_pred[:, :2].sigmoid() * 3. - 1. + grid) * self.stride[i]
    pwh = (fg_pred[:, 2:4].sigmoid() * 2) ** 2 * anch[i][idx] * self.stride[i] #/ 8.
    pxywh = torch.cat([pxy, pwh], dim=-1)
    pxyxy = xywh2xyxy(pxywh)
    pxyxys.append(pxyxy)

pxyxys = torch.cat(pxyxys, dim=0)
if pxyxys.shape[0] == 0:
    continue
p_obj = torch.cat(p_obj, dim=0)
p_cls = torch.cat(p_cls, dim=0)
from_which_layer = torch.cat(from_which_layer, dim=0)
all_b = torch.cat(all_b, dim=0)
all_a = torch.cat(all_a, dim=0)
all_gj = torch.cat(all_gj, dim=0)
all_gi = torch.cat(all_gi, dim=0)
all_anch = torch.cat(all_anch, dim=0)

#计算预测值与target的iou。例如我们之前是初筛获得了99个正样本，target是8，那么pair_wise_iou的维度是[8,99]，即每个target都分别与这99个正样本计算iou
pair_wise_iou = box_iou(txyxy, pxyxys)
#计算iou_loss，如果iou越大，则这个loss越小
pair_wise_iou_loss = -torch.log(pair_wise_iou + 1e-8)

之后要对这个IOU进行排序，取出其头10位最大的IOU数值（如果不足10个则按实际数量取），然后把这10个IOU相加，结果取整（如结果小于1则取1）。这个值将表示对应某个target的正样本数量。

top_k, _ = torch.topk(pair_wise_iou, min(10, pair_wise_iou.shape[1]), dim=1)
dynamic_ks = torch.clamp(top_k.sum(1).int(), min=1)

例如我们计算获得了之前的8个target与这99个正样本的IOU数值，然后取其前10位最大的IOU，数值如下：

tensor([[0.33164, 0.31670, 0.31670, 0.27168, 0.27168, 0.26650, 0.26650, 0.24327, 0.23278, 0.23109],
        [0.59865, 0.57707, 0.46241, 0.41670, 0.38554, 0.37295, 0.36225, 0.33483, 0.33090, 0.22409],
        [0.74503, 0.71696, 0.58363, 0.37660, 0.36653, 0.36114, 0.34860, 0.32036, 0.29622, 0.16013],
        [0.46875, 0.46193, 0.39073, 0.30311, 0.27263, 0.17448, 0.14117, 0.11435, 0.09316, 0.07584],
        [0.53946, 0.53946, 0.44517, 0.44517, 0.38395, 0.36339, 0.36339, 0.31876, 0.31780, 0.31460],
        [0.46457, 0.30825, 0.29158, 0.20381, 0.20071, 0.14967, 0.14524, 0.13727, 0.12137, 0.02018],
        [0.66553, 0.52719, 0.30917, 0.26019, 0.20343, 0.19609, 0.17714, 0.17714, 0.15379, 0.14393],
        [0.56813, 0.54395, 0.38423, 0.29507, 0.23172, 0.21839, 0.16346, 0.16189, 0.12244, 0.05927]], device='cuda:0', grad_fn=<TopkBackward>)

经过求和计算之后，dynamic_ks的数值如下：

tensor([2, 4, 4, 2, 4, 2, 2, 2], device='cuda:0', dtype=torch.int32)

所以对于target1来说，其正样本的数量为2，对于target2其正样本的数量为4

计算每个target对应的全部候选正样本的预测分类的loss

gt_cls_per_image = (
    F.one_hot(this_target[:, 1].to(torch.int64), loss.nc)
    .float()
    .unsqueeze(1)
    .repeat(1, pxyxys.shape[0], 1)
)

num_gt = this_target.shape[0]
cls_preds_ = (
    p_cls.float().unsqueeze(0).repeat(num_gt, 1, 1).sigmoid_()
    * p_obj.unsqueeze(0).repeat(num_gt, 1, 1).sigmoid_()
)

y = cls_preds_.sqrt_()
pair_wise_cls_loss = F.binary_cross_entropy_with_logits(
    torch.log(y/(1-y)) , gt_cls_per_image, reduction="none"
).sum(-1)

这里计算结果pair_wise_cls_loss的维度是[8,99]，把这个loss与之前计算的IOU_loss相加，计算cost

    cost = (
        pair_wise_cls_loss
        + 3.0 * pair_wise_iou_loss
    )

然后就根据cost来进行从小到大的topK的排序，其中的K是刚才计算得到的dynamic_ks，获取K个cost最小值所对应的正样本的编号

matching_matrix = torch.zeros_like(cost)
for gt_idx in range(num_gt):
    _, pos_idx = torch.topk(
        cost[gt_idx], k=dynamic_ks[gt_idx].item(), largest=False
    )
    matching_matrix[gt_idx][pos_idx] = 1.0

例如对于第一个target，dynamic_ks对应的值是2，因此取出2个cost最小的对应的正样本的编号，这里的计算结果pos_idx是

tensor([74, 62], device='cuda:0')

即对于第一个target，我们挑选第74和62号这两个正样本

要注意的是，对于某一个target，我们可以挑选多个正样本来进行训练，也就是多个正样本可以预测同一个target。但是一个正样本不能同时预测多个target，以下代码判断是否有这种情况，如有则取cost最小的那个target作为预测：

anchor_matching_gt = matching_matrix.sum(0)
if (anchor_matching_gt > 1).sum() > 0:
    _, cost_argmin = torch.min(cost[:, anchor_matching_gt > 1], dim=0)
    matching_matrix[:, anchor_matching_gt > 1] *= 0.0
    matching_matrix[cost_argmin, anchor_matching_gt > 1] = 1.0

计算这些正样本中哪些被挑选出来预测target的，以及其对应要预测的target的序号

fg_mask_inboxes = matching_matrix.sum(0) > 0.0
matched_gt_inds = matching_matrix[:, fg_mask_inboxes].argmax(0)

把这些挑选出来的正样本所对应的anchor，网格xy坐标，预测的target序号等都进行保存并返回，这些就是筛选后得到的最终的正样本，可以用来进行模型训练的损失函数计算的输入值。

例如上面例子提到的truck这个target，其对应在80*80的细粒度检测层中，对应了[63,3], [62,3], [63,4]这三个正样本，经过进一步筛选后，保留了[63,3], [62,3]这两个正样本。

最后就是损失函数的计算了，这部分比较简单，包括了正样本与target的IOU loss，objectness loss，分类loss这三部分

IOU loss的计算代码如下：

pxy = ps[:, :2].sigmoid() * 2. - 0.5
pwh = (ps[:, 2:4].sigmoid() * 2) ** 2 * anchors[i]
pbox = torch.cat((pxy, pwh), 1)  # predicted box
selected_tbox = targets[i][:, 2:6] * pre_gen_gains[i]
selected_tbox[:, :2] -= grid
iou = bbox_iou(pbox.T, selected_tbox, x1y1x2y2=False, CIoU=True)  # iou(prediction, target)
lbox += (1.0 - iou).mean()  # iou loss

Objectness loss计算如下：

tobj[b, a, gj, gi] = (1.0 - self.gr) + self.gr * iou.detach().clamp(0).type(tobj.dtype)

分类loss的计算如下：

selected_tcls = targets[i][:, 1].long()
if self.nc > 1:  # cls loss (only if multiple classes)
    t = torch.full_like(ps[:, 5:], self.cn, device=device)  # targets
    t[range(n), selected_tcls] = self.cp
    lcls += self.BCEcls(ps[:, 5:], t)  # BCE

最后把这三个loss加起来即可

lbox *= self.hyp['box']
lobj *= self.hyp['obj']
lcls *= self.hyp['cls']
bs = tobj.shape[0]  # batch size

loss = (lbox + lobj + lcls)*bs

在下一篇博客，我将继续分析Yolov7的训练过程。

文章出处登录后可见！

已经登录？立即刷新

解读YOLO v7的代码(三)损失函数

相关推荐