【代码学习】C3D提取视频特征

青葱年少 • 2022年5月9日上午11:07 • 技术文章 • 阅读 412

Table of Contents

学习目标

理论补充：C3D网络特点
动手能力：用C3D提取目标数据集（Charades-STA、ActivityNet Captions、TVR）的滑动窗口（64、128、256、512祯，各个窗口间80%的重叠）特征

目标答案

特点1）：使用3D卷积网络，能够在卷积过程中考虑时序信息，从而进行motion modeling；这是通过将frame视作通道来达到的，比如卷积核大小为，那么每次的深度感受野就是，也就是说，网络每次能在3帧之间建立起联系特点2）使用同质的大小的卷积核，这种卷积核被证明是最有效的卷积核大小
见项目实施部分

网络结构

网络层	卷积核	padding(conv) stride(pool)	输入	输出	激活函数
conv1					ReLU
pool1
conv2					ReLU
pool2
conv3a					ReLU
conv3b					ReLU
pool3
conv4a					ReLU
conv4b					ReLU
pool4
conv5a					ReLU
conv5b					ReLU
pool5		, (padding)
view
fc6			ReLU+dropout
fc7			ReLU+dropout
fc8			softmax

项目准备

下载写好的项目：c3d-pytorch[0]
下载预训练的功能（也可以在项目中使用）[0]
修改predict.py代码

代码逐行注释

""" How to use C3D network. """
import numpy as np

import torch
from torch.autograd import Variable

from os.path import join
# python标准库中的一个重要模块，主要用来查找符合特定规则（通配符）的目录和文件
from glob import glob

# 用于导入和处理视频的帧，因此项目的输入是视频提好的帧
import skimage.io as io
from skimage.transform import resize

from C3D_model import C3D


def get_sport_clip(clip_name, verbose=True):
    """
    Loads a clip to be fed to C3D for classification.
    TODO: should I remove mean here?
    
    Parameters
    ----------
    clip_name: str
        the name of the clip (subfolder in 'data'). 此处为'roger'
    verbose: bool
        if True, shows the unrolled clip (default is True).
    Returns
    -------
    Tensor
        (batch_size, channels, frames, height, weight)
        a pytorch batch (n, ch, fr, h, w).
    """

    # glob.glob返回符合匹配条件的所有文件的路径，list形式
    # ./data/roger/*.png
    clip = sorted(glob(join('data', clip_name, '*.png')))
    # 将每一帧转化为(112, 200, 3)的shape，并转为np.array形式
    # (frames, 112, 200, 3)，论文中将frames设为16，即以16帧为单位进行特征提取
    # 所以如果要提取滑动窗口特征，就应该将提前准备好输入图片
    # 每个特征提取单位放在一个文件夹中（如此处的'roger'）
    clip = np.array([resize(io.imread(frame), output_shape=(112, 200), preserve_range=True) for frame in clip])
    # 把weight中间的112拿出来（这应该对应论文中的jittering）
    clip = clip[:, :, 44:44+112, :]  # crop centrally

    # 这一步会将所有clip以视频序列的形式展示出来（如下图）
    # 如果要处理的视频比较多或者没有可视化需求，就设为false吧
    if verbose:
        clip_img = np.reshape(clip.transpose(1, 0, 2, 3), (112, 16 * 112, 3))
        io.imshow(clip_img.astype(np.uint8))
        io.show()

    # 增加batch_size维度，并转化为tensor
    clip = clip.transpose(3, 0, 1, 2)  # ch, fr, h, w
    clip = np.expand_dims(clip, axis=0)  # batch axis
    clip = np.float32(clip)

    return torch.from_numpy(clip)


def read_labels_from_file(filepath):
    """
    Reads Sport1M labels from file
    
    Parameters
    ----------
    filepath: str
        the file.
        
    Returns
    -------
    list
        list of sport names.
    """
    with open(filepath, 'r') as f:
        labels = [line.strip() for line in f.readlines()]
    return labels


def main():
    """
    Main function.
    """

    # load a clip to be predicted
    X = get_sport_clip('roger')
    # X: batch_size, channels, frames, height, weight
    # X: 1, 3, 16, 112, 112
    X = Variable(X)
    X = X.cuda()

    # get network pretrained model
    net = C3D()
    net.load_state_dict(torch.load('c3d.pickle'))
    net.cuda()
    net.eval()

    # perform prediction
    # prediction: 1, 487（这里是指有487个类，）
    # 如果是提特征的话，后面这些都可以不要，只把fc6的拿出来就可以
    prediction = net(X)
    prediction = prediction.data.cpu().numpy()

    # read labels
    labels = read_labels_from_file('labels.txt')

    # print top predictions
    top_inds = prediction[0].argsort()[::-1][:5]  # reverse sort and take five largest items
    print('\nTop 5:')
    for i in top_inds:
        print('{:.5f} {}'.format(prediction[0][i], labels[i]))


# entry point
if __name__ == '__main__':
    main()

项目实施

根据视频名称、窗口值和滑动值准备视频的输入帧
修改网络输出，只要fc6的输出来表示视频特征
输出为.npy格式的文件，存入特征、其他属性

文章出处登录后可见！

已经登录？立即刷新

pytorch 学习深度学习

赞 (0)

青葱年少普通用户

0

线性回归算法

上一篇 2022年5月9日

使用循环神经网络进行自然语言处理

下一篇 2022年5月9日

此站出售，如需请站内私信或者邮箱！