【代码学习】C3D提取视频特征

青葱年少

2 年前

Table of Contents

学习目标

理论补充：C3D网络特点
动手能力：用C3D提取目标数据集（Charades-STA、ActivityNet Captions、TVR）的滑动窗口（64、128、256、512祯，各个窗口间80%的重叠）特征

目标答案

特点1）：使用3D卷积网络，能够在卷积过程中考虑时序信息，从而进行motion modeling；这是通过将frame视作通道来达到的，比如卷积核大小为，那么每次的深度感受野就是，也就是说，网络每次能在3帧之间建立起联系特点2）使用同质的大小的卷积核，这种卷积核被证明是最有效的卷积核大小
见项目实施部分

网络结构

网络层	padding(conv) stride(pool)	输入	激活函数
conv1			ReLU
pool1
conv2			ReLU
pool2
conv3a			ReLU
conv3b			ReLU
pool3
conv4a			ReLU
conv4b			ReLU
pool4
conv5a			ReLU
conv5b			ReLU
pool5	, (padding)
view
fc6		ReLU+dropout
fc7		ReLU+dropout
fc8		softmax

项目准备

下载写好的项目：c3d-pytorch[0]
下载预训练的功能（也可以在项目中使用）[0]
修改predict.py代码

代码逐行注释

""" How to use C3D network. """
import numpy as np

import torch
from torch.autograd import Variable

from os.path import join
# python标准库中的一个重要模块，主要用来查找符合特定规则（通配符）的目录和文件
from glob import glob

# 用于导入和处理视频的帧，因此项目的输入是视频提好的帧
import skimage.io as io
from skimage.transform import resize

from C3D_model import C3D


def get_sport_clip(clip_name, verbose=True):
    """
    Loads a clip to be fed to C3D for classification.
    TODO: should I remove mean here?
    
    Parameters
    ----------
    clip_name: str
        the name of the clip (subfolder in 'data'). 此处为'roger'
    verbose: bool
        if True, shows the unrolled clip (default is True).
    Returns
    -------
    Tensor
        (batch_size, channels, frames, height, weight)
        a pytorch batch (n, ch, fr, h, w).
    """

    # glob.glob返回符合匹配条件的所有文件的路径，list形式
    # ./data/roger/*.png
    clip = sorted(glob(join('data', clip_name, '*.png')))
    # 将每一帧转化为(112, 200, 3)的shape，并转为np.array形式
    # (frames, 112, 200, 3)，论文中将frames设为16，即以16帧为单位进行特征提取
    # 所以如果要提取滑动窗口特征，就应该将提前准备好输入图片
    # 每个特征提取单位放在一个文件夹中（如此处的'roger'）
    clip = np.array([resize(io.imread(frame), output_shape=(112, 200), preserve_range=True) for frame in clip])
    # 把weight中间的112拿出来（这应该对应论文中的jittering）
    clip = clip[:, :, 44:44+112, :]  # crop centrally

    # 这一步会将所有clip以视频序列的形式展示出来（如下图）
    # 如果要处理的视频比较多或者没有可视化需求，就设为false吧
    if verbose:
        clip_img = np.reshape(clip.transpose(1, 0, 2, 3), (112, 16 * 112, 3))
        io.imshow(clip_img.astype(np.uint8))
        io.show()

    # 增加batch_size维度，并转化为tensor
    clip = clip.transpose(3, 0, 1, 2)  # ch, fr, h, w
    clip = np.expand_dims(clip, axis=0)  # batch axis
    clip = np.float32(clip)

    return torch.from_numpy(clip)


def read_labels_from_file(filepath):
    """
    Reads Sport1M labels from file
    
    Parameters
    ----------
    filepath: str
        the file.
        
    Returns
    -------
    list
        list of sport names.
    """
    with open(filepath, 'r') as f:
        labels = [line.strip() for line in f.readlines()]
    return labels


def main():
    """
    Main function.
    """

    # load a clip to be predicted
    X = get_sport_clip('roger')
    # X: batch_size, channels, frames, height, weight
    # X: 1, 3, 16, 112, 112
    X = Variable(X)
    X = X.cuda()

    # get network pretrained model
    net = C3D()
    net.load_state_dict(torch.load('c3d.pickle'))
    net.cuda()
    net.eval()

    # perform prediction
    # prediction: 1, 487（这里是指有487个类，）
    # 如果是提特征的话，后面这些都可以不要，只把fc6的拿出来就可以
    prediction = net(X)
    prediction = prediction.data.cpu().numpy()

    # read labels
    labels = read_labels_from_file('labels.txt')

    # print top predictions
    top_inds = prediction[0].argsort()[::-1][:5]  # reverse sort and take five largest items
    print('\nTop 5:')
    for i in top_inds:
        print('{:.5f} {}'.format(prediction[0][i], labels[i]))


# entry point
if __name__ == '__main__':
    main()

项目实施

根据视频名称、窗口值和滑动值准备视频的输入帧
修改网络输出，只要fc6的输出来表示视频特征
输出为.npy格式的文件，存入特征、其他属性

文章出处登录后可见！

已经登录？立即刷新