时空动作检测 (spatio-temporal action detection)

时空动作检测 (spatio-temporal action detection) ：输入一段视频，不仅需要识别视频中动作出现的区间和对应的类别，还要在空间范围内用一个包围框 (bounding box)标记出人物的空间位置。

一、算法介绍

ACT （《Action Tubelet Detector for Spatio-Temporal Action Localization》）
YOWO（《You Only Watch Once: A Unifified CNN Architecture for Real-Time Spatiotemporal Action Localization》）
MOC（《Actions as Moving Points》）
ACAR-Net（《《Actor-Context-Actor Relation Network for Spatio-Temporal Action Localization》》）

二、常用数据集介绍

2.1 AVA

数据集全称 Atomic Visual Actions，在CVPR2018公布，该数据集是用来做时空行为检测的，即检测出视频中所有人的位置以及对应的行为类别。

数据形式：

数据来源于Youtube中的电影
每个电影只标值第15-30分钟内的视频
总共430个视频，分为235个训练，64个验证，131个测试
标记的内容包括人物的bbox，以及每个人的行为类别，同一时间同一人可能有多个行为
标记的内容还包括每个实体编号，即相邻关键帧中的人物如果是同一个人，则拥有相同的实体编号。换句话说，“实体编号”其实就是目标跟踪的标签。
数据集中只对关键帧进行标记，所谓的关键帧是指每隔一秒作为一个关键帧

行为类别有80类（evaluate的时候只用到其中的60类），80类标签分为三类（person movement、object manipulation、person interaction）。具体如下：

* person movement
	bend/bow (at the waist)、crawl、crouch/kneel、dance、fall down、get up、jump/leap、lie/sleep、martial art、run/jog、sit、stand、swim、walk
	
* object manipulation
	answer phone、brush teeth、carry/hold (an object)、catch (an object)、chop、climb (e.g., a mountain)、clink glas 、close (e.g., a door, a box)、cook、cut、dig、dress/put on clothing、drink、drive (e.g., a car, a truck)、eat、enter、exit、extract、fishing、hit (an object)、kick (an object)、lift/pick up、listen (e.g., to music)、open (e.g., a window, a car door)、paint、play board game、play musical、instrument、play with pets、point to (an object)、press、pull (an object)、push (an object)、put down、read、ride (e.g., a bike, a car, a horse)、row boat、sail boat、shoot、shovel、smoke、stir、take a photo、text on/look at a cellphone、throw、touch (an object)、turn (e.g., a screwdriver)、watch (e.g., TV)、work on a computer、write
	
* person interaction
	fight/hit (a person)、give/serve (an object) to (a person)、grab (a person)、hand clap、hand shake、hand wave、hug (a person)、kick (a person)、kiss (a person)、lift (a person)、listen to (a person)、play with kids、push (another person)、sing to (e.g., self, a person, a group)、take (an object) from (a person)、talk to (e.g., self, a person, a group)、watch (a person)

2.2 J-HMDB-21

数据集全称Joint-annotated Human Motion Data Base，该数据是HMDB-51数据集的一个子集。

2.2.1 HMDB-51

该数据集文章名为HMDB: A Large Video Database for Human Motion Recognition. ，发布于ICCV2011，用于动作识别领域(action recognition)。

数据集形式：

该数据集的视频是从多个地方采集的，大部分来源于电影，一小部分来源于YouTube和Google videos等。
数据集总共有裁切好的6849个片段的，并将所有的片段分为51个动作类别，每个类别至少有101个视频片段。

类别分为下列五种:

* general facial actions
	smile、laugh、chew、talk

* facial actions with object manipulation
	smoke、eat、drink

* General body movements
	cartwheel, clap hands, climb, climb stairs, dive, fall on the floor, backhand flip, handstand, jump, pull up, push up, run, sit down, sit up, somersault, stand up, turn, walk, wave.
	
* Body movements with object interaction
	brush hair, catch, draw sword, dribble, golf, hit something, kick ball, pick, pour, push something, ride bike, ride horse, shoot ball, shoot bow, shoot gun, swing baseball bat, sword exercise, throw.
	
* Body movements for human interaction
	fencing, hug, kick someone, kiss, punch, shake hands, sword fight.

2.2.2 J-HMDB-21

该数据库文章名为Towards understanding action recognition，发表于ICCV2013。

数据集形式：

该数据集是从HMDB-51中选取了960个视频片段，这些视频片段属于不同的21个类别。
该数据集是每帧都进行标注
每个视频最多只有一类目标行为，bbox只标了做目标行为的那几个人

数据集中的21类为:

sit、run、pullup、walk、shoot_gun、brush_hair、jump、pour、pick、kick_ball、golf、shoot_bow、catch、clap、swing_baseball、climb_stairs、throw、wave、shoot_ball、push、stand

2.3 UCF101-24

数据集UCF101-24是数据集UCF101的一个子集

2.3.1 UCF101

该数据集文章名为UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild，该数据集又是UCF50的扩充，该数据主要是用于视频识别领域。

数据集形式：

数据集包含了13320个裁剪好的视频片段
所有的视频都来源于YouTube，帧率为25fps，分辨率为
该数据集中的类别分为5大类 (Body motion、 Human-human interactions、Human-object interactions、 Playing musical instruments and Sports)

该数据集总共有101个类别（BTW: UCF50意思就是该数据集有50个类别），分别为：

Apply Eye Makeup, Apply Lipstick, Archery, Baby Crawling, Balance Beam, Band Marching, Basketball Dunk, Blow, Drying Hair, Blowing Candles, Body Weight Squats,Bowling,Boxing-Punching Bag, Boxing-Speed Bag, Brushing, Teeth, Cliff Diving, Cricket Bowling, Cricket Shot, Cutting In Kitchen, Field Hockey Penalty, Floor Gymnastics, Frisbee Catch, Front Crawl, Hair cut, Hammering, Hammer Throw, Handstand Pushups, Handstand Walking, Head
Massage, Ice Dancing, Knitting, Long Jump, Mopping, Floor, Parallel Bars, Playing Cello, Playing Daf, PlayingDhol, Playing Flute, Playing Sitar, Rafting, Shaving Beard, Shot put, Sky Diving, Soccer Penalty, Still Rings, Sumo, Wrestling, Surfing, Table, Tennis Shot, Typing, Uneven Bars, Wall Pushups, Writing On Board

2.3.2 UCF101-24

数据集标注的下载点击，here

数据集形式:

该数据集每一帧都进行标注
每个视频最多只有一类行为，bbox只标做了目标行为的那几个人

数据集中的24类为：

WalkingWithDog、Diving、PoleVault、SkateBoarding、CricketBowling、GolfSwing、Skijet、RopeClimbing、FloorGymnastics、Basketball、Biking、VolleyballSpiking、Fencing、CliffDiving、HorseRiding、SoccerJuggling、TennisSwing、LongJump、SalsaSpin、TrampolineJumping、IceDancing、Skiing、Surfing、BasketballDunnk

其它视频任务介绍请查看-文章<<主流的视频动作类算法任务介绍>>

文章出处登录后可见！

已经登录？立即刷新

时空动作检测 (spatio-temporal action detection)

一、算法介绍

二、常用数据集介绍

2.1 AVA

2.2 J-HMDB-21

2.3 UCF101-24

相关推荐