Deep Learning for Human Action Recognition: A Comprehensive Review Open Access

[3]

Ahsan

Madhok

, and

Essa

, “

Video Jigsaw: Unsupervised Learning of Spatiotemporal Context for Video Action Recognition

,” in

2019 IEEE Winter Conference on Applications of Computer Vision (WACV)

, IEEE,

2019

179

–

189

[4]

Angelova

Toshev

, and

M. S.

Ryoo

, “

Evolving Spacetime Neural Architectures for Videos

,” in

Proceedings of the IEEE International Conference on Computer Vision

2019

1793

–

1802

[5]

Alwassel

Mahajan

Korbar

Torresani

Ghanem

, and

Tran

, “

Self-supervised Learning by Cross-Modal Audio-Video Clustering

,”

Advances in Neural Information Processing Systems

2020

[6]

Antti

et al., “

Mean Teachers are Better Role Models: Weight Averaged Consistency Targets Improve Semi-supervised Deep Learning Results

,” in

NeurIPS

2017

1195

–

1204

[7]

and

Caruana

, “

Do Deep Nets Really Need to be Deep?

” In

NIPS

2013

[8]

Buchler

Brattoli

, and

Ommer

, “

Improving Spatio-temporal Self-supervision by Deep Reinforcement Learning

,” in

ECCV

2018

770

–

786

[9]

Caba Heilbron

Escorcia

Ghanem

, and

Carlos Niebles

, “

Activitynet: A Large-scale Video Benchmark for Human Activity Understanding

,” in

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

2015

961

–

970

[10]

Caron

Bojanowski

Joulin

, and

Douze

, “

Deep Clustering for Unsupervised Learning of Visual Features

,” in

Proceedings of the European Conference on Computer Vision (ECCV)

2018

132

–

149

[11]

Carreira

Noland

Banki-Horvath

Hillier

, and

Zisserman

, “

A Short Note about Kinetics-600

,”

arXiv preprint arXiv:1808.01340

2018

[12]

Carreira

Noland

Hillier

, and

Zisserman

, “

A Short Note on the Kinetics-700 Human Action Dataset

,”

arXiv preprint arXiv:1907.06987

2019

[13]

Chakraborty

M. B.

Holte

T. B.

Moeslund

, and

Gonzalez

, “

Selective Spatio-temporal Interest Points

,”

Computer Vision and Image Understanding

116

(

2012

396

–

410

[14]

Chen

Zhang

Yao

Guo

, and

Liu

, “

Deep Learning for Sensor-based Human Activity Recognition: Overview, Challenges, and Opportunities

,”

ACM Computing Surveys (CSUR)

(

2021

–

[15]

Chi

Tian

, and

Tian

, “

Two-stream Video Classification with Cross-modality Attention

,” in

Proceedings of the IEEE International Conference on Computer Vision Workshops

2019

451120

[16]

Christoph

Fan

Malik

, and

, “

SlowFast Networks for Video Recognition

,” in

The IEEE 2019 International Conference on Computer Vision (ICCV)

, IEEE,

2019

6201

–

6210

[17]

Crasto

Weinzaepfel

Alahari

, and

Schmid

, “

Mars: Motion-Augmented RGB Stream for Action Recognition

,” in

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

2019

7882

–

7891

[18]

Dalal

Triggs

, and

Schmid

, “Human Detection Using Oriented Histograms of Flow and Appearance,” in

ECCV

Springer

2006

428

–

441

[19]

Diba

Fayyaz

Sharma

A. H.

Karami

M. M.

Arzani

Yousefzadeh

, and

Van Gool

, “

Temporal 3D Convnets: New Architecture and Transfer Learning for Video Classification

,”

arXiv preprint arXiv:1711.08200

2017

[20]

Diba

Fayyaz

Sharma

Mahdi Arzani

Yousefzadeh

Gall

, and

Van Gool

, “

Spatio-temporal Channel Correlation Networks for Action Classification

,” in

ECCV

2018

284

–

299

[21]

Diba

Fayyaz

Sharma

Paluri

Gall

Stiefelhagen

, and

Van Gool

, “

Holistic Large Scale Video Understanding

,”

arXiv preprint arXiv:1904.11451

2019

[22]

Diba

Sharma

, and

L. V.

Gool

, “

Deep Temporal Linear Encoding Networks

,” in

The 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

2017

2329

–

2338

[23]

Donahue

L. A.

Hendricks

Rohrbach

Venugopalan

Guadarrama

Saenko

, and

Darrell

, “

Long-Term Recurrent Convolutional Networks for Visual Recognition and Description

,”

IEEE Transactions on Pattern Analysis and Machine Intelligence

(

2017

677

–

691

[24]

Duan

Zhao

Xiong

Liu

, and

Lin

, “

Omni-sourced Webly-supervised Learning for Video Recognition

,”

arXiv preprint arXiv:2003.13042

2020

[25]

Q. V.

Duc

Phung

Nguyen

B. Y.

Nguyen

, and

T. H.

Nguyen

, “

Self-knowledge Distillation: An Efficient Approach for Falling Detection

,” in

International Conference on Artificial Intelligence and Big Data in Digital Era

, Springer,

2022

369

–

380

[26]

Zhai

G. W.

Taylor

, and

J. M.

Susskind

, “

Skip-Clip: Self-Supervised Spatiotemporal Representation Learning by Future Clip Order Ranking

,”

arXiv preprint arXiv:1910.12770

2019

[27]

Feichtenhofer

Pinz

, and

Zisserman

, “

Convolutional Two-Stream Network Fusion for Video Action Recognition

,” in

The 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

, June

2016

1933-41

[28]

Fernando

Bilen

Gavves

, and

Gould

, “

Self-supervised Video Representation Learning with Odd-One-Out Networks

,” in

CVPR

2017

3636

–

3645

[29]

Gan

Gong

Liu

, and

L. J.

Guibas

, “

Geometry Guided Convolutional Neural Networks for Self-supervised Video Representation Learning

,” in

CVPR

2018

5589

–

5597

[30]

Ghadiyaram

Tran

, and

Mahajan

, “

Large-scale Weakly Supervised Pre-training for Video Action Recognition

,” in

CVPR

2019

12046

–

[31]

Girdhar

Tran

Torresani

, and

Ramanan

, “

Distinit: Learning Video Representations without a Single Labeled Video

,” in

ICCV

2019

852

–

861

[32]

Goyal

S. E.

Kahou

Michalski

Materzynska

Westphal

Kim

Haenel

Fruend

Yianilos

Mueller-Freitag

, et al., “

The” Something Something” Video Database for Learning and Evaluating Visual Common Sense

.,” in

ICCV

, Vol.

, No.

2017

[33]

Han

Xie

, and

Zisserman

, “

Video Representation Learning by Dense Predictive Coding

,” in

Proceedings of the IEEE International Conference on Computer Vision Workshops

2019

1483

–

1492

[34]

Hara

Kataoka

, and

Satoh

, “

Can Spatio-temporal 3D CNNs Retrace the History of 2D CNNs and Imagenet?

” In

CVPR

2018

654655

[35]

Zhang

Ren

, and

Sun

, “

Deep residual learning for image recognition

,” in

CVPR

2016

770

–

778

[36]

Hinton

Vinyals

, and

Dean

, “

Distilling the Knowledge in a Neural Network

,”

arXiv preprint arXiv:1503.02531

2015

[37]

A. G.

Howard

Zhu

Chen

Kalenichenko

Wang

Weyand

Andreetto

, and

Adam

, “

Mobilenets: Efficient Convolutional Neural Networks for Mobile Vision Applications

,”

arXiv preprint arXiv:1704.04861

2017

[38]

Shen

, and

Sun

, “

Squeeze-and-Excitation Networks

,” in

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

2018

7132

–

7141

[39]

Huang

and

A. G.

Bors

, “

Busy-Quiet Video Disentangling for Video Classification

,” in

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

2022

1341

–

1350

[40]

Huang

and

A. G.

Bors

, “

Region-based Non-local Operation for Video Classification

,”

arXiv preprint arXiv:2007.09033

2020

[41]

Ionescu

Papava

Olaru

, and

Sminchisescu

, “

Human3. 6m: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments

,”

IEEE Transactions on Pattern Analysis and Machine Intelligence

(

2013

1325

–

1339

[42]

Iosifidis

Tefas

, and

Pitas

, “

Semi-supervised Classification of Human Actions based on Neural Networks

,” in

2014 22nd International Conference on Pattern Recognition

, IEEE,

2014

1336

–

1341

[43]

Jegham

A. B.

Khalifa

Alouani

, and

M. A.

Mahjoub

, “

Visionbased Human Action Recognition: An Overview and Real World Challenges

,”

Forensic Science International: Digital Investigation

2020

200901

[44]

Yang

, and

, “

3D Convolutional Neural Networks for Human Action Recognition

,”

IEEE Transactions on Pattern Analysis and Machine Intelligence

(

2013

221

–

231

[45]

Jiang

Wang

Gan

, and

Yan

, “

STM: Spatiotemporal and Motion Encoding for Action Recognition

,” in

Proceedings of the IEEE International Conference on Computer Vision

2019

2000

–

2009

[46]

Jing

Parag

Tian

, and

Wang

, “

Videossl: Semi-supervised Learning for Video Classification

,” in

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

2021

1110

–

1119

[47]

Jing

Yang

Liu

, and

Tian

, “

Self-supervised Spatiotemporal Feature Learning via Video Rotation Prediction

,”

arXiv preprint arXiv:1811.11387

2018

[48]

Joao

and

Andrew

, “

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

,” in

CVPR

2017

6299

–

6308

[49]

Kalfaoglu

Kalkan

, and

A. A.

Alatan

, “

Late Temporal Modeling in 3D CNN Architectures with Bert for Action Recognition

,”

arXiv preprint arXiv:2008.01232

2020

[50]

Karpathy

Toderici

Shetty

Leung

Sukthankar

, and

Fei-Fei

, “

Large-scale Video Classification with Convolutional Neural Networks

,” in

CVPR

2014

1725

–

1732

[51]

Kay

Carreira

Simonyan

Zhang

Hillier

Vijaya-Narasimhan

Viola

Green

Back

Natsev

, et al., “

The Kinetics Human Action Video Dataset

,”

arXiv preprint arXiv:1705.06950

2017

[52]

Kazakos

Nagrani

Zisserman

, and

Damen

, “

Epicfusion: Audio-Visual Temporal Binding for Egocentric Action Recognition

,” in

Proceedings of the IEEE International Conference on Computer Vision

2019

5492

–

5501

[53]

Kim

Cho

, and

I. S.

Kweon

, “

Self-supervised Video Representation Learning with Space-Time Cubic Puzzles

,” in

Proceedings of the AAAI Conference on Artificial Intelligence

, Vol.

2019

8545

–

[54]

Klaser

Marszaiek

, and

Schmid

, “

A Spatio-temporal Descriptor based on 3D-gradients

,” in

BMVC 2008-19th British Machine Vision Conference

, British Machine Vision Association,

2008

275

–

271

[55]

Kliper-Gross

Hassner

, and

Wolf

, “

The Action Similarity Labeling Challenge

,”

IEEE Transactions on Pattern Analysis and Machine Intelligence

(

2011

615

–

621

[56]

Knights

Vanderkop

Ward

Mackenzie-Ross

, and

Moghadam

, “

Temporally Coherent Embeddings for Self-Supervised Video Representation Learning

,”

arXiv preprint arXiv:2004.02753

2020

[57]

Komkov

Dzabraev

, and

Petiushko

, “

Mutual Modality Learning for Video Action Classification

,”

arXiv preprint arXiv:2011.02543

2020

[58]

Krizhevsky

Sutskever

, and

G. E.

Hinton

, “

Imagenet classification with deep convolutional neural networks

,”

Communications of the ACM

(

2017

–

[59]

Kuehne

Jhuang

Garrote

Poggio

, and

Serre

, “

HMDB: A Large Video Database for Human Motion Recognition

,” in

Proceedings of the International Conference on Computer Vision (ICCV)

2011

[60]

Kwon

Kim

Kwak

, and

Cho

, “

MotionSqueeze: Neural Motion Feature Learning for Video Understanding

,” in

European Conference on Computer Vision

, Springer,

2020

345

–

362

[61]

Laptev

, “

On Space-time Interest Points

,”

International Journal of Computer Vision

(

2-3

2005

107

–

[62]

D.-H.

Lee

et al., “

Pseudo-label: The Simple and Efficient Semisuper-vised Learning Method for Deep Neural Networks

,” in

Workshop on Challenges in Representation Learning, ICML

, Vol.

, No.

2013

[63]

Zhong

Xie

, and

, “

Collaborative Spatiotemporal Feature Learning for Video Action Recognition

,” in

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

2019

7872

–

7881

[64]

Zhang

, and

Liu

, “

Action Recognition Based on a Bag of 3D Points

,” in

2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops

, IEEE,

2010

–

[65]

, and

, “

Weakly-Supervised Temporal Action Detection for Fine-Grained Videos with Hierarchical Atomic Actions

,” in

European Conference on Computer Vision

, Springer,

2022

[66]

Lin

Guo

, and

, “

Self-supervised Video Representation Learning with Meta-contrastive Network

,” in

Proceedings of the IEEE/CVF International Conference on Computer Vision

2021

8239

–

8249

[67]

Luo

Liu

Zhou

Yang

, and

Wang

, “

Video Cloze Procedure for Self-supervised Spatio-Temporal Learning

,”

arXiv preprint arXiv:2001.00294

2020

[68]

D. C.

Luvizon

Picard

, and

Tabia

, “

2D/3D Pose Estimation and Action Recognition using Multitask Deep Learning

,” in

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

2018

5137

–

5146

[69]

Zhang

H.-T.

Zheng

, and

Sun

, “

Shufflenet v2: Practical Guidelines for Efficient CNN Architecture Design

,” in

European conference on computer vision

2018

116

–

131

[70]

Mandal

Narayan

S. K.

Dwivedi

Gupta

Ahmed

F. S.

Khan

, and

Shao

, “

Out-of-Distribution Detection for Generalized Zero-shot Action Recognition

,” in

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

2019

9985

–

9993

[71]

Mao

Xue

, and

Zhang

, “

Hierarchical Video Frame Sequence Representation with Deep Convolutional Graph Network

,” in

Proceedings of the European Conference on Computer Vision (ECCV)

2018

262

–

270

[72]

Marszalek

Laptev

, and

Schmid

, “

Actions in Context

,” in

2009 IEEE Conference on Computer Vision and Pattern Recognition

, IEEE,

2009

2929

–

2936

[73]

Martinez

Modolo

Xiong

, and

Tighe

, “

Action Recognition with Spatial-temporal Discriminative Filter Banks

,” in

Proceedings of the IEEE International Conference on Computer Vision

2019

5482

–

5491

[74]

Materzynska

Berger

Bax

, and

Memisevic

, “

The Jester Dataset: A Large-scale Video Dataset of Human Gestures

,” in

Proceedings of the IEEE International Conference on Computer Vision Workshops

2019

2874

–

2882

[75]

Mazzia

Angarano

Salvetti

Angelini

, and

Chiaberge

, “

Action Transformer: A Self-attention Model for Short-time Pose-based Human Action Recognition

,”

Pattern Recognition

124

2022

108487

[76]

Misra

C. L.

Zitnick

, and

Hebert

, “Shuffle and Learn: Unsupervised Learning using Temporal Order Verification,” in

ECCV

Springer

2016

527

–

544

[77]

Monfort

Andonian

Zhou

Ramakrishnan

S. A.

Bargal

Yan

Brown

Fan

Gutfreund

Vondrick

, et al., “

Moments in Time Dataset: One Million Videos for Event Understanding

,”

IEEE Transactions on Pattern Analysis and Machine Intelligence

(

2019

502

–

508

[78]

Morgado

Vasconcelos

, and

Misra

, “

Audio-visual Instance Discrimination with Cross-modal Agreement

,”

arXiv preprint arXiv:2004.12943

2020

[79]

J. Y.-H.

Hausknecht

Vijayanarasimhan

Vinyals

Monga

, and

Toderici

, “

Beyond Short Snippets: Deep Networks for Video Classification

,” in

Computer Vision and Pattern Recognition

2015

[80]

Wang

, and

Moulin

, “

RGBD-HuDaAct: A Color-depth Video Database for Human Daily Activity Recognition

,” in

2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops)

, IEEE,

2011

1147

–

1153

[81]

S. N.

Paul

and

Y. J.

Singh

, “

Survey on Video Analysis of Human Walking Motion

,”

International Journal of Signal Processing, Image Processing and Pattern Recognition

(

2014

–

122

[82]

Paul

Roy

, and

A. K.

Roy-Chowdhury

, “

W-talc: Weakly supervised Temporal Activity Localization and Classification

,” in

Proceedings of the European Conference on Computer Vision (ECCV)

2018

563

–

579

[83]

L. L.

Presti

and

La Cascia

, “

3D Skeleton-based Human Action Classification: A Survey

,”

Pattern Recognition

2016

130

–

147

[84]

Qian

Meng

Gong

M.-H.

Yang

Wang

Belongie

, and

Cui

, “

Spatiotemporal Contrastive Video Representation Learning

,”

arXiv preprint arXiv:2008.03800

2020

[85]

Qiu

Yao

, and

Mei

, “

Learning spatio-temporal representation with pseudo-3d residual networks

,” in

Proceedings of the IEEE International Conference on Computer Vision

2017

5533

–

5541

[86]

Qiu

Yao

C.-W.

Ngo

Tian

, and

Mei

, “

Learning Spatio-temporal Representation with Pseudo-3d Residual Networks

,” in

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

2019

12056

–

[87]

Recasens

Luc

J.-B.

Alayrac

Wang

Strub

Tallec

Malinowski

Päträucean

Altché

Valko

, et al., “

Broaden your Views for Self-supervised Video Learning

,” in

Proceedings of the IEEE/CVF International Conference on Computer Vision

2021

1255

–

1265

[88]

K. K.

Reddy

and

Shah

, “

Recognizing 50 Human Action Categories of Web Videos

,”

Machine Vision and Applications

(

2013

971

–

981

[89]

Romero

Ballas

S. E.

Kahou

Chassang

Gatta

, and

Bengio

, “

FitNets: Hints for Thin Deep Nets

,” in

ICLR

2015

[90]

Sayed

Brattoli

, and

Ommer

, “Cross and Learn: Cross-Modal Self-supervision,” in

Pattern Recognition

Cham

Springer International Publishing

2019

228

–

[91]

Scovanner

Ali

, and

Shah

, “

A 3-dimensional Sift Descriptor and Its Application to Action Recognition

,” in

Proceedings of the 15th ACM International Conference on Multimedia

2007

357

–

360

[92]

Sharir

Noy

, and

Zelnik-Manor

, “

An Image is Worth 16 × 16 Words, What is a Video Worth?

”

arXiv preprint arXiv:2103.13915

2021

[93]

G. A.

Sigurdsson

Varol

Wang

Farhadi

Laptev

, and

Gupta

, “

Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding

,” in

European Conference on Computer Vision

, Springer,

2016

510

–

526

[94]

Simonyan

and

Zisserman

, “

Two-stream Convolutional Networks for Action Recognition in Videos

,” in

NIPS

2014

568

–

576

[95]

Singh

Chakraborty

Varshney

Panda

Feris

Saenko

, and

Das

, “

Semi-Supervised Action Recognition with Temporal Contrastive Learning

,” in

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

2021

10389

–

[96]

Smaira

Carreira

Noland

Clancy

, and

Zisserman

, “

A Short Note on the Kinetics-700-2020 Human Action Dataset

,”

arXiv preprint arXiv:2010.10864

2020

[97]

Sohn

Berthelot

Carlini

Zhang

C. A.

Raffel

E. D.

Cubuk

Kurakin

, and

C.-L.

, “

FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence

,”

Advances in Neural Information Processing Systems

2020

[98]

Soomro

A. R.

Zamir

, and

Shah

, “

UCF101: A Dataset of 101 Human Actions Classes from Videos in the Wild

,”

arXiv preprint arXiv:1212.0402

2012

[99]

Sun

Rahmani

Bennamoun

Wang

, and

Liu

, “

Human Action Recognition from Various Data Modalities: A Review

,”

IEEE transactions on pattern analysis and machine intelligence

2022

[100]

Tao

Wang

, and

Yamasaki

, “

Self-supervised Video Representation Learning using Inter-Intra Contrastive Framework

,” in

Proceedings of the 28th ACM International Conference on Multimedia

2020

2193201

[101]

Tao

Wang

, and

Yamasaki

, “

Self-Supervised Video Representation Using Pretext-Contrastive Learning

,”

arXiv preprint arXiv:2010.15464

2020

[102]

Tian

Krishnan

, and

Isola

, “

Contrastive Multiview Coding

,”

arXiv preprint arXiv:1906.05849

2019

[103]

Tong

Song

Wang

, and

Wang

, “

Videomae: Masked Autoencoders are Data-efficient Learners for Self-supervised Video Pretraining

,”

arXiv preprint arXiv:2203.12602

2022

[104]

Tran

Bourdev

Fergus

Torresani

, and

Paluri

, “

Learning Spatiotemporal Features with 3D Convolutional Networks

,” in

The 2015 IEEE International Conference on Computer Vision (ICCV)

2015

4489

–

4497

[105]

Tran

Wang

Torresani

, and

Feiszli

, “

Video Classification with Channel-separated Convolutional Networks

,” in

Proceedings of the IEEE International Conference on Computer Vision

2019

5552

–

5561

[106]

Tran

Wang

Torresani

Ray

LeCun

, and

Paluri

, “

A Closer Look at Spatiotemporal Convolutions for Action Recognition

,” in

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

2018

6450

–

6459

[107]

Varol

Laptev

, and

Schmid

, “

Long-Term Temporal Convolutions for Action Recognition

,”

IEEE Transactions on Pattern Analysis and Machine Intelligence

(

2018

1510

–

1517

[108]

D.-Q.

, and

J.-C.

Wang

, “

(2+1)D Distilled ShuffleNet: A Lightweight Unsupervised Distillation Network for Human Action Recognition

,” in

Proceedings of the IEEE International Conference on Pattern Recognition (ICPR)

2022

[109]

D.-Q.

, and

J.-C.

Wang

, “

Teaching Yourself: A Self Knowledge Distillation Approach to Action Recognition

,”

IEEE Access

2021

105711

–

[110]

D.-Q.

J.-C.

Wang

, et al., “

A Novel Self-knowledge Distillation Approach with Siamese Representation Learning for Action Recognition

,” in

2021 International Conference on Visual Communications and Image Processing (VCIP)

, IEEE,

2021

–

[111]

Wang

and

Schmid

, “

Action Recognition with Improved Trajectories

,” in

ICCV

2013

3551

–

3558

[112]

Wang

Liu

, and

Yuan

, “

Mining Action Let Ensemble for Action Recognition with Depth Cameras

,” in

2012 IEEE Conference on Computer Vision and Pattern Recognition

, IEEE,

2012

1290

–

1297

[113]

Wang

Jiao

Bao

Liu

, and

Liu

, “

Self-supervised Spatio-temporal Representation Learning for Videos by Predicting Motion and Appearance Statistics

,” in

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

2019

4006

–

4015

[114]

Wang

Tran

, and

Feiszli

, “

What Makes Training MultiModal Classification Networks Hard?

” In

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

2020

12695

–

705

[115]

Wei

J. J.

Lim

Zisserman

, and

W. T.

Freeman

, “

Learning and Using the Arrow of Time

,” in

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

2018

8052

–

8060

[116]

Willems

Tuytelaars

, and

Van Gool

, “An Efficient Dense and Scale-invariant Spatio-temporal Interest Point Detector,” in

ECCV

Springer

2008

650

–

663

[117]

Xian

Schiele

, and

Akata

, “

Zero-shot Learning-the Good, the Bad and the Ugly

,” in

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

2017

4582

–

4591

[118]

Xiao

Zhao

Shao

Xie

, and

Zhuang

, “

Self-supervised spatiotemporal Learning via Video Clip Order Prediction

,” in

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

2019

10334

–

[119]

Yao

Liu

Luo

Zhou

, and

, “

Video Playback Rate Perception for Self-Supervised Spatio-Temporal Representation Learning

,” in

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

2020

6548

–

6557

[120]

Yin

Vahdat

J. M.

Alvarez

Mallya

Kautz

, and

Molchanov

, “

A-ViT: Adaptive Tokens for Efficient Vision Transformer

,” in

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

2022

10809

–

[121]

Zhai

Oliver

Kolesnikov

, and

Beyer

, “

S4l: Self-supervised semi-supervised Learning

,” in

Proceedings of the IEEE/CVF International Conference on Computer Vision

2019

1476

–

1485

[122]

Zhang

Zou

Chen

, and

Gan

, “

PAN: Towards Fast Action Recognition via Learning Persistence of Appearance

,”

arXiv preprint arXiv:2008.03462

2020

[123]

Zhang

Weng

Chen

Y.-G.

Jiang

, and

L. S.

Davis

, “

Videolt: Large-scale Long-tailed Video Recognition

,” in

Proceedings of the IEEE/CVF International Conference on Computer Vision

2021

7960

–

7969

[124]

Zhang

Liu

Shuai

Zhu

Brattoli

Chen

Marsic

, and

Tighe

, “

Vidtr: Video Transformer without Convolutions

,” in

Proceedings of the IEEE/CVF International Conference on Computer Vision

2021

13577

–

[125]

Zhao

Yan

Torresani

, and

Torralba

, “

HACS: Human Action Clips and Segments Dataset for Recognition and Temporal Localization

,”

arXiv preprint arXiv:1712.09374

2019

[126]

Zhu

Sevilla-Lara

Tran

Feiszli

Yang

, and

Wang

, “

Faster Recurrent Networks for Video Classification

,”

arXiv preprint arXiv:1906.04226

2019

2023

D.Q. Vu, T.P.T. Thu, N. Le and J.C. Wang

Figure 1

An overview diagram of the action recognition system. There are two phases in this system, including the training phase and the testing phase. Training phase aims to learn a recognition model that is able to distinguish various human actions defined in the training dataset. Testing phase utilizes the trained recognition model to recognize an action given a video.

Figure 2

An illustration compares two diagrams for learned video compression: (a) and (b).

The two main challenges in the action recognition problem. (a) is the lack of temporal information when a clip with few frames represents the entire video. (b) are the comparison of the number of parameters and computational cost between the 2D ResNet-50 network usually applied for images and the 3D ResNet-50 network usually adopted for videos?

Figure 3

An illustration contains two comparisons for video analysis, labeled (a) and (b).

Comparison between three different deep network architectures for action recognition. (a): Recurrent Neural Networks (e.g., LSTM); (b): convolutional networks Networks (e.g., 3D CNN); (c): two-stream convolutional networks (e.g., RGB - optical flow).

Figure 4

A block diagram shows the standard process for training a deep classification model.

Overview of the supervised learning methods for action recognition. In which, the black line denotes the forward path and the blue line is the backward i.e., the backpropagation step.

Figure 5

A diagram shows a machine learning model training process using a Labeled Dataset and an Unlabeled Dataset.

Overview of the semi-supervised learning methods for action recognition. In which, the black line denotes the forward path and the blue line is the backward i.e., backpropagation.

Figure 6

A diagram illustrates Pretext Task Training and Downstream Task Training.

Overview of self-supervised learning-based methods for action recognition. There are two tasks in self-supervised learning including the pretext task and the downstream task. For the pretext task, the network is trained with pseudo-labels generated without human labor. For the downstream task, the network is transferred to address this task with labeled data.

Table 1

A summary of supervised learning methods for Action Recognition. The column “Performance” presents the top-1 accuracy of the best model in each method. The column “Model size” shows the number of parameters and FLOPs of each model. In case the authors didn’t provide information about model size in their paper, we denote by —. Moments denotes the Moments in Time dataset and SS is the Something Something dataset.

Method	Description	Network	Model size	Performance	Code
BQN [39]	- Focusing on busy motion in the input videos. - Separating busy features from quiet features. - Two networks have been proposed for two features types.	BQN	92M 241GFLOPs	77.3 (Kinetics400) 97.6 (UCF101) 77.6 (HMDB51)	Link
STAM [92]	- Proposing two types of transformer including temporal transformer and spacial transformer.	Transformer	96M 270GFLOPs	79.3 (Kinetics400) 97.0 (UCF101) 39.7 (Charades)	Link
En-VidTr [124]	- Proposing two types of transformer including temporal transformer and spacial transformer.	VidTr-M	98.1M 220GFLOPs	79.7 (Kinetics400) 96.7 (UCF101) 74.4 (HMDB51)	None
Omni-sourced [24]	- Leveraging crawled data. - Adopt pre-trained models as a teacher. - Training students with teacher’s labels.	irCSN-152	—	83.6 (Kinetics400) 96.0 (UCF101) 71.1 (HMDB51)	Link
G-Blend [114]	- Identifying causes for performance drop on multi-modal networks. - Proposing a technique to avoid overfitting on these networks.	ipCSN-152	32.8M 110.1GFLOPs	83.3 (Kinetics400)	Link
irCSN-152 [105]	- Design an architecture named Channel-Separated Convolutional Network. - Utilize Group convolution to offer computational savings.	irCSN-152	29.6M 96.7GFLOPs	82.6 (Kinetics400)	Link
ipCSN-152 [105]	- Design an architecture named Channel-Separated Convolutional Network. - Utilize Group convolution to offer computational savings.	ipCSN-152	32.8M 108.8GFLOPs	79.2 (Kinetics400)	Link
GB+DF+LB [73]	- Focusing on improving the last layers. - Propose 3 classification branches instead of using the global average pooling alone.	ResNet-152	—	53.4(SS V1) 78.8 (Kinetics400)	None
HATNet [21]	- Fusing 2D and 3D architectures into one. - Training on HVU dataset.	ResNet-50	—	77.6 (Kinetics400) 97.8 (UCF101) 76.5 (HMDB51)	None
CoST [63]	- Proposing a novel operation to learn features using 2D Conv with a weight-sharing constraint.	ResNet-101	—	31.5 (Moments) 77.5 (Kinetics400)	None
RNL-TSM [40]	- Present region-based non-local operations as a self-attention.	ResNet-50	35.95M 41.16GFLOPs	49.47 (SS V1) 77.2 (Kinetics400)	Link
MSNet [60]	- Learn correspondences across frames and convert them into motion features.	ResNet-50	49.2M 67.6GFLOPs	55.1 (SSV1) 67.1 (SS V2) 76.4 (Kinetics400) 77.4(HMDB51)	Link
CMA [15]	Propose a cross-modality attention operation.	ResNet-152	—	75.98 (Kinetics400) 96.5(UCF101)	None
FASTER32 [126]	- Leverages the video’s redundancy to reduce FLOPs. - Combine an expensive model that captures actions, and a lightweight model that captures scene changes.	ResNet-50	67.7GFLOPs	75.3 (Kinetics400) 96.9(UCF101) 75.7(HMDB51)	None
MARS [17]	- Knowledge distillation from the flow network to the RGB network.	ResNeXt-101	—	74.9 (Kinetics400) 53(SS V1) 98.1(UCF101) 80.9(HMDB51)	None
STM [45]	- Encode features in a 2D framework. - The Channel-wise Spatio Temporal Module presents the spatiotemporal features. - The Channel-wise Motion Module efficiently encodes motion features.	ResNet-50	23.88M 32.93GFLOPs	73.7 (Kinetics400) 50.5(SS V1) 64.2(SS V2) 96.7(Jester) 96.2(UCF101) 72.2(HMDB51)	None
SlowFastNet [16]	- Two streams with one a low frame rate and the other a high frame rate.	ResNet-101	234GFLOPs	79.8 (Kinetics400) 81.8(Kinetics600)	Link
EvaNet [4]	- Finding video CNN architectures based on an evolutionary algorithm.	Inception Net	—	77.4 (Kinetics400) 82.3(HMDB51) 31.8(Moments)	None
R(2+1)D [106]	- Explicitly factorize 3D Conv into two operations, a 2D Conv and a 1D Conv.	ResNet-34	—	75.4 (Kinetics400) 73.3(Sports1M) 97.3(UCF101) 78.7(HMDB51)	Link
P3D [85]	- (2+1)D Conv uses ReLU between the 2D and 1D Conv in each block. - Using separate spatial and temporal components renders the optimization easier.	ResNet-152	—	77.4 (Kinetics400) 93.7(UCF101) 66.4(Sports1M) 75.12(ActivityNet) 80.8(ASLAN)	Link
I3D [48]	- Repeat 2D filters in the pre-trained Inception Net.	Inception-V1	25M	74.2 (Kinetics400) 93.4(UCF101) 66.4(HMDB51)	Link

Method	Description	Network	Model size	Performance	Code
BQN [39]	- Focusing on busy motion in the input videos. - Separating busy features from quiet features. - Two networks have been proposed for two features types.	BQN	92M 241GFLOPs	77.3 (Kinetics400) 97.6 (UCF101) 77.6 (HMDB51)	Link
STAM [92]	- Proposing two types of transformer including temporal transformer and spacial transformer.	Transformer	96M 270GFLOPs	79.3 (Kinetics400) 97.0 (UCF101) 39.7 (Charades)	Link
En-VidTr [124]	- Proposing two types of transformer including temporal transformer and spacial transformer.	VidTr-M	98.1M 220GFLOPs	79.7 (Kinetics400) 96.7 (UCF101) 74.4 (HMDB51)	None
Omni-sourced [24]	- Leveraging crawled data. - Adopt pre-trained models as a teacher. - Training students with teacher’s labels.	irCSN-152	—	83.6 (Kinetics400) 96.0 (UCF101) 71.1 (HMDB51)	Link
G-Blend [114]	- Identifying causes for performance drop on multi-modal networks. - Proposing a technique to avoid overfitting on these networks.	ipCSN-152	32.8M 110.1GFLOPs	83.3 (Kinetics400)	Link
irCSN-152 [105]	- Design an architecture named Channel-Separated Convolutional Network. - Utilize Group convolution to offer computational savings.	irCSN-152	29.6M 96.7GFLOPs	82.6 (Kinetics400)	Link
ipCSN-152 [105]	- Design an architecture named Channel-Separated Convolutional Network. - Utilize Group convolution to offer computational savings.	ipCSN-152	32.8M 108.8GFLOPs	79.2 (Kinetics400)	Link
GB+DF+LB [73]	- Focusing on improving the last layers. - Propose 3 classification branches instead of using the global average pooling alone.	ResNet-152	—	53.4(SS V1) 78.8 (Kinetics400)	None
HATNet [21]	- Fusing 2D and 3D architectures into one. - Training on HVU dataset.	ResNet-50	—	77.6 (Kinetics400) 97.8 (UCF101) 76.5 (HMDB51)	None
CoST [63]	- Proposing a novel operation to learn features using 2D Conv with a weight-sharing constraint.	ResNet-101	—	31.5 (Moments) 77.5 (Kinetics400)	None
RNL-TSM [40]	- Present region-based non-local operations as a self-attention.	ResNet-50	35.95M 41.16GFLOPs	49.47 (SS V1) 77.2 (Kinetics400)	Link
MSNet [60]	- Learn correspondences across frames and convert them into motion features.	ResNet-50	49.2M 67.6GFLOPs	55.1 (SSV1) 67.1 (SS V2) 76.4 (Kinetics400) 77.4(HMDB51)	Link
CMA [15]	Propose a cross-modality attention operation.	ResNet-152	—	75.98 (Kinetics400) 96.5(UCF101)	None
FASTER32 [126]	- Leverages the video’s redundancy to reduce FLOPs. - Combine an expensive model that captures actions, and a lightweight model that captures scene changes.	ResNet-50	67.7GFLOPs	75.3 (Kinetics400) 96.9(UCF101) 75.7(HMDB51)	None
MARS [17]	- Knowledge distillation from the flow network to the RGB network.	ResNeXt-101	—	74.9 (Kinetics400) 53(SS V1) 98.1(UCF101) 80.9(HMDB51)	None
STM [45]	- Encode features in a 2D framework. - The Channel-wise Spatio Temporal Module presents the spatiotemporal features. - The Channel-wise Motion Module efficiently encodes motion features.	ResNet-50	23.88M 32.93GFLOPs	73.7 (Kinetics400) 50.5(SS V1) 64.2(SS V2) 96.7(Jester) 96.2(UCF101) 72.2(HMDB51)	None
SlowFastNet [16]	- Two streams with one a low frame rate and the other a high frame rate.	ResNet-101	234GFLOPs	79.8 (Kinetics400) 81.8(Kinetics600)	Link
EvaNet [4]	- Finding video CNN architectures based on an evolutionary algorithm.	Inception Net	—	77.4 (Kinetics400) 82.3(HMDB51) 31.8(Moments)	None
R(2+1)D [106]	- Explicitly factorize 3D Conv into two operations, a 2D Conv and a 1D Conv.	ResNet-34	—	75.4 (Kinetics400) 73.3(Sports1M) 97.3(UCF101) 78.7(HMDB51)	Link
P3D [85]	- (2+1)D Conv uses ReLU between the 2D and 1D Conv in each block. - Using separate spatial and temporal components renders the optimization easier.	ResNet-152	—	77.4 (Kinetics400) 93.7(UCF101) 66.4(Sports1M) 75.12(ActivityNet) 80.8(ASLAN)	Link
I3D [48]	- Repeat 2D filters in the pre-trained Inception Net.	Inception-V1	25M	74.2 (Kinetics400) 93.4(UCF101) 66.4(HMDB51)	Link

Table 2

A summary of semi-supervised learning methods for Action Recognition. The column “Performance” presents the top-1 accuracy of the best model in each method. The percent (%) after each dataset denotes the percent of labeled data used for training. * denotes that these methods were re-implement for video domain by [46].

Method	Description	Network	Performance	Code
VideoSSL [46]	Utilizing a pre-trained network on ImageNet to guide the training of the 3D CNN.	3D ResNet-18	47.6 (Kinetics100 - 5%) 32.4 (UCF101 - 5%) 32.7 (HMDB51 - 40%)	None
TCL [95]	Proposing two types of loss including Maximize Instance Agreement and Maximize Group Agreement.	TSM ResNet-18	29.81 (SS-V2 - 5%) 30.28 (Kinetics400 - 5%) 93.29 (Jester - 5%)	Link
FitMach* [97]	The pseudo-labels from weakly-augmented data are utilized to guide the training for a strongly-augmented version of the same data.	3D ResNet-18	40.5 (Kinetics100 - 5%) 27.1 (UCF101 - 5%) 32.9 (HMDB51 - 40%)	None
S4L* [121]	The combination of the self-supervised and semi-supervised learning method.	3D ResNet-18	33.0 (Kinetics100 - 5%) 22.7 (UCF101 - 5%) 29.8 (HMDB51 - 40%)	None
MT* [6]	Calculating the average of model weights over training steps that helps to generate a more robust model compared to using the final weights.	3D ResNet-18	27.8 (Kinetics100 - 5%) 17.5 (UCF101 - 5%) 27.2 (HMDB51 - 40%)	None
PL* [62]	The prediction from a sample is reused to guide itself.	3D ResNet-18	27.8 (Kinetics100 - 5%) 17.6 (UCF101 - 5%) 27.3 (HMDB51 - 40%)	None

Method	Description	Network	Performance	Code
VideoSSL [46]	Utilizing a pre-trained network on ImageNet to guide the training of the 3D CNN.	3D ResNet-18	47.6 (Kinetics100 - 5%) 32.4 (UCF101 - 5%) 32.7 (HMDB51 - 40%)	None
TCL [95]	Proposing two types of loss including Maximize Instance Agreement and Maximize Group Agreement.	TSM ResNet-18	29.81 (SS-V2 - 5%) 30.28 (Kinetics400 - 5%) 93.29 (Jester - 5%)	Link
FitMach* [97]	The pseudo-labels from weakly-augmented data are utilized to guide the training for a strongly-augmented version of the same data.	3D ResNet-18	40.5 (Kinetics100 - 5%) 27.1 (UCF101 - 5%) 32.9 (HMDB51 - 40%)	None
S4L* [121]	The combination of the self-supervised and semi-supervised learning method.	3D ResNet-18	33.0 (Kinetics100 - 5%) 22.7 (UCF101 - 5%) 29.8 (HMDB51 - 40%)	None
MT* [6]	Calculating the average of model weights over training steps that helps to generate a more robust model compared to using the final weights.	3D ResNet-18	27.8 (Kinetics100 - 5%) 17.5 (UCF101 - 5%) 27.2 (HMDB51 - 40%)	None
PL* [62]	The prediction from a sample is reused to guide itself.	3D ResNet-18	27.8 (Kinetics100 - 5%) 17.6 (UCF101 - 5%) 27.3 (HMDB51 - 40%)	None

Table 3

A summary of methods on Self-Supervised Learning for Action Recognition as the downstream task. We record the results on two standard datasets, including UCF101 and HMDB51. All results are the top-1 accuracy, which corresponds to backbone architectures in the column “Network.”

Method	Description	Network	Pre-training Dataset	Performance	Code
VideoMAE [103]	Proposing data-efficient learning via video reconstruction using autoencoders as the pretext task.	ViT-L	Kinetics700	96.1 (UCF101) 61.1 (HMDB51)	Link
BraVe [87]	Training the network to learn features from a narrow view to the general content of the input clip.	TSM-50x2	Kinetics600	93.1 (UCF101) 77.8 (HMDB51)	Link
MCN [66]	Proposing multi-task process between contrastive learning and meta-learning.	3D ResNet-18	UCF101	84.8 (UCF101) 54.8 (HMDB51)	Link
CVRL [84]	Contrastive learning based on the SimCLR method.	ResNet-50	Kinetics400	92.1 (UCF101) 65.4 (HMDB51)	None
AVID+CMA [78]	Contrastive learning for cross-modal discrimination of video from audio and vice versa.	R2+1D-18	Audioset	91.5 (UCF101) 64.7 (HMDB51)	Link
XDC [5]	- Based on Deep Clustering. - Leverages unsupervised clustering in audio as a supervisory signal for video and vice versa. - The first self-supervised method outperforms large-scale fully-supervised pretraining.	R2+1D-18	Kinetics Audioset IG-65M	91.5 (UCF101) 63.1 (HMDB51)	None
PCL [101]	- Combine Pretext tasks with contrastive learning, referred to as Pretext-Contrastive Learning.	ResNet-18	UCF101	82.3 (UCF101) 43.2 (HMDB51)	None
PRP [119]	- Capture temporal resolution characteristics within the video domain in a self-supervised manner. - Introduce a motion attention mechanism to focus on meaningful foreground regions.	R2+1D-18	UCF101	72.1 (UCF101) 35.0 (HMDB51)	Link
DPC [33]	- Learning spatiotemporal features by recurrently predicting future representations. - Predicting further into the future with progressively less temporal context.	ResNet-34	Kinetics400	75.7 (UCF101) 35.7 (HMDB51)	Link
IIC [100]	- Uses positive-negative pairs to train with contrastive learning. - Different modalities of the same video are treated as positives and breaking temporal relations in the video or other videos are treated as negatives.	ResNet-18	UCF101	74.4 (UCF101) 38.8 (HMDB51)	Link
TCE [56]	- Encoding videos such that adjacent frames exist close to each other and videos are separated from one another.	ResNet-50	Kinetics400	71.2 (UCF101) 36.6 (HMDB51)	Link
VCP [67]	- Randomly choose one from 4 transformations or keeping the original. - Predict which is transform applied to the input clip.	ResNet-18	UCF101	66.0 (UCF101) 31.5 (HMDB51)	None
3D Cubic Puzzles [53]	- Ambiguity in time direction when hardly distinguishing between a “catch” or a “throw” action from shuffled frames. - Introducing a pretext task based on solving Space-Time Cubic Puzzles.	ResNet-18	Kinetics400	65.8 (UCF101) 33.7 (HMDB51)	None
Video Clip Ordering [118]	Learning the spatiotemporal representation of the video by predicting the order of shuffled clips from the video.	ResNet-18	UCF101	64.9 (UCF101) 29.5 (HMDB51)	None
Skip-Clip [26]	- Training a deep model for future clip order ranking based on a context clip.	ResNet-18	UCF101	64.4 (UCF101)	None
3D RotNet [47]	- A set of rotations are applied to all videos as a pretext task and a model is defined to predict these rotations.	ResNet-18	Kinetics400	62.9 (UCF101) 33.7 (HMDB51)	None
CMC [102]	- Presenting a set of sensory views of a video clip. - Based on contrastive learning, A model is built to maximize the mutual information between different views of the same scene.	CaffeNet	UCF101	59.1 (UCF101) 26.7 (HMDB51)	Link
M&A [113]	- Based on regressing both motion and appearance statistics along spatial and temporal dimensions. - Predicting several numerical labels generated through the characteristics of video such as the region with the largest motion and its direction, etc.	C3D	UCF101	58.8 (UCF101) 20.3 (HMDB51)	Link
Arrow of Time [115]	- Learning to see the arrow of time — to tell whether a video sequence is playing forward or backward. - Focusing on the motion cues in videos and using the arrow of time to pretrain action recognition models.	AlexNet	UCF101	55.3 (UCF101)	None
Cross & Learn [90]	- Information shared across modalities has a much higher semantic meaning compared to modality-specific information. - Present a self-supervised method for representation learning utilizing two different modalities (RGB and flow).	CaffeNet	UCF101	58.7 (UCF101) 27.2 (HMDB51)	Link
Geometry [29]	- Extracting pixel-wise geometry information as flow fields and disparity maps from synthetic imagery and real 3D movies. - Introducing a new type of auxiliary supervision based on exploring geometry.	CaffeNet	UCF101	55.1 (UCF101) 23.3 (HMDB51)	None

Method	Description	Network	Pre-training Dataset	Performance	Code
VideoMAE [103]	Proposing data-efficient learning via video reconstruction using autoencoders as the pretext task.	ViT-L	Kinetics700	96.1 (UCF101) 61.1 (HMDB51)	Link
BraVe [87]	Training the network to learn features from a narrow view to the general content of the input clip.	TSM-50x2	Kinetics600	93.1 (UCF101) 77.8 (HMDB51)	Link
MCN [66]	Proposing multi-task process between contrastive learning and meta-learning.	3D ResNet-18	UCF101	84.8 (UCF101) 54.8 (HMDB51)	Link
CVRL [84]	Contrastive learning based on the SimCLR method.	ResNet-50	Kinetics400	92.1 (UCF101) 65.4 (HMDB51)	None
AVID+CMA [78]	Contrastive learning for cross-modal discrimination of video from audio and vice versa.	R2+1D-18	Audioset	91.5 (UCF101) 64.7 (HMDB51)	Link
XDC [5]	- Based on Deep Clustering. - Leverages unsupervised clustering in audio as a supervisory signal for video and vice versa. - The first self-supervised method outperforms large-scale fully-supervised pretraining.	R2+1D-18	Kinetics Audioset IG-65M	91.5 (UCF101) 63.1 (HMDB51)	None
PCL [101]	- Combine Pretext tasks with contrastive learning, referred to as Pretext-Contrastive Learning.	ResNet-18	UCF101	82.3 (UCF101) 43.2 (HMDB51)	None
PRP [119]	- Capture temporal resolution characteristics within the video domain in a self-supervised manner. - Introduce a motion attention mechanism to focus on meaningful foreground regions.	R2+1D-18	UCF101	72.1 (UCF101) 35.0 (HMDB51)	Link
DPC [33]	- Learning spatiotemporal features by recurrently predicting future representations. - Predicting further into the future with progressively less temporal context.	ResNet-34	Kinetics400	75.7 (UCF101) 35.7 (HMDB51)	Link
IIC [100]	- Uses positive-negative pairs to train with contrastive learning. - Different modalities of the same video are treated as positives and breaking temporal relations in the video or other videos are treated as negatives.	ResNet-18	UCF101	74.4 (UCF101) 38.8 (HMDB51)	Link
TCE [56]	- Encoding videos such that adjacent frames exist close to each other and videos are separated from one another.	ResNet-50	Kinetics400	71.2 (UCF101) 36.6 (HMDB51)	Link
VCP [67]	- Randomly choose one from 4 transformations or keeping the original. - Predict which is transform applied to the input clip.	ResNet-18	UCF101	66.0 (UCF101) 31.5 (HMDB51)	None
3D Cubic Puzzles [53]	- Ambiguity in time direction when hardly distinguishing between a “catch” or a “throw” action from shuffled frames. - Introducing a pretext task based on solving Space-Time Cubic Puzzles.	ResNet-18	Kinetics400	65.8 (UCF101) 33.7 (HMDB51)	None
Video Clip Ordering [118]	Learning the spatiotemporal representation of the video by predicting the order of shuffled clips from the video.	ResNet-18	UCF101	64.9 (UCF101) 29.5 (HMDB51)	None
Skip-Clip [26]	- Training a deep model for future clip order ranking based on a context clip.	ResNet-18	UCF101	64.4 (UCF101)	None
3D RotNet [47]	- A set of rotations are applied to all videos as a pretext task and a model is defined to predict these rotations.	ResNet-18	Kinetics400	62.9 (UCF101) 33.7 (HMDB51)	None
CMC [102]	- Presenting a set of sensory views of a video clip. - Based on contrastive learning, A model is built to maximize the mutual information between different views of the same scene.	CaffeNet	UCF101	59.1 (UCF101) 26.7 (HMDB51)	Link
M&A [113]	- Based on regressing both motion and appearance statistics along spatial and temporal dimensions. - Predicting several numerical labels generated through the characteristics of video such as the region with the largest motion and its direction, etc.	C3D	UCF101	58.8 (UCF101) 20.3 (HMDB51)	Link
Arrow of Time [115]	- Learning to see the arrow of time — to tell whether a video sequence is playing forward or backward. - Focusing on the motion cues in videos and using the arrow of time to pretrain action recognition models.	AlexNet	UCF101	55.3 (UCF101)	None
Cross & Learn [90]	- Information shared across modalities has a much higher semantic meaning compared to modality-specific information. - Present a self-supervised method for representation learning utilizing two different modalities (RGB and flow).	CaffeNet	UCF101	58.7 (UCF101) 27.2 (HMDB51)	Link
Geometry [29]	- Extracting pixel-wise geometry information as flow fields and disparity maps from synthetic imagery and real 3D movies. - Introducing a new type of auxiliary supervision based on exploring geometry.	CaffeNet	UCF101	55.1 (UCF101) 23.3 (HMDB51)	None

Table 4

A summary of common small-scale datasets from 2011 to now used for action recognition.

Dataset	Description	#classes	Samples	Download
HMDB51 [59]	- At least 1s / video. - Single activity / video.	51	6,849	Link
UCF50 [88]	- Realistic videos from Youtube. - Single activity / video.	50	6,676	Link
UCF101 [98]	- At least 1.06s/video. - Single activity / video.	101	13,320	Link
ActivityNet [9]	- Large-scale video. - 1.41 activity instance / video.	203	27,811	Link
Hollywood2 [72]	- 19.7s/video on average action videos and scene videos.	22	3,669	Link
MSR-Action3D [64]	An action dataset of depth sequences captured by a depth camera.	20	—	Link
MSR-Daily Activity 3D [112]	- A daily activity dataset captured by a Kinect device camera. - An activity is performed in either “sitting on sofa” or “standing” pose.	12	320	Link
ASLAN [55]	- Focus on action similarity.	432	3,697	Link
RGBD-HuDaAct [80]	- Synchronized color-depth video streams 30s-150s/video.	16	1,189	Link
Charades [93]	- Video action classification performance 6.8 actions/video.	157	9,848	Link

Dataset	Description	#classes	Samples	Download
HMDB51 [59]	- At least 1s / video. - Single activity / video.	51	6,849	Link
UCF50 [88]	- Realistic videos from Youtube. - Single activity / video.	50	6,676	Link
UCF101 [98]	- At least 1.06s/video. - Single activity / video.	101	13,320	Link
ActivityNet [9]	- Large-scale video. - 1.41 activity instance / video.	203	27,811	Link
Hollywood2 [72]	- 19.7s/video on average action videos and scene videos.	22	3,669	Link
MSR-Action3D [64]	An action dataset of depth sequences captured by a depth camera.	20	—	Link
MSR-Daily Activity 3D [112]	- A daily activity dataset captured by a Kinect device camera. - An activity is performed in either “sitting on sofa” or “standing” pose.	12	320	Link
ASLAN [55]	- Focus on action similarity.	432	3,697	Link
RGBD-HuDaAct [80]	- Synchronized color-depth video streams 30s-150s/video.	16	1,189	Link
Charades [93]	- Video action classification performance 6.8 actions/video.	157	9,848	Link

Table 5

A summary of common large-scale datasets from 2011 to now used for action recognition.

Dataset	Description	#classes	Samples	Download
Kinetics400 [51]	- Last around 10s /video. - Single activity / video.	400	273K	Link
Kinetics600 [11]	- Last around 10s /video. - Single activity / video.	600	435K	Link
Kinetics700 [12]	- Last around 10s /video. - Single activity / video.	700	643K	Link
Kinetics700-2020 [96]	- Last 10s around /video. - Single activity / video.	700	648K	Link
Human3.6M Dataset [41]	- 3D human poses.	17	3.6M	Link
Sports-1M [50]	- Single action/video. - YouTube videos contain 6 different types of bowling, 7 different types of American football, and 23 types of billiards.	487	1.1M	Link
Youtube-8M [1]	- Provide pre-computed and compressed features based on a Deep CNN pre-trained on ImageNet.	3862	6.1M	Link
Something-Something [32]	- Video prediction tasks. - 6.8 actions/video.	174	220K	Link
HACS [125]	- 2-second clip annotations.	200	890K	Link
Moments in Time [77]	- 3s/video.	339	1M	Link
HVU-Dataset [21]	- Holistic video understanding (multi-label & multi-task video).	3,142	572K	Link
Jester [74]	- 3s/video on average.	27	148K	Link
IG65M [49]	- Weakly supervised dataset.	400	65M	None
VideoLT [123]	- Large-scale long-tailed video recognition.	1,004	256K	Link

Dataset	Description	#classes	Samples	Download
Kinetics400 [51]	- Last around 10s /video. - Single activity / video.	400	273K	Link
Kinetics600 [11]	- Last around 10s /video. - Single activity / video.	600	435K	Link
Kinetics700 [12]	- Last around 10s /video. - Single activity / video.	700	643K	Link
Kinetics700-2020 [96]	- Last 10s around /video. - Single activity / video.	700	648K	Link
Human3.6M Dataset [41]	- 3D human poses.	17	3.6M	Link
Sports-1M [50]	- Single action/video. - YouTube videos contain 6 different types of bowling, 7 different types of American football, and 23 types of billiards.	487	1.1M	Link
Youtube-8M [1]	- Provide pre-computed and compressed features based on a Deep CNN pre-trained on ImageNet.	3862	6.1M	Link
Something-Something [32]	- Video prediction tasks. - 6.8 actions/video.	174	220K	Link
HACS [125]	- 2-second clip annotations.	200	890K	Link
Moments in Time [77]	- 3s/video.	339	1M	Link
HVU-Dataset [21]	- Holistic video understanding (multi-label & multi-task video).	3,142	572K	Link
Jester [74]	- 3s/video on average.	27	148K	Link
IG65M [49]	- Weakly supervised dataset.	400	65M	None
VideoLT [123]	- Large-scale long-tailed video recognition.	1,004	256K	Link

[1]

Abu-El-Haija

Kothari

Lee

Natsev

Toderici

Varadara-Jan

, and

Vijayanarasimhan

, “

Youtube-8m: A Large-scale Video Classification Benchmark

,”

arXiv preprint arXiv:1609.08675

2016

[2]

J. K.

Aggarwal

and

M. S.

Ryoo

, “

Human Activity Analysis: A Review

,”

ACM Computing Surveys (CSUR)

(

2011

–

[3]

Ahsan

Madhok

, and

Essa

, “

Video Jigsaw: Unsupervised Learning of Spatiotemporal Context for Video Action Recognition

,” in

2019 IEEE Winter Conference on Applications of Computer Vision (WACV)

, IEEE,

2019

179

–

189

[4]

Angelova

Toshev

, and

M. S.

Ryoo

, “

Evolving Spacetime Neural Architectures for Videos

,” in

Proceedings of the IEEE International Conference on Computer Vision

2019

1793

–

1802

[5]

Alwassel

Mahajan

Korbar

Torresani

Ghanem

, and

Tran

, “

Self-supervised Learning by Cross-Modal Audio-Video Clustering

,”

Advances in Neural Information Processing Systems

2020

[6]

Antti

et al., “

Mean Teachers are Better Role Models: Weight Averaged Consistency Targets Improve Semi-supervised Deep Learning Results

,” in

NeurIPS

2017

1195

–

1204

[7]

and

Caruana

, “

Do Deep Nets Really Need to be Deep?

” In

NIPS

2013

[8]

Buchler

Brattoli

, and

Ommer

, “

Improving Spatio-temporal Self-supervision by Deep Reinforcement Learning

,” in

ECCV

2018

770

–

786

[9]

Caba Heilbron

Escorcia

Ghanem

, and

Carlos Niebles

, “

Activitynet: A Large-scale Video Benchmark for Human Activity Understanding

,” in

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

2015

961

–

970

[10]

Caron

Bojanowski

Joulin

, and

Douze

, “

Deep Clustering for Unsupervised Learning of Visual Features

,” in

Proceedings of the European Conference on Computer Vision (ECCV)

2018

132

–

149

[11]

Carreira

Noland

Banki-Horvath

Hillier

, and

Zisserman

, “

A Short Note about Kinetics-600

,”

arXiv preprint arXiv:1808.01340

2018

[12]

Carreira

Noland

Hillier

, and

Zisserman

, “

A Short Note on the Kinetics-700 Human Action Dataset

,”

arXiv preprint arXiv:1907.06987

2019

[13]

Chakraborty

M. B.

Holte

T. B.

Moeslund

, and

Gonzalez

, “

Selective Spatio-temporal Interest Points

,”

Computer Vision and Image Understanding

116

(

2012

396

–

410

[14]

Chen

Zhang

Yao

Guo

, and

Liu

, “

Deep Learning for Sensor-based Human Activity Recognition: Overview, Challenges, and Opportunities

,”

ACM Computing Surveys (CSUR)

(

2021

–

[15]

Chi

Tian

, and

Tian

, “

Two-stream Video Classification with Cross-modality Attention

,” in

Proceedings of the IEEE International Conference on Computer Vision Workshops

2019

451120

[16]

Christoph

Fan

Malik

, and

, “

SlowFast Networks for Video Recognition

,” in

The IEEE 2019 International Conference on Computer Vision (ICCV)

, IEEE,

2019

6201

–

6210

[17]

Crasto

Weinzaepfel

Alahari

, and

Schmid

, “

Mars: Motion-Augmented RGB Stream for Action Recognition

,” in

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

2019

7882

–

7891

[18]

Dalal

Triggs

, and

Schmid

, “Human Detection Using Oriented Histograms of Flow and Appearance,” in

ECCV

Springer

2006

428

–

441

[19]

Diba

Fayyaz

Sharma

A. H.

Karami

M. M.

Arzani

Yousefzadeh

, and

Van Gool

, “

Temporal 3D Convnets: New Architecture and Transfer Learning for Video Classification

,”

arXiv preprint arXiv:1711.08200

2017

[20]

Diba

Fayyaz

Sharma

Mahdi Arzani

Yousefzadeh

Gall

, and

Van Gool

, “

Spatio-temporal Channel Correlation Networks for Action Classification

,” in

ECCV

2018

284

–

299

[21]

Diba

Fayyaz

Sharma

Paluri

Gall

Stiefelhagen

, and

Van Gool

, “

Holistic Large Scale Video Understanding

,”

arXiv preprint arXiv:1904.11451

2019

[22]

Diba

Sharma

, and

L. V.

Gool

, “

Deep Temporal Linear Encoding Networks

,” in

The 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

2017

2329

–

2338

[23]

Donahue

L. A.

Hendricks

Rohrbach

Venugopalan

Guadarrama

Saenko

, and

Darrell

, “

Long-Term Recurrent Convolutional Networks for Visual Recognition and Description

,”

IEEE Transactions on Pattern Analysis and Machine Intelligence

(

2017

677

–

691

[24]

Duan

Zhao

Xiong

Liu

, and

Lin

, “

Omni-sourced Webly-supervised Learning for Video Recognition

,”

arXiv preprint arXiv:2003.13042

2020

[25]

Q. V.

Duc

Phung

Nguyen

B. Y.

Nguyen

, and

T. H.

Nguyen

, “

Self-knowledge Distillation: An Efficient Approach for Falling Detection

,” in

International Conference on Artificial Intelligence and Big Data in Digital Era

, Springer,

2022

369

–

380

[26]

Zhai

G. W.

Taylor

, and

J. M.

Susskind

, “

Skip-Clip: Self-Supervised Spatiotemporal Representation Learning by Future Clip Order Ranking

,”

arXiv preprint arXiv:1910.12770

2019

[27]

Feichtenhofer

Pinz

, and

Zisserman

, “

Convolutional Two-Stream Network Fusion for Video Action Recognition

,” in

The 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

, June

2016

1933-41

[28]

Fernando

Bilen

Gavves

, and

Gould

, “

Self-supervised Video Representation Learning with Odd-One-Out Networks

,” in

CVPR

2017

3636

–

3645

[29]

Gan

Gong

Liu

, and

L. J.

Guibas

, “

Geometry Guided Convolutional Neural Networks for Self-supervised Video Representation Learning

,” in

CVPR

2018

5589

–

5597

[30]

Ghadiyaram

Tran

, and

Mahajan

, “

Large-scale Weakly Supervised Pre-training for Video Action Recognition

,” in

CVPR

2019

12046

–

[31]

Girdhar

Tran

Torresani

, and

Ramanan

, “

Distinit: Learning Video Representations without a Single Labeled Video

,” in

ICCV

2019

852

–

861

[32]

Goyal

S. E.

Kahou

Michalski

Materzynska

Westphal

Kim

Haenel

Fruend

Yianilos

Mueller-Freitag

, et al., “

The” Something Something” Video Database for Learning and Evaluating Visual Common Sense

.,” in

ICCV

, Vol.

, No.

2017

[33]

Han

Xie

, and

Zisserman

, “

Video Representation Learning by Dense Predictive Coding

,” in

Proceedings of the IEEE International Conference on Computer Vision Workshops

2019

1483

–

1492

[34]

Hara

Kataoka

, and

Satoh

, “

Can Spatio-temporal 3D CNNs Retrace the History of 2D CNNs and Imagenet?

” In

CVPR

2018

654655

[35]

Zhang

Ren

, and

Sun

, “

Deep residual learning for image recognition

,” in

CVPR

2016

770

–

778

[36]

Hinton

Vinyals

, and

Dean

, “

Distilling the Knowledge in a Neural Network

,”

arXiv preprint arXiv:1503.02531

2015

[37]

A. G.

Howard

Zhu

Chen

Kalenichenko

Wang

Weyand

Andreetto

, and

Adam

, “

Mobilenets: Efficient Convolutional Neural Networks for Mobile Vision Applications

,”

arXiv preprint arXiv:1704.04861

2017

[38]

Shen

, and

Sun

, “

Squeeze-and-Excitation Networks

,” in

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

2018

7132

–

7141

[39]

Huang

and

A. G.

Bors

, “

Busy-Quiet Video Disentangling for Video Classification

,” in

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

2022

1341

–

1350

[40]

Huang

and

A. G.

Bors

, “

Region-based Non-local Operation for Video Classification

,”

arXiv preprint arXiv:2007.09033

2020

[41]

Ionescu

Papava

Olaru

, and

Sminchisescu

, “

Human3. 6m: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments

,”

IEEE Transactions on Pattern Analysis and Machine Intelligence

(

2013

1325

–

1339

[42]

Iosifidis

Tefas

, and

Pitas

, “

Semi-supervised Classification of Human Actions based on Neural Networks

,” in

2014 22nd International Conference on Pattern Recognition

, IEEE,

2014

1336

–

1341

[43]

Jegham

A. B.

Khalifa

Alouani

, and

M. A.

Mahjoub

, “

Visionbased Human Action Recognition: An Overview and Real World Challenges

,”

Forensic Science International: Digital Investigation

2020

200901

[44]

Yang

, and

, “

3D Convolutional Neural Networks for Human Action Recognition

,”

IEEE Transactions on Pattern Analysis and Machine Intelligence

(

2013

221

–

231

[45]

Jiang

Wang

Gan

, and

Yan

, “

STM: Spatiotemporal and Motion Encoding for Action Recognition

,” in

Proceedings of the IEEE International Conference on Computer Vision

2019

2000

–

2009

[46]

Jing

Parag

Tian

, and

Wang

, “

Videossl: Semi-supervised Learning for Video Classification

,” in

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

2021

1110

–

1119

[47]

Jing

Yang

Liu

, and

Tian

, “

Self-supervised Spatiotemporal Feature Learning via Video Rotation Prediction

,”

arXiv preprint arXiv:1811.11387

2018

[48]

Joao

and

Andrew

, “

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

,” in

CVPR

2017

6299

–

6308

[49]

Kalfaoglu

Kalkan

, and

A. A.

Alatan

, “

Late Temporal Modeling in 3D CNN Architectures with Bert for Action Recognition

,”

arXiv preprint arXiv:2008.01232

2020

[50]

Karpathy

Toderici

Shetty

Leung

Sukthankar

, and

Fei-Fei

, “

Large-scale Video Classification with Convolutional Neural Networks

,” in

CVPR

2014

1725

–

1732

[51]

Kay

Carreira

Simonyan

Zhang

Hillier

Vijaya-Narasimhan

Viola

Green

Back

Natsev

, et al., “

The Kinetics Human Action Video Dataset

,”

arXiv preprint arXiv:1705.06950

2017

[52]

Kazakos

Nagrani

Zisserman

, and

Damen

, “

Epicfusion: Audio-Visual Temporal Binding for Egocentric Action Recognition

,” in

Proceedings of the IEEE International Conference on Computer Vision

2019

5492

–

5501

[53]

Kim

Cho

, and

I. S.

Kweon

, “

Self-supervised Video Representation Learning with Space-Time Cubic Puzzles

,” in

Proceedings of the AAAI Conference on Artificial Intelligence

, Vol.

2019

8545

–

[54]

Klaser

Marszaiek

, and

Schmid

, “

A Spatio-temporal Descriptor based on 3D-gradients

,” in

BMVC 2008-19th British Machine Vision Conference

, British Machine Vision Association,

2008

275

–

271

[55]

Kliper-Gross

Hassner

, and

Wolf

, “

The Action Similarity Labeling Challenge

,”

IEEE Transactions on Pattern Analysis and Machine Intelligence

(

2011

615

–

621

[56]

Knights

Vanderkop

Ward

Mackenzie-Ross

, and

Moghadam

, “

Temporally Coherent Embeddings for Self-Supervised Video Representation Learning

,”

arXiv preprint arXiv:2004.02753

2020

[57]

Komkov

Dzabraev

, and

Petiushko

, “

Mutual Modality Learning for Video Action Classification

,”

arXiv preprint arXiv:2011.02543

2020

[58]

Krizhevsky

Sutskever

, and

G. E.

Hinton

, “

Imagenet classification with deep convolutional neural networks

,”

Communications of the ACM

(

2017

–

[59]

Kuehne

Jhuang

Garrote

Poggio

, and

Serre

, “

HMDB: A Large Video Database for Human Motion Recognition

,” in

Proceedings of the International Conference on Computer Vision (ICCV)

2011

[60]

Kwon

Kim

Kwak

, and

Cho

, “

MotionSqueeze: Neural Motion Feature Learning for Video Understanding

,” in

European Conference on Computer Vision

, Springer,

2020

345

–

362

[61]

Laptev

, “

On Space-time Interest Points

,”

International Journal of Computer Vision

(

2-3

2005

107

–

[62]

D.-H.

Lee

et al., “

Pseudo-label: The Simple and Efficient Semisuper-vised Learning Method for Deep Neural Networks

,” in

Workshop on Challenges in Representation Learning, ICML

, Vol.

, No.

2013

[63]

Zhong

Xie

, and

, “

Collaborative Spatiotemporal Feature Learning for Video Action Recognition

,” in

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

2019

7872

–

7881

[64]

Zhang

, and

Liu

, “

Action Recognition Based on a Bag of 3D Points

,” in

2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops

, IEEE,

2010

–

[65]

, and

, “

Weakly-Supervised Temporal Action Detection for Fine-Grained Videos with Hierarchical Atomic Actions

,” in

European Conference on Computer Vision

, Springer,

2022

[66]

Lin

Guo

, and

, “

Self-supervised Video Representation Learning with Meta-contrastive Network

,” in

Proceedings of the IEEE/CVF International Conference on Computer Vision

2021

8239

–

8249

[67]

Luo

Liu

Zhou

Yang

, and

Wang

, “

Video Cloze Procedure for Self-supervised Spatio-Temporal Learning

,”

arXiv preprint arXiv:2001.00294

2020

[68]

D. C.

Luvizon

Picard

, and

Tabia

, “

2D/3D Pose Estimation and Action Recognition using Multitask Deep Learning

,” in

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

2018

5137

–

5146

[69]

Zhang

H.-T.

Zheng

, and

Sun

, “

Shufflenet v2: Practical Guidelines for Efficient CNN Architecture Design

,” in

European conference on computer vision

2018

116

–

131

[70]

Mandal

Narayan

S. K.

Dwivedi

Gupta

Ahmed

F. S.

Khan

, and

Shao

, “

Out-of-Distribution Detection for Generalized Zero-shot Action Recognition

,” in

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

2019

9985

–

9993

[71]

Mao

Xue

, and

Zhang

, “

Hierarchical Video Frame Sequence Representation with Deep Convolutional Graph Network

,” in

Proceedings of the European Conference on Computer Vision (ECCV)

2018

262

–

270

[72]

Marszalek

Laptev

, and

Schmid

, “

Actions in Context

,” in

2009 IEEE Conference on Computer Vision and Pattern Recognition

, IEEE,

2009

2929

–

2936

[73]

Martinez

Modolo

Xiong

, and

Tighe

, “

Action Recognition with Spatial-temporal Discriminative Filter Banks

,” in

Proceedings of the IEEE International Conference on Computer Vision

2019

5482

–

5491

[74]

Materzynska

Berger

Bax

, and

Memisevic

, “

The Jester Dataset: A Large-scale Video Dataset of Human Gestures

,” in

Proceedings of the IEEE International Conference on Computer Vision Workshops

2019

2874

–

2882

[75]

Mazzia

Angarano

Salvetti

Angelini

, and

Chiaberge

, “

Action Transformer: A Self-attention Model for Short-time Pose-based Human Action Recognition

,”

Pattern Recognition

124

2022

108487

[76]

Misra

C. L.

Zitnick

, and

Hebert

, “Shuffle and Learn: Unsupervised Learning using Temporal Order Verification,” in

ECCV

Springer

2016

527

–

544

[77]

Monfort

Andonian

Zhou

Ramakrishnan

S. A.

Bargal

Yan

Brown

Fan

Gutfreund

Vondrick

, et al., “

Moments in Time Dataset: One Million Videos for Event Understanding

,”

IEEE Transactions on Pattern Analysis and Machine Intelligence

(

2019

502

–

508

[78]

Morgado

Vasconcelos

, and

Misra

, “

Audio-visual Instance Discrimination with Cross-modal Agreement

,”

arXiv preprint arXiv:2004.12943

2020

[79]

J. Y.-H.

Hausknecht

Vijayanarasimhan

Vinyals

Monga

, and

Toderici

, “

Beyond Short Snippets: Deep Networks for Video Classification

,” in

Computer Vision and Pattern Recognition

2015

[80]

Wang

, and

Moulin

, “

RGBD-HuDaAct: A Color-depth Video Database for Human Daily Activity Recognition

,” in

2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops)

, IEEE,

2011

1147

–

1153

[81]

S. N.

Paul

and

Y. J.

Singh

, “

Survey on Video Analysis of Human Walking Motion

,”

International Journal of Signal Processing, Image Processing and Pattern Recognition

(

2014

–

122

[82]

Paul

Roy

, and

A. K.

Roy-Chowdhury

, “

W-talc: Weakly supervised Temporal Activity Localization and Classification

,” in

Proceedings of the European Conference on Computer Vision (ECCV)

2018

563

–

579

[83]

L. L.

Presti

and

La Cascia

, “

3D Skeleton-based Human Action Classification: A Survey

,”

Pattern Recognition

2016

130

–

147

[84]

Qian

Meng

Gong

M.-H.

Yang

Wang

Belongie

, and

Cui

, “

Spatiotemporal Contrastive Video Representation Learning

,”

arXiv preprint arXiv:2008.03800

2020

[85]

Qiu

Yao

, and

Mei

, “

Learning spatio-temporal representation with pseudo-3d residual networks

,” in

Proceedings of the IEEE International Conference on Computer Vision

2017

5533

–

5541

[86]

Qiu

Yao

C.-W.

Ngo

Tian

, and

Mei

, “

Learning Spatio-temporal Representation with Pseudo-3d Residual Networks

,” in

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

2019

12056

–

[87]

Recasens

Luc

J.-B.

Alayrac

Wang

Strub

Tallec

Malinowski

Päträucean

Altché

Valko

, et al., “

Broaden your Views for Self-supervised Video Learning

,” in

Proceedings of the IEEE/CVF International Conference on Computer Vision

2021

1255

–

1265

[88]

K. K.

Reddy

and

Shah

, “

Recognizing 50 Human Action Categories of Web Videos

,”

Machine Vision and Applications

(

2013

971

–

981

[89]

Romero

Ballas

S. E.

Kahou

Chassang

Gatta

, and

Bengio

, “

FitNets: Hints for Thin Deep Nets

,” in

ICLR

2015

[90]

Sayed

Brattoli

, and

Ommer

, “Cross and Learn: Cross-Modal Self-supervision,” in

Pattern Recognition

Cham

Springer International Publishing

2019

228

–

[91]

Scovanner

Ali

, and

Shah

, “

A 3-dimensional Sift Descriptor and Its Application to Action Recognition

,” in

Proceedings of the 15th ACM International Conference on Multimedia

2007

357

–

360

[92]

Sharir

Noy

, and

Zelnik-Manor

, “

An Image is Worth 16 × 16 Words, What is a Video Worth?

”

arXiv preprint arXiv:2103.13915

2021

[93]

G. A.

Sigurdsson

Varol

Wang

Farhadi

Laptev

, and

Gupta

, “

Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding

,” in

European Conference on Computer Vision

, Springer,

2016

510

–

526

[94]

Simonyan

and

Zisserman

, “

Two-stream Convolutional Networks for Action Recognition in Videos

,” in

NIPS

2014

568

–

576

[95]

Singh

Chakraborty

Varshney

Panda

Feris

Saenko

, and

Das

, “

Semi-Supervised Action Recognition with Temporal Contrastive Learning

,” in

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

2021

10389

–

[96]

Smaira

Carreira

Noland

Clancy

, and

Zisserman

, “

A Short Note on the Kinetics-700-2020 Human Action Dataset

,”

arXiv preprint arXiv:2010.10864

2020

[97]

Sohn

Berthelot

Carlini

Zhang

C. A.

Raffel

E. D.

Cubuk

Kurakin

, and

C.-L.

, “

FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence

,”

Advances in Neural Information Processing Systems

2020

[98]

Soomro

A. R.

Zamir

, and

Shah

, “

UCF101: A Dataset of 101 Human Actions Classes from Videos in the Wild

,”

arXiv preprint arXiv:1212.0402

2012

[99]

Sun

Rahmani

Bennamoun

Wang

, and

Liu

, “

Human Action Recognition from Various Data Modalities: A Review

,”

IEEE transactions on pattern analysis and machine intelligence

2022

[100]

Tao

Wang

, and

Yamasaki

, “

Self-supervised Video Representation Learning using Inter-Intra Contrastive Framework

,” in

Proceedings of the 28th ACM International Conference on Multimedia

2020

2193201

[101]

Tao

Wang

, and

Yamasaki

, “

Self-Supervised Video Representation Using Pretext-Contrastive Learning

,”

arXiv preprint arXiv:2010.15464

2020

[102]

Tian

Krishnan

, and

Isola

, “

Contrastive Multiview Coding

,”

arXiv preprint arXiv:1906.05849

2019

[103]

Tong

Song

Wang

, and

Wang

, “

Videomae: Masked Autoencoders are Data-efficient Learners for Self-supervised Video Pretraining

,”

arXiv preprint arXiv:2203.12602

2022

[104]

Tran

Bourdev

Fergus

Torresani

, and

Paluri

, “

Learning Spatiotemporal Features with 3D Convolutional Networks

,” in

The 2015 IEEE International Conference on Computer Vision (ICCV)

2015

4489

–

4497

[105]

Tran

Wang

Torresani

, and

Feiszli

, “

Video Classification with Channel-separated Convolutional Networks

,” in

Proceedings of the IEEE International Conference on Computer Vision

2019

5552

–

5561

[106]

Tran

Wang

Torresani

Ray

LeCun

, and

Paluri

, “

A Closer Look at Spatiotemporal Convolutions for Action Recognition

,” in

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

2018

6450

–

6459

[107]

Varol

Laptev

, and

Schmid

, “

Long-Term Temporal Convolutions for Action Recognition

,”

IEEE Transactions on Pattern Analysis and Machine Intelligence

(

2018

1510

–

1517

[108]

D.-Q.

, and

J.-C.

Wang

, “

(2+1)D Distilled ShuffleNet: A Lightweight Unsupervised Distillation Network for Human Action Recognition

,” in

Proceedings of the IEEE International Conference on Pattern Recognition (ICPR)

2022

[109]

D.-Q.

, and

J.-C.

Wang

, “

Teaching Yourself: A Self Knowledge Distillation Approach to Action Recognition

,”

IEEE Access

2021

105711

–

[110]

D.-Q.

J.-C.

Wang

, et al., “

A Novel Self-knowledge Distillation Approach with Siamese Representation Learning for Action Recognition

,” in

2021 International Conference on Visual Communications and Image Processing (VCIP)

, IEEE,

2021

–

[111]

Wang

and

Schmid

, “

Action Recognition with Improved Trajectories

,” in

ICCV

2013

3551

–

3558

[112]

Wang

Liu

, and

Yuan

, “

Mining Action Let Ensemble for Action Recognition with Depth Cameras

,” in

2012 IEEE Conference on Computer Vision and Pattern Recognition

, IEEE,

2012

1290

–

1297

[113]

Wang

Jiao

Bao

Liu

, and

Liu

, “

Self-supervised Spatio-temporal Representation Learning for Videos by Predicting Motion and Appearance Statistics

,” in

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

2019

4006

–

4015

[114]

Wang

Tran

, and

Feiszli

, “

What Makes Training MultiModal Classification Networks Hard?

” In

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

2020

12695

–

705

[115]

Wei

J. J.

Lim

Zisserman

, and

W. T.

Freeman

, “

Learning and Using the Arrow of Time

,” in

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

2018

8052

–

8060

[116]

Willems

Tuytelaars

, and

Van Gool

, “An Efficient Dense and Scale-invariant Spatio-temporal Interest Point Detector,” in

ECCV

Springer

2008

650

–

663