Table 1 A summary of supervised learning...

Table 1

A summary of supervised learning methods for Action Recognition. The column “Performance” presents the top-1 accuracy of the best model in each method. The column “Model size” shows the number of parameters and FLOPs of each model. In case the authors didn’t provide information about model size in their paper, we denote by —. Moments denotes the Moments in Time dataset and SS is the Something Something dataset.

Method	Description	Network	Model size	Performance	Code
BQN [39]	- Focusing on busy motion in the input videos. - Separating busy features from quiet features. - Two networks have been proposed for two features types.	BQN	92M 241GFLOPs	77.3 (Kinetics400) 97.6 (UCF101) 77.6 (HMDB51)	Link
STAM [92]	- Proposing two types of transformer including temporal transformer and spacial transformer.	Transformer	96M 270GFLOPs	79.3 (Kinetics400) 97.0 (UCF101) 39.7 (Charades)	Link
En-VidTr [124]	- Proposing two types of transformer including temporal transformer and spacial transformer.	VidTr-M	98.1M 220GFLOPs	79.7 (Kinetics400) 96.7 (UCF101) 74.4 (HMDB51)	None
Omni-sourced [24]	- Leveraging crawled data. - Adopt pre-trained models as a teacher. - Training students with teacher’s labels.	irCSN-152	—	83.6 (Kinetics400) 96.0 (UCF101) 71.1 (HMDB51)	Link
G-Blend [114]	- Identifying causes for performance drop on multi-modal networks. - Proposing a technique to avoid overfitting on these networks.	ipCSN-152	32.8M 110.1GFLOPs	83.3 (Kinetics400)	Link
irCSN-152 [105]	- Design an architecture named Channel-Separated Convolutional Network. - Utilize Group convolution to offer computational savings.	irCSN-152	29.6M 96.7GFLOPs	82.6 (Kinetics400)	Link
ipCSN-152 [105]	- Design an architecture named Channel-Separated Convolutional Network. - Utilize Group convolution to offer computational savings.	ipCSN-152	32.8M 108.8GFLOPs	79.2 (Kinetics400)	Link
GB+DF+LB [73]	- Focusing on improving the last layers. - Propose 3 classification branches instead of using the global average pooling alone.	ResNet-152	—	53.4(SS V1) 78.8 (Kinetics400)	None
HATNet [21]	- Fusing 2D and 3D architectures into one. - Training on HVU dataset.	ResNet-50	—	77.6 (Kinetics400) 97.8 (UCF101) 76.5 (HMDB51)	None
CoST [63]	- Proposing a novel operation to learn features using 2D Conv with a weight-sharing constraint.	ResNet-101	—	31.5 (Moments) 77.5 (Kinetics400)	None
RNL-TSM [40]	- Present region-based non-local operations as a self-attention.	ResNet-50	35.95M 41.16GFLOPs	49.47 (SS V1) 77.2 (Kinetics400)	Link
MSNet [60]	- Learn correspondences across frames and convert them into motion features.	ResNet-50	49.2M 67.6GFLOPs	55.1 (SSV1) 67.1 (SS V2) 76.4 (Kinetics400) 77.4(HMDB51)	Link
CMA [15]	Propose a cross-modality attention operation.	ResNet-152	—	75.98 (Kinetics400) 96.5(UCF101)	None
FASTER32 [126]	- Leverages the video’s redundancy to reduce FLOPs. - Combine an expensive model that captures actions, and a lightweight model that captures scene changes.	ResNet-50	67.7GFLOPs	75.3 (Kinetics400) 96.9(UCF101) 75.7(HMDB51)	None
MARS [17]	- Knowledge distillation from the flow network to the RGB network.	ResNeXt-101	—	74.9 (Kinetics400) 53(SS V1) 98.1(UCF101) 80.9(HMDB51)	None
STM [45]	- Encode features in a 2D framework. - The Channel-wise Spatio Temporal Module presents the spatiotemporal features. - The Channel-wise Motion Module efficiently encodes motion features.	ResNet-50	23.88M 32.93GFLOPs	73.7 (Kinetics400) 50.5(SS V1) 64.2(SS V2) 96.7(Jester) 96.2(UCF101) 72.2(HMDB51)	None
SlowFastNet [16]	- Two streams with one a low frame rate and the other a high frame rate.	ResNet-101	234GFLOPs	79.8 (Kinetics400) 81.8(Kinetics600)	Link
EvaNet [4]	- Finding video CNN architectures based on an evolutionary algorithm.	Inception Net	—	77.4 (Kinetics400) 82.3(HMDB51) 31.8(Moments)	None
R(2+1)D [106]	- Explicitly factorize 3D Conv into two operations, a 2D Conv and a 1D Conv.	ResNet-34	—	75.4 (Kinetics400) 73.3(Sports1M) 97.3(UCF101) 78.7(HMDB51)	Link
P3D [85]	- (2+1)D Conv uses ReLU between the 2D and 1D Conv in each block. - Using separate spatial and temporal components renders the optimization easier.	ResNet-152	—	77.4 (Kinetics400) 93.7(UCF101) 66.4(Sports1M) 75.12(ActivityNet) 80.8(ASLAN)	Link
I3D [48]	- Repeat 2D filters in the pre-trained Inception Net.	Inception-V1	25M	74.2 (Kinetics400) 93.4(UCF101) 66.4(HMDB51)	Link

Method	Description	Network	Model size	Performance	Code
BQN [39]	- Focusing on busy motion in the input videos. - Separating busy features from quiet features. - Two networks have been proposed for two features types.	BQN	92M 241GFLOPs	77.3 (Kinetics400) 97.6 (UCF101) 77.6 (HMDB51)	Link
STAM [92]	- Proposing two types of transformer including temporal transformer and spacial transformer.	Transformer	96M 270GFLOPs	79.3 (Kinetics400) 97.0 (UCF101) 39.7 (Charades)	Link
En-VidTr [124]	- Proposing two types of transformer including temporal transformer and spacial transformer.	VidTr-M	98.1M 220GFLOPs	79.7 (Kinetics400) 96.7 (UCF101) 74.4 (HMDB51)	None
Omni-sourced [24]	- Leveraging crawled data. - Adopt pre-trained models as a teacher. - Training students with teacher’s labels.	irCSN-152	—	83.6 (Kinetics400) 96.0 (UCF101) 71.1 (HMDB51)	Link
G-Blend [114]	- Identifying causes for performance drop on multi-modal networks. - Proposing a technique to avoid overfitting on these networks.	ipCSN-152	32.8M 110.1GFLOPs	83.3 (Kinetics400)	Link
irCSN-152 [105]	- Design an architecture named Channel-Separated Convolutional Network. - Utilize Group convolution to offer computational savings.	irCSN-152	29.6M 96.7GFLOPs	82.6 (Kinetics400)	Link
ipCSN-152 [105]	- Design an architecture named Channel-Separated Convolutional Network. - Utilize Group convolution to offer computational savings.	ipCSN-152	32.8M 108.8GFLOPs	79.2 (Kinetics400)	Link
GB+DF+LB [73]	- Focusing on improving the last layers. - Propose 3 classification branches instead of using the global average pooling alone.	ResNet-152	—	53.4(SS V1) 78.8 (Kinetics400)	None
HATNet [21]	- Fusing 2D and 3D architectures into one. - Training on HVU dataset.	ResNet-50	—	77.6 (Kinetics400) 97.8 (UCF101) 76.5 (HMDB51)	None
CoST [63]	- Proposing a novel operation to learn features using 2D Conv with a weight-sharing constraint.	ResNet-101	—	31.5 (Moments) 77.5 (Kinetics400)	None
RNL-TSM [40]	- Present region-based non-local operations as a self-attention.	ResNet-50	35.95M 41.16GFLOPs	49.47 (SS V1) 77.2 (Kinetics400)	Link
MSNet [60]	- Learn correspondences across frames and convert them into motion features.	ResNet-50	49.2M 67.6GFLOPs	55.1 (SSV1) 67.1 (SS V2) 76.4 (Kinetics400) 77.4(HMDB51)	Link
CMA [15]	Propose a cross-modality attention operation.	ResNet-152	—	75.98 (Kinetics400) 96.5(UCF101)	None
FASTER32 [126]	- Leverages the video’s redundancy to reduce FLOPs. - Combine an expensive model that captures actions, and a lightweight model that captures scene changes.	ResNet-50	67.7GFLOPs	75.3 (Kinetics400) 96.9(UCF101) 75.7(HMDB51)	None
MARS [17]	- Knowledge distillation from the flow network to the RGB network.	ResNeXt-101	—	74.9 (Kinetics400) 53(SS V1) 98.1(UCF101) 80.9(HMDB51)	None
STM [45]	- Encode features in a 2D framework. - The Channel-wise Spatio Temporal Module presents the spatiotemporal features. - The Channel-wise Motion Module efficiently encodes motion features.	ResNet-50	23.88M 32.93GFLOPs	73.7 (Kinetics400) 50.5(SS V1) 64.2(SS V2) 96.7(Jester) 96.2(UCF101) 72.2(HMDB51)	None
SlowFastNet [16]	- Two streams with one a low frame rate and the other a high frame rate.	ResNet-101	234GFLOPs	79.8 (Kinetics400) 81.8(Kinetics600)	Link
EvaNet [4]	- Finding video CNN architectures based on an evolutionary algorithm.	Inception Net	—	77.4 (Kinetics400) 82.3(HMDB51) 31.8(Moments)	None
R(2+1)D [106]	- Explicitly factorize 3D Conv into two operations, a 2D Conv and a 1D Conv.	ResNet-34	—	75.4 (Kinetics400) 73.3(Sports1M) 97.3(UCF101) 78.7(HMDB51)	Link
P3D [85]	- (2+1)D Conv uses ReLU between the 2D and 1D Conv in each block. - Using separate spatial and temporal components renders the optimization easier.	ResNet-152	—	77.4 (Kinetics400) 93.7(UCF101) 66.4(Sports1M) 75.12(ActivityNet) 80.8(ASLAN)	Link
I3D [48]	- Repeat 2D filters in the pre-trained Inception Net.	Inception-V1	25M	74.2 (Kinetics400) 93.4(UCF101) 66.4(HMDB51)	Link

[ViewLarge]