Table 1

A summary of supervised learning methods for Action Recognition. The column “Performance” presents the top-1 accuracy of the best model in each method. The column “Model size” shows the number of parameters and FLOPs of each model. In case the authors didn’t provide information about model size in their paper, we denote by —. Moments denotes the Moments in Time dataset and SS is the Something Something dataset.

MethodDescriptionNetworkModel sizePerformanceCode
BQN [39]

- Focusing on busy motion in the input videos.

- Separating busy features from quiet features.

- Two networks have been proposed for two features types.

BQN92M 241GFLOPs

77.3 (Kinetics400)

97.6 (UCF101)

77.6 (HMDB51)

Link
STAM [92]- Proposing two types of transformer including temporal transformer and spacial transformer.Transformer96M 270GFLOPs

79.3 (Kinetics400)

97.0 (UCF101)

39.7 (Charades)

Link
En-VidTr [124]- Proposing two types of transformer including temporal transformer and spacial transformer.VidTr-M98.1M 220GFLOPs

79.7 (Kinetics400)

96.7 (UCF101)

74.4 (HMDB51)

None
Omni-sourced [24]

- Leveraging crawled data.

- Adopt pre-trained models as a teacher.

- Training students with teacher’s labels.

irCSN-152

83.6 (Kinetics400)

96.0 (UCF101)

71.1 (HMDB51)

Link
G-Blend [114]

- Identifying causes for performance drop on multi-modal networks.

- Proposing a technique to avoid overfitting on these networks.

ipCSN-15232.8M 110.1GFLOPs83.3 (Kinetics400)Link
irCSN-152 [105]

- Design an architecture named Channel-Separated Convolutional Network.

- Utilize Group convolution to offer computational savings.

irCSN-15229.6M 96.7GFLOPs82.6 (Kinetics400)Link
ipCSN-152 [105]

- Design an architecture named Channel-Separated Convolutional Network.

- Utilize Group convolution to offer computational savings.

ipCSN-15232.8M 108.8GFLOPs79.2 (Kinetics400)Link
GB+DF+LB [73]

- Focusing on improving the last layers.

- Propose 3 classification branches instead of using the global average pooling alone.

ResNet-152

53.4(SS V1)

78.8 (Kinetics400)

None
HATNet [21]

- Fusing 2D and 3D architectures into one.

- Training on HVU dataset.

ResNet-50

77.6 (Kinetics400)

97.8 (UCF101)

76.5 (HMDB51)

None
CoST [63]- Proposing a novel operation to learn features using 2D Conv with a weight-sharing constraint.ResNet-101

31.5 (Moments)

77.5 (Kinetics400)

None
RNL-TSM [40]- Present region-based non-local operations as a self-attention.ResNet-50

35.95M

41.16GFLOPs

49.47 (SS V1) 77.2 (Kinetics400)Link
MSNet [60]- Learn correspondences across frames and convert them into motion features.ResNet-50

49.2M

67.6GFLOPs

55.1 (SSV1)

67.1 (SS V2)

76.4 (Kinetics400)

77.4(HMDB51)

Link
CMA [15]Propose a cross-modality attention operation.ResNet-152

75.98 (Kinetics400)

96.5(UCF101)

None
FASTER32 [126]

- Leverages the video’s redundancy to reduce FLOPs.

- Combine an expensive model that captures actions, and a lightweight model that captures scene changes.

ResNet-5067.7GFLOPs

75.3 (Kinetics400)

96.9(UCF101)

75.7(HMDB51)

None
MARS [17]- Knowledge distillation from the flow network to the RGB network.ResNeXt-101

74.9 (Kinetics400)

53(SS V1)

98.1(UCF101)

80.9(HMDB51)

None
STM [45]

- Encode features in a 2D framework.

- The Channel-wise Spatio Temporal Module presents the spatiotemporal features.

- The Channel-wise Motion Module efficiently encodes motion features.

ResNet-5023.88M 32.93GFLOPs

73.7 (Kinetics400)

50.5(SS V1)

64.2(SS V2)

96.7(Jester)

96.2(UCF101)

72.2(HMDB51)

None
SlowFastNet [16]- Two streams with one a low frame rate and the other a high frame rate.ResNet-101234GFLOPs

79.8 (Kinetics400)

81.8(Kinetics600)

Link
EvaNet [4]- Finding video CNN architectures based on an evolutionary algorithm.Inception Net

77.4 (Kinetics400)

82.3(HMDB51)

31.8(Moments)

None
R(2+1)D [106]- Explicitly factorize 3D Conv into two operations, a 2D Conv and a 1D Conv.ResNet-34

75.4 (Kinetics400)

73.3(Sports1M)

97.3(UCF101)

78.7(HMDB51)

Link
P3D [85]

- (2+1)D Conv uses ReLU between the 2D and 1D Conv in each block.

- Using separate spatial and temporal components renders the optimization easier.

ResNet-152

77.4 (Kinetics400)

93.7(UCF101)

66.4(Sports1M)

75.12(ActivityNet)

80.8(ASLAN)

Link
I3D [48]- Repeat 2D filters in the pre-trained Inception Net.Inception-V125M

74.2 (Kinetics400)

93.4(UCF101)

66.4(HMDB51)

Link

or Create an Account

Close Modal
Close Modal