A summary of supervised learning methods for Action Recognition. The column “Performance” presents the top-1 accuracy of the best model in each method. The column “Model size” shows the number of parameters and FLOPs of each model. In case the authors didn’t provide information about model size in their paper, we denote by —. Moments denotes the Moments in Time dataset and SS is the Something Something dataset.
| Method | Description | Network | Model size | Performance | Code |
|---|---|---|---|---|---|
| BQN [39] | - Focusing on busy motion in the input videos. - Separating busy features from quiet features. - Two networks have been proposed for two features types. | BQN | 92M 241GFLOPs | 77.3 (Kinetics400) 97.6 (UCF101) 77.6 (HMDB51) | Link |
| STAM [92] | - Proposing two types of transformer including temporal transformer and spacial transformer. | Transformer | 96M 270GFLOPs | 79.3 (Kinetics400) 97.0 (UCF101) 39.7 (Charades) | Link |
| En-VidTr [124] | - Proposing two types of transformer including temporal transformer and spacial transformer. | VidTr-M | 98.1M 220GFLOPs | 79.7 (Kinetics400) 96.7 (UCF101) 74.4 (HMDB51) | None |
| Omni-sourced [24] | - Leveraging crawled data. - Adopt pre-trained models as a teacher. - Training students with teacher’s labels. | irCSN-152 | — | 83.6 (Kinetics400) 96.0 (UCF101) 71.1 (HMDB51) | Link |
| G-Blend [114] | - Identifying causes for performance drop on multi-modal networks. - Proposing a technique to avoid overfitting on these networks. | ipCSN-152 | 32.8M 110.1GFLOPs | 83.3 (Kinetics400) | Link |
| irCSN-152 [105] | - Design an architecture named Channel-Separated Convolutional Network. - Utilize Group convolution to offer computational savings. | irCSN-152 | 29.6M 96.7GFLOPs | 82.6 (Kinetics400) | Link |
| ipCSN-152 [105] | - Design an architecture named Channel-Separated Convolutional Network. - Utilize Group convolution to offer computational savings. | ipCSN-152 | 32.8M 108.8GFLOPs | 79.2 (Kinetics400) | Link |
| GB+DF+LB [73] | - Focusing on improving the last layers. - Propose 3 classification branches instead of using the global average pooling alone. | ResNet-152 | — | 53.4(SS V1) 78.8 (Kinetics400) | None |
| HATNet [21] | - Fusing 2D and 3D architectures into one. - Training on HVU dataset. | ResNet-50 | — | 77.6 (Kinetics400) 97.8 (UCF101) 76.5 (HMDB51) | None |
| CoST [63] | - Proposing a novel operation to learn features using 2D Conv with a weight-sharing constraint. | ResNet-101 | — | 31.5 (Moments) 77.5 (Kinetics400) | None |
| RNL-TSM [40] | - Present region-based non-local operations as a self-attention. | ResNet-50 | 35.95M 41.16GFLOPs | 49.47 (SS V1) 77.2 (Kinetics400) | Link |
| MSNet [60] | - Learn correspondences across frames and convert them into motion features. | ResNet-50 | 49.2M 67.6GFLOPs | 55.1 (SSV1) 67.1 (SS V2) 76.4 (Kinetics400) 77.4(HMDB51) | Link |
| CMA [15] | Propose a cross-modality attention operation. | ResNet-152 | — | 75.98 (Kinetics400) 96.5(UCF101) | None |
| FASTER32 [126] | - Leverages the video’s redundancy to reduce FLOPs. - Combine an expensive model that captures actions, and a lightweight model that captures scene changes. | ResNet-50 | 67.7GFLOPs | 75.3 (Kinetics400) 96.9(UCF101) 75.7(HMDB51) | None |
| MARS [17] | - Knowledge distillation from the flow network to the RGB network. | ResNeXt-101 | — | 74.9 (Kinetics400) 53(SS V1) 98.1(UCF101) 80.9(HMDB51) | None |
| STM [45] | - Encode features in a 2D framework. - The Channel-wise Spatio Temporal Module presents the spatiotemporal features. - The Channel-wise Motion Module efficiently encodes motion features. | ResNet-50 | 23.88M 32.93GFLOPs | 73.7 (Kinetics400) 50.5(SS V1) 64.2(SS V2) 96.7(Jester) 96.2(UCF101) 72.2(HMDB51) | None |
| SlowFastNet [16] | - Two streams with one a low frame rate and the other a high frame rate. | ResNet-101 | 234GFLOPs | 79.8 (Kinetics400) 81.8(Kinetics600) | Link |
| EvaNet [4] | - Finding video CNN architectures based on an evolutionary algorithm. | Inception Net | — | 77.4 (Kinetics400) 82.3(HMDB51) 31.8(Moments) | None |
| R(2+1)D [106] | - Explicitly factorize 3D Conv into two operations, a 2D Conv and a 1D Conv. | ResNet-34 | — | 75.4 (Kinetics400) 73.3(Sports1M) 97.3(UCF101) 78.7(HMDB51) | Link |
| P3D [85] | - (2+1)D Conv uses ReLU between the 2D and 1D Conv in each block. - Using separate spatial and temporal components renders the optimization easier. | ResNet-152 | — | 77.4 (Kinetics400) 93.7(UCF101) 66.4(Sports1M) 75.12(ActivityNet) 80.8(ASLAN) | Link |
| I3D [48] | - Repeat 2D filters in the pre-trained Inception Net. | Inception-V1 | 25M | 74.2 (Kinetics400) 93.4(UCF101) 66.4(HMDB51) | Link |
| Method | Description | Network | Model size | Performance | Code |
|---|---|---|---|---|---|
| BQN [ | - Focusing on busy motion in the input videos. - Separating busy features from quiet features. - Two networks have been proposed for two features types. | BQN | 92M 241GFLOPs | 77.3 (Kinetics400) 97.6 (UCF101) 77.6 (HMDB51) | Link |
| STAM [ | - Proposing two types of transformer including temporal transformer and spacial transformer. | Transformer | 96M 270GFLOPs | 79.3 (Kinetics400) 97.0 (UCF101) 39.7 (Charades) | Link |
| En-VidTr [ | - Proposing two types of transformer including temporal transformer and spacial transformer. | VidTr-M | 98.1M 220GFLOPs | 79.7 (Kinetics400) 96.7 (UCF101) 74.4 (HMDB51) | None |
| Omni-sourced [ | - Leveraging crawled data. - Adopt pre-trained models as a teacher. - Training students with teacher’s labels. | irCSN-152 | — | 83.6 (Kinetics400) 96.0 (UCF101) 71.1 (HMDB51) | Link |
| G-Blend [ | - Identifying causes for performance drop on multi-modal networks. - Proposing a technique to avoid overfitting on these networks. | ipCSN-152 | 32.8M 110.1GFLOPs | 83.3 (Kinetics400) | Link |
| irCSN-152 [ | - Design an architecture named Channel-Separated Convolutional Network. - Utilize Group convolution to offer computational savings. | irCSN-152 | 29.6M 96.7GFLOPs | 82.6 (Kinetics400) | Link |
| ipCSN-152 [ | - Design an architecture named Channel-Separated Convolutional Network. - Utilize Group convolution to offer computational savings. | ipCSN-152 | 32.8M 108.8GFLOPs | 79.2 (Kinetics400) | Link |
| GB+DF+LB [ | - Focusing on improving the last layers. - Propose 3 classification branches instead of using the global average pooling alone. | ResNet-152 | — | 53.4(SS V1) 78.8 (Kinetics400) | None |
| HATNet [ | - Fusing 2D and 3D architectures into one. - Training on HVU dataset. | ResNet-50 | — | 77.6 (Kinetics400) 97.8 (UCF101) 76.5 (HMDB51) | None |
| CoST [ | - Proposing a novel operation to learn features using 2D Conv with a weight-sharing constraint. | ResNet-101 | — | 31.5 (Moments) 77.5 (Kinetics400) | None |
| RNL-TSM [ | - Present region-based non-local operations as a self-attention. | ResNet-50 | 35.95M 41.16GFLOPs | 49.47 (SS V1) 77.2 (Kinetics400) | Link |
| MSNet [ | - Learn correspondences across frames and convert them into motion features. | ResNet-50 | 49.2M 67.6GFLOPs | 55.1 (SSV1) 67.1 (SS V2) 76.4 (Kinetics400) 77.4(HMDB51) | Link |
| CMA [ | Propose a cross-modality attention operation. | ResNet-152 | — | 75.98 (Kinetics400) 96.5(UCF101) | None |
| FASTER32 [ | - Leverages the video’s redundancy to reduce FLOPs. - Combine an expensive model that captures actions, and a lightweight model that captures scene changes. | ResNet-50 | 67.7GFLOPs | 75.3 (Kinetics400) 96.9(UCF101) 75.7(HMDB51) | None |
| MARS [ | - Knowledge distillation from the flow network to the RGB network. | ResNeXt-101 | — | 74.9 (Kinetics400) 53(SS V1) 98.1(UCF101) 80.9(HMDB51) | None |
| STM [ | - Encode features in a 2D framework. - The Channel-wise Spatio Temporal Module presents the spatiotemporal features. - The Channel-wise Motion Module efficiently encodes motion features. | ResNet-50 | 23.88M 32.93GFLOPs | 73.7 (Kinetics400) 50.5(SS V1) 64.2(SS V2) 96.7(Jester) 96.2(UCF101) 72.2(HMDB51) | None |
| SlowFastNet [ | - Two streams with one a low frame rate and the other a high frame rate. | ResNet-101 | 234GFLOPs | 79.8 (Kinetics400) 81.8(Kinetics600) | Link |
| EvaNet [ | - Finding video CNN architectures based on an evolutionary algorithm. | Inception Net | — | 77.4 (Kinetics400) 82.3(HMDB51) 31.8(Moments) | None |
| R(2+1)D [ | - Explicitly factorize 3D Conv into two operations, a 2D Conv and a 1D Conv. | ResNet-34 | — | 75.4 (Kinetics400) 73.3(Sports1M) 97.3(UCF101) 78.7(HMDB51) | Link |
| P3D [ | - (2+1)D Conv uses ReLU between the 2D and 1D Conv in each block. - Using separate spatial and temporal components renders the optimization easier. | ResNet-152 | — | 77.4 (Kinetics400) 93.7(UCF101) 66.4(Sports1M) 75.12(ActivityNet) 80.8(ASLAN) | Link |
| I3D [ | - Repeat 2D filters in the pre-trained Inception Net. | Inception-V1 | 25M | 74.2 (Kinetics400) 93.4(UCF101) 66.4(HMDB51) | Link |