Matching Compound Prototypes for Few-Shot Action Recognition

Table 8 Comparison of performance and detailed model specifics including the number of parameters (# Params), FLOPS, and inference time in seconds/iteration between different backbones and methods

Method	Backbone	Object	# Params	FLOPS-b	FLOPS-m	FLOPS-o	Inference time	SSv2-Small	SSv2-Full	Kinetics
MatchNet	ResNet50	/	24.6M	33.0G	0	0	0.4 (s/it)	34.9	35.1	54.6
TRX (Perrett et al., 2021)	ResNet50	/	27.2M	33.0G	10.57G	0	0.8 (s/it)	37.1	41.5	64.6
ITA-Net (Zhang et al., 2021b)	ResNet50	/	30.9M	33.0G	11.3G	0	0.9 (s/it)	38.4	46.1	72.6
Ours-	ResNet-50	/	32.0M	33.0G	2.2G	0	0.6 (s/it)	38.9	49.3	73.3
Ours-ms	ResNet-50	/	39.8M	33.0G	8.82G	0	0.8 (s/it)	42.6	52.3	74.0
Ours-ms	ResNet-18	/	26.7M	15.6G	8.82G	0	0.6 (s/it)	40.8	50.2	71.4
Ours-ms	DenseNet	/	23.2M	26.0G	8.82G	0	0.7 (s/it)	41.0	50.7	71.7
Ours-obj	ResNet-50	41.8M	37.2M	33.0G	8.06G	3T	3.3 (s/it)	57.1	59.6	81.0
Ours-obj	ResNet-18	41.8M	24.1M	15.6G	8.06G	3T	3.2 (s/it)	53.4	56.2	77.3
Ours-obj	DenseNet	41.8M	20.6M	26.0G	8.06G	3T	3.2 (s/it)	53.7	56.5	77.6

FLOPS-b, FLOPS-m, and FLOPS-o denote the computation cost on backbones, few-shot learning modules, and object detectors, respectively. Method “Ours-ms” indicates our method with multiscale feature and “Ours-obj” denotes our method that uses an additional object detector. Experiments are conducted on a single NVIDIA V100 GPU