摘要
Objective Few-shot learning (FSL) aims to learn emerged visual categories derived from constraint samples. A scenario of few-shot learning is the model learning via the classification strategy in the meta-train phase. It is required to recognize previously unseen classes with few labeled data in the meta-test phase. Current few-shot image classification methods focus on a robust global representation based learning., It is challenged to facilitate in-situ fine-grained image classification in spite of a common few-shot image classification existing. Such a global representation cannot capture the local and subtle features well, which is critical for fine-grained image recognition. The fine-grained image datasets samples are constrained due to the high cost of labeling, which is a tailored scenario of few-shot learning. Therefore, fine-grained images recognition is lack of annotated data. To fulfill image classification, fine-grained image recognition is based on the most discriminative region location and the discriminate features utilization. However, many fine-grained image recognition methods cannot be straightforward to the fine-grained few-shot task due to limited annotation data (e. g., bounding box). Thus, it is necessary to promote the few-shot learning and the fine-grained few-shot learning tasks both. Method Weakly-supervised object localization (WSOL) analysis is beneficial to the fine-grained few-shot classification task. Most fine-grained few-shot datasets are merely involved the label-based annotation due to the high cost of the pixel-level annotation. In addition, WSOL can provide the most discriminative regions directly, which is critical to general image classification and fine-grained image classification both. However, many existing WSOL methods cannot achieve complete localization of objects. For instance, class activation map (CAM) can update the last few layers of the classification network to obtain the merely class activation map via global maximum pooling and fully connected layers. To tackle these issues, we yield a self-attention based complementary module (SACM) to fulfill the WSOL. Our SACM contains the channel-based attention module (CBAM) and classifier module. Based on the spatial attention mechanism of the feature maps, CBAM can directly generate the saliency mask for the feature maps. A complementary non-saliency mask can be obtained through the threshold at the same time. To obtain the saliency and complementary non-saliency feature maps each, the saliency mask and the complementary non-saliency mask spatial-wise multiplies with the feature map. The classifier can obtain a more complete class activation map by assigning the saliency and non-saliency feature maps into the same category. Subsequently, we utilize the class activation map to filter and obtain the useful local feature descriptors for classification, which is as the descriptor representation. Additionally, images, the metric method cannot be directly applied to the fine-grained few-shot image classification in terms of common images based few-shot classification. We harness the semantic alignment distance to measure the distance between the two fine-grained images through the optioned feature descriptors and the naive Bayes nearest neighbor (NBNN) algorithm. First, we clarify the most neighboring descriptor among the supporting set through cosine distance for each query feature descriptor, which is denoted as the most neighboring cosine distance. Then, we accumulate the most neighboring cosine distance of each optioned feature descriptor to obtain the semantic alignment distance. The above two phases are merged into the semantic alignment module (SAM). Each feature descriptor in the query image can be accurately aligned by the support feature descriptor through the nearest neighbor cosine distance. This guarantees that the content between the query image and the supporting image can be semantically aligned. Meanwhile, each feature descriptor has a larger search space than the previous high-dimensional feature vector representation, which is equivalent to classification in a relative “high-data” regime, thereby improving the tolerance of the metric to noise. Result We carried out a large number of experiments to verify the performance. On the miniImageNet dataset, the proposed method gains 0. 56% and 5. 02% improvement than the second place under the 1-shot and 5-shot settings, respectively. On the fine-grained datasets Stanford Dogs and Stanford Cars, our method improves by 4. 18%, 7. 49%, and 16. 13, 5. 17% under 1-shot setting and 5-shot setting, respectively. In CUB 200-2011, our method also improves 1. 82% under 5-shot. Our approach can be applied to both general few-shot learning and fine-grained few-shot learning. The ablation experiment demonstrates that to feature descriptors filtering improves the performance of fine-grained few-shot recognition via SACM-based activation map classification. Meanwhile, our proposed semantic alignment distance improves the classification performance of few-shot classification under the same conditions compared to the Euclidean distance. Extra visualization illustrates the proposed SACM can localize the key interval objects based on merely label-based annotations. Conclusion Our WSOL-based fine-grained few-shot learning method has its priorities for common and fine-grained few-shot learning both. ? 2022 Journal of Image and Graphics.
- 单位