摘要

Visual question answer(VQA) is a newly developing multi-modal learning task that bridges both the comprehensions of the visual content and the textual question to generate a corresponding answer. It attracts a lot of attention from the community and involves the interaction of different modalities, which requires the capability of image perception and textual semantic learning. However, the training of VQA has great requirements for the dataset. It requires a wide variety of question patterns and a large number of question answer annotations with different answers for similar scenarios to ensure the robustness of the model and the generalization ability under different modalities. Thus, it is very time-consuming and expensive to label a VQA dataset, which becomes a bottleneck for the development of VQA. In view of these problems, this paper proposes a contrastive cross-modal representation learning based active learning(CCRL) method for VQA. The key idea of CCRL is to cover more question patterns and make the distribution of answers more balanced. It consists of a visual question matching evaluation(VQME) module and a visual answer uncertainty estimation(VAUE) module. The visual question matching evaluation module utilizes mutual information and contrastive predictive coding as the constraints to learn the alignment relationship between visual content and question pattern. The answer uncertainty module introduces the label state learning model. It selects matched question patterns for each image and learn the semantic relationship between cross-modal questions and answers. Then the model estimates the uncertainty of the answer based on the distribution of its probability, by which CCRL can select most informative samples and label them. In the experiment, this work implements the latest active learning algorithms on the VQA task and performs performance evaluation on VQA-v2 dataset. The experimental results demonstrate that CCRL outperforms the previous methods in all question patterns and averagely improves the accuracy by 1.65% compared to the state-of-the-art active learning method. With 30% labeled samples, CCRL achieves 96% of the performance with 100% labeled data. With 40% labeled samples, CCRL achieves 97% of the performance with 100% labeled data. This indicates that CCRL can select instructive and diverse samples, which greatly cuts down the annotation cost and maximizes the VQA performance respectively.

全文