摘要

Video question answering(VideoQA)is a typical cross-modal understanding task. Its challenge lies in how to learn appropriate multimodal representation and cross-modal correlation for answer inference. Most existing video question answering methods focus on the latter,e. g.,relationship learning between each video frame or clip and word. In this work, we devote to advanced feature embedding of both video and query. We develop a clustering-based VLAD technique for VideoQA. The novelty of our work is the joint exploitation of temporal aggregation and correlation in multimodality. We propose an end-to-end trainable Transformed VLAD embedding network, named TVLAD-Net. TVLAD-Net constructs a differentiable aggregation network module(i. e., convolutional Residual-less VLAD Block)to generate compact VLAD descriptors(transforming N frames,clips or words to compact K descriptors while K < N),and realizes multi-head attention to correlate multimodal RVLAD descriptors. The characteristics are to eliminate redundant and invalid clues in the feature sequence and ensure diversity with multiple to-be-learned descriptors(corresponding to multiple clustering cells). To be specific,at first,we argue that a suitable representation should effectively exhibit the potential core semantic clues of sequence data. Based on this rule, we focus on the temporal aggregation of multimodality to extract core descriptors of data. For either videos or questions,we develop a learnable clustering-based Residual VLAD encoder to summarize each entire feature sequence into compact descriptors, respectively. Each descriptor can be deemed as a weighted aggregation over the entire feature sequence(a global perspective of unimodality). Multiple descriptors mean viewing global sequence serval times. It ensures the rich perspectives of semantic summarization. In this work,we consider the summarization of visual frame features,clip features,the combined frame & clip features of video, and word features of question. Second, we construct a unified Transformed module to realize multimodal descriptor interaction. To avoid irrelevant or redundant semantics of both visual and textual descriptors, we leverage multi-head attention in the Transformer architecture to control informative flows from these descriptors. The proposed transformed VLAD embedding module performs the context correlation of both inter-modality and intra-modality. Finally, each answer inference decoder is constructed for specific question types. The questions in VideoQA can be divided into the following three types:1)Multi-choice task,2)Open counting task and 3)Open word task. We use the corresponding decoder for each specific question type to infer the final answer. We evaluated TVLAD-Net on three VideoQA benchmark datasets, TGIF-QA, MSVD-QA, and MSRVTT-QA. The experimental results show that the proposed method achieves high accuracy of answer reasoning. There is a performance improvement of 2% to 5% compared with the existing methods. To summarize,the main contributions are summarized as follows:1)by introducing the clustering-based VLAD aggregation into the differentiable convolution network, we refine the original features and generate advanced multimodal descriptors for VideoQA; 2) the multi-head operation in transformed VLAD embedding ensures the context correlation of both inter-modality and intra-modality. Either visual or textual descriptors, descriptors with similar or consistent semantics gather round;3)extensive experiments demonstrate the effectiveness of TVLAD-Net over other approaches on three benchmark datasets. ? 2023 Science Press.

全文