摘要

With the rapid development of multimedia technology, a large amount of multimedia data such as text, image, video, and audio have emerged. These multi-sources data are heterogeneous in the form while interrelated in semantics. By using the rich information from massive heterogeneous multimedia data, the aim of multimedia social relation understanding is to learn the social relation in multimedia, so as to promote the intelligent business services, such as multimedia content understanding, character tracking, knowledge graph construction and so on. Image and video are important parts of multimedia information. The research of social relation understanding based on image and video information have gradually attracted increasing attention from both academic and industry areas. In this paper, we summarize the existing studies of social relation understanding based on image and video information in recent years. We first briefly introduce the research background and the general organization of our paper: relevant definitions, formal description of the problem, research methods and the process of social relation understanding. In the definitions of relevant concepts, we mainly introduce nine definitions from the aspects of node, edge, feature, network and so on. Problem formalization is mainly described from two aspects: relation existence judgment and relation type judgment. Then we look into the studies of social relation understanding based on image and video information. The process of social relation understanding includes four parts, namely data preprocessing, feature extraction, social relation extraction and research application. We also summarize the similarities and differences between the studies of social relation understanding in image and video areas. Afterwards, we give detailed introductions of the existing methods in social relation understanding based on both image and video information. And analyze the experiment from three parts: evaluation method, data set and comparison method based on image and video data. Finally, we make a conclusion of the problems and challenges on the social relation understanding based on image and video information. In particular, based on the technology development in social relation understanding, we divide the existing methods of social relation understanding into seven categories: co-occurrence-based methods, traditional graph-based methods, supervision-based methods, machine learning-based methods, deep learning-based methods, multimodal information-based methods and GNN-based methods. As for social relation understanding in both relation existence judgment and relation type judgment, we further classify the methods into two categories based on the number of relations: single and multiple relations. In the part of experiments, we summarize five evaluation methods, namely accuracy, precision, recall, F1 and mAP. Then we introduce the image and video datasets related to social relation understanding in recent years. In the experiments based on image information, we chose PISC and PIPA datasets for method comparisons. As for the experiments based on video information, we chose SRIV and ViSR datasets for method comparisons. Moreover, we analyze the advantages and disadvantages of the existing methods based on the experimental results. And finally summarize the problems and challenges from seven aspects, namely small sample learning, multi-source data fusion, unsupervised social relation understanding, multi-role different relation recognition, efficient relation understanding algorithm, real-time data feedback and multimedia knowledge graphs. The aim of this paper is to provide a research scope of social relation understanding based on image and video information, which may be helpful for the researchers to have a quick understanding of the field, and promote the further development in this area. ? 2021, Science Press. All right reserved.

全文