摘要
Objective: Images and videos manipulation is becoming more easy-use and indistinguishable with development of deep learning. Deepfake is a sort of face manipulation technique which poses a great threat to social security and individual rights. Researchers have been working to propose various detection models or frameworks, which can be divided into three categories combined with their inputs factors like frame level, clip level and video level, respectively. Detection models of frame level have focused on single frame and ignore temporal information only, potentially leading to low confidence in videos detection. Although detection models of clip level make use of a sequence of frames simultaneously, the length of sequence is relatively shorter than the real length of a video. Thus, a clip cannot well represent a video. Moreover, video clips are fragmented and may have adverse effect on video level detection. The consecutive frames in a short clip have little difference and cause redundant information, which may cut the detection performance. The video level detection methods use frames of large interval as input and capture more key features to represent qualified video. The existing methods ignore the impact of sample extraction procedure and its expensive computation of decoding video stream. To solve this problem and provide more efficient detection method on face-swap manipulation videos, a detection framework based on the interaction of key frames' features is illustrated. Method: The proposed detection framework has consisted of two parts: key frames extraction in context of face region images extraction and the detection model. First, an amount of key frames from the video stream have been extracted and checked. Inter-frame decoding is avoided and computation time is deducted via key frames extraction. Next, multitask cascaded convolutional neural networks(MTCNN) is applied to locate the position of face region on the extracted frames. Face images are cropped with 80 margins from them. MTCNN is re-applied to the images extracted before. Compact face images are extracted from them. The face images input are mapped into high dimensional embedding space by Inception-ResNet-V1. This convolution neural network is initialized by pre-trained parameters in face recognition task and updated end-to-end implementation. At last, these features of key frames are melted into an interaction learning module, which contains various self-attention-based encoders. In this module, each key frame feature can learn from every other key frame and update itself. Distinctive abnormal features of manipulated images are extracted via part of linear and non-linear transformations. A global classification vector is concatenated at the first of key frame features, updating along with them, and makes the final decision. Result: The detection framework has been evaluated on five mainstream datasets listed below: Deepfakes, FaceSwap, FaceShifter, DeepFakeDetection and Celeb-DF, respectively. The three datasets of Deepfakes, FaceSwap, FaceShifter are from FaceForensics++. It achieves accuracies of 97.50%, 97.14%, 96.79%, 97.09% and 98.64%, respectively, with a small quantity of key frames. Original 3D convolution models and LSTM-based models are compared with the illustrated detection model on Celeb-DF in terms of 16 key frames as input. A demonstrated lightweight 3D model(L3D) for deepfake detection has been tested as well. As the samples size is smaller than that of exisited work, R3D, C3D, I3D and L3D have demonstrated poor detection performance while LSTM-based one achieves an accuracy of 98.06%. The demonstrated model is much better than before (99.61%). In the condition that the input is changed to consecutive frames, the proposed model has shown qualified performance 98.64% as well. The time cost of detection is evaluated and illustrated that our framework can detect a video in an average time of 3.17 s, less than major models or with consecutive frames as input. The research strategy of key frame extraction and the framework proposed are shown to be efficient based on the experiments results. A realistic scene has been considered, in which key frames quantity of the video has been checked. A little more frames than training can achieve higher accuracy as the detection model has learned the relation well amongst frames and can be generalized well, but fewer frames can also lead to insufficient information and worse performance. In general, the proposed model can achieve good and stable detection performance, training with 16 key frames. Conclusion: An efficient detection framework for face-swap manipulation videos has been demonstrated. It takes the advantage of key frame extraction that it skips the procedure of inter-frame decoding and get time cutting in the preprocessing step. Based on face region images being cropped from valid key frames' pictures, Inception-ResNet-V1 maps them to a standardized embedding space followed by several layers of self-attention based encoders and linear or non-linear transformations. More meaningful and distinguishing information is captured when every frame feature can learn from each other. The experiments on Celeb-DF dataset demonstrate that the illustrated model outperforms other sequential model and 3D convolution neural networks. The time cost is relatively deducted and the effiency of the proposed framework is improved.
- 单位