基于 Vision Transformer 的中文唇语识别

作者:Xue Feng*; Hong Zikun; Li Shujie; Li Yu; Xie Yincen
来源:Pattem Recognition and Aitificial Intelligence, 2022, 35(12): 1111-1121.
DOI:10.16451/j.cnki.issn1003-6059.202212006

摘要

Lipreading is a multimodal task to convert lipreading videos into text, and it is intended to understand the meaning expressed by a speaker in the absence of sound. In the existing lipreading methods, convolutional neural networks are adopted to extract visual features of the lips and capture short-distance pixel relationships, resulting in difficulties in distinguishing lip shapes of similarly pronounced characters. To capture the long-distance relationship between pixels in the lip region of the video images, an end-to-end Chinese sentence-level lipreading model based on vision transformer(ViT) is proposed. The ability of the model to extract visual spatio-temporal features from lip videos is improved by fusing ViT and Gate Recurrent Unit(GRU). Firstly, the global spatial features of lip images are extracted using the self-attention module of ViT. Then, GRU is employed to model the temporal sequence of frames. Finally, the cascading sequence-to-sequence model based on the attention mechanism is utilized to predict Chinese pinyin and Chinese character utterances. Experimental results on Chinese lipreading dataset CMLR show that the proposed model produces a lower Chinese character error rate. ? 2022 Journal of Pattern Recognition and Artificial Intelligence.

全文