摘要
With the rapid development of internet and mobile internet technologies, many new applications require extensive use of rich text information in natural scenarios, such as sign board recognition and automatic driving. Thus, the analysis and processing of scene text plays an essential role in this field and has increasingly become one of the research hotspots in the field of computer vision. Traditional text detection and recognition methods often rely on manually designed features, with large amount of computation and low efficiency. These methods also lack satisfactory generalization performance for complex scenes. With the development of deep learning in recent years, convolutional neural network has made great progress on scene text detection and recognition. These deep learning-based methods outperform traditional ones by a large margin and have already become the mainstream in the field of text reading in the wild. For scene text detection, the methods can be divided into two categories in accordance with the difference of target objects, as follows: top-down methods and bottom-up methods. Top-down methods mainly inherit the basic idea from general object detection or instance segmentation and directly regress the entire bounding box for the text instance. On the contrary, bottom-up methods, following the idea of traditional ones, initially detect some components of the text instance and then group them together through some rules. Bottom-up methods is more effective in processing text detection of arbitrary shapes and orientations than the top-down methods, and they are not as sensitive to text scaling as top-down methods. However, grouping the detected components into different text instances requires complex design and processing; thus, the inference stage of bottom-up approach becomes inefficient. These methods also encounter some difficulties when detecting long text. In addition, text conglutination occurs when detecting dense text. However, the top-down methods do not have this issue and can have a higher precision for text detection. In recent years, recognizing text in natural scenes (also known as scene text recognition (STR)) has aroused great interest in academia and industry. In particular, the objective of STR is to translate a cropped text instance image into a target string sequence. Although optical character recognition (OCR) in scanned documents has been well developed, STR remains challenging due to many factors (such as very complex backgrounds, various, fonts and imperfect imaging conditions). Early work has relied on hand-crafted features, such as histogram of oriented gradients descriptors, connected components, and stroke width transformation. However, the performance of these approaches is limited by the low capability of features. In recent years, with the increase and development of deep learning, the community has witnessed substantial advancements. In particular, scene text recognition approaches based on deep learning can be roughly divided into two branches, namely, segmentation-based approaches and segmentation-free approaches. Segmentation-based approaches attempt to locate the position of each character from the input text instance image, apply a character classifier to recognize each character, and then group characters into text lines to obtain the final recognition results. Segmentation-free approaches recognize the text instance image as a whole and focus on mapping the entire text instance image into a target string sequence directly. Both branches own their advantages and limitations. Therefore, practitioners should select the best trade-offs according to their needs under different application scenarios. In the previous decades, although the practicality and efficiency of recognition approaches have been significantly improved, future research is still required for generalization ability, evaluation protocols, and scenarios of STR. Finally, end-to-end scene text spotting aims to combine text detection and text recognition into a unified system, which can be optimized in a single pipeline. Bridging the gap between the detection branch and recognition branch is the most essential problem for the design of an end-to-end text spotting system. Similar to general object detection and instance segmentation, end-to-end text spotting methods can be divided into two categories, namely, two-stage methods and one-stage methods. Two-stage methods are mainly based on faster R-CNN(region convolutional neural network) and mask R-CNN, in which region of interest(RoI) pooling/align acts as a bridge between the two branches. However, these operations may lose some information given that the region proposals from region proposal network (RPN) are insufficiently accurate. One-stage methods follow the pipeline of detection then recognition. Various feature-align operations are carefully designed to boost the linking between detection and recognition branches. We sort out and summarize the detection and recognition methods of scene text, and further elaborate and analyze the basic ideas of various methods and their pros and cons. We aim to provide reference for researchers and help in future work.
- 单位