摘要
Objective: Single image super-resolution reconstruction (SISR) is a classic problem in computer vision. SISR aims to reconstruct one high-resolution image from single or many low-resolution (LR) images. Currently, image super-resolution (SR) technology is widely used in medical imaging, satellite remote sensing, video surveillance, and other fields. However, the SR problem is an essentially complex and morbid problem. To solve this problem, many SISR methods have been proposed, including interpolation-based methods and reconstruction-based methods. Due to large amplification factors, the repair performance will drop sharply, and the reconstructed results are very poor. With the rise of deep learning, deep convolutional neural networks have also been used to solve this problem. Researchers have proposed a series of models and made significant progress. With the gradual understanding of deep learning techniques, researchers have found that deep network brings better results than shallow network, and too deep network can cause gradient explosion or disappearance. In addition, the gradient explosion or disappearance can cause the model to be untrainable and thus unable to achieve the best results through training. In recent years, most networks based on deep learning for single-image SR reconstruction adopt single-scale convolution kernels. Generally, a 3×3 convolution kernel is used for feature extraction. Although single-scale convolution kernels can also extract a lot of detailed information, these algorithms usually ignore the problem of different receptive field sizes caused by different convolution kernel sizes. Receptive fields of different sizes will make the network pay attention to different features; therefore, only using a 3×3 convolution kernel will cause the network to ignore the macroscopic relation between different feature images. Considering these problems, this study proposes a multi-level perception network based on GoogLeNet, residual network, and dense convolutional network. Method: First, the feature extraction module is used as the input, which can extract low-frequency image features. The feature extraction module consists of two 3×3 convolution layers, which is input to multiple densely connected multi-level perception modules. The multi-level perception module is composed of 3×3 and 5×5 convolution kernels. The 3×3 convolution kernel is responsible for extracting detailed feature information, and the 5×5 convolution kernel is responsible for extracting global feature information. Second, the multi-level perception module is divided into shallow multi-level feature extraction, deep multi-level feature extraction, and tandem compression unit. The shallow multi-level feature extraction is composed of 3×3 chain convolution and 5×5 chain convolution. The former is responsible for extracting fine local feature information in shallow features, whereas the latter is responsible for extracting global features in shallow features. The deep multi-level feature extraction is also composed of 3×3 chain convolution and 5×5 chain convolution. The former extracts fine local feature information in deep features, whereas the latter extracts global feature information in deep features. In the tandem compression unit, the global feature information in shallow features, the fine local feature information in deep features, the global information in deep features, and the initial input are concatenated together and then compressed into the same dimension as the input image. In this way, not only low-level and high-level features of the image can be ensured, but also the macro relationship between the features can be guaranteed. Finally, the reconstruction module is used to obtain the final output by combining the upscaling image with the residual image. This study adopts the DIV2K dataset, which consists of 800 high-definition images, and each image has probably 2 million pixels. In order to make full use of these data, the picture is randomly rotated by 90°, 180°, and 270° and horizontally flipped. Result: The reconstructed results are evaluated by using the peak signal-to-noise ratio (PSNR) and structural similarity index and compared with some state-of-the-art SR reconstruction methods. The reconstructed results with 2 scaling factor show that the PSNRs of the proposed algorithm in four benchmark test sets (Set5, Set14, Berkeley Segmentation Dataset(BSD100), and Urban100) are 37.851 1 dB, 33.933 8 dB, 32.219 1 dB, and 32.148 9 dB, respectively, which are all higher than those of the other methods. Conclusion: Compared with other algorithms, the proposed convolutional network model in this study can better take into account the problem of the receptive field and fully extracts different levels of hierarchical features through multi-scale convolution. At the same time, the model uses the structural feature information of the LR image itself to complete the reconstruction, and good reconstructed results can be obtained by using this model.
- 单位