摘要
Objective: Semantic segmentation, a challenging task in computer vision, aims to assign corresponding semantic class labels to every pixel in an image. This process is widely applied into many fields, such as autonomous driving, obstacle detection, medical image analysis, 3D geometry, environment modeling, reconstruction of indoor environment, and 3D semantic segmentation. Despite the many achievements in semantic segmentation task, two challenges remain: 1) the lack of rich multiscale information and 2) the loss of spatial information. Starting from capturing rich multiscale information and extracting affluent spatial information, a new semantic segmentation model is proposed, which can greatly improve the segmentation results. Method: The new module is built on an encoder-decoder structure, which can effectively promote the fusion of high-level semantic information and low-level spatial information. The details of the entire architecture are elaborated as follows: First, in the encoder part, the ResNet-101 network is used as our backbone to capture feature maps. In ResNet-101 network, the last two blocks utilize atrous convolutions with rate=2 and rate=4, which can reduce the spatial resolution loss. A multiscale information fusion module is designed in the encoder part to capture feature maps with rich multiscale and discriminative information in the deep stage of the network. In this module, by applying expansion and stacking principle, Kronecker convolutions are arranged in parallel structure to expand the receptive field for extracting multiscale information. A global attention module is applied to highlight discriminative information selectively in the feature maps captured by Kronecker convolutions. Subsequently, a spatial information capturing module is introduced as a decoder part in the shallow stage of the network. The spatial information capturing module aims to supplement the affluent spatial information, which can compensate for the spatial resolution loss caused by the repeated combination of max-pooling and striding at consecutive layers in ResNet-101. Moreover, the spatial information-capturing module plays an important role in effectively enhancing the relationships between the widely separated spatial regions. The feature maps with rich multiscale and discriminative information captured by the multiscale information fusion module in the deep stage and the feature maps with affluent spatial information captured by the spatial information-capturing module will be fused to obtain a new feature map set, which is full of effective information. Afterward, a multikernel convolution block is utilized to refine these feature maps. In the multikernel convolution block, two convolutions are in parallel. The sizes of the two convolution kernels are 3×3 and 5×5. The feature maps refined by the multikernel convolution block will be fed to a Data-dependent Upsampling (DUpsampling) operator to obtain the final prediction feature maps. The reason for replacing the upsample operators with bilinear interpolation with DUpsampling is that DUpsampling not only can utilize the redundancy in the segmentation label space but also can effectively recover the pixel-wise prediction. We can safely downsample arbitrary low-level feature maps to the resolution of the lowest resolution of feature maps and then fuse these features to produce the final prediction. Result: To prove the effectiveness of the proposals, extensive experiments are conducted on two public datasets: PASCAL VOC 2012 and Cityscapes. We first conduct several ablation studies on the PASCAL VOC 2012 dataset to evaluate the effectiveness of each module and then perform several contrast experiments on the PASCAL VOC 2012 and Cityscapes datasets with existing approaches, such as FCN (fully convolutional network), FRRN(full-resolution residual networks), DeepLabv2, CRF-RNN(conditional random fields as recurrent neural networks), DeconvNet, GCRF (Gaussion conditional random field network), DeepLabv2-CRF, Piecewise, Dilation10, DPN (deep parsing network), LRR (Laplacian reconstruction and refinement), and RefineNet models, to verify the effectiveness of the entire architecture. On the Cityscapes dataset, our model achieves 0.52%, 3.72%, and 4.42% mIoU improvement in performance compared with the RefineNet, DeepLabv2-CRF, and LRR models, respectively. On the PASCAL VOC 2012 dataset, our model achieves 6.23%, 7.43%, and 8.33% mIoU improvement in performance compared with the Piecewise, DPN, and GCRF models, respectively. Several examples of visualization results from our model experimented on the Cityscapes and PASCAL VOC 2012 datasets demonstrate the superiority of the proposals. Conclusion: Experimental results show that our model outperforms several state-of-the-art saliency approaches and can dramatically improve the results of semantic segmentation. This model has great application value in many fields, such as medical image analysis, automatic driving, and unmanned aerial vehicle. ? 2020, Editorial and Publishing Board of Journal of Image and Graphics. All right reserved.
- 单位