摘要

Objective In the semantic segmentation of high-resolution remote sensing images, it is difficult to distinguish regions with similar spectral features (such as lawn and trees, roads and buildings) only using visible images for their sin-gle-angles. Most of the existing neural network-based methods focus on spectral and contextual feature extraction through a single encoder-decoder network, while geometric features are often not fully mined. The introduction of elevation information can improve the classification results significantly. However, the feature distribution of visible image and elevation data is quite different. Multiple modal flow features cascading simply fails to utilize the complementary information of multimodal data in the early, intermediate and latter stages of the network structure. The simple fusion methods by cascading or adding cannot deal with the noise generated by multimodal fusion clearly, which makes the result poor. In addition, high-resolution remote sensing images usually cover a large area, and the target objects have problems of diverse sizes and uneven distribution. Current researches has involved to model long-range relationships to extract contextual features. Method We proposed a multi-source features adaptation fusion network in our researchanalysis. In order to dynamically recalibrate the scene contexted feature maps, we utilize the modal adaptive fusion block to model the correlations explicitly between the two modal feature maps. To release the influence of fusion noise and utilize the complementary information of multi-modal data effectively, modal features are fused by the target categories and context information of pixels in motion. Meanwhile, the global context aggregation module is facilitated to improve the feature demonstration ability of the full convolutional neural network through modeling the remote relationship between pixels. Our model consists of three aspects as mentioned below: 1) the double encoder is responsible for extracting the features of spectrum modality and elevation modality; 2) the modality adaptation fusion block is coordinated to the multi-modal features to enhance the spectral features based on the dynamic elevation information; 3) the global context aggregation module is used to model the global context from the perspective of space and channel. Result Our efficiency unimodal segmentation architecture (EUSA) is evaluated on the International Society for Photogrammetry and Remote Sensing (ISPRS) Vaihingen and Gaofen Image Dataset (GID) validation set, and the overall accuracy is 90. 64% and 82. 1%, respectively. Specifically, EUSA optimizes the overall accuracy value and mean intersection over union value by 1. 55% and 3. 05% respectively in comparison with the value of baseline via introducing a small amount of parameters and computation on ISPRS Vaihingen test set. This proposed modal adaptive block increases the overall accuracy value and mean intersection over union value of 1. 32% and 2. 33% each on ISPRS Vaihingen test set. Our MSFAFNet has its priorities in terms of the ISPRS Vaihingen test set evaluation, which achieves 90. 77% in overall accuracy. Conclusion Our experimental results show that the efficient single-mode segmentation framework EUSA can model the long-range contextual relationships between pixels. To improve the segmentation results of regions in the shadow or with similar textures, we proposed MSFAFNet to extract more effectivefeatures of elevation information.

全文