摘要

Non-contiguous data communication means that the sender transfers multiple blocks of data at discontinuous addresses to multiple memory regions with discontinuous addresses on the receiver. This communication model is common in scientific computing applications, such as solution calculation, FFT calculation, fluid dynamics simulation, etc. These applications include data transfer operations such as the transfer of matrix transpose, the transfer of submatrix of 2D, 3D and 4D matrices, unstructured data access, and other non-contiguous data communication. Therefore, the communication performance of non-contiguous data has important influences on many scientific computing applications. Currently, there are some offloading or non-offloading methods to achieve non-contiguous data communication, but no one has measured these kinds of methods on one platform. Further, nobody has analyzed or proposed a guideline of which situation that each method is suitable for before now. We give the summary of implementation methods and performance analysis of non-contiguous data communication in this paper. The main contributions of this paper include: (1) a comprehensive summary of the implementation methods; (2) a series of detailed performance experiments using dependable micro-benchmarks and applications by different methods; (3) comparative analysis and useful conclusion for our experimental platform; (4) potential problems and research points of non-contiguous data communication in future work. Firstly, this paper summarizes the current implementation methods of non-contiguous data communication. The non-offloading method consists mainly of manual copy and some callable interfaces like MPI DDT(Message Passing Interface Derived Data Type) based on data movement in memory. The offloading method includes different implementations that make use of RDMA(Remote Direct Memory Access) technology to achieve different degree of decreasing data copy. After that, we use both existing benchmarks and self-designed benchmarks to measure the performance of non-contiguous data communication in different ways as we summarize in detail. All these experiments are completed on the same experimental platform for comparison and analysis. We also give the fine-grained analysis of the overhead of data copying and RDMA communication in the case of different data distributions. Especially, we relatively analyze the offloading performance based on RDMA sg_list(scatter gather list) and the offloading performance based on UMR(User-mode Memory Registration) functions, and conclude the applicable situations and potential problems of various methods of non-contiguous data communication. We list the data table as a guideline for non-contiguous data communication on our platform. We find that RDMA offloading methods truly have an advantage over memory copy in performance when the block size is large, but some problems still exist. The low efficiency of UMR MTT(Memory Table Translation) may cause performance degradation when block number becomes large. Finally, this paper verifies the correctness of the analysis results through micro-application experiments, and proposes the optimization direction of the technology related to the analysis results.