摘要
In order to solve the problem of some operations that interfere with detection, such as synonym substitution, text paraphrase, etc., we propose a Chinese documents plagiarism detection approach based on semantic textual similarity. Firstly, we divide the document into sentence units and use word2vec to have a vector representation of each word of a sentence as the input of the convolutional neural network (CNN). Then, the CNN is applied to extract and filter the features of sentences, calculate the difference between sentence pairs, output the similarity of sentence pairs. Pair sentences with the highest similarity are considered as the candidates for plagiarism. Finally, copy-and-paste documents and semantically similar documents are used as the dataset to verify and compare the proposed method with the traditional fingerprint feature extraction method. The proposed method is tested on a large publicly available Tencent cloud text similarity data set, and applied to the plagiarism detection of students' homework. The results show that although the traditional fingerprint feature extraction method can find the same fragments in two documents accurately, it is sensitive to the noise in the semantically similar documents, while the proposed approach can overcome this disadvantage.
- 单位