摘要

At present, data mining, as a method of analysis with high timeliness and high fidelity, is playing an increasingly important role in society. Its quick pattern-discovering ability in large-scale data, and the ability to quickly discover laws is gradually replacing the role of manpower. In the current large-scale distributed systems (such as Hadoop, Spark, etc.), there are tens of thousands of system logs every day. The amount of data in these logs and the chaos of the relationship have greatly affected the programmers. Manual monitoring of the system's efficiency also increases the cost of training for new programmers. To solve the above problems, the combination of data mining and system analysis is an inevitable trend. Therefore, the machine learning model is also increasingly mentioned by the industry for system log analysis. But in most cases, the system logs will report really few "serious" logs of the system, which are the programmers most concerned about. However, since most machine learning models used for system log analysis are assumed to train on balanced data, these models are prone to overfitting when they do syslog warnings, so that the results are not ideal enough. This paper will explore the application capabilities of CNN-text (CT) in system log analysis from the perspective of deep learning. By comparing CT with the mainstream system log analysis machine learning model Support Vector Machine and Decision Tree, we will explore the superiority of CT, comparing with these algorithms; we will compare CT with CRT, analyzes the treatment of CT features, and verifies the superiority of CT in processing deep-learning models to process syslog class texts; finally applies all models to two different log class texts. Contract the data to prove the universality of CT. In the experiments comparing CT with the mainstream machine learning model of log analysis, the recall rate of CT compared with the optimal model has increased by nearly 15%. In experiments comparing CT with CRT model, CT is more advanced than The accuracy of CRT model is about 20% higher, the recall rate is about 80% higher, and the precision rate is about 60% higher; in the universal experiment of CT, various types of models are integrated into the experimental dataset of this paper, logstash and public. In the data set WC85_1, the recall rate of CT was higher than that of the model with the highest remaining recall (DT-Bi) by nearly 14% when the accuracy rate was 100% with other models with better performance. According to the results above, the ability to abstract feature sets from plots of texts and the ability of non-linear regression are better than mainstream system log analysis machine learning model. Meanwhile, comparing to CNN-RNN-text, which is also a sort of CNN model and pays too much attention to the sequential features of the texts of systematical logs, CNN-text concentrates less on that. This difference, however, makes CNN-text maintains much better performance than that of CNN-RNN-text. Finally, it is argued that CNN-text is the best method among the methods mentioned in this paper.