摘要

The public opinion system is a system to monitor the trend of public opinion on the Web. Through the public opinion system, we can understand hot spots on the Web and track their trends. Events are the focus of the public opinion system. News data about public opinion events are very complicated. Even for the data about the same event, it often contains different sub-topics (different perspective of the event). The sub-topics of an event can reflect its different aspects. For example, in the event of an earthquake, sub-topics include earthquake details, rescue work, post-disaster reconstruction, and so on. These sub-topics not only embody different aspects of the event, but also reflect the hot spots that public opinion may concern about. Tags of events sub-topics can be regarded as the attributes of events, which can help us to describe and comprehensively understand the events. Through sub-topics, we can compare the similarities and differences between different events, and the sub-topic tags in a certain period of time can reflect changes in public opinion for the spots of events. It is significance to detect sub-topics of events and generate accurate sub-topic tags for public opinion system. It usually contains two major steps to generate the tags of sub-topics of a public opinion event: It first discovers sub-topics and then generates effective tags for them based on their corresponding keywords and documents. Existing methods for discovering topics or sub-topics are usually based on clustering or classification, which put the documents about the same topic into the same cluster. However, as the documents about the same event are similar to each other, it is very difficult for existing methods to measure the distance between these documents and thus they cannot effectively differentiate the sub-topics in the same event. There are a lot of high frequency background words in each document, how to ensure the diversity of sub-topics is a big problem. In addition, traditional methods often employ an extraction based manner to generate sub-topics' tags, where the accuracy of the tags cannot be guaranteed. And it is difficult to ensure the intelligibility of the generated tags. For overcoming such problems, this paper proposes an ET-TAG model, which uses PLSA-BLM to discover sub-topic keywords, KL divergence to merge similar sub-topics, and then utilizes co-occurrence relations to update sub-topic keywords. Based on the sub-topic keywords, the external knowledge base is used to generate the corresponding tags for each sub-topic. ET-TAG has higher accuracy when generating sub-topic tags, ET-TAG performs much better. Furthermore, the tags generated by ET-TAG are more accurate and summary. Finally, the tags generated by Experiments on Sogou news corpus and specific multi-category public opinion events corpus can prove that ET-TAG has obvious advantages compared with traditional methods(including K-means and LDA) in sub-topic discovery. It has higher accuracy when generating sub-topic tags. ET-TAG is used to compare and track events, which shows that sub-topic tags may help find the common points between different events and reflect the heat trends of the sub-topics of events.