Detection of topic on Health News in Twitter Data
Keywords:
Prolonged sitting, muscle activity, exercises on prolonged sittingAbstract
Abstract: The development and rapid popularization of the internet has led to an exponential growth of data in the network, thus, the text mining becomes more important. Users search for the information from the immense information available online. The ways to obtain valuable information, and to classify, organize and manage vast text data automatically make the text processing even more difficult. Therefore, in order to solve those problems and requirements, intelligent information processing has been extensively studied. Topic modelling has been widely employed in the field of natural language processing. Current research directions are more focused on ways to improve the classification speed and accuracy of text classification and topic detection as well as selecting feature methods in achieving better dimension reduction operations. Latent Dirichlet Allocation (LDA) topic model works well on data noise reduction. The LDA is widely used as a feature model combined with the classifier design in order to achieve a good classification effect. This study aims to conduct data mining and save load from the huge database. Thus, three supervised learning algorithms are run, which are Naïve Bayes, Decision Tree and Random Forest. Random Forest classifier outperforms the other two classifiers with 99.99% accuracy. Seven clusters for topic modelling have been revealed using Random Forest classifier. Each output has been set to four highest word and shows the highest term and its weight. The highest term used in the dataset is term ‘Ebola’. Based on the finding of this study, it shows that the combination of the LDA and supervised learning algorithm effectively solve the problem of data sparseness in short text sets. The method of selecting microblogs that are most likely to discuss news topics will significantly reduce the size of data objects of concern, and to a certain extent eliminate the interference of non-news blogs.