Malay Roman Corpus Annotation System
Keywords:
Named Entity Recognition, Polyglot, language modelAbstract
The Malay Roman Corpus Annotation is a web-based Natural Language Processing system. The system was developed using Flask and is powered by Polyglot, a natural language pipeline. Polyglot supports multilingual applications and the Malay language is one of the supported languages in the library. There are many unstructured texts in the WWW resources and those texts are incomprehensible to computers. Then, text analysis takes longer time and is inefficient. Furthermore, these unstructured texts contain an excessive amount of information, such as people's names, places, and locations, which will almost always result in incorrect information being evaluated. Hence, this work is to define and extract Malay Roman Named Entity Recognition characteristics from an unstructured document. Besides, this work was also created to develop a system that is able to annotate Malay Roman by using a suitable approach. The system built able to help users extract information correctly. The method used to develop this work consists of 5 phases, which are sentence segmentation, tokenization, part of speech tagging, entity recognition, and relationship recognition. Manage users, manage text input, manage clear text, view entity labels, manage analyse text and manage results are the functionalities developed. This innovation can help news providers by automatically going through the entire articles and identifying the entities, which helps in categorizing articles and saves students time by helping them summarize the documents.