Malay Roman Corpus Annotation System

Authors

  • Safwan Sufian Chang School of Computing, UUM College of Arts and Sciences, Universiti Utara Malaysia, Sintok, Kedah, 06010, MALAYSIA
  • Juhaida Abu Bakar School of Computing, UUM College of Arts and Sciences, Universiti Utara Malaysia, Sintok, Kedah, 06010, MALAYSIA
  • Norliza Katuk School of Computing, UUM College of Arts and Sciences, Universiti Utara Malaysia, Sintok, Kedah, 06010, MALAYSIA

Keywords:

Named Entity Recognition, Polyglot, language model

Abstract

The Malay Roman Corpus Annotation is a web-based Natural Language Processing system. The system was developed using Flask and is powered by Polyglot, a natural language pipeline. Polyglot supports multilingual applications and the Malay language is one of the supported languages in the library. There are many unstructured texts in the WWW resources and those texts are incomprehensible to computers. Then, text analysis takes longer time and is inefficient. Furthermore, these unstructured texts contain an excessive amount of information, such as people's names, places, and locations, which will almost always result in incorrect information being evaluated. Hence, this work is to define and extract Malay Roman Named Entity Recognition characteristics from an unstructured document. Besides, this work was also created to develop a system that is able to annotate Malay Roman by using a suitable approach. The system built able to help users extract information correctly. The method used to develop this work consists of 5 phases, which are sentence segmentation, tokenization, part of speech tagging, entity recognition, and relationship recognition. Manage users, manage text input, manage clear text, view entity labels, manage analyse text and manage results are the functionalities developed. This innovation can help news providers by automatically going through the entire articles and identifying the entities, which helps in categorizing articles and saves students time by helping them summarize the documents.

Downloads

Published

08-12-2021

How to Cite

Sufian Chang, S. ., Abu Bakar, J. ., & Katuk, N. . (2021). Malay Roman Corpus Annotation System. Multidisciplinary Applied Research and Innovation, 2(3), 001-004. https://publisher.uthm.edu.my/periodicals/index.php/mari/article/view/5075

Most read articles by the same author(s)