Malay Roman Corpus Annotation System

Safwan  Sufian Chang; Juhaida  Abu Bakar; Norliza  Katuk

Authors

Safwan Sufian Chang School of Computing, UUM College of Arts and Sciences, Universiti Utara Malaysia, Sintok, Kedah, 06010, MALAYSIA
Juhaida Abu Bakar School of Computing, UUM College of Arts and Sciences, Universiti Utara Malaysia, Sintok, Kedah, 06010, MALAYSIA
Norliza Katuk School of Computing, UUM College of Arts and Sciences, Universiti Utara Malaysia, Sintok, Kedah, 06010, MALAYSIA

Keywords:

Named Entity Recognition, Polyglot, language model

Abstract

The Malay Roman Corpus Annotation is a web-based Natural Language Processing system. The system was developed using Flask and is powered by Polyglot, a natural language pipeline. Polyglot supports multilingual applications and the Malay language is one of the supported languages in the library. There are many unstructured texts in the WWW resources and those texts are incomprehensible to computers. Then, text analysis takes longer time and is inefficient. Furthermore, these unstructured texts contain an excessive amount of information, such as people's names, places, and locations, which will almost always result in incorrect information being evaluated. Hence, this work is to define and extract Malay Roman Named Entity Recognition characteristics from an unstructured document. Besides, this work was also created to develop a system that is able to annotate Malay Roman by using a suitable approach. The system built able to help users extract information correctly. The method used to develop this work consists of 5 phases, which are sentence segmentation, tokenization, part of speech tagging, entity recognition, and relationship recognition. Manage users, manage text input, manage clear text, view entity labels, manage analyse text and manage results are the functionalities developed. This innovation can help news providers by automatically going through the entire articles and identifying the entities, which helps in categorizing articles and saves students time by helping them summarize the documents.

Downloads

Download data is not yet available.

Malay Roman Corpus Annotation System

Authors

Keywords:

Abstract

Downloads

Downloads

Published

Issue

Section

How to Cite

Make a Submission

info

proceedings

index