Application of named entity recognition method for Indonesian datasets: a review

ABSTRACT


INTRODUCTION
Named entity (NE) was introduced at the sixth message understanding conference (MUC-6). With the introduction of NE, the MUC conference has helped to advance the field of information extraction [1]. NE refers to a proper name that designates a person, location, or organization. For example, there are three NE in the following sentence: "James is a doctoral student in the Faculty of Computer Science at the University of Indonesia." James an NE insofar as it is the name of a person (P); Indonesia refers to a location (L); and the Faculty of Computer Science refers to the organization (O). Named entity recognition (NER) is a procedure that finds, extracts, and automatically classifies named entities from open domains and unstructured texts such as newspaper articles. It then categorizes these NE into predefined types [2]. There are four approaches to NER: i) a rule-based approach, which does not require annotated data because it relies on artificial rules; ii) an unsupervised learning approach; iii) a feature-based supervised learning approach that relies on supervised learning algorithms with careful feature engineering; and iv) a deep-learning-based approach, which automatically finds the required representation for detecting or classifying raw input in an end-to-end manner [3], [4]. NER is a straightforward process for humans because many named entities are self-names, and most of them have initial capital letters and can be easily recognized, but for machines, it is very difficult [5]. Information extraction often uses data available on social media, online news, and e-commerce [3]. Much information can thereby be obtained, including product reviews, analysis, and information extraction. For example, NER research is used for Indonesian news articles [6]. The use of NER is also carried out for the extraction of comments related to flood monitoring and traffic monitoring [7], [8]. On the other hand, the use of this method is also useful for quote identification [9]. The role of language in text-analysis often determines which model is used [10], because not all libraries are available for specific tasks [2].
NER has been applied to a wide variety of tasks [3], but a brief survey of the application of NER to texts in the Indonesian language reveals a total of only 241 documents (accessed December 2021). Meanwhile, the need to perform NER with Indonesian datasets is continuing to grow. Currently, there are libraries and tools available to facilitate machine learning (ML) as it pertains to the use of NE to extract information, but are there enough datasets? To what extent is NER used to extract information on social media and online news in an Indonesian-language context? Not all natural language processing (NLP) functions are available in Indonesian because, unlike in English, the functions that rely on the ML model mentioned above are not directly supported [11], [12].
In addition, another motivation for doing SLR is triggered by the emergence of illegal financial technology (Fintech) problems [13], [14]. Several previous studies on Fintech have been carried out and the main side that can be solved is by monitoring entities on social media [15]. But the challenge is that not all corpus (text set) is available in all languages.
This study looks at research trends in the application of NER to Indonesian datasets, including specific tasks, datasets, method/techniques, and entity labels. Therefore, this article will help facilitate the design of experiments to extract Fintech information on social media and online news. With the hope that it is not only a Fintech platform but can be a proposal for supervision of agencies or organizations based on social media data and online news.

METHOD 2.1. Systematic literature review
First, this article presents a SLR of the field of NER research. A SLR aims to collect all research on a particular topic, evaluates it critically, and reaches conclusions that synthesize that research. Then follows a discussion of how NER has been applied to Indonesian texts. SLR has been used in various research domains such as P2P lending [13], Fintech [14], Teaching and learning via webinars [16], supply chain management model [17], and software engineering [18].
A SLR was carried out in three stages: the planning stage, the implementation stage and the reporting stage (see Figure 1). In the first stage, the planning stage is carried out to identify the need for a systematic review of the use of the agile project management (APM) method. At this stage, a review protocol was also developed by setting research questions (RQ) and formulating a boolean search to determine search keywords. This study used the population, intervention, comparison, outcomes, and context (PICOC) strategy to determine the RQ, as shown in Table 1.  There follows the RQ that guided the following analysis: RQ. "What are the trends in the application of NER to extract information from Indonesian online news and social media?" In this study, the search string is ("named-entity recognition" OR NER OR "named entity recognition") AND ("online news" OR "social media") AND (Indonesia* OR Bahasa). According to the research question, the criteria for inclusion and exclusion in Table 2 were used to define the results. In the second stage, this research defines a search strategy, namely selecting a publication database, selection results for research, data extraction and the synthesis process. These processes are sequential processes where each process aims to find the right study to be used in this research. The search and selection process are an elimination process based on the criteria specified in each process.
The authors collected papers from relevant electronic databases such as SCOPUS, ACM, IEEEXplore, and Science Direct, then used Mendeley software to organize the data. Some irrelevant papers were omitted in the first stage of collection based on the title and abstract. The second stage of selection articles is a full-text selection. Figure 2 illustrates the procedure of text-selection. The total number of papers obtained from the four databases was initially 241. Upon completion of the selection procedure, however, only 20 papers remained. The low number of papers is both a challenge to and an opportunity for NER research in an Indonesian context, as few studies have used the "Bahasa" dataset. The third stage is reporting the results and analyzing the results of this review. We mapped research results from previous studies and examined how the experimental process in NER was, what libraries could be used for Indonesian language datasets, how to approach NER, and proposed future research. Table 2. Criteria of selection studies process Inclusion criteria Exclusion criteria The paper studied about NER The paper is not using English Studies published in the last 5 years, between 2016-2021 Not full-text paper The paper being studied is in the form of a journal or proceedings/conference Same papers from different database Papers discussing NER but not the Indonesian text dataset

Bibliometrics analysis
This study also presents a bibliometric analysis of the document results at the initial stage of selection. Bibliometrics is one way to perform statistical analysis of books, articles, or other publications. This analysis is carried out using data on the number and authors of scientific publications as well as articles and citations in them which aims to measure the outcomes of individuals or research teams, institutions, and countries, identify national and international networks and map the development of new fields of science and technology. The VOSviewer tool views keyword clusters and authors in the NER field and thereby helps to expand the scope of NER research.

Bibliometrics analysis results
VOSViewer software helps to visualize research trends by putting the keywords of articles into clusters and constructing diagrams from them. From Figure 3(a) that NER research began to develop in early 2018 (see blue cluster) with content analysis tasks and experiments to identify documents and sentences. Some of these tasks included the comparison of precision-recall with simple ML model approaches such as conditional random fields (CRF) and support vector machines (SVM). It was not until the beginning of 2019 (see green cluster), however, that research using Indonesian datasets began. It was also around that time that several other classification tasks also began to develop. Research data does not only come from scholarly articles but is also available in the form of data and images on the web and on platforms such as Twitter. On the other hand, if we look at 2019-2021 (see yellow cluster), NER research has started working on fake news, conducting aspect-based sentiment analysis, and measuring the model's performance. This is where the NER approach with deep learning (DL) begins to emerge. DL is one of the implementation methods of ML which aims to imitate the workings of the human brain using an artificial neural network or artificial reasoning network. The algorithm results are naturally expected to improve the performance of ML.
Additionally, we conducted a VOSViewer analysis with the co-authorship feature to see which authors were actively researching NER topics. Of 684 authors, 49 met the threshold; however, the results show that researchers are not connected by any network. This shows that each NER experiment has its own research goals, dataset, methods/techniques, as well as part of speech (PoS) tagging process. It can be seen below that the one of the most active researchers in the field is Purwanti, see Figure 3

Discussion
The internet and especially social media are a strategic tool for disseminating information to the public. Techniques have recently emerged that allow one to extract information on a targeted topic from the internet and then to examine the relationship between the words associated with that topic. Moreover, these techniques allow one to map out the relationship between the chief exponents of that topic and perhaps even locate them by charting their movements. One technique, namely text mining, provides a set of methodologies and tools for finding, visualizing, and evaluating information from extensive collections of text data [36]. Four processes need to be executed in text mining (see Figure 4). There are two ways of collecting data from social media and online news: i) web crawling using an API or BOT automatically; ii) web scraping by inserting HTML or XML elements using the HTTP protocol. After the data is collected and cleaned, the next stage is pre-processing, which can be done with a tokenizer, by removing stopwords, or by stemming. At that point a machine-learning approach to modeling performs a looping procedure for a final evaluation and validation. Finally, presentations of the data help to visualize the results of modeling after the tasks of categorization, recommendation, spam detection, and summarization have been completed. As explained in the introduction, the toolkit for analyzing languages, especially the natural language toolkit (NLTK), is intended for English. Each country needs to rely on other tools and cannot fully use NLTK. NLTK is a library and program for NLP written in the Python programming language. NLTK supports tokenization classification, stemming, tagging, parsing, and semantic reasoning functions. Some Indonesian language libraries have InaNLP, kateglo, BimaNLP, Indonesian Stemmer, Sastrawi, PySastrawi, and SentiStrengthID. Table 4 describes the most frequently used Indonesian language libraries. In addition, some tools and libraries for NER include SpaCy, GATE, OpenNLP, CoreNLP, NLTK, and CogcompNLP. Ported from Sastrawi project in PHP to Python SentiStrengthID [38] Sentiment Strength Detection in Bahasa Indonesia NER is one of the first steps toward information extraction that seeks to find entities mentioned in a text and classify them into predefined categories such as the person's name, organization, location, time, value, and percentage [3]. NER is used in many NLP fields and can help address many needs [25], [39], [40]. NER is a critical pre-processing tool for various downstream applications such as information recovery, query answering, and machine translation. Recognition of named entities in search queries will help understand user intent better, thus providing better search results [41].
It is important to classify the various approaches that NER employs. Even though they both carry out classification functions, various other approaches to NER continue to develop. Figure 5 illustrates the NER approach. The NER approach to the non-ML algorithm consists of four steps. First, the rule-based method identifies the rules in the system that are made by themselves based on linguistic knowledge [9]. Second, the lexicon-based method works by first making a dictionary of opinion words (lexicon). Third, statistical based using probabilistic. For example, the CRF and HMM algorithms [24]. A CRF is a framework for building discriminative probabilistic models for segmenting and labelling sequential data. At the same time, HMM is the primary technique for POS tagging in NLP. HMM models observations using a Markovian process with a state that is not directly observed (hidden). The main idea of HMM is to solve the problem of sequence tagging. Fourth is ontology-based NER such as a machine-learning approach. This method can identify known terms and concepts in the unstructured or semi-structured text, but at the same time it also relies on updating. The ontology approach provides additional advantages in terms of making further reasoning and knowledge acquisition for the extracted concepts [23], [30].
In the field of NLP, researchers are interested in identifying the word class for each word in each sentence. For example, the sentence Ryan menendang bola ('Ryan kicks the ball'). After the POS tagging process, the classification is "Ryan/noun menendang/verb bola/noun." This is useful for choosing nouns in sentences. Word classes are referred to as syntactic categories. POS tagging is a form of sequential job classification.
There also exist several schemes to annotate NER data. Widely used tagging schemes include inside-outside (IO), inside-outside-beginning (IOB), and beginning-inside-last-outside-unit (BILOU). If two tags appear consecutively, IO cannot distinguish between their boundaries. However, IOB and BILOU can incorporate boundary information but differ concerning their respective abilities to model more acceptable context information [41].

Figure 5. NER approaches
In conducting the NER experiment, the lack of datasets in Indonesian provides an opportunity for further research to build new datasets. The important thing in building this dataset is how to conduct crawling and scrapping. If we conduct scrapping manually, it may be necessary to spend time copying and pasting data. So, the suggestion for scrapping is to use coding, applications, and or browser extensions. HTML parsing techniques can also be performed via JavaScript and target linear and branching HTML pages. This method is more efficient in identifying HTML scripts from websites which are then used to extract text, links, and data. There is no one hundred percent effective scrapping technique because the data obtained are not always neat, and this depends on the structure of the page. So, understanding the structure of website pages is essential.
Second, after getting the dataset, we need to understand the data cleansing approach, including tokenizer, stemming, and stopwords. Several features to remove punctuation marks, numbers, and emoticons are used so that text data are of a high quality before being used during data analysis. Text preprocessing prepares unstructured text into good data so that they are ready to be processed. Third, the prepared dataset is generally divided into training data, development sets, and testing sets in building ML models. Training is the process of building a data model, and testing is testing the performance of the learning model. Development sets are generally not used when the data set is small. For example, 80% training data and 20% testing data or 70% training data and 30% testing data. the right approach must be selected to carry out NER carefully. Several studies demand a high level of accuracy and a high percentage of F1 scores.

Recommendations for illegal Fintech supervision strategies with the NER approach based on social media data and online news
Based on the SLR, new ideas emerge to utilize this method in the era of technological and social media transformation. The digital economy can change society and business's economic activities, from what was originally manual to fully automated. This impacts the provision of financial services by startups and Fintech companies. Currently, Fintech practices in Indonesia are very developed, starting from payments, funding, and Robo-advisors. However, in its implementation, Fintech lending (online lending) received special attention because it caused several problems, namely the emergence of illegal fintech. Unreasonable billing processes, issues of personal data protection, and even moral hazards are the focus of the supervision. In Indonesia, a government website channel is available for illegal Fintech complaints, but people tend to use social media to submit their complaints [15]. With the NER concept described in the previous section and the basics of libraries, POS tagging, and named entities, this research becomes the basis for developing ML models in the early identification of platform names on social media. Figure 6 is our proposed Fintech supervision model with social media data and online news that can be used for further research.

CONCLUSION
In conclusion, this study has provided an overview of research trends in applying the NER method to Indonesian datasets, including extracting news articles, flood monitoring, traffic monitoring, and quotation identification. Other areas of research to consider are data collection, building data sets, cleaning data, and selecting ML algorithm models for NER tasks. The theoretical implication of this research is to obtain the concept of NER and its application. This includes finding researchers and comparing the NER methods used. At the same time, the practical implication is that this NER approach can be used to extract social media comments for platform entity detection. As has been proposed, what is interesting is developing an Illegal Fintech supervision model from social media data.
This survey has limitations because the number of articles reviewed is low due to the lack of research using Indonesian datasets. This is an opportunity for further research in developing models and libraries that use Indonesian datasets. In the field of computer linguistics, the grammatical structure of each country will be a consideration and a challenge that can be explored for future research.

Indra Budi
is a lecturer in computer science and information systems at the Faculty of Computer Science, Universitas Indonesia. He is also a head of the information retrieval and natural language processing (IR-NLP) Laboratory (ir.cs.ui.ac.id/new). His research fields include information extraction, text mining, e-commerce, sentiment analysis, and social network analysis. He can be contacted at email: indra@cs.ui.ac.id.

Ryan Randy Suryono
is a doctoral student at the Faculty of Computer Science, Universitas Indonesia. He is a member of the IR-NLP Laboratory, E-government and Ebusiness Laboratory in the Faculty of Computer Science, Universitas Indonesia. He is also a lecturer at Universitas Teknokrat Indonesia, Bandar Lampung. His research interests include information systems, financial technology, and text analysis. He can be contacted at email: ryan@teknokrat.ac.id.