Classification of handwritten Javanese script using random forest algorithm

ABSTRACT


INTRODUCTION
Javanese language is one of the oldest cultural heritage in Indonesia. Javanese was born as one of the languages that characterizes Indonesia as a nation. It is a language that is widely used by the world community. In 2007, it was noted that Javanese was spoken by 82 million or 1.25% of the world's population [1]. However, by 2015 this number has actually decreased to 68.2 million [2] which is clearly not proportional to the increase in the world's population. With the reduced use of language and script, it will become a threat to the loss of Indonesian culture.
One of the things that can be done to overcome the problem is to intensively introduce the language to students as the next generation and speakers of the language. Currently, Javanese is taught as local content at the primary and secondary education levels in some regions [3]. The basic competencies taught in Javanese subjects are wayang, tembang, geguritan, fairy tales, traditional games, and Javanese script. Among these, Javanese script is one of the competencies that is very important for students to learn. However, because Javanese is a local content subject, the teaching portion is quite small compared to other compulsory subjects such as mathematics and English. Furthermore, not all schools teach it to their students. With the small amount of teaching time and the variety of other materials in Javanese lessons, students may experience difficulties memorizing and writing Javanese letters or characters. Coupled with the variety and complexity of these characters, Javanese script reading and writing competencies are often not optimally conveyed to students.
To overcome these problems, some media to support Javanese script learning need to be developed. One of the development concepts that can be done is by building a system that asks students to write

1309
Javanese characters according to the questions given. The system will then work based on the image classification technique where the handwritten results of the students will be identified using the previously developed handwriting prediction model. If the results match the expected answers, the students will get a score and can continue to the next question. Meanwhile, if the results are wrong or not suitable, then the students will be asked to start over. It is hoped that with the developed system, students will be able to learn independently so that the competence of writing Javanese characters can be better mastered. It can be seen from the description that the most crucial component in the learning system being developed is the classification model. If the performance of the classification model is not good, unwanted things will occur, such as student answers being classified as wrong even though they are true or vice versa, wrong answers are considered as correct answers. This of course will cause the low quality of the system being developed.
In previous studies, the problem of handwriting classification has been widely researched. The most popular problem is the MNIST database [4], which is the handwriting recognition of digits (numbers 0 to 9). This problem has been researched and solved by several methods, including linear classifier [4], K-nearest neighbors (KNN) [5], support vector machine (SVM) [6], artificial neural network (ANN) [7], and convolutional neural network (CNN) [8]. The best model for current MNIST digit classification can yield an accuracy of more than 99% or an error of less than 1% [8].
Apart from MNIST, handwriting classification in various languages and scripts has also been widely explored. For example, [9] described the handwriting recognition mechanism used by Google to identify handwriting in 22 scripts with an average error rate of below 10% while [10] developed a handwritten alphabet recognition application using ANN and produced an accuracy of 86.535%. In the field of Javanese script recognition itself, [11] have implemented a combination of SVM and directional element feature to produce an accuracy of 93.6%. However, the research conducted in [11] only used a very limited amount of data (50 training data and 10 test data for each character) so that its application needs to be studied again for a larger and more varied amount of data. In another study, the classification of Javanese characters was carried out using ANN [12]. However, the performance of the resulting model is still unsatisfactory with an average accuracy of only 73%. Deep learning approaches, such as CNN, which is well-known for its image classification capabilities, e.g. in [13]- [15], have also been commonly used in the field of Javanese handwriting recognition, for example in [16]- [18]. Their applications, however, must be reviewed due to limited data and unsatisfactory performance. Finally, several feature extraction methods have also been extensively explored in [19]- [21]. The results show that by employing feature extraction, some traditional machine learning methods such as KNN are able to produce fairly good accuracy of more than 80%.
In this study, we propose the application of the random forest algorithm [22] to solve the Javanese script classification problem. The random forest algorithm is an ensemble learning application that combines several Decision Tree models in order to make predictions. This algorithm has been widely implemented and produces excellent performance in various related studies, including handwriting recognition with an accuracy rate above 90% [23]. Our contributions are as follows. To the best of our knowledge, we are the first to focus on implementing the random forest algorithm for classifying Javanese characters. There has been previous research comparing the performance of the random forest algorithm for Javanese script classification [24]. However, that study focused more on the application of SVM for its classification method. Another contribution is that in this study, we also experimented with a fairly large amount of data and extensively compared several data preprocessing schemes to find out which one is more efficient to use in Javanese script classification problems.

RESEARCH METHODOLOGY 2.1. Data collection
Our research methodology is illustrated in Figure 1. The data used in this research are Javanese script handwritten image data. We use the 20 characters of the Nglegena character set (ha, na, ca, ra, ka, da, ta, sa, wa, la, pa, dha, ja, ya, nya, ma, ga, ba, tha, nga) shown in Figure 2. To collect this data, we asked our respondents to copy the 20 Javanese characters in a form shown in Figure 3 which were then scanned and cropped for each character. From this process, 6000 images were collected (300 images for each character). The data is further divided into 2 datasets: training data and test data with a ratio of 7:3.

Data augmentation
Machine learning requires large amounts of data to improve the quality of the resulting model and avoid overfitting. In this research, we use image augmentation to increase the variety of our data. Random rotations and shears are performed to augment each image in both training data and test data 4 times as

Data preprocessing
Image data that have been collected in the previous step are then preprocessed in several stages, namely converting them into black-and-white images, cropping, and resizing. In the first stage, the data will be converted into black and white images. This step is carried out according to the pseudocode shown in Figure  5. The resulting images are then cropped to remove the empty space around the character. This cropping process is simply implemented using pseudocode in Figure 6. Finally, the resulting images will be resized into 32x32 pixels. An example of the results of these three stages is shown in Figure 7.
In addition to using these three stages, in this study we also tried several additional scenarios, namely by using the thinning process to attenuate the lines of the characters as shown in Figure 7. In addition, we also experimented with the feature extraction process using the histogram of oriented gradients (HOG) [25]. Thus, there are 4 variations of data that will be used and compared in the training and testing process, namely data without thinning and HOG, data with thinning only, data with HOG only, and data with both thinning and HOG.

Training
To train our random forest models, Grid Search with 3-fold cross validation is employed to find the best combination of parameters for the model. The explored parameters are shown in Table 1 with a total of 20 combinations.

Testing
The best model found from the training process using the grid search will be tested using the previously prepared test data. From the test results, accuracy, precision, and recall will be calculated and reported based on the following formulas;  Table 2 shows the performance of the random forest model generated using the four types of data used. It can be seen that the model generated from data without the thinning and HOG processes produces the best performance compared to other models when viewed from accuracy, precision and recall aspects. This result is quite surprising, considering that the thinning process produces images that are quite easily recognized by the human eye and HOG is a feature extraction method that is widely used in image classification problems. However, in this Javanese script handwriting recognition problem, both actually produce slightly worse performance when compared to data without any additional treatment. This may be due to the small size of the image used so that any additional treatment will slightly reduce the information contained in the images that may be needed in the recognition process.

RESULTS AND DISCUSSION
In terms of the best parameters for model development, almost all models produce the best parameters using gini as the impurity measure except the model from data with additional thinning treatment and feature extraction with HOG, which uses entropy as the impurity measure, which is shown in Table 3. It can also be seen that all models tend to require a fairly large number of trees, which makes sense considering that as the number of trees in the random forest model increases, the tendency of the model to overfit tends to decrease.  Next, we look at the confusion matrix of the model from the data without thinning and HOG feature extraction to see what characters are most often incorrectly predicted. It can be seen in Figure 8 that the character that is most often wrongly predicted is the character tha which is predicted as nga 20 times. As can be seen in Figure 9, the tha character is indeed very similar to the nga character so that the model has difficulty distinguishing the two. To deal with this problem, in future studies, the quality and quantity of data needs to be improved.

CONCLUSION
From the results of this study, we have seen that the Random Forest algorithm can be applied in identifying handwritten Javanese characters and shows very good performance. The best performance is obtained through preprocessing by converting the images into black-and-white, cropping, and resizing. Additional steps, namely thinning and feature extraction with HOG do not result in better performance. This is presumably because the image size is small enough so that the additional step may actually reduce the useful information in the image, which is needed in the character recognition process. In future research, we would like to compare the performance of traditional machine learning algorithms such as random forest, SVM, and KNN to the state-of-the-art method, that is deep learning with CNN on the same problem. CNN has been recognized as one of the most powerful methods for image classification. It would be interesting to see if the performance of a model generated using CNN is significantly better than that of a typical machine learning model on this particular problem.