Vietnamese character recognition based on CNN model with reduced character classes

ABSTRACT


INTRODUCTION
Text recognition is the process of converting the text in one or more images into documents such as those written on a computer [1][2][3]. Currently, the word recognition problem has been studied and widely used in the automation of office operations in many languages like English [4,5], Chinese [6,7], Japanese [8]. In Vietnam, there are some identification systems such as VietOCR software [9] and VNDOCR [10]. with accuracy of about 90-95% [9,10]. In general, the recognition of Vietnamese text has been a matter of concern for many scientific researchers based on the different models and techniques such as using HANDS-VNOnDB [11], utilizing RetinaNet for text detection and Inception-v3 CNN network for feature extraction, passing through bidirectional long short-term memory (Bi-LSTM)-based RNN for text recognition [12], Bi-LSTM combined with Conditional Random Field (CRF) [13], SSD Mobilenet V2 for text detection and Attention OCR (AOCR) for text recognition [14], CNN, a well-known for image processing applications [15][16][17][18], integrated with grammatical features for emotion detection [16].
Unlike the mentioned works, in this work, the authors focused on the printed word recognition from the Vietnamese educational textbook with a simplified set of several characters: 138 character classes instead of 178 character classes since some characters are different in sizes only.  Figure 1 shows an overview of the proposed CNN-based reduced-character-class-model. In Figure 1, in this work, the raw data are pictures taken by phone from textbook pages in which each page is a color image and a snapshot of the entire text in it. In order to make the data usable, they need to be pre-processed to form two data sets for CNN model training (training data and testing data).
The CNN model's input is the pre-processed data (in this case, gray images of accented characters with a fixed size of 28x28 pixels), and the output is the images' corresponding labels. Here, it should be noted that the selected image size is significant for use in pattern recognition. There are 178 characters classes in Vietnamese that have accents, but within this article, based on analysis of words in Vietnamese textbooks of grade 1, there are some classes of characters that differ only in actual sizes, for example, "c" and "C", "o" and "O". Therefore, the authors only built identification models for 138 characters classes after filtering out similar characters. This is to increase the effectiveness of the developed model.

Pre-processing data
Given an input image taken from the phone (RGB color system), the authors performed the preprocessing data steps as: converting the image to grayscale and binary images, finding the contour of tokens in the image, finding the contour of each character in each token, cutting each character and saving it. These preprocessing steps were performed in Java language in combination with the OpenCV library. The following details procedures for each step. a.
Step 1: Converting to grayscale and binary images OpenCV [19][20][21] is an open-source library built for image processing. There are functions in the library that convert color images to grayscale and binary one based on the Otsu algorithm [22]. Different from [23] which used localization thresholding technique, this work utilized the existing functions from the library. b.
Step 2: Finding the contour of each token To find the contour of each character, the authors built a function to perform the following: Perform closing operation to create a seamless connection of characters in a token; each token forms a block. However, the authors need to retain the original binary image to cut each character into pictures for the next step. -Next, implement the Canny [24][25][26][27][28] algorithm to find the contour of each word in the image.
-Not only does the image contains text, but also other physical objects, so the boundaries found include those objects. These objects are retained, but when the identification step is taken, they are ignored. c.
Step 3: Finding the contour of each character In this paper, a function has been built to accomplish this task: -The input of the function is images of Vietnamese tokens (including the boundary connections), the output is photos of the characters, including accents. This step again uses the Canny algorithm on a one-token binary image to find the contour of each token's character images. Here, the boundaries of the found images will be characters (without accents). There may be additional images of individual accents (for example, the 'ế' has three (3) contours, the contour of the word 'e', '^', and an acute accent). To find the contours of a fully accented character image, this paper rearranges the rectangular blocks containing these elements, thereby combining the elements together along the same y-axis forming an image of the character to be searched.

Building training and testing data
From the raw data of 2,000 photos from Vietnamese textbooks, the dataset consists of 86,544 samples and is labelled manually after pre-processing. This data set is then divided into two parts: testing and training. The testing dataset consists of 50 images per character class for a total of 6,900 samples, while the training data set is the remaining one, which is detailed in Table 1. Figure 3 presents are some images of the "Ợ" character class retrieved from the training dataset:  Figure 3. Training data of "Ợ" character class

Building a CNN model for Vietnamese printed character recognition and evaluating identification results
The proposed CNN model consists of four main classes: Convolution class, Relu class, Pooling class, Fully-connected class. This paper presents the proposed CNN model in Figure 4. Based on the above figure, a CNN model was built. The model has an Input layer which uses grey images of 28x28 pixels. Next, it has 2 Convolution layers followed by 2 Max Pooling layers correspondingly. Next are 2 Fully-connected layers. Finally, there is an Output layer having 138 outputs.
In Convolution layer 1, the input image rows and columns are padded with 0s. Because of this, when performing the convolution multiplication on the input image, it's size will remain 28x28 pixels. In this work, 32 random filters to get 32 output channels. After convolution, the next layer is Max pooling with a size of 2x2, step size of 2x2. After this step, the dimension of the image is 14x14. In the second Convolution layer, the first layer's image is also added with rows and columns of 0 to keep the dimension number 14x14. This layer will give 64 output channels. After convolution, the image is passed through the Max pooling layer with a size of 2x2, step size of 2x2, and the dimension of the image is 7x7. After two layers of Convolution and Max pooling, we get a layer of 7x7x64 = 3,164 neurons. Next, we add another fully-connected layer of 1,000 neurons, and finally, the output layer consists of 138 outputs.
The above model is built in the Keras library [29] with Python language, as shown in Figure 5.  In case 1, the accuracy of the training set improves as time goes and the loss of the function goes closer to zero. The model results had low accuracy and instability for testing data since loss function was still large. This is the main reason why the authors conducted further training with a larger dataset. In case 2, the results were better with the full training dataset and gave greater accuracy on the testing dataset. After two training sessions, the authors found that the number of 50 training sessions for each training was sufficient because the accuracy had almost no significant change in the last period. The loss function also reached a threshold that was small enough. Summarizing the two cases, we get the results shown in Table 2.

Token identification in Vietnamese textbooks based on the developed CNN model
A mobile demo application was built on the Android platform to test the developed CNN model's accuracy and practicality. The application's input is an image of each page in the textbook, and the output will be a sequence of tokens in text form. Here, the application took a picture, then performed pre-processing steps outlined in Section 2, and finally used the trained CNN model to produce the main result, a text string in the textbook page that has potential for auto reading application. Below are some demo images of the application shown in Figure 10. The photography had some issues with brightness, angle of shooting. Thus, it made the application difficult in pre-processing steps and, therefore, not well text recognition. It is also tricky to recognize uppercase or lowercase letters because there are similar faces in uppercase and lowercase letters.

CONCLUSION
Through research and experiment, the author group has achieved the main results: studying in detail the deep learning model to build a reduced-character-class CNN model for Vietnamese printed character recognition. The CNN model's training and testing datasets include 2,000 raw images, 79,644 training samples, and 6,900 testing samples. This work demonstrated that the developed CNN model is useful in image recognition since relatively satisfactory results were achieved. The developed model can be used in Vietnamese spelling application for learning Vietnamese. The authors also built and installed the CNN model on mobile to solve the problem of Vietnamese token identification on textbooks for learning and spelling Vietnamese.