Orchid types classification using supervised learning algorithm based on feature and color extraction

Received Oct 15, 2020 Revised Mar 20, 2021 Accepted Jul 1, 2021 Orchid flower as ornamental plants with a variety of types where one type of orchid has various characteristics in the form of different shapes and colors. Here, we chosen support vector machine (SVM), Naïve Bayes, and k-nearest neighbor algorithm which generates text input. This system aims to assist the community in recognizing orchid plants based on their type. We used more than 2250 and 1500 images for training and testing respectively which consists of 15 types. Testing result shown impact analysis of comparison of three supervised algorithm using extraction or not and several variety distance. Here, we used SVM in Linear, Polynomial, and Gaussian kernel while k-nearest neighbor operated in distance starting from K1 until K11. Based on experimental results provide Linear kernel as best classifier and extraction process had been increase accuracy. Compared with Naïve Bayes in 66%, and a highest KNN in K=1 and d=1 is 98%, SVM had a better accuracy. SVM-GLCM-HSV better than SVM-HSV only that achieved 98.13% and 93.06% respectively both in Linear kernel. On the other side, a combination of SVM-KNN yield highest accuracy better than selected algorithm here.


INTRODUCTION
Orchid is the largest monocot plant with an estimated population of 25,000 species in the world. The orchid is a beautiful flower type and is included in the ornamental plants that are widely cultivated in Indonesia [1], besides the orchids are also produced as cut flowers. The uniqueness of orchids is in the shape and color [2] of the lips or labellum [1] that can distinguish with other plants. Dendrobium orchids [3] are able to meet the demands of interest consumers whose tastes change over time. This can be seen from the types of orchids on the market that have varied flower colors and shapes, as well as the presence of new varieties with a more beautiful and attractive appearance [4].
The process of comparing one type of orchid with another orchid can see the color, texture, and flower petals. Knowing these differences can make it easier to classify orchid types, but orchid plants have many similarities in flower petals, texture and color, so people find it difficult to distinguish orchid species, especially lay people who do not yet know the characteristics of orchid species, by therefore a computer system is needed for the automatic classification of flowers that is expected to make it easier to classify orchid types [3]. The classification process [5] is done by digital image processing technology. This information system will produce a level of accuracy that is accurate enough to determine the name and type of orchid. With this application, people's insight into the names and types of orchids can be increased [6]. Extraction [7], [8] is needed to get a true and accurate information from the object of a smudge contained in the image. Classification is the process of finding a group of models or functions that describe and distinguish each class of data, to predict the class of objects or data whose class labels are unknown.
In this paper, we present the results of a comparison of the accuracy of the supervisd learning algorithms, including support vector machine (SVM), k-nearest neighbor (KNN), and Naïve Bayes in the unrealtime classification. This study will provide an overview for novice researchers in developing a classification process in a supervised learning algorithm. For best accuracy we have tried to apply kernel and range variations. We use the SVM, KNN and Naïve Bayes (NB) algorithms which may be classified as quite popular but old school algorithms, in fact we have considered the operating time and parameter setting process using neural networks and their variations that take quite a long time, for example convolutional neural network (CNN), and multilayer perceptron (MLP). In fact, not everything will be good if it is processed with the latest algorithms. We also need to analyze the use of outdated algorithms, because there are still many things that we can gain from the results of these implementations.

RESEARCH METHOD 2.1. Related research
Nilsback [9], has conducted a large-scale investigation of orchid images in 103 image classes using 4 features, namely texture, boundary, petal, and color. In a single feature obtained an accuracy of 55.1% while in combination of all features obtained an accuracy of 72.8%. The challenge in accurately calculating accuracy is influenced by the similarity of large and small classes of each testing data input. SVM has also been used in [10], in this study 13 orchid genotypes were identified by FTIR spectroscopy with 3 models, namely SSAE, SVM, and KNN. SSAE is proven to be more accurate than SVM and KNN. SSAE produces 99.4% accuracy and 97.9% calibration while KNN produces 100% accuracy but only 92.6% calibration. Paphiopedilum orchid species have been recognized using CNN in a study conducted by Arwatchananukul et al. [11] using 1500 images and 15 classes with an accuracy of 98.6%. Image classification also conducted using Naïve Bayes as simple statistics and probabilities in shallot quality achieved high accuracy [12]. Here, Naïve Bayes combine with hue saturation value (HSV) color model. Using 60 training and testing data, Naïve Bayes produces 91.67% accuracy. Here, the choice of HSV color models and color channels is adjusted to the theory of human vision might change with another preprocessing to get highest accuracy. Jayech and Mabjoud compared with tree augmented Naïve Bayes (TAN) and forest augmented Naïve Bayes (FAN), regular Naïve Bayes (RN) achieved highest mean classification [13]. A comparative study by K. Chandel et al. [14], proved that KNN is better than NB with accuracy 93.44% and 22.56% respectively. By H. T. Zaw et al. [15] conduct Naïve Bayes to detect brain tumor. Another research by H. M. Zawbaa et al. [16] that applying SVM for flower image classification has been done. SVM has been tested and compared with random forest (RF) classifier using 215 flowers. By SIFT ad SFTA, the dataset was devided into 70% training and 30% tersting result shown that SIFT-SVM yield higher accuracy in 100% than SFTA-SVM. I. Mohamed et al. [17], a support vector machine is trained with different strategies according to the organs and species of plants using SIFT and OpponentColor SIFT. This experiment using big data in 5061 leaf uniform background images, 2107 leaf natural backgroungd images, and 2167 flower images. Berfore classify using SVM, images have been clustered using K-Means. Here, optimal paramater were K=4000 and C=100 has improved score from from 0.67 to 0.74. Another research in segmentation in fast and accurate detection of kiwifruit has been done by L. Fu et al. [18] and proved their model has small and efficient for real-time kiwifruit detection in the orchard. This experiment need more time to make hardware and high cost using R-CNN with ZFNet, Faster R-CNN with VGG16, YOLOv2 and YOLOv3-tiny, the DY3TNet model has achieved precision of 0.9005 in 27 MB data. Another research by A. Koirala et al. [19] using R-CNN in COCO dataset to detect mango fruit. They used of around 400 training tiles. The MangoYOLO(bu) achieved a F1 score of 0.89 on a day-time mango image dataset. Another detection using waxbery image has been done [20]. This research using COCO dataset and performed by MR-CNN and compared with K-Means in verification sample set, while the average detection accuracy and recall rate reaching 97% and 91%, respectively. Based on previous research and urgency of the classification of orchid images, we proposed SVM. It algorithm had been tested and investigated using three kernels are linear, polynomial, and gaussian. We had been investigated the result in accuracy both of using GLCM-HSV or only in HSV and also compare with KNN. We had been investigated the effect of using the extraction feature in achieving the best accuracy.

Proposed method
The preprocessing steps include cropping, resizing, and extracting the image. Cropping is an image processing process by cutting the image which aims to take an important part of the image, while resizing is the process of resizing an image. Feature extraction is the stage to recognize the characteristics or information of objects in the image, while the feature is a form that is unique to the image. In this study, the extraction of orchid features using the gray level co-occurrence matrix (GLCM) method. Figure 1 is the stage that we have implemented to find the best accuracy.  Figure 1, orchid data as many as 2250 original images for the entire data. Cropping the 2250 orchid images, after that match all orchid image sizes to 512x512 pixels. From the 2250 data, it is grouped into 2 parts, namely training data and testing data, for training data there are 2250 original images and testing data for 1500 original images. After the data is grouped into 2 parts, each original image will look for the RGB value. After getting the RGB value do the conversion process to grayscale, then create a cohesion matrix with a distance of 1 and an angle of 0 0 . Finding the value of the four parameters, namely the value of contrast, energy, correlation, homogeneity using the MATLAB R2015a application. Calculate the GLCM using the values obtained from the calculation of the four previous parameters. Finding the HSV value for each orchid image by converting RGB to HSV. Then calculate the average value of hue, saturation and value. After getting the GLCM and HSV values, do the orchid flower classification process using the SVM method. The orchid flower classification process will produce a classification using the SVM method, with the value previously obtained in the extraction of the gray level GLCM and HSV features. When you get the classification results of orchids, calculate the level of accuracy. After that observe and record the results of the classification process of orchids.

Hue saturation value
HSV is a color extraction feature used for basic color classification and has a tolerance for changes in light intensity [21]. Here, RGB can be converted to HSV using some of calculation steps as in (1) until (10). Some of the advantages of HSV compared to other color spaces are : hue (H), which is a picture of the original color, such as blue, yellow, green, etc. that can be seen clearly by human vision. The angular values in HSV range from 0 0 to 360 0 ; saturation (S) is the relative purity of the colors represented as the distance from the axis of black and white light with a value of 0 to 100; value (V) is represented as high on the black and white axis or the darkness of a color. Where R is red value before normalized, r is normalized red value, G is green value before normalized, g is normalized green value, B is blue value before normalized, b is normalized blue value, V is value, S is saturation, H is hue in (8) until (10) used to change the image to 8 bit image. The value range from value is 0 to 100, value 0 is black. Based on saturation, 100 as white color or more or less saturation level.

Gray level co-occurrence matrix
GLCM is feature extraction with texture calculations in the second order, whereas the second order text calculation is the relationship between pairs of two original image pixels [22]. The pixel neighbor can be selected eastward (right). The way to represent this relationship is (1,0), stating that the relationship of two pixels in a row forms a horizontal value of 1 and followed by a pixel of value 0. Based on this composition, the number of groups of pixels that meet the relationship is calculated. The following are the steps to calculate the GLCM, which are: First do the initial matrix formation of the GLCM from a pair of two pixels in the direction parallel to the direction 0 0 , 45 0 , 90 0 , or 135 0 . Second, forming a symmetric matrix by adding up the initial GLCM matrix with the value of the transpose matrix. And the next steps are normalize the GLCM matrix to eliminate dependence on image size by dividing each matrix element by the number of pixel pairs. Calculating the value of feature extraction in the GLCM method. As in (10) where i is the row value of the i th matrix, j is the value of the j th matrix column and p(i, j) is the value of the co-occurrence matrix element of rows (i) and column (j). As in (11) until (16), i is the matrix row value, j is the matrix column value, p(i, j) is the row (i) and column (j) co-occurrence matrix element value, µi, µj is the average value of the elements in the row and column matrix, σi, σj, is the standard deviation value for the rows and columns of the matrix. i jp

Support vector machine
SVM works very well on high-dimensional data sets [23]. This method uses a kernel technique that maps original data from the originating dimension to another relatively higher dimension [24]. In the NN method, the training process studies all training data, whereas SVM only studies selected data used in classification [25]. Unlike the k-nearest neighbor method, at the time of prediction it stores all the training data that will be used [26], but for SVM it stores a small portion of the training data to be used at the time of prediction as in (17). Where b is bias value, m=ampunt of support vector, and (x i ). ɸ(z) is kernel function. For non-linear data you can use the kernel method in the initial data set feature [27]. The concept of kernel substitution can also be used in other methods in data analysis, but SVM is one of the well-known methods that uses the kernel to represent data [28], [29].

K-nearest neighbor
The KNN algorithm is a classification method that works based on the proximity of data to other data, where the results of new query instances are classified based on the majority of the proximity of the existing categories in the KNN [10]. Euclidean distance is a distance search that is widely used in numeric data in the k-nearest neighbor algorithm technique by drawing a straight line from the training data point to the testing data point. According to Partiningsih et al. [30], Euclidean distance can adjust the order of the level of image similarity with good results, where ( , ) is distance between data training and data testing, xi is data training, yi is data testing, I is data variable, and p data dimension.

Naïve Bayes
NB algorithm is supervised learning, which means it takes early to make decisions or predictions. The advantage of using the Naïve Bayes algorithm is that it does not use numerical optimization, so it is cheaper matrix [31]. This algorithm is efficient in training and can use binary or polynomial data [13], [32]. Where P (X | Y) is the probability of the data with vector X in class Y, P (Y) is the initial probability of class X, P (Xi | Y) is the independent probability of class Y of the features in vector X, and P (X) is probability of X. For categorical data, it only requires all the possibilities that occur while continuous data can use the following methods [33] are: (1) calculate the probability (prior) of each class; (2) calculate the average (mean) of each feature as in (19), where k is the amount of data and n is data value; (3) compute the standard deviation of these features as in (20); (4) calculate the probability density as in (21); calculate the probability of each class as shown in (22).

RESULTS AND DISCUSSION
Here, we used 15 type of orchid taken from ImageNet namely Dendrobium, Brassavola, Cattleya, Cymbidium, Epidendrum, Vanda, Pleurothallis, Oncidium, Calanthe, Coelogyne, Odontoglossum, Masdevallia, Laelia, Caladenia, Helleborine were selected where 100 of them as testing data in 512x512 pixels. Sample of datasets for each type of orchid and sample of preprocessing shown in Figure 2 and Figure 3 respectively. Figure 2 can be used as knowledge about the value of the GLCM and HSV. Table 1 shows the accuracy of the classification results with the SVM algorithm in linear, polynomial and gaussian kernels using and without GLCM HSV. We also compared our SVM with Naïve Bayes. Based on [3], accuracy (23) we has been tested all data. Here, we known that SVM better than Naïve Bayes that only yield 66% with accuracy far below that of the SVM up to 85%. Based on the classification process that has been done it is known that the classification using the GLCM and HSV feature extraction on the SVM algorithm is the best classification when used to classify orchids. On the other side, due to provide another overview of the research we have made, we conducted another trial. We have tested the classification model by implementing K-Nearest Neighbor. Here we used KNN with HSV and GLCM as explain in Table 2. We intend to compare several supervised learning algorithms as an illustration to prove superiority to the SVM algorithm. In addition to making comparisons, in this paper we also present a combination of SVM and KNN with feature optimization or not. It can be seen that SVM KNN turns out to produce the highest accuracy, so it can be concluded that in fact SVM can classify well, but the combination of SVM KNN produces higher accuracy than SVM alone regardless of whether it uses features or not. According to Table 2 each K represent achieve different accuracy. It caused identification class by pixels distance. Here, we known that d=1 in K=1 produce a highest accuracy. We also tested our dataset using Naïve Bayes. After this, we compared a best KNN result with SVM and Naïve Bayes as spelled out in Table 1. SVM is a more reliable more of classifiers, however KNN is less computationally intensive than SVM [34]. SVM has better performance than KNN [10], whereas another research by J. Kim et al. [35] conclude that KNN better than SVM. Based on thus several research, we had been investigated the accuracy  Table 1 and Table 3 where the KNN on K1 still produces lower accuracy than SVM. Based on Table 3 and Figure 4, our proposed method produce highest accuracy compared with Naïve Bayes and KNN. SVM-KNN proved better accuracy than SVM only. Features provide new knowledge which in fact can also improve the accuracy of the machine not significantly. It caused GLCM feature plays a role in identifying image texture, especially in orchid images that have curves and embossed lines such as Dendrobium, Vanda and Laelia with a slightly hairy element in each image. As shown in this Table 3, actually can be concluded that SVM-HSV is better than Naïve Bayes and several K in KNN.

CONCLUSION
Based on the tests we have done on 1500 orchid image data on 15 types of flowers using 3 supervised learning algorithms, namely SVM, KNN and Naïve Bayes. In the experiment using features, it is known that the KNN and SVM can be obtained higher than the data operated without features. If the algorithm is not combined, then SVM will produce better accuracy than Naïve Bayes both in using features and not. On the other hand, SVM without features is higher than some KNN results with features on certain K, for example at K 1 to K5 values at low d, it has higher accuracy than SVM GLCM on Gaussian kernel. By combining SVM and KNN using either features or not, the accuracy value increases, but in our experiments we haven't been able to produce 100% accuracy. Another finding in our tests is that linear kernels are most suitable for classification processes where the results are better than polynomial or gaussian kernels. This is our challenge to improve accuracy so that it is maximized. SVM may be combined with other algorithms or feature extraction, for example with linear binary processing (LBP) to get a better feature value than GLCM.