Bulletin of Electrical Engineering and Informatics

Received Sep 6, 2022 Revised Oct 6, 2022 Accepted Oct 21, 2022 Image classification is the process of assigning labeling to the input images to a fixed set of categories; however, assigning labels to the image is difficult by using the traditional method because of the large number of images. To solve this problem, we will resort to deep learning techniques. Which is enables computers to recognize and extract visual characteristics. The convolutional neural network (CNN) is a deep neural network used for many purposes, such as image classification, detection, and face recognition, due to its high-performance accuracy in classification and detection tasks. In this paper, we develop CNN based on the transfer learning approach for image classification. The network comprises two types of transfer learning, ResNet and DenseNet, as building blocks of the network with an multilayer perceptron (MLP) classifier. The proposed method does not need to preprocess before these datasets that input into the network. It was train on two datasets: the Cifar-10 and the Sign-Traffic datasets. We conclude that the proposed method achieves the best performance compared with other states of the art. The accuracy gained is 97.45% and 99.45%, respectively, where the proposed CNN increased the accuracy compared to other methods by 3%.


INTRODUCTION
A large volume of published studies classifying images is a significant problem in artificial vision systems and has been for decades. This area aims to provide a label to a picture based on the information visible [1]. Researchers may benefit from image classification since it allows them to organize images according to their shared characteristics. For example, if images A and B have specific characteristics, we may classify and label them as part of the same set. Their studies tested the algorithms in various ways and made comparisons [2]. Object detection and classification are among the most challenging tasks regarding image processing. Several object classification methods have been suggested throughout the years to address these issues [3].
Researchers and developers are now able to approach bigger models to handle complicated issues, something that was previously impossible with traditional artificial neural networks (ANNs) [4]. Most studies have used hand-crafted features like histogram of oriented gradients (HoG) [5] or scale-invariant feature transform (SIFT) [6] to characterize a picture with discriminatory power. Next, the collected features are fed into a learnable classifier, such as a support vector machine, a random forest, or a decision tree.
However, it becomes a highly challenging challenge to discover characteristics from a large number of provided photos. For these and other reasons, a new model based on deep neural networks is in the future. Convolutional neural network (CNN) is widely used for image identification and is one of the most wellknown deep neural networks. Many computer vision and natural processing applications, such as image identification, and object identification, have benefited from its utilization [3]. Furthermore, CNN provides outstanding efficiency in solving machine learning issues. For instance, a complete image classification dataset is useful for programs with images. In the previous decade, CNN has seen widespread use in the effort to enhance picture categorization precision. Since CNN permits the cooperative learning of features and classifiers, it can provide superior classification accuracy for big data sets [7]. The bag-of-features pipeline has recently been used in picture classification approaches. Clustering is performed using SIFT descriptors [8]. Features are collected via spatial pooling [9] histogram encoding [10] and, most recently, fisher vector encoding [11]. While these representations have been shown to provide workable outcomes, it is not immediately clear whether or not they are ideal for the job at hand. This requires a lot of time and effort, not to mention the cost of hiring specialized personnel. The AlexNet deep CNN by Krizhevsky et al. [12] stands out as the most novel of these networks (i.e., graphics processing unit (GPU), an intense network of 60 million and 650,000 neurons). AlexNet embraced the challenge, outperforming its rivals and achieving a topfive error rate of just 15.3%. The error rate in the top 5 spots was close to 26.2%; this was not a CNN variant. Gehring et al. [13] developed a CNN architecture for learning in sequence. The model outperforms the recurrent models, which failed to understand the compositional nature of the sequences. In addition, all of the components may be parallelized entirely during training for more efficient calculations.
To further facilitate a more organic optimization, nonlinearities are made constant and independent of the input length in. Ye et al. [14], developed an alternative approach to CNN. They detailed the pixel-bypixel operation of the CNN and showed its use in several contexts. Since other CNN kinds have more features and processing capacity, they are employed by many academics as primary image classifiers in their studies. The results of a comparison of the suggested approach with others indicated that it was superior. To categorize pictures more quickly and accurately than previous models, Han et al. [15] suggested a novel CNN approach, which they tested on six distinct small-sized datasets to verify their findings. After analyzing the outcomes, they concluded that this strategy is simple for small datasets. The novel-based model for image classification and multi-label method was given by Song et al. [16]. The basic premise of this study was to train a model using various data sources, including multi-label picture data. When many labels are needed, this study might be helpful. According to the paper's declared accuracy parameters, its offered model beats other presented models. In a recent study, authors M. A method for classifying images on embedded systems was developed and proven in a study by Çalik and Demirci [17]. The researchers employed CNN to achieve an accuracy of 85.9% on the Cifar-10 dataset. In their paper "Empirical study of the output of common convolution neural networks for object identification in real-time video feeds," Sharma et al. [18] published the results of such a study.
The following is the structure of the paper: section 2 summarizes the components used to construct the suggested model. Section 3 discusses the proposed classification strategy for photographs. Section 4 describes the experimental setup, which includes the datasets that are used to train the proposed model, the results, the evaluation metrics that evaluate the accuracy and loss of the model, the dissection of the results, and the comparison of the accuracy of the proposed model to that of other previous models. Section 5 concludes the model proposal.

METHOD 2.1. DenseNets and ResNets blocks of suggested framework
ResNet [19] and DenseNet [20] are two CNNs suggested as cutting-edge in recent years. ResNet and DenseNet are two successful deep learning architectures primarily related to their respective building components, ResNet blocks (RBs) and DenseNet blocks (DBs). Figure 1 shows an example of RB composed of three convolutional layers2 and one skip connection. The names of the convolutional layers are Conv1, Conv2, and Conv3. On Conv1, a reduced number of filters with a size of 1×1 minimizes the spatial dimension of the input in order to reduce the problematic computational of Conv2. On Conv2, filters with a larger size, such as 3x3, are used to learn spatially identical characteristics. On Conv3, a filter size of 1×1 is employed again, and this time the spatial dimension is raised so that more characters may be generated. The output of Conv3 is combined with the input to form the output of the RD. In case the input and Conv3's output spatial sizes are different, a series of convolutional operations with 1×1 sized filters are performed on the input to attain the same dimensionality as the result of Conv3 for the sum. Figure 2 displays an example of a DB. In the interest of simplicity, the DB has just four convolutional layers. In practice, the number of convolutional layers in the DB may be adjusted by the user. Each convolutional layer in the DB takes inputs from the input data and the output of all previously convolutional layers. Attempts in [21], [22] have investigated the mechanism underlying the success of running backs and safeties, revealing that RBs and DBs can decrease the negative effect of the gradient's vanishing problem [23], based on which a deep architecture is possible to efficiently learn the classification tasks of the training dataset and subsequently enhance the classification precision. Additionally, it has been suggested that dense connections in DBs may reuse low-level attributes to improve the acquired discrimination of characteristics in the upper layers of CNNs [20]. The suggested method selects RBs and DBs as the basis primarily due to their positive features.

Suggested framework for image classification
We suggest a framework based on a model of deep neural networks with four blocks. After a fully connected layer and a multi-layer perceptron as a classifier using softmax as the activation function, the network consists of the first two blocks from ResNet50 and the second two blocks from DenseNet121. By default, these blocks share the same configuration parameters. In both the Cifar-10 and the Sign-Traffic datasets, this CNN has shown effective during training. In every test, we split our dataset into two halves: training and validation. The best model is selected in CNN training after the first 30 iterations have the lowest validation loss. The architecture accepts images of varying sizes as input, and the input images are zero-paddeding. Figure 3 iullstrate the suggested network and Table 1 iullstrate the seetings of the blocks used in dsign the architecture.

BENCHMARK DATA SETS 3.1. Cifar-10
Cifar-10 is a dataset of natural RGB images of 32×32 pixels [24]. It contains 10 classes with 50,000 training images and 10,000 test images. All of these images have different backgrounds with different light sources. Objects in the image are not restricted to the one at center, and these objects have different sizes that range in orders of magnitude. The dataset Cifar-10 contains 60,000 color images, with a training set comprising of 50,000 images, a test set containing 10,000 images, all within twenty object classes in ten broad categories: airplane, automobile, bird, cat, deer, dog, frog, horse, ship and truck as shown in Figure 4 with images of size 32×32 pixles.

Sign signal
The traffic sign dataset [25] contains more than 360 images in total, divided into different classes. To avoid using the testing data, we leave 180 images from the training set for validation and 180 test images featuring among four classes "stop sign", "non stop sign", "green light" and "red light". Both training and testing data are distributed over these categories as shown in Figure 5.

EXPERIMENT RESULTS AND DISCUSSION
This experiment uses to classify multi class images. In this part, we compare the proposed method with other methods described for image classification in the literature, demonstrating that the suggested network has enough performance for our current needs. Cifar-10, widely utilized in detection and classification applications, and the sign-traffic dataset are used to evaluate the efficacy of the proposed architecture. This model is built using the Keras library and the Google TensorFlow framework on a machine with 16 GB of RAM and an NVidia GEFORCE GTX 1,650. A learning rate of the Adam optimizer was utilized (0.001), and a batch size of 50 samples was also utilized. Moreover, the model uses the (categorical cross-entropy) loss function. The drop-out is (0.5) used to avoid overfitting. We used MPL as a classifier, which is expected since it correlates the feature non-linearly to generate all possible patterns. The accuracy receiver operating characteristic (ROC) curve results of Cifar-10 dataset are shown in Figure 6(a), while the error rate ROC curve results of Cifar-10 are shown in Figure 6(b). Moreover, the accuracy ROC curve result of sign-traffic shown in Figure 7(a), and the error rate ROC curve results of sign-traffic shown in Figure 7(b). Table 2 shows the proposed network outperforms competing methods with a classification accuracy of 97.45% of Cifar-10 and 99.45% for sign-traffic datasets. The generated findings are reasonably stable and accurate, giving valuable insights into the classifying performance of the images.maps.

Comparisons with state-of-the-art works
To check further the performance of the proposed CNN model. The comparisons are compared among proposed CNN and some state-of-the-art works. Note that handcrafted-AE-CNN by Sun, also compared with CNN prposed by Yim et al. [1], and with proposed network by Aamir et al. [3] are designed for Cifar-10 dataset classification tasks, so they cannot converge for our forensics task. In addition compare the results of sign-traffic datasets with the proposed CNN by Jmour and Zayen. Table 3 report the accuracy of multi classification on these dataset. We can see that proposed CNN can obtain the best results in multi classification tasks.

Evaluation metrics
We consider our work's accuracy (ACC) and error rate (ERR) metrics to evaluate the model's efficiency. The accuracy helps to know the errors in the measurement values of the models. Accuracy and the error rate are inversely related. High accuracy refers to a low error rate, and a high error rate refers to low accuracy. ACC is derived by dividing the total number of accurate predictions by the total number of observations in the dataset (shown in (1)). ERR is computed by dividing the total number of inaccurate predictions by the total dataset (shown in (2)).
Where (TP + TN) the correct prediction, (FP + FN) the incorrect prediction, (P + N) the total number of the datasets. Which (TP, TN) are taken from confusion matrix are shown in Table 4.

CONCLUSION AND FUTURE WORK
In a study, we developed a technique that employs a deep neural network and consists of two blocks of two transfer learning approaches, namely ResNet and DenseNet, followed by a fully connected layer. This approach is implemented on Cifar-10 and sign-traffic datasets for training and testing. This kind of learning, namely the CNN, is used to identify image data. We have shown that fine-tuning settings is a crucial and beneficial training strategy. Based on these results, it can be concluded that image categorization using deep neural networks can achieve high performance. The suggested network needed fewer processing resources and less memory. The network enhances classification accuracy and yields acceptable identification outcomes compared to conventional methods. In addition, the network's performance assessment indicates that it may be used to construct a considerably better classifier. In future work, the researchers can use another transfer learning network instead of ResNet and DenseNet for image classification.