Hyper-parameter optimization of convolutional neural network based on particle swarm optimization algorithm

ABSTRACT

CNN detects edges from pixels in the 1st stage, then detects simple shapes in the 2nd stage, and then detects higher-level features. The outputs of the convolution process are passed through a nonlinear activation function. The most common nonlinear activation function is the rectified linear unit (ReLU). Decreasing the dimensionality of the feature map and keeping the critical data is the task of the pooling layer as shown in Figure 2. This layer has a filter component called the kernel, the yield of a kernel is the (mean or max) value of the region.  After that, a fully connected layer is added to the network which can be any kind of classifier such as a multi-layer perceptron. When applying CNN-based methods, choosing the ideal hyper-parameters is a challenging task. From the machine learning perspective, there is no definite method to specify a certain structure of the network that will enhance the performance. Also, most architectures of the network are selected with prediction instead of a contemplate choice. We propose a method to learn the optimal CNN hyper-parameters values automatically that can lead to a competitive CNN structure and can deploy it to any application that uses CNN. This approach is based on PSO algorithm, only with 15 particles and 30 iterations, we can find architectures that achieve a testing error that is competitive to others designs and can be applied to most of the CNN architectures with no encoding strategy during computations or using parallel layers, it is just with simple CNN. Since the training of a CNN is a time-consuming task, we chose to use the PSO algorithm in the process of optimal CNN architectures searching; because its convergence is faster than genetic algorithms [9], [10]. Our contributions are: -Introducing a CNN that uses optimized parameters in a simple way.
-Finding CNN architecture that achieves a testing error that is competitive to other designs.
-Using CNN architectures with no encoding strategy during computations or using parallel layers.
-First research that tests the CNN hyperparameters optimization problem with bee colony algorithm. The organization of the paper is as follows; section 2 shows the related work. Section 3 shows the proposed method. The evaluation of the performance is performed using the MNIST dataset in section 4. The last section includes the conclusion.

RELATED WORK
Numerous researches had suggested many CNN architectures as AlexNet [11], also some of them increased the network depth to enhance the accuracy [12]. Others chose to add new inner configurations [13]. Although these models have demonstrated their productivity, a lot of them were designed manually. On the other hand, there are many attempts to design appropriate models such as Sehla and Afef [14] who attempted to optimize the CNN parameters based on genetic algorithm. More methods for optimization of the hyperparameters have been proposed by David et al. [15], Ilya and Frank [16], Francisco and Gary [17] who applied evolutionary algorithms (EAs). Also, several works have been done to optimize CNN hyper-parameters based on particle swarm optimization (PSO), Foysal et. al. applied a modified CNN form by optimizing only one CNN parameter which is convolution size [18]. The optimization was done by the model-based selection technique of particle swarm optimization. They used synthetic datasets and the classification accuracy was around 95. Sinha et al. [19] used PSO to optimize The CNN hyper-parameters of the first layer of a 13-layer CNN and got a classification error of 18.53% on CIFAR-10 dataset. Also, on the same dataset, they got an error of 22.5% of the 8-layer AlexNet. So, deeper CNN yields higher accuracy, but more computation cost is needed.
The parameters used in this paper are size of input image, filter size, and number of filters and they need an encoding strategy during the computations. Guo et al. [20] proposed that the configuration of CNN parameters is an optimization problem, and a distributed PSO (DPSO) was used to optimize it and get the best model of CNN globally and automatically. In DPSO, they designed the mix-variable encoding strategy and updated the operations to each particle so, each particle represents a CNN. Also, the distributed framework reduces the running time. Experimental results on MNIST dataset get an accuracy of 99.25% for PSO and 99.2% for DPSO. Yamasaki et al. [21] applied PSO to get the optimal parameters for CNN automatically. The best setting of the parameter was obtained for Alexnet with different five datasets. Accuracy was enhanced from 0.7% to 5.7% of the standard Alexnet-CNN. The global best parameters could not be guaranteed in this algorithm due to its randomness. Also, the best parameters change frequently and are not constant. Wang et al. [22] proposed three mechanisms, acceleration coefficients vectorizing for adaptation different ranges of CNN hyper-parameters, compound normal confidence distribution for exploration capability enhancing, and linear estimation scheme for fast fitness evaluation. PSO is used with these mechanisms to improve the quality of CNN hyper-parameters and less the computation cost. Experimental results on CIFAR-10 get a classification error of 8.67%.

RESEARCH METHOD
Swarm intelligence (SI) is based on the behavior of groups intelligence. The SI algorithms can powerfully find the optimal solution. The Bee colony algorithm is an optimization algorithm inspired by the behavior of honeybees to find the optimal solution and was proposed in 2005 [23], [24]. Kennedy and Eberhart developed the PSO algorithm in 2001 and it is considered one of the evolutionary algorithms [25]. PSO algorithm consists of several steps. First, initialization of particles (searching agents) positions (x) and the velocities (v). Second, insert the particles into a cost function to find local bests (pbest) and global best (gbest). The smallest cost for each particle is the pbest and the smallest cost among all the pbest is the gbest. The cost function is calculated based on the problem, such as an error that needs to be minimized or accuracy that needs to be maximized. Third, update the particles by (1) and (2) [25]. In (1) and (2), n is particle number, c1, c2 are learning factors to adjust each iteration step length, usually in practice equals 2 which gives best results in many problems, and it can be set with trails and errors, v is the particle velocity which based on pbest and gbest, x is the current particle (solution), r1() and r2() are random variables between (0,1).
To design a CNN model with an optimized parameter for our classification problem, we deployed the PSO algorithm to the CNN architecture. So, we have to solve a problem with numbers of factors that expresses the hyper-parameters of the CNN whereas guaranteeing a high classification accuracy. A CNN architecture is defined by various hyper-parameters. In this work, we will focus on the optimization of the convolution layer size and kernel size parameters that form the CNN structure. The fitness function is set to be the classification accuracy. So, we aim to find the ideal hyper-parameters values with higher accuracy and less error. Figure 3 shows a flowchart of the proposed method. We define the hyper-parameters which are a set that contains first, the convolutional layer's parameters which contains the filters number and the filter size, second, fully connected layer's parameters which contains the size of the layer. So, the proposed method of hyper-parameters optimization aims to discover the hyper-parameters that maximize the classification accuracy of CNN. The first population is created randomly. The fitness function is set to be the classification accuracy of the CNN. If no change occurred in the individual hyper-parameters so, no need for CNN re-training. This process will be repeated until the iteration numbers are ended. The proposed method steps: -Initialize parameters of the PSO (population size, iteration number, acceleration constants c1 and c2).

RESULTS AND DISCUSSION
We evaluated the proposed approach in this section using the MNIST dataset; sample image of the dataset is shown in Figure 4. The dataset has images of 28 × 28 grayscale pixel size of handwritten digits from 0 to 9. It has a 60,000 training and 10,000 testing samples. No further preprocessing is done on the data. The parameters that used in the experiments are shown in Table 1. These parameters are chosen by trials and errors. At the end of each epoch, a permutation is done to the training samples. To save the time of the training and avoid overfitting; if the loss of the training was less than the loss of validation, we terminate the training, else, continue till the end of the training iterations.  The global best parameters of PSO are convolutional layer's parameters={ (6,5), (36, 5)} and connected layer's parameters={192}. The proposed method is consistently enhancing the gbest particle training accuracy after each iteration. Figure 5 shows training accuracy for 10 iterations as a sample from 30 iterations.  Table 2. We compared this result with a subset of representative methods which are psoCNN that automatically search for meaningful CNNs architectures with PSO algorithm [11], DPSO [14], LeNet-1 [27], LeNet-4 [27], LeNet-5 [27], recurrent CNN that incorporate recurrent connections into each convolutional layer [28], PCANet-2, RANDNet-2 that employed the  [29], CAE-1 and CAE-2 that present an approach for training deterministic auto-encoders [30]. It showed that the optimized method is not just a competitive method but also it helps to improve other methods. We also tested our CNN hyperparameters optimization problem with another swarm intelligent algorithm which is the bee colony optimization algorithm. Table 2. Comparison of MINST classification error of the state-of-the-art methods

Method
Test Error in % psoCNN [17] 0.44 DPSO [20] 0.8 LeNet-1 [27] 1.7 LeNet-4 [27] 1.1 LeNet-5 [27] 0.95 Recurrent CNN [28] 0.31 PCANet-2 [29] 1.06 RANDNet-2 [29] 1.27 CAE-1 [30] 2.83 CAE-2 [30] 2.48 Proposed method 0.87 bee colony 1.02 Table 2 shows that our proposed method has a testing error of 0.87% using simple CNN with PSO and if we incorporate the skip connection or parallel layers it may yield less error. This error is competitive to the previously mentioned methods except for the psoCNN that has a testing error of 0.44%, considering that its architecture is more complex than our method, needs an encoding strategy during the computations, and the running time is longer, DPSO that has a testing error of 0.8, but its architecture was more complex as it designs a mix-variable encoding strategy and distributed framework to reduce the running time. Also, recurrent CNN has a testing error of 0.31% but it did not explore the best configuration and limited the search in constrained hyper-parameter space, meaning that no optimization is done in this method and it uses recurrent connections. LeNet-1, LeNet-4, and LeNet-5 obtained a testing error of 1.7, 1.1, and 0.95 respectively, these networks have a small number of parameters with different types of last layer classifier. PCANet-2, RANDNet-2 that employed the principal component, binary hashing, and block-wise histograms analysis to the deep learning network obtained a testing error of 1.06 and 1.27 respectively. CAE-1 and CAE-2 that present an approach for training deterministic auto-encoders obtained a testing error of 2.83 and 2.48 respectively. Also, optimization with the bee colony algorithm gives an error of 1.02.

CONCLUSION AND FUTURE WORK
We proposed a method of CNN hyper-parameters optimization based on particle swarm optimization algorithm. The experiments on the MNIST dataset demonstrate an improvement to some of the state-of-the-art methods. The results represent that this method was able to find optimized parameters of the CNN model. With 15 particles and 30 iterations, we can find architectures that achieve a testing error of 0.87 which is competitive to other designs. We aim to extend the proposed method and add more hyperparameters in the optimization process for our future work.