Comparison of dimensionality reduction and clustering methods for SARS-CoV-2 genome

Received Jan 13, 2021 Revised Apr 29, 2021 Accepted May 20, 2021 This paper aims to conduct an analysis of the SARS-CoV-2 genome variation was carried out by comparing the results of genome clustering using several clustering algorithms and distribution of sequence in each cluster. The clustering algorithms used are K-means, Gaussian mixture models, agglomerative hierarchical clustering, mean-shift clustering, and DBSCAN. However, the clustering algorithm has a weakness in grouping data that has very high dimensions such as genome data, so that a dimensional reduction process is needed. In this research, dimensionality reduction was carried out using principal component analysis (PCA) and autoencoder method with three models that produce 2, 10, and 50 features. The main contributions achieved were the dimensional reduction and clustering scheme of SARSCoV-2 sequence data and the performance analysis of each experiment on each scheme and hyper parameters for each method. Based on the results of experiments conducted, PCA and DBSCAN algorithm achieve the highest silhouette score of 0.8770 with three clusters when using two features. However, dimensionality reduction using autoencoder need more iterations to converge. On the testing process with Indonesian sequence data, more than half of them enter one cluster and the rest are distributed in the other two clusters.


INTRODUCTION
In December 2019, a pneumonia outbreak pneumonia occurred in Wuhan which was caused by a new virus called covid-19 (coronavirus disease 2019) or SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2). The virus was first detected in Wuhan City, China, on 12 December 2019 [1]. The virus has a very high transmission rate, so that in just two months, the virus can spread from Wuhan to all of China, as well as 33 other countries. As of November 30, 2020, there were 62,363,527 confirmed cases of covid-19 and 1,456,687 deaths due to covid-19 worldwide [2]. In Indonesia, the first case of covid-19 was announced on March 2, 2020, and as of November 30, 2020, 534,266 positive cases have been detected and 16,815 people have died due to covid-19 [3]. Symptoms that often appear in covid-19 patients include fever, headache, cough, expectoration, fatigue, and dyspnea. However, some patients show symptoms of shortness of breath, haemoptysis, diarrhea, and there are even patients who do not show any symptoms at all [4].
The SARS-Cov-2 virus has a high mutation rate because it is included in viral RNA so that the variety of the virus increases [5], [6]. Therefore, analysis of variations in the viral genome is urgently needed. Variations in the covid-19 virus can be detected through genome analysis from samples of infected patients. The samples are collected from parts of the body where the coronavirus collects, such as a person's nose or throat. These samples were inserted into the polymerase chain reaction (PCR) tool and used optimal primers and probes to extract RNA from the virus [7], [8]. Based on the extraction/sequencing results, the SARS-CoV-2 genome has a length of 29,891 bp with a frequency of G and C of 38% [9]. In terms of genetic structure, SARS-CoV-2 has a different genetic structure compared to the previously identified SARS-CoV and MERS-CoV viruses, with a similarity rate of 79% against SARS-CoV and nearly 50% against MERS-CoV [10]. In this paper, dimensionality reduction and clustering analyses were carried out to cluster genome data from patients with covid-19 indication to analyze virus variation in several country in Asia and Indonesia. The method used is the principal component analysis (PCA) and autoencoder for dimensionality reduction and several clustering algorithms for genome clustering. Dimensionality reduction is needed to reduce dimensions and remove redundant features [11] from genome data which has very high dimensions, so that the data can be processed in the clustering method. PCA reduces the dimension of the data by calculating the eigen value and eigen vector from the training data [12]. PCA has been applied to a wide variety of data including microarray data [13] and DNA sequence data [14]. Autoencoder is an effective method to reduce dimensions of image data, text, and other complex data [11], [15]- [18]. Then several clustering methods are used to cluster the data resulting from dimensionality reduction method, including K-means, Agglomerative hierarchical clustering (AHC), Gaussian mixture models (GMM), density-based spatial clustering of applications with noise (DBSCAN), and mean shift clustering. Furthermore, clustering performance is measured using a Silhouette score, which has a value in the range of -1 to 1.
The findings of this research are the dimensional reduction and clustering scheme of SARS-CoV-2 sequence data, the performance analysis of each experiment on each scheme and hyper parameters for each method, distribution of SARS-CoV-2 genome sample variations in China, Japan, South Korea, Singapore, and India for training data and Indonesia for testing data on each of the resulting clusters. Finally, this paper is arranged in the following order: section two (proposed method) about the proposed schema methodology, which includes data acquisition and preprocessing, PCA and autoencoder, and clustering algorithms; section three (results and discussion) about experimental scenarios, presentation of results and discussion; and the final section (conclusion) about the research conclusions. Figure 1 shows a diagram process of the SARS-CoV-2 sequence clustering system. Genome sequence data of patients infected with covid-19 were obtained from EpiCoV, GISAID [19] which amounted to 4935 data (access on 21 November 2020). The data taken comes from several countries in Asia, including China, Japan, South Korea, Singapore, India, and Indonesia. Genome sequence from China, Japan, South Korea, Singapore, and India will be training data, and genome sequence from Indonesia will be testing data. The data is stored in .fasta format so that it requires a parsing process for the acquisition of sequence data only, without other information. Raw data of SARS-CoV-2 sequence can be seen at Figure 2. In addition to the parsing process, data preprocessing was also carried out to form a complete sequence with a size of 1xN, where N is the number of nucleotides in one data, because the raw sequence in the fasta file consists of several nucleotide lines. Mapping DNA sequences is needed to convert DNA sequences into numerical sequences so that they can be processed at a later stage. The mapping technique used is the integer representation [20], [21], by modifying the integer numbers assigned to the nucleotide A, C, T, and G, because the number 0 is used for the padding process and other values if there is a missing value. The equation used for sequence mapping is presented in (1). Then the padding process is carried out with the number 0 so that the length of the sequence becomes the same and produces a sequence length of 30,018. Padding process is needed because PCA and autoencoder requires the same input length for all data. To extract important features in sequence data and reduce dimension data, feature extraction/dimensionality reduction is performed using the PCA and autoencoder method as shown in Figure 3. PCA reduces data dimension by transforming it into principal component obtained from calculating the eigenvalues and eigenvectors of training data. The eigenvectors that have the highest eigenvalues generated by PCA represent the most important features [22]. The eigenvalues and eigenvectors are obtained by solving the (2), where is covariance matrix of input X, is eigenvectors, and is eigenvalues [13]. The eigenvectors are sorted based on their eigenvalues which then become the principal components. In this paper, three PC models from the PCA reduction dimensions are compared which produce data with 50 dimensions (PCA_50), 10 dimensions (PCA_10), and 2 dimensions (PCA_2).

=
(2) Figure 3. General autoencoder architecture [17] The autoencoder model that is built consists of encoder and decoder. The encoder functions to extract features from the data used and also to reduce the dimensions of the data, while the decoder is used to reconstruct the reduced data into initial data [23], [24]. There are three architectures of the autoencoder model used in this task as shown in Table 1, which have differences in the depth of the architecture and the . These three models are trained with Adam's optimization algorithm [25], [26] which is a first order gradient descent optimization algorithm and estimates the adaptive learning rate based on lower order moments. This algorithm is computationally efficient and requires little memory, so it is suitable for problems with large amounts of data or parameters. Loss function used is the mean square error (MSE). In this paper, we also compare the results of clustering the sequence data resulting from dimensional reduction using several clustering methods, including K-means, AHC, GMM, DBSCAN, and mean shift clustering. The K-means algorithm starts by randomly initializing the centroid points according to a predetermined number of clusters [27], [28]. Then calculate the distance of each data to each centroid to determine the cluster for each data point. The update of the centroid position is done by calculating the average value of the data points contained in each cluster. These processes are carried out until the positions of the centroid has not changed or the maximum number of iterations is reached.
GMM is more flexible than K-means in formed clusters. In GMM, there are two parameters that are calculated to update the centroid point, namely the average value and standard deviation, so that the cluster formed can be various kinds of ellipses. In calculating the average value and standard deviation, GMM uses an optimization algorithm, namely expectation-maximization (EM) and Gaussian distribution for each cluster [29], [30]. AHC uses a bottom up algorithm where each data point is considered as one cluster and combines the clusters that have the closest distance [31]. Distance calculations can be done by calculating the average value of data points in a cluster or by calculating the shortest distance between data points in one cluster and data points in another cluster. This process is carried out until one large cluster is obtained and then the desired number of clusters is selected.
The mean-shift clustering algorithm begins by randomly determining the data points that are the center points of the sliding window with a certain radius [32]. The sliding window will shift towards an area with a higher density level (number of data points in the window) until convergent conditions are reached. Mean-shift clustering has the advantage that the number of clusters does not need to be determined in advance because the algorithm can find the optimal number of clusters automatically. DBSCAN is an algorithm similar to mean-shift clustering that can determine the number of clusters automatically. DBSCAN starts by selecting a data point that has never been visited, then other data points adjacent to the data point within a threshold distance will be included in the same cluster [33]. If the number of data points in the cluster is more than the minPoints parameter (the minimum number of data points in a cluster), then the data points will become one cluster and marked as visited. The process will repeat until there are no more data points that can be visited. Data points that do not include to any cluster are considered noise. Dimensionality reduction data using three autoencoder models and PCA will be the input for the clustering algorithm, which has 2, 10, and 50 features. For K-means, AHC, and GMM algorithms, the number of clusters is observed from 2 to 50 clusters. Meanwhile, for the mean-shift clustering algorithm and DBSCAN, the number of clusters does not need to be determined at the beginning. To evaluate the performance of the clustering algorithm, Silhouette analysis is used based on the Silhouette Score [34]. The silhouette score determines the degree of separation between clusters which has a value in the range [-1,1], where the value 1 indicates the sample data is very far from the other cluster, the value of 0 indicates the sample data is very close to the other cluster, and the value -1 indicates the sample data is included in the wrong cluster.

RESULTS AND DISCUSSION
In this paper a system was built to reduce dimension and cluster genome data form SARS-CoV-2 virus to analize the genome virus variation. The method used is PCA and autoencoder for dimensional reduction, and several clustering algorithms which includes K-means, Gaussian mixture models (GMM), agglomerative hierarchical clustering (AHC), density-based spatial clustering of applications with noise (DBSCAN), and mean shift clustering. To test the system performance that has been built, experiment is done by comparing silhouette score of clustering methods for various input dimension from PCA and autoencoder for training and validation process. As for the testing process, the genome sequence from Indonesia was included in the dimension reduction and clustering model that was previously built to analyse the variation of viruses found in Indonesia. Following are the results and analysis of the experiments that have been carried out.

Autoencoder performance for dimensionality reduction
This paper proposed three model autoencoder for dimensionality reduction, namely AE_50, AE_10, and AE_2, that produce data with 50, 10, and 2 dimensions, respectively. Based on the results presented in Table 2, the MSE training and validation of AE_50 and AE_10 is better than the AE_2. This shows that the AE_50 and AE_10 converge faster than the autoencoder 1. For AE_2, the dimensionality reduction process that occurs is very large, so the reconstruction process into original data is still not good. The training and validation running time does not have a significant difference because there is only a slight difference in the number of parameters/weights of the autoencoder architecture of the three models.

Number of clusters observation on K-means, GMM, and AHC
The number of cluster effect parameters of the K-Means, GMM, and AHC clustering algorithms based on silhouette score of clustering results of the features generated by the PCA and autoencoder models. PCA and autoencoder model that was built produced data with 2, 10, and 50 features. The parameters used in the K-Means, GMM, and AHC algorithms were the number of clusters, and in this study the number of clusters was observed in the range of 2 to 50 clusters. K-means, AHC, and GMM are included in clusteringbased algorithms so that the selection of the number of clusters and the initial initialization of centroids greatly affect the results of clustering [35]. Based on the results obtained in Figure 4 to Figure 6, it can be concluded that the output of the PCA_2 with 2 features have the highest silhouette score using the K-means, GMM, and AHC algorithms, for all the number of clusters 2 to 50. However, AE_2 with 2 features have more stable silhouette score over all number of clusters. Data with 10 and 50 features have smaller silhouette score relatively. This can happen because of the clustering algorithm is more able to cluster fewer features so that the data contain less noise or redundancy. Based on the results of effect of the number of clusters observations, the greater the number of clusters, the higher the Silhouette score obtained for data with 10 and 50 features.

Quantile parameter observation on mean-shift clustering
The quantile parameter is used to calculate the bandwidth (window size/radius) in the mean-shift clustering algorithm, by calculating the median pairwise distance of the sample used. The value of the quantile parameter ranges in the range 0 to 1, value of 0.5 means that the median value of all pairwise distances is used. In this study, the observed quantile values were 0.01, 0.05, 0.1, 0.15, 0.2, and 0.25. Based on Table 3 and Table 4, the greater the quantile value, the bigger the resulting window size/bandwidth, so that the number of clusters produced is smaller. A large quantile value also results a higher silhouette score for data with two features, but this is not the case for data with 10 and 50 features.

Epsilon parameter observation on DBSCAN clustering
The epsilon parameter is a parameter used to determine whether a data point belongs to the same cluster or not in DBSCAN. A data point can enter the same cluster if the distance is less than epsilon. In this study, the observed epsilon values were 0.5, 1, 5, 10, 20, 30, 40, and 50. Based on Table 5 to Table 7, it can be concluded that each different data will require a certain epsilon value that is different depending on the distribution of the data. However, data with less features will need smaller epsilon value. DBSCAN can also detect outlier data, where the data is very far from the existing cluster, rather than forcing the outlier data to enter a specific cluster.

Comparison of five clustering algorithms results
In Table 8 and Table 9, in the number of features 10 and 50 from AE_50 and AE_10, mean-shift clustering has the best silhouette score (0.7380 and 0.784) compared to other algorithms, but it is likely not optimal because the estimated number of clusters really large (564 and 367 clusters). This is different for data with 50 features from PCA_50 which gets a Silhouette Score of 0.7791 from the results of the AHC clustering algorithm and the resulting number of clusters is 48 clusters. Meanwhile, for data with 10 features, PCA_10 achieved a silhouette score of 0.7080 and the number of clusters was 10 using the mean-shift Clustering algorithm. Data with 2 features has the best silhouette score, both for the reduction of autoencoder (AE_2) and PCA (PCA_2) dimensions, namely 0.8133 for AE_2 and 0.877 for PCA_2. The two reduced data can be clustered into 3 clusters using the AHC algorithm for AE_2 and DBSCAN for PCA_2. This proves that the dimension reduction process carried out is quite good because the training data is taken from five countries, namely China, South Korea, Singapore, Japan, and India that have the geographical proximity of the country.
DBSCAN and mean-shift clustering have the advantage of being able to determine the optimal number of clusters automatically, while in the K-means, GMM, and AHC algorithms the number of clusters must be initialized at the start. In terms of data clustering characteristics, AHC has similarities in the formation of phylogenetic trees, which construct trees by calculating the distance / similarity between sequences. DBSCAN can identify outliers in the data and consider them to be noise, whereas in Mean-shift clustering, the outliers are forced into a particular cluster. DBSCAN can have any shape and size of the cluster, while in Mean-shift, the cluster formed is circular. However, DBSCAN has the disadvantage of not having a good performance if the data has high dimensions because estimating the epsilon value becomes more difficult.   Figure 7 and Figure 8 show the distribution of training data in each cluster based on the best silhouette score from the experiments that have been carried out, namely AE_2 data are clustered using the AHC algorithm and PCA_2 data are clustered using the DBSCAN algorithm. The training data used is SARS-CoV-2 sequence data from China, Japan, South Korea, Singapore, and India. In both figures, the sample distribution from each country in each cluster is almost the same. So, it can be concluded that Cluster 1, Cluster 2, and Cluster 3 in Figure 7 are the same as Cluster 3, Cluster 1, and Cluster 2 in Figure 8, respectively. In the data resulting from AE_2 and PCA_2, the sample data is spread out over one cluster. This can be due to the SARS-CoV-2 sequences in the five countries that have a high majority of similarities. The testing data used were SARS-CoV-2 sequences from Indonesia, totaling 109 sequences. Figure 9 shows the distribution of testing data in each cluster, which is the result of the dimensional reduction and clustering processes using the two best models obtained in the training process, namely AE_2 data are clustered using the AHC algorithm and PCA_2 data are clustered using the DBSCAN algorithm. These results show that the Indonesian sequence data has a slightly different distribution from the training data. However, half of the data is still gathered in one cluster such as training data, while the other data is spread over the other two clusters.

CONCLUSION
In this paper, a dimensional reduction and clustering system was developed for the genome data of the SARS-CoV-2 virus. Dimensional reduction is done by using PCA and autoencoder with three models that produce reduced data with 2, 10, and 50 features. The clustering algorithms used are K-means, GMM, AHC, Mean-shift clustering, and DBSCAN. Based on the experimental results, the system built can achieve the highest silhouette score of 0.8770 with three clusters when using two features of PCA and DBSCAN algorithm. As well as a silhouette score of 0.8133 with three clusters using two features of the autoencoder and the AHC algorithm. Meanwhile, the reduced sequences with features 10 and 50 have a smaller silhouette 2179 score in all clustering algorithms used. The K-means, GMM, and AHC algorithms require a predefined number of clusters. Meanwhile, mean-shift clustering and DBSCAN algorithms can find the optimal number of clusters automatically, so that it is an advantage when compared to the K-means, GMM, and AHC algorithms. The optimal number of clusters with the highest silhouette score consists of three clusters, either using data result from the reduction dimension of the autoencoder or PCA with two features. This paper also analyses the distribution of training data and testing in each cluster formed using two models that have the best silhouette scores. The training data is a SARS-CoV-2 sequence from China, Japan, South Korea, Singapore, and India, and testing data from Indonesia. Based on the results of clustering using the two best models, both training data and testing data, more than half of data enter one cluster and the rest are distributed in the other two clusters. It can be concluded that the SARS-CoV-2 virus found in China, Japan, South Korea, Singapore, India, and Indonesia, mostly has similarities in virus sequences. The future research plan that can be done is to detect mutations that occur in each cluster using alignment and machine learning approaches.