Profiling DNA Sequence of SARS-Cov-2 Virus Using Machine Learning Algorithm
Abstract
Corona virus disease-19 (COVID-19) is growing rapidly because it is an infectious disease. This disease is caused by a virus belonging to the type of DNA virus with very diverse genetics. This study proposes a feature extraction method using k-mer to obtain nucleotide frequencies in protein coding. In profiling viral DNA sequences, this study proposes to obtain similarity by country using hierarchical k-means, where the results are averaged by the hierarchical clustering method and then find the initial cluster center. The experimental results show that the silhouette, purity, and entropy are 0.867, 0.208, and 0.892, respectively. Then, we apply the Gini index feature selection to find the important components as characteristics in each country. The selected components are implemented using the ensemble method, Random Forest, to evaluate their performance. The experimental results showed high performance, including sensitivity, accuracy, specificity, and area under the curve (AUC).
Keywords
Covid-19; DNA sequence; Feature extraction; k-mer; Random Forest
Full Text:
PDFDOI: https://doi.org/10.11591/eei.v11i2.3487
Refbacks
- There are currently no refbacks.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Bulletin of EEI Stats