Profiling DNA Sequence of SARS-Cov-2 Virus Using Machine Learning Algorithm

Lailil Muflikhah, Muh. Arif Rahman, Agus Wahyu Widodo

Abstract


Corona virus disease-19 (COVID-19) is growing rapidly because it is an infectious disease. This disease is caused by a virus belonging to the type of DNA virus with very diverse genetics. This study proposes a feature extraction method using k-mer to obtain nucleotide frequencies in protein coding. In profiling viral DNA sequences, this study proposes to obtain similarity by country using hierarchical k-means, where the results are averaged by the hierarchical clustering method and then find the initial cluster center. The experimental results show that the silhouette, purity, and entropy are 0.867, 0.208, and 0.892, respectively. Then, we apply the Gini index feature selection to find the important components as characteristics in each country. The selected components are implemented using the ensemble method, Random Forest, to evaluate their performance. The experimental results showed high performance, including sensitivity, accuracy, specificity, and area under the curve (AUC).

Keywords


Covid-19; DNA sequence; Feature extraction; k-mer; Random Forest

Full Text:

PDF


DOI: https://doi.org/10.11591/eei.v11i2.3487

Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Bulletin of EEI Stats