Common human diseases prediction using machine learning based on survey data

In this era, the moment has arrived to move away from disease as the primary emphasis of medical treatment. Although impressive, the multiple techniques that have been developed to detect the diseases. In this time, there are some types of diseases COVID-19, normal flue, migraine, lung disease, heart disease, kidney disease, diabetics, stomach disease, gastric, bone disease, autism are the very common diseases. In this analysis, we analyze disease symptoms and have done disease predictions based on their symptoms. We studied a range of symptoms and took a survey from people in order to complete the task. Several classification algorithms have been employed to train the model. Furthermore, performance evaluation matrices are used to measure the model's performance. Finally, we discovered that the part classifier surpasses the others.


INTRODUCTION
In this modern world, we cannot think our regular life without technology.Artificial intelligence (AI) is one of the core parts of computer science and technology.Creating a sort of AI that's so modern it can itself make AI substances with indeed more prominent intelligence might change man-made innovation until the end of time.Such substances would outperform human insights and reach superhuman accomplishments.Another algorithm within the field of machine learning is data mining which is a fast-growing area.Because it extracts important data from a mountain of datasets and uses it for decision-making tasks, data mining is one of the more well-known concepts in machine learning.Data mining is demonstrated to be an awesome apparatus for investigating modern roads to consequently examine, visualize, and reveal designs in the information that encourage the decision-making process.By immediately providing high-quality diagnosis results and significantly reducing or eliminating the need for human involvement, artificial intelligence, a fast-developing computer technologies in the area of healthcare image identification, has also benefited in sickness prediction.Machine learning and deep learning, two important fields of AI, have lately gained a lot of traction in clinical applications.For diseases diagnosis, systems based on deep learning for support are being created utilizing CT and X-ray samples.Computerized sickness identification has emerged as a crucial area of study in medical advances as a result of rapid population growth.A computerized illness identification framework aids clinicians in disease diagnosis by providing precise, consistent, and quick findings, as well as lowering the death rate [1].Nowadays, many people are not mindful of their well-being.There are some people who are not curious about getting to the doctor and that's why people of younger ages are also getting into serious diseases at that time.This type of problem has a lot of complications that change over time.Despite the fact that various strategies have been established, none of them can produce an exact and dependable result.Work with physicians, doctors, and other health experts is a feature of all the modalities.As a result, a system that can function without hospital supplies or personnel may be an effective alternative [2].It is more necessary to identify consistent risk variables and construct a prediction model for several illnesses than it is to do so for a specific disease.For instance, a patient with hyperlipidemia or Hypertension is more likely to experience cardiovascular problems than someone in good condition.Hyperlipidemia and hypertension are related conditions [3].Many medical datasets are nowadays easily accessible for research in a variety of medical specialties.Therefore, managing enormous amounts of data by a human is challenging, if not impossible.As a result, computer-based procedures that are more successful are replacing traditional techniques.The use of computers improves accuracy while saving both money and time [4].Common human diseases have recently had a significant impact on the medical field and the global economy.Scientists, Doctors, and specialists are working on new techniques to diagnose diseases more quickly, such as the creation of autonomous disease detection systems.
Chang et al. [5] made a two-phase research approach for predicting hyperlipidemia and hypertension at the same time.They began by selecting specific risk variables for both these two diseases using four data mining methodologies and then utilized the voting principle to discover the shared risk factors.After that, they built multiple predictive models for hyperlipidemia and hypertension using the multivariate adaptive regression splines (MARS) approach.Maleki et al. [6] proposed a method for determining the level of lung cancer.In this research, a K-nearest neighbors (KNN) approach is used to diagnose the stage of patients' disease, with a genetic algorithm utilized for effective feature identification to minimize dataset dimensionality and improve classifier speed.The optimal value for k is discovered via an experimental technique to increase the accuracy of the presented algorithm.The proposed method was tested on a lung cancer database and found to be 100 percent accurate.Bang et al. [7] developed a multiclassification method based on ML to make a distinction between the gut microbiome and the six diseases listed below: chronic fatigue syndrome/myalgic encephalomyelitis, acquired immune deficiency syndrome, juvenile idiopathic arthritis, multiple sclerosis, colorectal cancer.and stroke.To create the prediction model, they utilized the plentiful microbes at five taxonomic levels as characteristics in only 696 samples obtained from various research.Four multi-class classifiers and two feature selecting approaches, including forwarding selection and backward removal, were used to create classification models.Kunjir et al. [8] proposed a method for efficient and advanced disease prediction based on historical training data.Analyzing and evaluating different data methods is the best strategy.For each disease algorithm training data example, the datasets chosen for implementation purposes comprise more than 20 medical relevant attributes.Heart disease, breast cancer, arthritis, and diabetes are among the medical datasets chosen for the research.The Naive Bayes (NB) method was chosen to implement in this project after assessing the prediction accuracy and latency test results.Besag and Newell [9] using an epidemiological dataset of COVID-19 patients from South Korea.Muhammad et al. [10] established a model for predicting COVID-19 affected patients' recoveries.To create the models, the decision tree, KNN algorithms, support vector machine (SVM), logistic regression, Naive Bayes, and random forest (RF) were directly implemented on the dataset using the Python programming language presented and demonstrated a Geographical Analysis Machine Learning method for detecting tiny illness clusters.A secondary goal is to go over some frequent difficulties in applying clustering tests to epidemiology data.For the classification of breast cancer disease.Shamrat et al. [11] employed some supervised classification approaches.SVM, KNN, RF, decision tree are examples of early breast cancer prediction algorithms.As a result, we used specificity, sensitivity the f1 score, and total accuracy to assess the breast cancer dataset.The results of the breast cancer prediction performance analysis show that SVM had the best results, with a classification accuracy of 97.07 percent.NB and RF, on the other hand, have the second-highest forecast accuracy.Islam et al. [12] present a deep learning method that uses a convolutional neural network (CNN) and long short-term memory (LSTM) to diagnose COVID-19 from X-ray images.In this system, deep feature extraction is performed using CNN, and the extracted features are identified by LSTM.The dataset in this system included 1525 COVID-19 pictures from 4575 X-ray scans.The testing results showed that their proposed technique achieved an accuracy of 99.4%, a specificity of 99.2%, an AUC of 99.9%, a sensitivity of 99.3%, and a F1-score of 98.9%.Rajdhan et al. [13] proposed study uses data mining techniques such Naive Bayes, decision tree, random forest, and logistic regression to determine the patient's total risk and estimating the likelihood of cardiac disease.Consequently, this study compares the effectiveness of various machine learning techniques.Priya et al. [14] analyzed liver patient datasets in order to develop classification algorithms for predicting liver disease.The initial phase involves applying the minmax normalization approach to the actual liver disease datasets that were retrieved from the UCI repository.In the second phase of the liver dataset forecast, PSO feature selection is used to obtain a subset (data) of the ISSN: 2302-9285  Common human diseases prediction using machine learning based on survey data … (Jabir Al Nahian) 1113 standardized whole liver patient datasets that only contains important features.The data set is then subjected to categorization algorithms in the third phase.The accuracy will be calculated in the fourth phase using the root mean error value and root mean square value.Rahman et al. [15] used five types of supervised classification algorithms are used in this research, there are logistic regression, SVM, KNN, random forest, and decision tree.The execution of different classification methods was assessed on distinctive estimation procedures such as accuracy, recall, precision, specificity, and f--1 score.In a large community pediatric clinic, Gabrielsen et al. [16] developed controls and children who tested negative for autism during the widespread screening.Following the screening, medical evaluations were conducted to ascertain the pattern models (autism, language delay, or typical).Unaware of participants' diagnosis status, licensed psychologists with toddler and autism expertise assessed two 10-minute video samples of participants' autistic evaluations, evaluating five behavioral patterns: responsive, conducting, verbalizing, play, and responding to name.
Reviewers were asked to give their opinions on autism referrals based purely on 10-minute assessments.Some of the systems are built using a pre-trained model using transfer learning, while others are implemented using bespoke networks.Machine learning and data science are two more fields that are being utilized to diagnose, prognostic, predict, and forecast disease outbreaks [17]- [27].
After seeing such kind of problem, we got a thought to make a machine to predict human diseases by their symptoms.The goal of this work is to use data-mining techniques to determine common diseases and build a prediction model for these diseases.We have done with eleven diseases and dealt with 53 symptoms under a survey which collects from a vast number of people.By doing this we have collected our data and also work with a different type of algorithm like random forest, logistic regression KNN, and SVM.In our research, we calculated many performance evaluation criteria and compared the results to select the best classifier in the working situation.The part classifier produces the best result in terms of metrics, according to the study of the collected results.
The organization of this paper is listed below: in section 2, the study methodology is described, along with a quick rundown of the dataset, the implementation plan, and the classifier algorithms.The outcome of the experiment and other findings are presented in section 3. Section 4 undertakes a thorough evaluation of related studies to identify any unresolved issues.Section 5 concludes everything in the end.

METHOD
This part contains the following sections: implementation methodology, data analysis and description, algorithm summary.This section explains how we went about completing this project.Below is a full description of all of the sub-sections.

Implementation procedure
The purpose of this work is to achieve disease prediction.Many important characteristics, especially disease symptoms, are considered to ensure an accurate prediction.Figure 1 shows the numerous processes we followed to finish this project.

. Data collection
First and foremost, we have prepared a 53-question survey based on disease symptoms.Then, we obtained data from a sizable number of respondents utilizing this survey.Then, in order to feed this data into the classifier, we used various preprocessing procedures.To label a specific question, only one variable is used.To identify all of the questions, a total of 53 variables are employed.

Data preprocessing
Data preprocessing is the procedure in which input is modified if required.The set of data may contain incomplete characteristics.The median, mean, or other measurements for that property can be used to fill in the blanks.Ultimately, the set of data is randomized to guarantee that the data is distributed evenly.

Apply algorithm
Our provided data is divided into the training and testing sets after preprocessing.In this case, 70 percent of the complete data set was used for the training process.The remaining 30% of the entire data set was used for the testing process.This division is done at random.The training process learns the data set from its properties, while the testing process predicts the data set's outcome and assesses predictive accuracy.Following that, the four classifiers, Random Forest, SVM, logistic regression and KNN were trained using the training data.After trained the classifiers, we used testing data to predict the current disease condition.

Calculate evaluation
To use these criteria, we found the best classifier to predict in this situation.The formulas below were used to calculate a number of performance indicators in percent using the confusion matrix the classifier produced.

Data description and analysis
A survey was employed to acquire the information for this research.The questionnaire consisted of 53 questions in total.Both personal and disease symptoms variables are considered in these 53 inquiries.A total of 53 attributes are utilized to categorize all of the queries.The dataset has 52 independent variables and one dependent variable shown in the Table 1.And Table 2 lists all of the variables and their potential values.In order to complete this task, 1443 individual records were needed.70% of the data is utilized to train the classifier, while 30% is used for testing.

Classifier description
The random forest classification method is a supervised algorithm for learning that divides up random tree groupings into different categories to produce forests.It may be applied to issues with regression and classification.It's a popular method for solving classification difficulties.It picks samples randomly from a particular dataset.It uses data samples to generate decision trees, which are subsequently used to make predictions [28].Then, using the voting method choose the appropriate solution in the Figure 2.While developing the trees, the random forest contributes more randomness to the pattern.When dividing a node, it searches for the best trait from a specified distribution rather than the most significant characteristic.There is a lot more variability as a result, which produces a better prediction accuracy.Because it is an ensemble learning technique, random forest outperforms a single decision tree.The overfitting problem is reduced by averaging the results.
In reality, the "SVM" is a supervised machine learning method that can resolve regression and classification issues.However, it is primarily employed to address classification issues [29].Every piece of data is represented as a point in an n-dimensional space (where n is the number of attributes we have), with each feature's value being the score at a particular place in the SVM classifier.SVM is used to choose the most nodes that contribute to the creation of the hyperplane.The algorithm is known as an SVM, and support vectors are also the algorithm's maximum examples.Look at the Figure 3 to see how two distinct groups are classified using a decision hyperplane.Next, we identify the hyper-plane which clearly distinguishes the class labels to complete identification.Simply put, support vectors are also the positions of each accuracy assessment.The classification algorithm is a frontier that separates the two categories (hyper-plane/line) the most effective.The supervised learning approach of logistic regression has been used to predict the categorical outcome variable using only a collection of individual variables [30].This is an important and strong method since it can provide possibilities and identify updated information using those discrete and continuous data.Figure 4 shows that the logistic (Sigmoid) model is a mathematical equation that maps anticipated outcomes to probabilities.It can convert any actual value between 0 and 1 into another.This method's major implication is that the predicted output must be classified and that the input parameter must not be multicollinear.The KNN method is one of the core principles of learning algorithms.It is predicated on the Supervised Learning approach.The KNN approach is used to address both classification and regression problems.Feature matching is the foundation of the KNN method.A straightforward, understandable, and flexible machine learning technique is KNN.

Bulletin of Electr
KNN has uses in a variety of fields, including economics, politics, medical, computer vision, and video recognition.Financial institutions utilize credit ratings to predict a customer's credit rating [31].The KNN approach allocates the new case to the category that is most similar to the classifications, implying that the current particular instance and previous examples are comparable (look at the Figure 5).KNN represents the number of nearest neighbors.The most important thing to take into account is the quantity of neighbors.When there are two courses, K is often an odd number.The procedure is known as the closest neighbor algorithm when K=1.

RESULT AND DISCUSSION
Since this is essentially a multiclass problem, the classifier produced a 12*12 confusion matrix.Table 3 shows the resulting matrix for each of the classifiers.Accuracy, F1, precision, and recall scores are calculated from the above confusion matrix to evaluate this work.Table 3 shows the results of numerous performance evaluation metrics.Table 3 demonstrates that the part classifier surpasses the other four classifications algorithm when results are examined as a whole.The part classifier has the highest accuracy of all the classifiers for all classes 88.2, 88.1, 88.5, and 88.2.Other Table 3 and Table 4 results also support the part classifier.After applying SVM, random forest, and logistic regression algorithm we found the best classifier algorithm as random forest.The result is also shown below the bar chart Figure 6.

COMPARATIVE ANALYSIS
In this analysis, we analysis some resource which related with our work.We got only three papers related to multiple diseases prediction.But find out some lakes and problems in their research.Chang et al. made a two-phase research approach for predicting hyperlipidemia and hypertension at the same time.They began by selecting specific risk variables for both these two diseases using six data mining methodologies and then utilized the voting principle to discover the shared risk factors.After that, they built multiple predictive models for hyperlipidemia and hypertension using the multivariate adaptive regression splines (MARS) approach [5].Bang et al. developed a multi-classification method based on ML to make a distinction between the gut microbiome from the six diseases.To create the prediction model, they used the abundance of microorganisms at five taxonomic levels as characteristics in only 696 samples obtained from various research [7].Kunjir et al. proposed a method for efficient and advanced disease prediction based on historical training data.Analyzing and evaluating different data methods is the best strategy.For each disease algorithm training data example, the datasets chosen for implementation purposes comprise more than 20 medical relevant attributes.Heart disease, breast cancer, arthritis, and diabetes are among the medical datasets chosen for the research [8].Here, we given below theirs work details and their result.But our research is very unique at this time.Because we successfully predict human diseases based on their given symptoms.All the related work functionality is given in Table 5.

CONCLUSION AND FUTURE WORK
This task mostly consists of predicting an individual's symptoms and identifying the disease.The many data mining techniques used to achieve this result.The major goal of this research is to integrate data mining and machine learning approaches to provide credible results for common human diseases.A total of 70% and 30% of data is utilized to train and test the classifier, respectively, to complete this task.We examined numerous quality assessment criteria to evaluate the effective classification algorithm.We found that part classifier outperforms every other data mining technique.In the future, we'll work with bigger datasets with more attributes and employ more data mining techniques.

Figure 1 .
Figure 1.Flowchart of diseases prediction using data mining technique based on symptoms

Figure 2 .
Figure 2. The random forest method is depicted in the diagram

Table 2 .
Diseases name their possible values

Table 3 .
Comparison of four classifier's performance Common human diseases prediction using machine learning based on survey data … (Jabir Al Nahian) 1119

Table 4 .
Overall comparison of four classifier's performance

Table 5 .
Disease prediction related work