Toward enhanced skin disease classification using a hybrid RF-DNN system leveraging data balancing and augmentation techniques

Significant health concerns are associated with skin diseases, and accurate and timely diagnosis is essential for effective treatment and patient management. To improve the classification of cutaneous diseases, we propose a novel hybrid system that incorporates the strengths of random forest (RF) and deep neural network (DNN) algorithms. The system employs data augmentation and balancing techniques to enhance model performance and generalizability. The HAM10000 dataset of diverse dermatoscopic images is used for training and evaluation in this study. In the hybrid system proposed, the RF model provides an initial diagnosis based on patient-reported symptoms, while the DNN analyzes images of skin lesions, resulting in more precise and efficient diagnoses. Using hyper-parameter optimization, we fine-tune the system for optimal performance. The evaluation demonstrates the accuracy of the hybrid model, which achieves a classification accuracy of 96.8% overall. According to our findings, the hybrid system demonstrates exceptional efficacy in six of seven skin disease classes. Variations in sensitivity and reliance on data quality and quantity are however cited as limitations. Nevertheless, this hybrid system has the potential to revolutionize skin disease diagnosis and treatment.


INTRODUCTION
Diseases of the skin are a prevalent and significant health concern, influencing millions of people around the world and posing substantial challenges to medical professionals [1].A timely and accurate diagnosis is essential for effective treatment and management of these conditions, as it has a direct impact on patient outcomes and quality of life [2].Recent advances in machine learning algorithms have created new opportunities for medical diagnosis, particularly in the classification of cutaneous diseases [3].The purpose of this study is to propose a novel and effective composite system for the classification of skin diseases that incorporates the strengths of two potent machine learning techniques: the random forest (RF) model and the deep neural network (DNN) [4], [5].Both algorithms offer distinct benefits that, when combined can improve diagnostic accuracy and overall system performance [6].
The RF model is well-known for its capacity to deal with large datasets and make precise predictions quickly.The RF model combines the individual predictions by constructing an ensemble of decision trees, each of which is trained on a random subset of the data.In our hybrid system, the RF model functions as the initial diagnostic instrument, utilizing patient-reported symptoms such as irritation, erythema, and scale to provide an early evaluation [7].
In contrast, DNNs are state-of-the-art image analysis tools that perform exceptionally well in image classification tasks [8].DNNs, which consist of multiple layers of interconnected artificial neurons, are adept at deriving intricate image features and making accurate predictions [9], [10].In our hybrid system, the DNN is tasked with analyzing dermatoscopic images of the patient's epidermis in order to provide a more precise diagnosis [11].
By combining the RF and DNN algorithms, this innovative hybrid approach capitalizes on their synergistic benefits.The system strives for superior diagnostic precision and dependability while optimizing time and resource utilization [12].In addition, the hybrid system seeks to reduce dermatologists' reliance on subjective visual inspection, thereby mitigating the inherent subjectivity and time-intensiveness of conventional diagnostic methods.To increase the efficiency of the hybrid system, we will also investigate data balancing and enhancement techniques [13].It is essential to address class imbalance in the dataset to prevent bias and ensure that all skin disease types are represented fairly [14].To modify the class distribution and enhance the efficacy of the model, balancing techniques such as over-sampling, under-sampling, and synthetic data generation will be used.In addition, data augmentation techniques will be utilized to increase the dataset's diversity, thereby augmenting the system's generalizability and diagnostic precision.Through transformations such as rotation, scaling, and inversion, the augmented dataset will include a broader spectrum of skin disease variants, thereby improving overall performance [15].
The paper is structured as follows: in section 2, we introduce the proposed RF-DNN hybrid system for the classification of skin diseases, detailing how the RF and DNN algorithms are combined to achieve high accuracy.Section 3 provides information on the dataset employed, data preprocessing techniques, and the hybrid system implementation methodology.In section 4, we present the results and discussion, which includes the model's performance metrics and its dermatological implications.Section 5 concludes the paper by discussing the system's efficacy, potential applications, and future directions for research.

PROPOSED RF-DNN HYBRID SYSTEM
Using the combined force of RF and DNN algorithms, the proposed research introduces an innovative hybrid system for the classification of skin diseases.Incorporating data balancing and augmentation techniques helps to improve the system's precision and effectiveness.The first component of the hybrid system is the RF model, which is renowned for its ability to manage large datasets and make accurate predictions quickly.The RF model aggregates individual predictions by constructing a collection of decision trees, each trained on a random subset of the data, in order to reach a final diagnosis.In this hybrid system, the RF model initially evaluates patient-reported symptoms, such as irritation, erythema, and scaling, in order to provide early diagnostic insight.
DNN, a sophisticated image analysis tool with exceptional efficacy in image classification tasks, is the second component of the RF model.DNNs, which consist of layers of interconnected artificial neurons, excel at extracting complex image features and making accurate predictions.Dermatoscopic images of the patient's epidermis are analyzed by the DNN component, which then provides detailed and accurate diagnoses [16].To address the issue of class imbalance in the dataset, data balancing techniques including oversampling, undersampling, and synthetic data generation are used.These techniques modify the class representation to prevent bias and ensure an equitable classification of all types of skin disease.The architecture of the proposed hybrid system is illustrated in Figure 1.By harmonizing the dataset, the hybrid system becomes more resistant to class imbalances, thereby enhancing its performance.In addition, data augmentation techniques are employed to increase the dataset's diversity, thereby enhancing the system's capacity for generalization and achieving greater diagnostic precision.Through transformations such as rotation, scaling, and inversion, the augmented dataset encompasses a vast array of skin disease variations, thereby enhancing the robustness of the hybrid system.
In the hybrid system, the combination of RF and DNN algorithms, along with data balancing and augmentation techniques, produces a synergistic effect [17].This strategy seeks to improve diagnostic accuracy and efficiency while decreasing dermatologists' reliance on subjective visual inspection.The proposed hybrid system is evaluated using a comprehensive dataset of dermatoscopic images, and its performance is compared to that of existing individual classifiers.By combining the strengths of both algorithms and integrating data balancing and augmentation techniques, the hybrid system is anticipated to revolutionize the classification of skin diseases, providing accurate, efficient, and dependable diagnostic support to both patients and healthcare professionals.In addition, the system's potential extends beyond the classification of skin diseases, with potential applications in a variety of medical diagnosis and classification tasks.

MATERIALS AND METHOD 3.1. Dataset collection
Edinburgh University researchers methodically created the HAM10000 dataset of dermatoscopic pictures [10], which is vital for cutaneous cancer research and categorization.This massive dataset includes over 10,000 high-quality photos categorized with one of seven skin cancer classifications.This dataset is a credible and representative source for researchers studying skin lesions and cutaneous cancer due to its vast variety and accuracy of annotations.Such a plethora of data allows the construction and assessment of sophisticated machine learning models and deep learning algorithms, improving skin disease detection and therapy.Researchers can improve dermatological patient care and medical understanding by using this plethora of data to understand skin problems.
The well maintained HAM10000 dataset is divided into two subsets: a training set of 7,500 photos and a test set of 2,500 images, each with skin cancer annotations [15].This careful segmentation makes the dataset suitable for training and assessing machine learning models, offering a large and varied dataset to reliably evaluate diagnostic methods.The HAM10000 dataset's rich and varied skin lesion photos are essential for creating and improving cutaneous cancer diagnostic algorithms.This wide range of skin abnormalities allows researchers to test their categorization algorithms and diagnosis systems.In Figure 2, we can see the seven samples of cutaneous disease classes from the HAM10000 dataset: Figure 2

Medical image information pre-processing
Medical image preprocessing is essential for producing dermatoscopic images from the HAM10000 dataset for subsequent analysis and classification.The objective of preprocessing is to improve picture quality and make it appropriate for machine learning algorithms [18].Several crucial stages are involved in the pre-processing of medical images to improve the precision and reliability of the skin disease classification system.

541
Image enhancement is the initial phase of the preprocessing pipeline.This entails employing a variety of techniques to enhance the overall quality and visibility of the images.A frequently employed technique is contrast enhancement, which modifies the image's contrast to make features more distinct.Noise reduction is another technique that eliminates random variations in pixel intensities, thereby reducing visual noise in images.In addition, image smoothing, which is accomplished through filtering, reduces highfrequency noise and produces images with a smoother appearance.Data normalization is the next stage, which is essential for making images comparable across modalities or patients.The intensity ranges of medical images acquired with different imaging devices may vary.Data normalization ensures that images can be consistently processed by machine learning algorithms by standardizing the intensity values.Lastly, data augmentation techniques are utilized to increase the dataset's diversity.This procedure involves generating new images by randomly transforming existing samples.Common transformations consist of rotation, inversion, and scaling.Data augmentation assists in overcoming the challenge of limited data, which can lead to overfitting, and improves the machine learning model's generalization capabilities [19].
The preprocessing of medical images from the HAM10000 dataset ensures that machine learning algorithms can extract and classify skin diseases with precision.By enhancing image quality, normalizing data, and increasing dataset diversity via augmentation, the skin disease classification system can achieve greater accuracy and robustness, thereby contributing to more accurate diagnosis and treatment planning.

RF algorithm
The RF algorithm is a powerful ensemble approach often used in machine learning for the purposes of classification and regression applications.During the training phase, it constructs multiple decision trees and combines their predictions to reach a conclusion [20].The RF method is founded on the concept of bagging, in which multiple models are constructed from distinct subsets of the training data in an effort to reduce variance and improve precision.RF's algorithm operates as follows.
During the training process, the RF algorithm produces a substantial quantity of decision trees, with each tree employing a randomly chosen fraction of the training data.Each decision tree learns to make predictions based on distinct data characteristics and patterns.Once all decision trees have been trained, the RF algorithm combines their predictions for classification tasks using a majority voting mechanism.It aggregates the predictions of all decision trees for regression tasks.The ability of the RF algorithm to manage overfitting is one of its major advantages.By training each decision tree on various subsets of data, RF reduces the likelihood of a single tree memorizing noise in the training data, thereby enhancing the model's overall generalization capability.Another advantageous aspect of RF is its capacity to evaluate the significance of features in the classification process.It can establish which features have the greatest influence on the model's predictions, thereby providing valuable insights into the data's underlying relationships.
The RF algorithm is a prime candidate for the initial diagnosis component of the hybrid system due to its adaptability and capacity to manage vast datasets with high dimensions.Using the strengths of the RF algorithm, the hybrid system can rapidly process patient-reported symptoms such as irritation, erythema, and scaling to provide an initial classification of skin diseases.This enables efficient data management and contributes to the overall precision and effectiveness of the classification procedure [21].

Deep NN algorithm
The DNN algorithm or network is a state-of-the-art approach in the field of image recognition and analysis.This technology is particularly well-suited for many applications, including but not limited to image classification, object identification, and segmentation.Its architecture is precisely tailored to effectively handle intricate patterns and characteristics that are often seen in images.DNNs consist of a series of linked layers of artificial neurons.The layers are structured in a hierarchical manner, beginning with an input layer that receives the raw pixel values of the picture.This is followed by many hidden layers that gradually extract characteristics of increasing complexity from the input.The convolutional layer plays a vital role in DNNs.These layers scan the input image using small filters (also known as Kernels), identifying and learning spatial patterns and features [22].
Convolutional layers allow the network to automatically identify visual characteristics, eliminating the need for human feature engineering.Nonlinear activation functions are applied to the output of each neuron in the network.Common activation functions, like as rectified linear unit (ReLU), allow non-linearity and complicated data association discovery in models.Pooling layers reduce feature map spatial dimensions, decreasing computational burden and avoiding overfitting [23].DNNs often use max and average pooling techniques.The DNN generally ends with fully connected layers, followed by convolutional and pooling layers.High-level characteristics from previous levels are used to construct final predictions in these layers.Backpropagation is used to change the weights of neurons in a DNN to minimize the discrepancy between anticipated and actual output [24].Optimization using stochastic gradient descent [25].

Data balancing and enhancement methods
Important stages in the process of producing medical image data for machine learning algorithms are data balancing and augmentation.They play an important role in addressing class imbalance issues and improving the model's generalizability and performance.

Balanced data techniques
When one class in a dataset is represented by substantially fewer samples than other classes, class imbalance exists.Certain skin maladies may be less prevalent in medical image datasets, resulting in an unbalanced distribution of samples among various classes.This may lead to skewed model predictions, as the model may favor the majority class.To mitigate the effects of class imbalance, techniques for balancing data are utilized.These techniques seek to modify the representation of various classes within a dataset so that each class has an adequate number of training samples.Data balancing methods include: oversampling involves producing more minority class samples to match the majority class samples.This strategy balances class distribution and allows the model to learn from various samples [26].Under-sampling: this decreases the number of majority class to match the number of minority class samples.Removing samples from the majority class leads to a more balanced distribution of classes.

Data enhancement methods
Data augmentation is the process of increasing the dataset's diversity through the application of various transformations to the original samples.Augmentation improves the model's generalization and robustness, particularly when the dataset is small.Typical data enhancement techniques for medical images include: images can be rotated to simulate different perspectives, thereby enhancing the model's ability to recognize objects from various orientations [27].Horizontal and vertical image rotation creates mirrored variants, which can assist the model in learning invariant characteristics.Scaling: resizing images to various dimensions provides variations in the size of objects, making the model more adaptable to various image sizes.Shifting images along the x and y axes introduces positional differences, thereby assisting the model in handling spatial disparities.Adding random Gaussian noise to images simulates real-world variations and makes the model more tolerant of noisy data.By incorporating data balancing and augmentation techniques into the hybrid RF-DNN system, we can improve its classification performance for cutaneous diseases.The balanced and diverse dataset obtained through these techniques enables the model to make accurate predictions across all classes, resulting in more accurate and efficient skin disease diagnosis.

Experimental setup
This section describes the experimental setup for evaluating the hybrid RF-DNN system's classification performance for cutaneous disorders.The experiments were done using Python and Jupyter Lab.We used Google Colab, a cloud-based tool that uses Jupyter Notebook for machine learning research and training, to train the models [28].
A comprehensive dataset of dermatoscopic pictures of skin lesions, HAM10000, was employed in this study.The collection includes approximately 10,000 photos representing seven skin cancer types.The dataset consists of 7,500 images for training and 2,500 images for testing [29].We used data balancing and augmentation methods to correct class imbalance in the dataset.These strategies improve machine learning model performance by altering class representation and increasing dataset variety.Data samples were artificially altered using rotation, scaling, and inversion.To enhance performance, we optimized the hyperparameters of the hybrid RF-DNN system [30].Table 1 summarizes the algorithm tuning hyperparameters employed in this study.

Training results
This section shows how the hybrid RF-DNN model works on HAM10000.The dataset has 10,000 instances, including 7,500 for training, 1,500 for testing, and 1,000 for validation.We regularly monitor the hybrid model's learning and validation curves to prevent overfitting or underfitting.In Figure 3, the accuracy and loss curves from hybrid model training and evaluation are presented, with Figure 3

Testing results
This sub-section evaluates the performance of the hybrid RF-DNN model using a distinct test dataset.Before calculating the model's performance metrics, the confusion matrix is computed to obtain insight into its classification results.In a multi-class classification problem, the confusion matrix provides a detailed dissection of the model's predictions for each class.Figure 4 depicts the results of the calculation of the disorientation matrix.By analyzing the confusion matrix, we can evaluate the performance of the model for each individual class and identify potential enhancement areas.conditions.The subsequent analysis will include a thorough examination of the testing outcomes and a discussion of the implications of the perplexity matrix and performance metrics.The presented findings will shed light on the efficacy of the hybrid RF-DNN model in classifying skin diseases and its potential contribution to accurate and efficient medical diagnosis.A reduction in the classification threshold increases both false positives and true positives.When the threshold is decreased, more cases are classified as positive, including both accurate positive predictions (true positives) and false positives.
This observation emphasizes the trade-off between classification sensitivity (recall) and specificity.By lowering the classification threshold, we can identify more positive cases and potentially reduce the number of false negatives (positive cases that were overlooked).This, however, comes at the expense of an increase in false positives, which may result in superfluous treatments or interventions for patients who do not have the disease.The selection of the classification threshold is determined by the task-specific requirements and objectives.In situations where early detection of certain skin maladies is crucial, for instance, a lower threshold may be preferred to reduce false negatives.In situations where avoiding false positives is essential for preventing unnecessary interventions, a higher threshold may be selected to maximize specificity.

Discussion
Combining the advantages of RF and DNN algorithms for the classification of skin maladies, the hybrid system proposed in this study represents a significant advancement in medical diagnosis.The hybrid system's 96.8% accuracy on the skin disease dataset is due to the synergy of its two major components, the RF classifier and the DNN classifier.These algorithms combine their capabilities to improve the hybrid system.The RF model's competence in handling huge datasets and identifying key characteristics from input data is critical to feature extraction.The RF model finds diagnostically important elements and visual patterns using advanced machine learning optimization.The RF model carefully selects these relevant characteristics to concentrate the hybrid system on the most important skin disease dataset features, speeding the diagnosis procedure.One-hot encoding transforms extracted characteristics.This technique turns categorical variables like skin lesion kinds into numerical data the DNN classifier can read.One-hot encoding retains critical data while preparing it for DNN classifier integration.The DNN classifier, known for its ability to grasp complicated visual data patterns, is designed to categorize altered features.It uses a carefully selected data set where the RF model has reduced confidence.This clever method maximizes the hybrid system's efficiency, enabling it to concentrate on difficult instances while preserving diagnostic accuracy.RF model and DNN classifier cooperation lets the hybrid system take use of their benefits.The DNN classifier uses the RF model's efficient feature extraction from the skin disease dataset to classify.However, the DNN classifier uses its multilayered design to recognize complicated patterns and minor changes that indicate skin diseases in converted data.By reducing the reliance on subjective visual inspection, the system provides a more objective and efficient approach to image analysis, resulting in more timely and accurate diagnoses.In addition, the hybrid system can assist dermatologists by providing second opinions and improving diagnostic precision.
Despite its virtues, the hybrid system has a number of drawbacks.Its accuracy is contingent on the quality and quantity of training data, and its performance may vary across datasets of cutaneous diseases.In addition, its efficacy depends on the integrity of diagnostic input images.In addition, the computational requirements of training and operating the model may limit its applicability in contexts with limited resources.To completely realize the hybrid system's robustness and generalizability, additional research and validation with diverse datasets and additional skin diseases are required.However, the promising evaluation metrics presented in Table 2 demonstrate the system's potential to revolutionize the diagnosis and treatment of skin diseases, providing patients and healthcare professionals with a more efficient, accurate, and reliable diagnostic instrument.

CONCLUSION
Using data balancing and augmentation techniques, this study presents a novel and efficient hybrid RF-DNN system for the classification of cutaneous diseases.The system incorporates the strengths of RF and DNN algorithms to create a potent and precise diagnostic instrument.Through extensive testing and evaluation on the HAM10000 dataset, the hybrid system obtains a remarkable 96.8% accuracy.The hybrid system addresses the limitations of individual models, such as the inability of RF to manage large datasets and the expertise of DNN in image analysis.Despite its successes, the hybrid system has limitations, such as its reliance on the quality and quantity of training data and its potential performance variability across various skin disease datasets.To address these obstacles and completely realize the system's potential, additional research and validation with diverse datasets and actual patient cases is required.Future work may involve expanding the system's capabilities by integrating more advanced models, such as convolutional neural networks (CNNs), to further improve diagnostic accuracy and expand its applicability to a wider spectrum of skin diseases.In addition, integrating the hybrid system with electronic medical records (EMR) could facilitate real-time and automated diagnosis, streamlining the workflow of healthcare professionals and improving the overall efficiency of healthcare.

Figure 3 .
Figure 3. Accuracy and loss curves from hybrid model training and evaluation; (a) accuracy training curve and (b) loss training curve

Figure 4 .
Figure 4.The confusion matrix serves as a representation of the performance of the hybrid classifier

Table 1 .
Optimal hyper-parameters of the baseline and fine-tuned model Toward enhanced skin disease classification using a hybrid RF-DNN system … (Soufiane Hamida) 543

Table 2 .
Performance evaluation metric of the fine-tuned model Toward enhanced skin disease classification using a hybrid RF-DNN system … (Soufiane Hamida) 545