Comparing the performance of linear regression versus deep learning on detecting melanoma skin cancer using apple core ML

Received Jun 24, 2021 Revised Sep 3, 2021 Accepted Oct 23, 2021 Melanoma is a type of deadly skin cancer. The survival rate of the patients can fall as low as 15.7% if the cancer cell has reached its final stage. Delayed treatment of melanoma can be attributed to its likeness to that of common nevus (moles). Two machine learning models were developed, each with a different approach and algorithm, to detect the presence of melanoma. Image classification is using the regression algorithm, and object detection is using deep learning. The two models are then compared, and the best model is determined according to the achieved metrics. The testing was conducted using 120 testing data and is made up of 60 positive data and 60 negative data. The testing result shows that object detection achieved 70% accuracy than image classification’s 68%. More importantly, linear regression’s 43% falsenegative rate is noticeably high compared to convolutional neural network’s (CNN) 25%. A false-negative rate of 43% means almost half of sick patients tested using image classification will be diagnosed as healthy. This is dangerous as it can lead to delayed treatment and, ultimately, death. Thus it can be concluded that CNN is the best method in detecting the presence of melanoma.


INTRODUCTION
Melanoma is a type of skin cancer that develops from melanocytes that are located in the human skin. According to a report issued by The World Health Organization, melanoma is the most lethal skin cancer and has accounted for 133.000 deaths globally every year [1]. According to a 2019 report issued by The Global Cancer Observatory, melanoma accounts for 1.392 deaths in Indonesia [2]. Melanoma is caused by an uncontrollable growth of melanocytes, a cell responsible for the production of melanin. Melanin is a pigment that gives the skin a dark tone. U.V. light from the sun can potentially damage the melanocyte and triggers uncontrollable cell growth. This is how melanoma formed. Early detection is critical because the survival rate is highly dependant on early treatment. The survival rate when the cancer has metastasized is only 15.7% [3]. It means 17 out of 20 patients will die when melanoma reaches its final stages. Meanwhile, when the cancer is detected early and still localized, the 5-year survival rate is 98.4% [4]. This is why it is crucial to detect melanoma when it is still in its early stage [5]. The problem is, melanoma is hard to detect in its early stages, and it looks very similar to common moles [6]. When it becomes a large lesion in the skin

RESEARCH METHOD 2.1. Acquiring data
The model is trained with data acquired from the International skin imaging collaboration (ISIC). ISIC is an academic and industry partnership designed to facilitate digital skin imaging to help reduce melanoma mortality. ISIC's objective is to support efforts to reduce melanoma-related fatalities and excessive biopsies by enhancing the precision and reliability of early detection of melanoma. On that end, ISIC is establishing proposed guidelines for digital imaging and building a public database of clinical and dermoscopic skin lesion images [12]. Table 1 shows some samples of positive and negative images of melanoma obtained through ISIC's website.
These images show why a model is needed to differentiate a mole from melanoma. The two is extremely similar. Six hundred images will be used, split into training, validation, and testing data. The dataset consists of 300 images of melanoma and 300 images of common mole (nevus). The ratio used is 60:20:20, making 360 images for training, 120 images for validation, and another 120 images for testing. The images are chosen at random. Both models will be using the same images, ruling out any bias and making the algorithm of both models the only factors affecting the result. After the images are downloaded, it is then placed on a folder corresponding to the usage of the images (e.g., Testing images are put on a folder named "Testing"). Training, validation, and testing folders each contain two folders inside named "Positive" and "Negative." Figure 1 shows how the folder is structured in preparation for training the models.

Image pre-processing
Images retrieved from ISIC already meet DICOM standards [7]. Therefore, no further preprocessing and image alteration is needed. The only pre-processing work needed is giving a bounding box to the training and validation images as a CNN requires the training and validation data to be given a bounding box before training the model. IBM Annotation Cloud will be used to give the images labels and bounding boxes. Figure 2 shows the process of giving a label and bounding box to an image.

Training process
Two models will be trained using Create ML, each with different algorithms. Create ML is an interface designed for training ML models. The output file is .mlmodel. This is a user interface that will communicate with the Core ML framework for ease of use. The following Figure 3 is the opening interface of Create ML.
"Image Classifier" option uses linear regression model whereas "Object Detector" option uses CNN. After choosing the model, the training interface will be shown. For the regression model, the training set is set to default with 25 iterations. Setting the iteration beyond 30 did not make any difference to the model's accuracy because the training process converges at exactly 25 iterations. Set the training, validation, and testing data to its corresponding folders. The CNN model process is the same, except there is an extra. JSON file inside the folder.

Linear regression
The first model uses linear regression algorithm. This algorithm is quite fast and lightweight but potentially less accurate. The model takes a labelled image and apply sharpening filters to accentuate the image's features. A grid is then superimposed on top of the image. This grid contains a value ranging from 0.00 to 1.00, where 0.00 is when the box is colored all white and 1.00 when the box is colored all black. This grid is the feature of the image, visualized by numerical values. The grid is transformed into an array. This array is then stored inside the model. This process is then repeated for each image. For each image trained, an array is created. These arrays are stored in two classes based on the labels given. These labelled arrays are what the models are made of. When identifying an image, the same process applies. Sharpening filters are applied, and a grid is made. The grid is transformed into an array. However, since this array has no label, the model then compares this unknown array to the model's databases of array from the aforementioned training process. A comparison is made, whether this new array is similar to "Positive" arrays or "Negative" arrays. A similarity percentage is then made based on the comparison. This is how linear regression model makes an identification. Figure 5 visualizes the process.

Convolutional neural network
The second model uses an algorithm called "CNN". This algorithm mirrors how the human's brain work. First, the training image is divided into multiple parts. Each of those parts are then given multiple filters. These parts with filters are then associated with the labels and forms the "neurons" of the model. With multiple images trained and each of those images containing multiple parts and those parts again contains multiple filters, one model could potentially have more than 100.000 neurons inside them. These neurons are the knowledge base of the model, similar to the arrays in regression model. The main difference is that one image only results in one array (or one "knowledge") in regression model. So, if 60 images are used in the training phase, the model only has 60 arrays to form the knowledge base. In CNN, 60 images will result in more than 60 neurons forming. CNN has the benefit of having more knowledge to work with when identifying an image. For the identifying process, the same process is done to the image.
However, since these neurons have no label, the neurons are then compared to the thousands of already labelled neurons inside the model's knowledge base. If these neurons connect more to the neurons labelled "positive", then this image is "positive" and vice-versa. The following Figure 6 shows the process. Since the model is learning an image labelled "positive", all the neurons will associate it with a "positive" neuron.

Testing process
After the training process is done, the testing process is conducted. Testing data consisted of 120 images, in which 60 images are labelled "positive-1-60", and another 60 are labelled "negative-1-60". Each model is fed with testing images, and the classification of each model is noted. The testing is only done once because the model is consistent at classifying each image, and there are no variations on multiple testing (i.e., Picture 1 is always identified as "negative" by the model no matter how much repetition is done on testing). The following Figure 7 shows how the testing is conducted.

Figure 7. Testing menu
For example on the Figure 7, the testing picture has the label "Positive 1", meaning it shows an image of a positive melanoma. The classification made by the model is 99% negative. However, the true classification of the image is positive. Hence, this model failed to identify the testing image numbered 1. The test continues to image labelled "Positive 2" and so on until the test image "Positive 60". The same process applies to images labelled "Negative 1" to "Negative 60". The long and the short of it is that when the classification made by the model is different from the label, then the model made a mistake in identification. These correct and false identifications are then tallied to form the confusion matrix for each of the models. From the confusion matrix, various metrics can be calculated to determine which algorithm is the best at diagnosing melanoma. The first basic metrics are accuracy and miss. Accuracy is a metric that measures the percentage of data correctly classified by the test under evaluation [13]. In (1) shows the formula for calculating accuracy.
Miss is a metric that measures how many data are incorrectly classified by the test. Miss is always the complement of accuracy. In (2) shows the formula for calculating miss.
(%) = + × 100 = 100 − The next metrics used are sensitivity and specificity. Sensitivity is a measure of how accurate the model is at identifying a true positive or a sick (positive) person [14]. The higher the sensitivity, the better the model at detecting a sick individual. In (3) shows the formula for calculating sensitivity.
On the other hand, specificity measures how accurate the model is at identifying a true negative or a healthy person [14]. A diagnostic model with high specificity is better at detecting a healthy individual. In (4) shows the formula for calculating specificity.
The last metrics are false-positive rate (FPR) and false-negative rate (FNR). FPR and FNR is a measure of how many errors a diagnostic tool makes in identifying a disease [15]. When a model has a high On the other hand, the model with high FNR tends to make many false-negative diagnoses, i.e., sick individuals identified as healthy. In (6) shows the formula for calculating FNR.

RESULTS AND DISCUSSION
The following Table 2 shows the metrics achieved by the regression model along with the duration it took to complete the training process. Linear regression's training duration is fairly fast, taking only 1 minute and 31 seconds. As can be seen in Figure 8 below, linear regression only took ten iterations to complete the training process.
As the training process converges to 100% at ten iterations, more iterations are not needed. The training process is fast, but there are potential consequences on the accuracy of the model. The following Table 3 shows the metrics achieved by the convoluted neural network model along with the duration it took to complete the training process.
Convoluted neural network took an astonishingly long 71 hours and 35 minutes to complete the training process. That is almost three days. Convolution process and the forming neuron is responsible for the long duration. The following Figure 9 shows the training performance of convoluted neural network. The graph in Figure 9 shows the loss in features when the model is being trained. To minimize this loss, the model ran multiple iterations and used the previous iteration to complete the features. As more iterations pass, the model will be more accurate. At 3.000 iterations, the model only lost 0.6 part of the feature.  Figure 9. Convoluted neural network performance

Metric analysis
At first glance, the two models appear to achieve similar results. Table 2 and Table 3 shows that the accuracy is only at a 2% difference, with CNN holding the lead. It is important to note that CNN algorithm failed to identify three images, giving neither positive nor negative diagnosis. These three images will not be included in the calculation. This is why accuracy is not the be-all and end-all of machine learning's metrics.
Other metrics can hold valuable information regarding which method is the best-for example, the sensitivity and specificity metric. Sensitivity measures how accurate a diagnostic model can identify a sick individual, whereas specificity measures how accurate a diagnostic model can identify a healthy individual. CNN's model has a higher sensitivity at 75% than 56% achieved by the regression model. However, the regression model has a much higher specificity at 80% than CNN's 68%. This means that CNN is better at identifying an individual with melanoma, but regression is better at identifying a healthy individual without melanoma. This is a delicate balance because we cannot maximize both metrics at the same time. Other metrics that can be analyzed are the FPR and FNR. CNN has a lower FPR but a lower FNR. These metrics can be analyzed to determine which method is the best at detecting melanoma.
For accuracy and miss, the analysis is pretty simple. The higher the accuracy, the better, and the lower the miss, the better. Table 3 shows CNN achieved the best result at 70% accuracy and 30% miss. However, the difference with the regression model shown in Table 2 is a mere 2%. This is too small of a difference to call which method is the best. We need to analyze other metrics to see which method is the best. For sensitivity metric, CNN has a higher sensitivity than linear regression, as shown in Table 3; therefore, the CNN model is more accurate at identifying the presence of melanoma. Meanwhile, the regression model is more accurate at identifying a healthy person. However, in a predictive diagnostic test, identifying a sick individual is more of a priority than identifying a healthy one [16]. Therefore, CNN achieved the best metric in this category.
The next metrics are FPR and FNR. This is an important metric to analyze because FPR and FNR can affect how the model worked in real life. Table 2 shows that the regression model has a lower FPR at 20% from the test conducted compared to CNN's 30%, shown in Table 3. This means the regression model is less prone to false-positive error than the CNN model. However, the regression model has an alarmingly high FNR of 43%. This means that if 100 sick individuals are using this model as a diagnostic tool, 43 will be identified as healthy. A diagnostic test with a high false-negative rate is hazardous because sick individuals with melanoma can potentially be identified as healthy, thus missing a much needed early treatment. A falsenegative result can lead to death in some diseases [17], including diseases that need early treatment, like  [8]. Therefore, CNN achieved an optimum metric in this category. The following Table 4 shows a comparison of metrics achieved by both regression and the CNN model.  Dermatologist with 1-2 years of experience [19] It is important to highlight that CNN is still more accurate compared to the two biopsies method, shave biopsy and visual biopsy. From this research, it can be determined that diagnosis with CNN, while not that accurate compared to a full-body biopsy, is accurate enough for a quick and fast assessment of a highly suspicious mole. It does not need an operation to be conducted, and the result comes within 5 seconds. The diagnosis can be done with a MacBook or an iPhone and does not need highly specialized equipment. Again, while the accuracy is not that high, immediate treatment is crucial when it comes to melanoma cases, and rapid analysis with a cellphone is one of the methods that can be used [24]. With this quick method, a diagnosis can be made at home, and when the result shows >50% positive, the person can get to a hospital and receive a more accurate biopsy and other treatments. As melanoma cases continue to rise every year [25] and skin disease being the most common form of disease globally [26], it is necessary to provide everyone with immediate treatments. Easy excision is usually curative when melanoma is detected when it's only isolated to the skin's outer layers, and the 5-year relative survival rate is around 90%. The need to enhance the effectiveness, efficacy, and consistency of the diagnosis of melanoma is apparent. The emotional and financial consequences of getting a diagnostic error is a prevalent, harmful, and costly phenomenon [27]. This method can prevent the aforementioned consequences with ease.

CONCLUSION
This research was made to develop two machine learning models that can detect melanoma when it is still in its early stage, compare both models, see which model is the best, and decide if it can diagnose melanoma skin cancer. This research proved that machines could diagnose melanoma better than human eyes, beating two biopsy methods, and beating dermatologists with 1-5 years of experience. It is also concluded that using a deep learning CNN is the best approach to detecting melanoma compared to linear regression. With higher accuracy and a lower false-negative rate, it is the best method to use. It is essential to highlight that while CNN has achieved a higher metric, it is by no means can be used as a one-and-only diagnostic tool. An accuracy of 70%, while seemingly high, is just too low to be reliable. 70% accuracy still means that for every 100 individuals using the model, 30 of them will have the wrong diagnosis. Of course, the perfect model is the one that can detect every individual, healthy or not, with 100% accuracy. However, there is currently no diagnostic tool that can detect melanoma with that kind of accuracy. Rather, this model can be used to do a quick test before the full biopsy is done. This model can be integrated into a smartphone application. It then uses a smartphone camera to capture skin imagery, giving predictions in real-time. This system can then be deployed anywhere, from an online doctor consultation app to staff in the hospital. Thus, this model can potentially be used as a secondary diagnostic tool, giving every doctor and physician the ability to detect melanoma with reasonably high accuracy, removing the need for unnecessary biopsies.