Classifying lymphoma and tuberculosis case reports using machine learning algorithms

ABSTRACT

has proven to improve diagnosis prediction, as reported by [12]. NLP, on the other hand, focuses on extracting information from unstructured texts and converting it into a format that computers can process [13]. It has been successfully implemented in decision support systems for areas such as risk stratification, symptom identification and medical diagnosis [14], [15].
Although generic text classifiers exist, they generally are not tuned for the scientific data analysis [16]. This is why specific NLP systems have been used to classify patients with TB since the early 1990s [17] [18], and differentiate between TB and other pulmonary diseases [19]. To the best of our knowledge, there is however no system differentiating specifically between TB and lymphoma. Hence the overall purpose of our study is to create an NLP system to classify lymphoma and TB diagnosis. The system could serve for screening purposes and help reduce the misdiagnosis rate between the two diseases.
In our previous paper [20], we classified the two diseases using case reports collected from ScienceDirect. The features in each report were extracted using TF-IDF as well as Amazon Medical Comprehend, which is an NLP API for medical feature extraction. The current paper aims to; 1) analyse the collected case reports using NLP and clustering, 2) explore their different characteristics, 3) and identify documents which are not case reports of either diseases using machine learning algorithms.
This will help us collect additional relevant case reports from various sources, and design a more robust training dataset to be used in differentiating TB and lymphoma. All algorithms in this study are implemented using the "sklearn" Python module [21], using the default parameters, with no parameter tuning. Another limitation of this study is that it disregards the semantic value when extracting terms from the collected text. The rest of this paper is organised as; section 2 discusses the methods used in this study while section 3 presents the results of the various experiments performed. Finally, we discuss the results obtained in section 4 and conclude this paper in section 5. Figure 1 gives a summary of the methodology applied. To create our dataset, we automatically scraped tuberculosis and lymphoma case reports from ScienceDirect through their search API using the following search terms; "tuberculosis case report", "tuberculosis case report". The case reports were restricted based on title, as described in [20]. For each search result returned, we retrieved the full article using ScienceDirect's Full-text retrieval API, then extracted the second section as the case report. This was achieved using a Python library called Beautiful Soup. A summary of the data collection process is shown in Figure 2.

Data pre-processing
The first part of preparing the data for our machine learning algorithms was done using "natural language toolkit" (NLTK), a Python module for NLP. This process consisted of the following steps:  Contractions expansion; using the 'contractions' Python package, known shortened combinations of words were expanded back to their original form.  Tokenization; each document was split into a series of words. Punctuation, numbers and special characters were then removed and letters converted to lower case.  Stopwords removal; recurrent English words which convey little to no information, such as articles and pronouns, were removed from the text. NLTK's stopwords list was extended to also include terms such as 'patient', 'disease', 'using', 'figure', 'fig', 'clinic', 'hospital', 'et','al'. These terms appeared in multiple texts without bringing information necessary to our classification task.  Lemmatization; using NLTK's WordNetLemmatizer algorithm, words were reduced to their root form in an effort to group together similar words (e.g.; plural words were converted to their singular form). The result data were then converted from free text into a vector space using term frequency-inverse document frequency. Extra pre-processing consisted of extracting the age and gender of each patient. The detailed feature extraction process is reported in [20].

Data exploration
Using the "scikit-learn" library in Python, k-means++ clustering was applied to the vectorised dataset in order to group together similar case reports. The algorithm is described as [22]: a. Choose k initial centroids For k iterations:  For each data point, calculate the Euclidian distance with the closest centroid.  Choose a centroid using a distribution specified by the squared Euclidean distances.  The optimal number of clusters (k) was decided based on silhouette scores, which measure how cohesive and distinguishable clusters are. In (1) is the formula used: where a is the average distance between each data point and other data points in the same cluster and b is the average distance between each data point and other data points in the closest cluster. For each data point, a and b are calculated as follows: (2)

Text classification
We implemented the following algorithms as a benchmark: logistic regression, k-Nearest Neighbours (KNN), artificial neural network (ANN), Naive Bayes, support vector machines (SVM) and perceptron. There are brief descriptions of each algorithm: a. Decision trees; this method returns a tree-like structure, where internal nodes perform tests based on attribute values and each branch represents the outcome of the test. The tree ends in leaf nodes, which are associated with the most probable decision. [23]. Instances are classified by traversing the tree and applying rules at each internal node until a decision node is reached [24]. b. Artificial neural network; an ANN consists of layers of artificial neurons which are connected with each other. Input data traverse the layers, which process it and output a result. [25,26]. Each neuron receives the input data from neurons in the previous layer, and each neuron-to-neuron connection has a weight representing its strength [25]. We used a multi layer perceptron (MLP) of one hidden layer with 100 hidden units. This ANN determines the input weights of each linear model as follows [27]:  Initialize w=0  Go through the data points { xi, yi }  if a data point is misclassified then w ← w + αsign(f(xi))xi  Until all the data are correctly classified c. Naive Bayes; Naive Bayes is a simple, statistics-based method, which predicts a class (Y) for a new example (X) based on the largest a posteriori probability, previous experience and event probability [28]. The probability of X belonging to a class c is given by the following formula.
where: P(c): probability of class c P(X): probability of the predictors X P(X|c): probability of having X features given class c P(c|X): probability of an instance X belonging to class c given the value of its dependent variables [29] d. Support vector machines; using a dataset of n features, a Support Vector Machine (SVM) attempts to find a decision boundary which maximises the margin between two observed classes [30]. This makes it a robust choice for binary classification. In the simplest case, SVMs must come up with a linear classifier of the form [31]: One method of determining the input weights is the perceptron algorithm described above. e. k-Nearest Neighbours; this method classifies a new instance by finding the k most similar instances in an existing dataset. The similarity is determined using metrics such as Euclidean distance or Mahalanobis distance [32]. With two feature vectors A=(x1,x2,...,xm) and B=(y1,y2,...,ym), representing two data points with m features, the Euclidean distance is calculated as: We evaluated the performance of each algorithm using classification accuracy, precision and recall. Accuracy evaluates the ratio of correctly classified instances. On the other hand, precision gives us the ratio of true positives among all instances classified as postive. Finally, recall computes the ratio of positive instances that were correctly classified.
For each evaluation metric above, the performance of each algorithm was estimated using crossvalidation. The dataset was randomly split into 5 subsets then each algorithm run 5 times, with 4 subsets used as for training and one used for testing.

RESULT AND DISCUSSION
The search terms submitted to the ScienceDirect API provided 6080 and 4034 articles for tuberculosis and lymphoma, respectively. After automatic title review, 546 TB and 765 lymphoma case reports were kept for our study. Figure 3 gives us a quick preview of some features obtained using TF-IDF. Looking at Figure 4, we see that highest average silhouette occurs when n=3. Considering three (3) clusters is therefore optimal in this case, since it minimises similarities between different clusters while maximising similarities within each cluster. This means that case reports are less likely of being assigned to the wrong cluster.  Figure 5 shows a word cloud for each cluster, which help visualise the most important words per cluster. The most frequent words in Cluster 1, such as "hodgkin" and "cell" suggest that this cluster mainly contains lymphoma case reports. Examples of lymphoma cases that were assigned to this cluster include those reported by [33]- [35]. TB cases were mostly allocated to Cluster 2. These cases include those reported by [36]- [38]. However, it also contained articles discussing tuberculosis, which were not excluded during title review but were not case reports [39], [40]. The documents in Cluster 0 were neither tuberculosis nor lymphoma case reports. After analysis, it was found that this cluster consisted of many cases of diseases wrongly diagnoses as TB, as reported by [41], [42]. The cluster also contained cases where had another disease on top of tuberculosis or lymphoma.

Cluster analysis
Analysing the age of patients in the different clusters revealed that lymphoma patients were in average older than TB patients, with respective mean ages of about 53 and 40 years old as shown in Figure 6. This is consistent with previous findings indicating that lymphoma cases tend to occur in older patients [43], [44]. We also notice that lymphoma cases had a higher proportion of reported male patients. After pre-processing the text and vectorising, we obtained 7088 features to be fed into machine learn ing algorithms. Table 1 shows the average cross-validation performances of each algorithm in terms of accuracy, recall and precision.  Performance evaluation of the various algorithms showed that the Multi-Layer Perceptron algorithm best identified the correct class of case reports (with 93.1% accuracy). This method also achieved the highest recall score (94.1%) and the highest positive predictive value, with a precision score of 95.4%. It therefore minimised the possibility of misclassifying a case report and maximise the number of documents from a given class to be identified correctly.

2863
These results show that machine algorithm can differentiate between TB and lymphoma case reports with high accuracy. Given that the reported results are cross-validation scores, it is likely that the trained model will perform well on unseen case reports. If implemented to classify case reports, it can help feed the right data into a diagnosis or referral support system. Such a system can be used to screen patients and detect lymphoma cases earlier, potentially improving the patients' prognosis. This could be extremely useful in diagnosing cancer in people with HIV-related lymphoma, who tend to show non-specific symptoms [45]. It is important to note that SVM performed very poorly, most likely due to the fact that the algorithm's default sklearn parameters were used. Future research will therefore look into tuning the algorithm and selecting their optimal parameters.

CONCLUSION
Since tuberculosis symptoms are shared by many other diseases, there is a high probability of misdiagnosis, especially in areas with restricted resources. And although there are various diagnosis machine learning systems, this study focuses on collecting and exploring data for a system dedicated to differentiating between tuberculosis and lymphoma.
As a starting point, the study used web scraping to collect available TB and lymphoma case reports, then used unsupervised methods to explore these latter. Case reports were assigned to one of three clusters: lymphoma, TB and "others". The results obtained after applying various classification algorithms on the dataset showed that the MLP model outperformed other algorithms when it came to accuracy, recall as well as precision, making it most likely to classify a case report correctly. This provides us with a tool for collecting additional case reports from different while ensuring the quality of the collected data.
Future research will aim to improve the MLP and decision tree models by tuning their hyperparameters. The pre-processing will also compare the performance when using stemming instead of lemmatization, since words like "abdomen" and "abdominal" are still seen as different concepts using the latter method. We will further collect case reports for the extraction of semantic features, such as patient symptoms. The resulting feature space will then be used to train a TB/Lymphoma screening support system.