Giving more insight for automatic risk prediction during pregnancy with interpretable machine learning

ABSTRACT


INTRODUCTION
Based on the data in [1], there has been approximately 44% decrease in the maternal mortality rate (MMR) or maternal mortality ration (MMR) globally over the past 25 years, from 1990 to 2015. This reduction becomes part of the millennium development goals (MDGs) program) which was initiated in 2000 with a reduction target based on maternal death indicators per 100,000 births. To maintain the sustainability of the program, sustainable development goals (SDGs) were formed to reduce MMR globally by less than 70 for every 100,000 live births. The decline in MMR values needs to be continued due to 830 female deaths worldwide related to pregnancy and childbirth [2]. In addition, it is revealed that 99% of deaths occured in developing countries. Environmental factors play crucial roles, where an increased risk occurs if women lived in rural areas within poor communities.
Indonesia, as a developing country, has preventive efforts to reduce the value of MMR, such as by conducting the 2015 intercensal population survey revealing that in every 100,000 birth, the MMR value declined to 305 nationally [3]. In fact, the increase in MMR was due to lack of insight in pregnant women leading to the risk of pregnancy. In carrying out the first level of health activities, public health center (puskesmas) carries out preventive and promotive actions to sustain public health. To monitor the status of pregnancy, based on Poedji Rochjati screening card (KSPR), Puskesmas can currently detect the status of risks and disorders in pregnant women [4]. In addition, there is also a pregnancy control card as a risk monitoring tool which is completed during pregnancy examination as the first attempt, and followed by the next examination procedures.
In its implementation, there are differences in the number of features in KSPR only amounting to 20 attributes from a total of 117 features available on pregnancy control cards. The list of features on the pregnancy control card is categorized into four categories, which include: pregnancy history, childbirth and family planning, current pregnancy history, general, physical and obstetric examination, laboratory examination. This difference raises questions related to the role of each attribute both on the KSPR and pregnancy control card. The next question lies on whether the 20 features on KSPR are representative for all factors to determine the risk of pregnancy with similar attribute on both cards?
In more specific cases of pregnancy health, the machine learning algorithm indicates promising performance which is beneficial for the detection of high-risk pregnancy [5], [6] and pregnancy health services in general [7]. Furthermore, machine learning algorithm is also increasingly popular, implemented in various tasks with various data sources in the world of healthcare [8]- [10]. However, the resulting model has limitations because it is difficult to be interpreted by experts. The Interpretability aspect holds a key role to provide insight into why certain prediction was made by the model for the patient's condition. In addition, the existence of this interpretative aspect serves as a means of transparency of an intelligent system in predicting the risk of pregnancy. With this transparency, experts such as doctors can validate the output of intelligent systems to avoid the potential result for biased datasets.
Interpretability in the context of artificial intelligence (AI) is defined as a degree where humans (experts) can understand the causes of a decision [11]. With a similar concept, interpretability can also mean that situations where humans can consistently predict the results of machine learning model [12]. This aspect has also become increasingly popular to be combined with powerful methods such as deep learning and ensemble models providing high accuracy with less interpretability [13].
However, recently there has been no research to build a system to predict the risk of pregnancy by utilizing visualization techniques to provide more insight to the user. This facility is worth to appear in AI-based medical applications for further analysis and guarantees prediction accuracy. Furthermore, the urgency of this interpretability aspect also arises because the agreement is required in determining the most significant features of the two monitoring cards. Thus, this study aims to build a system for predicting the risk of pregnancy based on machine learning techniques which is more detailed represented in the three research questions as shown in: (RQ. 1) Which features are most influential to determine the risk of pregnancy; (RQ. 2) Are there links in the form of intersecting features on KSPR and pregnancy control cards; (RQ. 3) How to build interpretable machine learning models for predicting pregnancy risk; In this research, we propose to utilize the two interpretable methods, which are: local interpretable model-agnostic explanations (LIME) and SHapley Additive EXPlanation (SHAP). LIME works are based on a local learning interpretable model (local surrogate model) focusing on individual prediction to explain individual prediction rather than training the global surrogate model [14]. SHAP works are identified by assigning a value to each feature for a prediction task [15]. This study implements both interpretability models as an expert verification medium for machine learning models in addressing the pregnancy risk cases.

PREGNANCY RISK PREDICTION
To answer the three research questions, there are six stages to be implemented, namely: (1) pregnancy risk monitoring to identify features on the KSPR and pregnancy control card; (2) data acquisition to obtain patient data from both monitoring cards; (3) preprocessing data to process raw patient data that is ready to be used on the next stage; (4) feature selection and classification aim to determine the most significant features of pregnancy risk; (5) implementation of interpretable machine learning techniques to provide more insight for classification results. In more detail, each stage is described in the section below.

Poedji Rochjati screening card (KSPR)
Generally, every pregnant woman who checks her pregnancy to puskesmas will get a copy of the mother and baby health (MCH) book along with KSPR. The card has been arranged to facilitate health workers in screening potential risks for pregnant women. The screening results are utilized to classify mothers into categories such as: low risk pregnancy (LRP) group, high risk pregnancy (HRP) group, and very high risk pregnancy (VHRP) group. Thus, proper actions are easily performed by medical personnel to 1623 minimize the potential risks that might arise. Information to complete the KSPR is obtained when a pregnant woman visits the health center and checks her condition. Risks in KSPR are symbolized by numbers, such as: LRP with a score of 2, HRP with a score of 6-10, and VHRP with a score of >=12. Lists of attributes on KSPR, include: Too many children, 4/more; Too young, pregnant in 1≤16 years old; Too old, pregnant in 1≥35 years old; Too old, age of ≥35 years old; Ever failed pregnancy; (stillbirth) baby died in utero; Ever administered the cesarean section; too soon pregnancy (<2 years); too long pregnancy (≥10 years); Swelling on the face/legs and high blood pressure; Bleeding during pregnancy; Location of breech; Location of oblig; and low blood supply. Table 1 shows the list of KSPR attributes, some of these attributes are accompanied by a list of indicators. Post term pregnancy 17 Breech position 18 Oblig position 19 Pregnancy bleeding 20 Chronic pre-eclampsia/eizure

Pregnancy control card
In this study, polymer data of 400 pregnancy control cards were involved. The data was obtained from the research partner of Cipto Mulyo Malang Public Health Center from 2016 to June 2017. The data of pregnant women cards were in the form of physical files; thus, in collecting data the researchers moved them manually in the form of command separated value (CSV) format. The number of features on the pregnancy control card is 117. On the pregnant mother's card, there are 4 examinations performed, such as: a history of pregnancy, childbirth and birth control, current pregnancy history, general, physical and obstetric examinations, and laboratory examinations. Explanations of each examination are as shown in:  Pregnancy history, delivery and family planning/birth control An examination of a history of pregnancy, childbirth and birth control needs are necessary especially for women who are pregnant for more than once. When suspecting the complications in a previous pregnancy, for example a pregnant woman with a history of abortion or miscarriage in a previous pregnancy, then there will be an indication that it can reoccur in the current pregnancy. The examination indicators are shown in Table 2.  Current pregnancy history Examination carried out in pregnancy is now aimed at finding out whether there are dangerous indications for the mother's ongoing pregnancy. For example, if pregnant women suffer from hypertension in the current pregnancy, it will be at risk of the mother exposed to pre-eclampsia. Details of pregnancy history are shown in Table 3.    Table 4 shows the types of examinations that may be carried out to determine general conditions.

Data preprocessing
In many cases, the dataset, both training data and test data, requires further processing to prepare the dataset with good format that fits the classification requirements. The process for preparing this dataset is called data preprocessing. The stages of preprocessing that will be carried out in this study are as shown in: -Missing value replacement The dataset used in this study still contains missing values. Thus, it is necessary to do a substitution technique based on centrality tendency. Substitution is performed by filling in the blank data with the mean value and filling in the blank data with the mode value [16]. -Data transformation Data transformation includes the process of converting a dataset structure into another form or structure [16]. The data used in this study has 2 types of attribute data, which are: nominal and numeric to form a good data format for data processing, thus the nominal data type will be transformed into a numeric data type.

-Data normalization
In some cases, there are data conditions that are quite far apart, requiring data normalization based handling to scale attributes with numeric types. One way to normalize is to use the min-max formula [16].

Features selection: C5.0 algorithm and correlation-based features selection
The total features used in this study amounted to 120 features, consisting of 117 features of the pregnancy control card, and 3 attributes of KSPR that were not found within the pregnancy control card (not intersecting). From the total features, features are then selected to determine the most influential feature indicating health risk of pregnancy. Feature selection is conducted by using 2 methods of: Correlation-based features selection (CFS) and C5.0 algorithm based on information gain. CFS was selected because it is considered as one of the most stable feature selection methods. This technique considers the use of individual features for class label estimated with the level of intercorrelation among other features. In addition, several studies in the field of medical diagnostics using the CFS method have shown satisfactory results [17], [18]. Whereas, C5.0 algorithm was selected because the feature selection model was based on information gain. This method is a fairly a simple method and is widely utilized in classification cases [19]- [21]. Information gain can help reduce noise due to less relevant features. Information gain detects features presenting the most information by class. Compared to its predecessor, the C4.5 method, the C5.0 method is claimed to be able to produce better accuracy with less memory usage [22].

Interpretable model for pregnancy risk classification
Increased use of predictive statistical models such as the linear model, rule-based model, classification, and many others, dives to the machine learning model of accountability, transparency, and interpretability. In the case of healthcare, the need for interpretability, fidelity and performance models is considered higher than other domains [23]- [25]. This requirement is due to a large risk in the case of misclassification by the machine learning model. The process of model transparency allows users to understand, audit, and even correct decisions made by the model on the healthcare decision support system.
Every system developed based on machine learning certainly expects a high performance model. In its implementation, there is a trade-off between the interpretability model and the performance model (precision, recall, F-score). The higher the performance of an algorithm, the less interpretability would appear as depicted in Figure 1. The more interpretable models such as decision trees and regression models will have smaller predictive performance when compared to less interpretable models such as boosting and deep learning models, and others [25].

Local interpretable model-agnostic explanations
As a local surrogate model, LIME [14] is principled to provide an explanation of the reasons why a machine learning model creates a certain prediction. This process is carried out by observing a machine learning model towards the amount of data provided. LIME model tests on how the prediction process is performed by forming a dataset consisting of permuted samples and predictions as generated by the model. Based on this dataset, LIME then conducts training (decision tree, lasso). The resulting model becomes a good approximation of machine learning models prediction locally, but not globally. LIME as a local surrogate model can be formulated as shown in [27]: Explanation model in the (1) aims to minimize the L loss presenting the value of how close the explanation to the prediction from the original machine learning model while maintaining the complexity model ( ) at a fairly low value. Further, proves a list of potential explanations that might be generated and defines how large the neighborhood is around instances .  Figure 2 provides an illustration of how LIME works. Complex machine learning models are represented by pink and blue areas which are linearly inseparated. LIME will do sampling against instances, obtain predictive results on samples taken, and give weight to the instance based on the distance from the starting point (the closer the point is valued the more important). Further, some of these samples are uutilized to train the correct simple classifier locally (dotted line) [14]. Figure 2. LIME interpretable method [14]

SHapley additive explanation
In interpretable machine learning, it is expected that the model can present an explanation of a classification or prediction result. In the case of prediction for pregnancy risk, a mother is curious to find out why she is predicted to have a high risk of pregnancy, despite uncompromissed condition. For this reason, the system is expected to be able to explain which features have the most influence on increasing the risk of pregnancy, including certain features which have effect on reducing the risk of pregnancy.
SHAP becomes a proper method for this purpose. SHAP will break down the model generated from the machine learning process to determine the effect of each feature on the prediction or classification results. The way SHAP works is by comparing the effect of the presence and absence of a feature on the predicted results. SHAP value can be defined through functions in [27].
In which: is a subset of the feature set in the model, x is defined as the vector of feature values of the instance being interpreted, dan is the number of features. One of the strengths of SHAP is the simplicity of the method. SHAP will run on the model generated from the machine learning process with no effect on the model itself. SHAP also provides a pretty good interface to present the effect of a feature on the prediction results. In Figure 3, it is apparent that the blue arrow indicates certain features with a positive influence on the improvement of prediction results, while the red arrows indicate the opposite related to features with a negative influence on the improvement of prediction results. Meanwhile, the length of the arrow indicates the weight of each feature, in which longer and greater weight means that the feature has a greater influence on changes in prediction results.

RESULTS AND DISCUSSION
In this section, the test results of the research question are presented by identifying a list of the most influential attributes by employing several stages of the scenario. The data in this test is preprocessing data. The results obtained from the two methods are then compared. CFS produces 14 attributes that are as the most important features and C5.0 considered algorithm produces 20 attributes. When compared, it turns out that there are 11 attributes intersecting each other from the list of attributes, as produced by the two methods and 12 other different attributes. Therefore, there are no missing attributes; in this study, twenty-three (23) attributes generated will be applied in the classification process. The list of attribute intersectionsis presented in the Table 5. The next step is performed to predict pregnancy risk by using a classification-based machine learning algorithm. The scenario of machine learning model development for prediction is completed by involving the 23 attributes with a number of instances of 400. However, there is an uneven distribution of data among: the severity of the risk level of LRP pregnancy (149 instances), HRP (183 instances), and VHRP (total 68 instances) as presented in Figure 4(a). Imbalance in data distribution affects the low quality of the resulting prediction models. Thusin this study, a synthetic minority over-sampling technique (SMOTE) algorithm is implemented to balance the distribution of data. The results obtained after the balancing process are as depicted in Figure 4(b) where all the risks of pregnancy have a number of instances of 127. To build a prediction model, this study utilized the 4 different algorithms such as: XGBoost, Random Forest, k Nearest Neighbor (kNN), and Naïve Bayes.

SHAP visualization result
In Figure 6, the results of the SHAP-based Interpretable model are presented for all classifiers. To interpret it, it is first necessary to know how SHAP visualizes the resulting machine learning model. The y-axis is the name of a feature or variable that is displayed in respective order, based on aspects of its importance variable. The x-axis indicates SHAP value of the variable on the y-axis which is also ordered from the lowest value on the left to the highest value on the right. This x-axis value determines whether the value of the feature is caused by a higher or lower prediction. In the Figure 6, an interpretable model generated by using the SHAP multiclass is depicted in the 4 classifiers. In the plot, the distribution and the average SHAP values for the three classes are explained, which include: LRP, HRP, and VHRP.
The interpretation process of Figure 6 is conducted by observing at the SHAP value for each class. For example, with the XGBoost algorithm, the average SHAP value generated for the VHRP class is around 1.25, 0.55 (1.80-1.25) for LRP, and 0.1 (1.90-1.80) for HRP. Thus, in general, it is apparent that the feature is highly dominant in influencing model prediction for all classifiers except kNN; however, there is a difference on how the cesar feature influences the model built. In XGBoost and the Random Forest, the Cesar influence predicting VHRP feature is compared to the other two classes such as VHRP and LRP with similar SHAP values.
Furthermore, Figure 6 apparently indicates that the kNN algorithm neglects more than half of the features (11 features) for prediction of LRP, HRP, and VHRP. Other algorithms conclude the resulting model will ignore these 11 features. However, because the kNN algorithm does not form a model, the learning mechanism in kNN algorithm indicates instances of these 11 features which do not show sufficient distance to affect the proximity to training data. A small average of SHAP value also occurs in other algorithms, which is not as small as the average of SHAP value in the kNN algorithm.

LIME visualization result
LIME-based visualization model consists of three main parts. The first part contains the prediction probability (the far left part of the plot) which contains information about the probability distribution of the target classes, including: LRP, HRP, and VHRP. The second part contains the list of Importance Feat ures (to the right of the prediction probability section) that most contribute to the resulting model. The third part contains the actual values of the list from the most important features (at the bottom of the plot). In Figure 7, it is apparent that LIME only displays the list of features that most influence the model, which is different from the SHAP-based interpretable model displaying all features with the SHAP value for every single feature.
If observed, LIME visualization for all classifiers in Figure 7 presents several notable phenomena in the case of a multiclass LIME, which provides the negation option of a class. For example in the XGBoost model, it is indicated that VHRP class is supported by a feature collection and the negation (non) VHRP class is also supported by a feature collection. In this example, there is a Cesar feature of > 0.42 which means that this feature's value satisfies the criteria that support the VHRP class. If observing the four LIME Plots, it is apparent that the Cesar plot for XGBoost and Random Forest feature is very dominant in determining the three classes of pregnancy risk. In two other algorithms such as Naïve Bayes, LIME plot indicates that feature of smoking dominates in the formation of the model and LIME Plot in kNN is strongly influenced by feature such as blood pressure and age.

Comparison of accuracy of pregnancy risk classification
The prediction model is established based on a dataset of three balanced classes. Classification accuracy testing is completed by forming the composition of training data and testing data with ratio of 80:20. The test produces a comparison of the accuracy value as depicted in the Table 6 The XGBoost algorithm has the highest accuracy value of 94%, followed by Random Forest, Naïve Bayes, and kNN of 87, 66, and 60 respectively.

CONCLUSION
In this study, a pregnancy risk prediction system was established in Indonesia based on the features inherent in pregnant women. Pregnancy risk is divided into three which are: low risk pregnancy (LRP) group, high risk pregnancy (HRP) group, and very high risk pregnancy (VHRP) group. For this reason, a pregnancy dataset is required as a representation of the mother's condition during pregnancy. This study involved 400 pregnancy data. The data cannot be directly applied because there are problems with the format and consistency and even distribution of three pregnancy risk statuses. For this reason, data preprocessing, selection attributes, and data balancing have been carried out. The process of forming the model is conducted by using 4 machine learning algorithms, such as: XGBoost, Random Forest, Naïve Bayes, and kNN. The classification results demonstrated that the XGBoost algorithm presented the highest accuracy value of 94% which was followed by Random Forest, Naïve Bayes, and kNN, each of which was equal to 87%, 66%, 60%. Both SHAP and LIME-based plots indicated the suitability of feature importance in all classes and all applied algorithms because both of these Interpretable Machine Learning techniques interpret the same model. However, there is a difference between the two; in which LIME only displays the list of features that most influence the model, unlike the SHAP-based interpretable model that displays all features (along with the SHAP value for every single feature).