Convolutional neural network for semantic segmentation of fetal echocardiography based on four-chamber view

Received Sep 14, 2020 Revised Mar 15, 2021 Accepted May 31, 2021 The acute shortage of trained and experienced sonographers causes the detection of congenital heart defects (CHDs) extremely difficult. In order to minimize this difficulty, an accurate fetal heart segmentation to the early location of such structural heart abnormalities prior to delivery is essential. However, the segmentation process is not an easy task due to the small size of the fetal heart structure. Moreover, the manual task for identifying the standard cardiac planes, primarily based on a four-chamber view, requires a well-trained clinician and experience. In this paper, a CNN method using UNet architecture was proposed to automate fetal cardiac standard planes segmentation from ultrasound images. A total of 519 fetal cardiac images was obtained from three videos. All data is divided into training and testing data. The testing data consist of 106 slices of the four-chamber segmentation tasks, i.e. atrial septal defect (ASD), ventricular septal defect (VSD), and normal. The segmentation of the post-processing method is needed to enhanced the segmentation result. In this paper, a combination technique with U-Net and Otsu thresholding gives the best performances with 99.48%pixel accuracy, 96.73% mean accuracy, 94.92% mean intersection over union, and 0.21% segmentation error. In the future, the implementation of Deep Learning in the study of CHDs holds significant potential for identifying novel CHDs in heterogeneous fetal hearts.


INTRODUCTION
In developing countries such as India and Pakistan, about 1% of several newborn babies are affected by congenital heart defects (CHDs). The number of newborn babies with CHD is increasing as reported in [1], wherein 2011, the ratio of CHD sufferers per 1000 births was 9.1%. The CHDs is a heart disease that has been detected as early as the first trimester of intra-uterine life [2]. Such defects are characterized by abnormalities in the heart structure, with varying degrees from mild to severe. CHDs still dominate the problem of heart disease in infants and children. With the incidence of CHDs at around 1% of the infant in Indonesia and every year, there will be about 45,000 babies with CHDs ranging from mild to severe abnormalities diagnosis, including complex conditions [3]. A newborn with undiagnosed heart disease will be discharged to the home, and it is only a matter of time to get a worse condition or even dies at home. Diagnosis of CHDs problems in the early stage of pregnancy allows for prompt, lifesaving treatment. Fetal diagnosis depends on observations by experienced clinicians using ultrasound imaging [4]. Unfortunately,  1987 -1996 1988 due to very few experts in the field of CHDs, it is common for an infant to be born without having an existing heart problem diagnosed [5]. Undetected CHDs are a serious problem: when an infant has a serious heart problem, often the outcome depends on an accurate diagnosis at the time of birth.
Newborn children with severe heart disease who are not analyzed before birth could in the first month or get more severely ill while still in the maternity ward. Nonetheless, treating intrinsic heart within seven days after birth mainly improves the prognosis [6]. In this way, numerous endeavors have been made to build up an innovation that makes fast and exact conclusions conceivable. A precise fetal heart segmentation is basic to localizing structural heart abnormalities before birth [7]. The difficulty increase due to the small size of the fetal heart structure and depressions, particularly the flimsy chamber boundaries in the atrial septum, the membranous section of the ventricular septum, and the valvular leaflets [8].
A machine learning (ML) algorithm provides the ability to learn the medical image data based on statistical techniques [9], [10]. Using medical data allows ML to improve its performance on a specific task progressively. Furthermore, the ML algorithm allows a diagnostic system to detect disease faster and more accurately than a human being [11]. Unfortunately, this process requires more information on normal and abnormal subjects to recognize a particular disease [12]. The problem is that heart defects in infants are infrequent, so there is a lack of available information to train the ML algorithm. The same issue applies to congenital heart disease: the problems are rare (there were no complete data sets), so predictions can only be made using relatively small and incomplete data sets [13]. According to this limitation, a diagnosis based on the ML-based method was insufficient in accuracy and does not recommend being used clinically [14], [15]. Adding more ultrasound images to the system can help the ML learn better to improve its screening accuracy [16]. The study of CHDs has been primarily conducted with handcrafted features, referred to as 'shallow learning' [17], [18]. Maraci et al. [18] applied dynamic texture modelling with handcrafted, rotation-invariant features to detect the fetal heartbeat. Bridge, Loannou, and Noble [17] proposed a framework based on sequential bayesian filtering (SBF) to predict visibility, position, and orientation of a fetal heart in consecutive frames. However, shallow architecture cannot learn the essential features directly from the data; because it requires feature engineering [19], [15].
One of the ML algorithms, called deep learning (DL) algorithm, can work using less data [20]- [22]. Using the augmented strategy, the image can be expanded and can be used as learning data as well [23]. The DL algorithm is vastly improving medical diagnoses speed and accuracy [12], [24]. In general, fetal heart diagnosis experts seek to find whether certain parts of the heart, such as valves and blood vessels, are in incorrect positions by comparing normal and abnormal fetal heart images. By using DL, such a process is similar to the 'object detection' technique, which allows DL to distinguish position and classify multiple objects appearing in images [25]- [28]. Using the ML approach, it is possible to develop an automatic diagnostic system to detect certain diseases faster than humans [12], [29]. Since there is limited research in automated fetal heart segmentation according to the aforementioned literature review study. This study proposed a DL algorithm to detect abnormalities in fetal hearts based on US images. By providing an accurate detection, an appropriate treatment can be conducted as soon as possible. The DL with automatic feature learning ability indicates an efficient method for fetal pattern recognition. Baumgartner et al. [26] employed a fully convolutional network to detect 12 standard planes and localize the respective fetal anatomy. Gao, Maraci, and Noble [27] presented a transfer learning-based design to study the transferability of features learned from natural images to ultrasound image object recognition. Chen et al. [28] proposed a hybrid model composed of ConvNets and recurrent neural networks (RNN) to explore spatiotemporal learning from contextual temporal information. However, the cardiac fetal data set is difficult to obtain, and it is lack of large publicly-available. Therefore, the use of the 3D Ultrasonographic (US) video of from the internet have becomes the data resource in [30]- [32]. Processing the US video to provide a reliable data set is a challenging task because the data is quite unstructured with varying dimensions. Hence, in this paper, the preliminary study of Deep Learning with convolutional neural network-based U-Net architecture is proposed to automated semantic segmentation the four-chamber view of fetal cardiac image with unstructured and varying of dimensional data.

SEGMENTATION PROCESS
There are three stages in the segmentation process: (1) data collection and preparation, (2) automated segmentation by using CNNs, and (3) Post-processing and calculating segmentation performances. A detail of these processes is presented in the following subsections:

Data collection and preparation
The data were obtained in MP4 format from three maternity in the gestational age range of 18-23 weeks. The raw data collected in terms of normal [7], ASD [33] and VSD [34]. The MP4 format is converted into frames/slices of the image. All data generated about 519 images in total with different dimensions. The dimensions of ASD, VSD, and Normal data were 1280x720, 640x480, and 640x480. Before the ground-truth data were obtained, we performed the data pre-processing stage, including cropping and resizing process. The overall steps of data preparation stages showed in Figure 1.

Figure 1. Data preparation stages
All images must be cropped to eliminate the noise from the frame. After that, the cropping process is done to reduce the size of the image dimensions to 400x300. Once the pre-processed data has been obtained, the next step is to create ground truth for all images. The purposes of creating ground truth are to use it as annotated data in training phases and to measure segmentation results. Figure 2 is the illustration of preprocessing steps and the ground truth process.

Automated semantic segmentation
The purpose of semantic segmentation is to find the specific characteristics in a cardiac fetal image and separate the various objects to become objects with specific characteristics. In this study, convolutional neural networks (CNNs) with U-Net architecture was proposed for cardiac fetal segmentation [35]. The U-Net architecture has been proven to carry out semantic segmentation processes in medical data sets, and it has been used successfully in brain tumour segmentation [36] and cell segmentation [37]. This architecture is an end-to-end fully convolutional network type architecture containing a convolution layer without a fully connected (dense) layer. Therefore, such architecture can accept various images with different sizes.
In general, CNN-based U-Net architecture has two layers: the convolution layer and the pooling layer. The convolution layer processes the value of a matrix (kernel or filter) and changes it based on the filter's values. The convolution process is defined using (1): where [ , ] is the input image, and is a filter. The process of convolution operation is calculated by using (2): where is the input features, the output features, a convolution kernel size, a convolution padding size, and a convolution stride size. The U-Net model uses max pooling on the pooling layer. This pooling function reduces the size of the feature map so that there are only a few parameters in the network. The maxpooling process selects the maximum pixel value from the feature map and then obtains the collected feature map. In the pooling process, the critical point is that the image size resolution is reduced because high- resolution images are converted to low-resolution images. The U-Net architecture is divided into two parts, the encoder/contraction path and the decoder/symmetric expansion path, as presented in Figure 3.
The function of contraction path or encoder is to capture the context contained in the image. In the first part, an image with dimension 256x256x1 will be sampled to produce the specified context. The contraction path consists of several convolutions and max-pooling layers, while the size of the convolution kernel is 3x3. A non-linearity ReLU operation always follows in every convolution operation. Besides, there are pooling operations measuring 2x2 with a shift of 2 times. Finally, this process produces 32x32x256 images. The second part is the symmetric expanding path or decoder to localize objects using convolution transformations. The symmetric expanding path stage contains the up-sampling operation of the results of the contraction path. In this operation, the image size will increase gradually, and the depth of the image will gradually decrease from 32x32x256 to 256x256x1. The symmetric expanding path process restores information generated from the contraction path process by slowly performing up-sampling stages. The skipconnection process is carried out at each symmetric expanding path layer to produce better object segmentation results. Such a process is done by combining the convolution layers output in the symmetric expanding path, which is transformed by the contraction path stage's feature-map at the same level.

Post-processing and performance measurement
In this study, we carried out several post-processing methods to enhance the segmentation result. In the post-processing phase, each image pixel value becomes 0 for the background and 1 for the foreground. This process improves the quality of segmentation results in terms of accuracy. The post-processing methods used are global thresholding (fixed thresholding; threshold=127), Otsu thresholding [38], and local thresholding (Gaussian thresholding [39]). These post-processing methods are compared to obtain the most optimal method based on four-chamber segmentation. After the post-processing stage, the testing phase is performed. The result is validated using pixel accuracy (PA), mean accuracy (MA), and mean intersection over union (MIoU) [40]. The PA metric calculates the ratio between the number of pixels classified correctly and the total number of pixels. To illustrate such a metric performance, the confusion matrix model is used in the case of classification. The pixels accuracy is shown in (3): where, Ncls is number of class and Nxy is number of pixels in x class that were predicted as y class. The value of the confusion matrix illustrated false positives (Nxy), false negatives (Nyx), true positives (Nxx), and true negatives (Nyy). Mean accuracy (MA) is a metric for calculating the accuracy ratio for each class and the average value based on all classes (Ncls). MA is described in (4). The mean intersection over union (MIoU) is also known as the Jaccard index. This is a metric used to calculate the intersection percentage between the labelled mask and the predicted output. Intersection over Union is counted per class, and the values of all classes are averaged. The IoU metric is highly effective and very straightforward. MIoU is presented in (5)

RESULTS AND ANALYSIS
In this study, we split the data into two groups of datasets, namely training and testing, with a proportion of 80 and 20, respectively. Train set consist of 413 images is used to train the U-Net architecture, while the test set consist of 106 images is used to measure the segmentation performance. We also tested several post-processing methods (fix threshold, Otsu and Gaussian) to improve the segmentation result. U-Net original architecture with sigmoid activation function in the last layer and mean squared error loss function is used to create a baseline model. Table 1 present the segmentation performances of the original model. From Table 1 it can infer that fix and Otsu thresholding methods increased the segmentation performance. Unfortunately, gaussian thresholding was failed due to its characteristic for determining threshold value. Fix and Otsu thresholding used one global value as a threshold, while the gaussian threshold determined threshold values based on a small region around it. The gaussian threshold enabled different threshold for different regions, which gives better results for image with varying illumination. Figure 4 shows the segmentation results of the baseline model.  In order to get the best parameters, several tuning in U-Net hyperparameter was done. We compared the segmentation result using binary cross-entropy as loss function and change numbers of convolutional filter in each layer. The size of convolution filters was down and up sample. Typically, U-Net architecture  Table 2.  Table 2 shows that binary cross-entropy loss function increases the performance metrics. The most noticeable escalation is the average IOU, which increased by more than 20% and the average accuracy increasing by about 15%. Moreover, the effect of filters numbers on the convolution layer can also be seen from the experimental results. The optimal value for convolutional filter was 64, 128, 256, 512 and 1024, both in encoder and decoder path with 99.48% of Pixel Accuracy, 94.92% of Mean IoU, and 96.73 of Mean Accuracy. In addition, Fix and Otsu threshold as post-processing methods only gives a slight difference for the evaluation metrics. Figure 5 illustrates the best model segmentation result, U-Net with 64, 128, 256, 512 and 1024 convolutional filter and Otsu thresholding. Figure 6 shows graphs of the accuracy and loss in the training process. It can be seen that, the loss curve decreased to zero, and the accuracy curve increased to 1.0. Furthermore, there is no gap between the training and validation data curves which indicates that there is no overfitting problem in the proposed architecture.  Table 3, the proposed architecture is compared with other segmentation methods. We have found limited segmentation methods using DL for fetal cardiac studies. A number of previous studies calculate the segmentation performance using conventional methods without the learning process. Our proposed model produced 99.48% of pixel accuracy, 94.92% of mean IoU, and 96.73 of mean accuracy. Moreover, the error rate only produced about 0.21%. The best model gives satisfactory result compared to others even with a very limited data.

CONCLUSION
The diagnosis of CHDs in the fetus is a challenging task. This is due to the small size of the fetal cardiac structure and the lack of data availability. An ultrasonography video observation from three maternity in the gestational age range of 18-23 weeks in format MP4 files used as the data set with cardiologist validation. However, the raw data need to be pre-processed due to the unstructured dimension and the low signal-to-noise ratio. Therefore, the limited data and the fetal heart small structure are impediments to a deep investigation. A deep learning approach is proposed to help experts in diagnosing CHDs. The CNN-based U-Net architecture helped the novice and expert sonographers identify the fetal cardiac standard planes. The U-Net architecture selected the base architecture combined with several post-processing methods. The network was trained and tested on a large number of data sets acquired. From the preliminary results, the segmentation performances based on PA, MA, and MIoU is observed. U-Net combined with the global thresholding approach (fix and Otsu methods) produces the best performance. On the other hand, local thresholding gives unsatisfactory results due to its ability to have a different threshold for different regions. From our preliminary results, we observe that, based on performance metrics such as accuracy and error, our network produced a comparable result with state-of-the-art techniques with 99.48% PA, 96.73% MA, 94.92% MIoU, and 0.21 segmentation error. As part of future work, we plan to test the network performance on a larger data set direct from several fetal subjects. Furthermore, we will try to detect and classify structural anomalies in the fetal heart. Based on the retrieved slices, classify the volume as normal or abnormal for various types of CHDs and extend the work to extract standard planes associated with other anatomical structures.

ACKNOWLEDGEMENT
This work was supported by the Ministry of Research, and Technology (RISTEKBRIN), Indonesia, through the Applied Research, under Grant 096/SP2H/LT/DRPM/2020.