Computer model for tsunami vulnerability using sentinel 2A and SRTM images optimized by machine learning

Received Jun 4, 2021 Revised Aug 3, 2021 Accepted Aug 30, 2021 This study aims to develop a software framework for modeling of tsunami vulnerability using DEM and Sentinel 2 images. The stages of study, are: 1) extraction Sentinel 2 images using algorithms NDVI, NDBI, NDWI, MSAVI, and MNDWI; 2) prediction vegetation indices using machine learning algorithms. 3) accuracy testing using the MSE, ME, RMSE, MAE, MPE, and MAPE; 4) spatial prediction using Kriging function and 5) modeling tsunami vulnerability indicators. The results show that in 2021 the area was dominated by vegetation density between (-0.1-0.3) with moderate to high vulnerability and risk of land use tsunami as a result of the decreasing of vegetation. The prediction results for 2021 show a low canopy density of vegetation and a high degree of land surface slope. Based on the prediction results in 2021, the study area mostly shows the existence of built-up lands with a high tsunami vulnerability risk (>0.1). Vegetation population had decreased to 67% from the original areas in 2017 with an area of 135 km. Forest vegetation had decreased by 45% from 116 km in 2017. Land use for fisheries had increased to the area of 86 km from 2017 with an area of 24 km.


INTRODUCTION
Throughout history, the two largest tsunami disasters ever recorded were the tsunami on the island of Sumatra, Indonesia in 2004 with casualties reached 220 thousand people, and the tsunami in Tohoku Japan in 2011 with 20 thousand dead and missing victims. In anticipation of similar events in the future, monitoring of various tsunami risk factors is required, such as run-up and inundation forecast data. Researchers from various countries around the world are looking for new data sources to develop tsunami models, one of which is by using remote sensing data, including images of ALOS AVNIR-2, ASTER GDEM, IKONOS, DEM SRTM, Synthetic Aperture Radar, Landsat, and Sentinel [1], [2]. Tsunami risk determination in several countries such as Portugal, Japan, and Indonesia so far is carried out using DEM to obtain an overview of the slope of the coastline [3], [4]. The currently available DEM data is generated from remote sensing image analysis, such as the SRTM interferometric synthetic aperture radar (InSAR) image [5]. The DEM data plays an important role in tsunami modeling because it can provide an overview of the geomorphology, hydrology, and ecology of a potential tsunami area. In addition to the DEM, tsunami modeling requires coastal LULC data for hazard and vulnerability assessments [6]. VI is a new method of LULC analysis based on radiometric measurements of visible light and infrared spectra with indicators of leaf area, percentage of vegetation cover, chlorophyll content, and vegetation biomass [7], [8]. LULC is dynamic and naturally occurring but Sentinel 2 MSI multispectral image contains 13 spectral bands in visible light and VNIR, as well as SWIR, with a total of four bands at 10 m, six bands at 20 m, and three bands at atmospheric correction spatial resolution of 60 m. A total of 4 bands of 10 m spatial resolution and 6 bands of 20 m spatial resolution are used for environmental monitoring, whereas 3 bands of 60 m spatial resolution are designed for atmospheric correction and cloud detection [20]. Judging from band composition, Sentinel 2 imagery is suitable for vegetation analysis because it has a finer spatial resolution compared to other satellite images. Besides, Sentinel 2 wavelength is sensitive to chlorophyll contents and its phenological status is higher than other images so that the classification of vegetation indices is more accurate [21]. The SRTM is a radar-based remote sensing image with a spatial resolution of 30-92 meters. SRTM is used as an alternative to replace the need for high-resolution satellite data which is very expensive. The SRTM is chosen usually for its accessibility, feature resolution, vertical accuracy, and lower amount of noise compared to other alternative global DEMs [22]. The SRTM is used to study topography, geomorphology, vegetation cover, tsunami assessment, and urban areas. Until now, SRTM has been available globally and can be accessed easily and cheaply [23]. The SRTM has a pixel size of 90 m, an accuracy of 16 m in a vertical position, and WGS84 as its reference ellipsoid [16]. The SRTM consistsof single-pass C and X-band interferometric synthetic aperture radars (InSAR). The SRTM can record terabytes of data from topographic maps of the earth with high accuracy from latitudes 60 N to 56 S. More than 95% of the targeted earth's surface areas have been mapped at least twice to reduce interferometric altitude errors. When completed, the SRTM topographic map will provide coverage of approximately 80% of the earth's surface. The C-band SRTM dataset consists of two types of data: 1) principal investigator processor (PI) data; 2) ground data processing system (GDPS) [24]. The VI is a combination of spectral measurements in different wavelengths as recorded by the radiometric sensor and assists in the analysis of multispectral image information by reducing the multidimensional data to a single value. The spectrum and wavelength VI are: ultraviolet (10-380), blue (450-495), green (495-570), red (620-750) anda NIR (850-1700). VI is defined as a dimension-reducing radiometric measurement mechanism involving linear ratios and combinations of the red and near-infrared (NIR) portion of the spectrum [25]. VI is a mathematical combination of various spectrum bands designed to separate the pixel values of different features numerically [26]. The VI equation in the Table 2. NDWI is designed to maximize the light reflectance of water bodies in the Green band and minimize the reflectance of water bodies in the NIR band [29]. The NDWI values vary from a negative value (-1) which represents soil to a positive value (+1) which represents the surface of open water bodies and thick land cover vegetation.
[30], [31] NDBI is used to assess the reflectance of an established land and has a higher index value compared to other land uses or land covers. The NDBI value varies from a negative value (-1) which represents land with a low proportion of buildings to a positive value (+1) which represents land with a higher proportion of buildings.
[33], [34] (1 + ) The soil surface and vegetation canopy will absorb solar radiation with a wavelength spectrum of 842 nm (NIR) and 705 nm (Red Edge1). The solution to overcome this problem is to calculate the square root of the reflectance absorbed by the canopy of vegetation and soil.
[35], [36] The data classification method using the ML algorithm has become the focus of many remote sensing researchers because it can model high-dimensional, non-parametric data that is abnormally distributed and produces a high accuracy when compared to conventional classification methods for a large number of predictors [37]. Literature studies show that various ML algorithms have been applied to classify remote sensing data, among which the focus of this study is RF, k-nn, MARS, and CART [38]- [39]. The RF algorithm is a part of the ML method used to classify high complexity and non-parametric data as well as remote sensing image data. The RF algorithm classifies data through a decision tree formation process, where each decision tree is built using bootstrap sample data from two-thirds of the training data taken randomly. The next third of the training data is used to determine the error factor in the data. Each classifying variable is given one label indicating that it is in a certain class. This is done on all sample data and the final result is the classifier that has the most labels [40]. RF uses bagging, which is an ensemble technique to increase the accuracy and stability of a model. RF uses the idea of bagging or bootstrap aggregation, namely selecting the class of classifying variables or predictors to make the final decision. For each the prediction function tree, ( ) is defined by (1) and (2).
(1) The notion M is denoted by the number of areas observed, is the area corresponding to the classifying variable or predictor variable with m, and is the constant corresponding to m. The final classification decision is made from the class that has the most labels [41].
MARS algorithm is a nonlinear and nonparametric regression method used to find the relationship between dependent and independent variables. The MARS method predicts using the basis function (BF). The definition of MARS is presented in (3).
Where 0 is the constant, is the BF coefficient, ( ) is BF and n is the number of BF in the model. BF is determined using the following function in (4).
Where x is the independent variable and is the constant related to the independent variable [42]. CART is a method in ML which is used to classify objects into two or more populations. CART works by identifying and configuring predictor variables to produce more homogeneous data. The determination of the classification variable is carried out using the Gini Index equation as shown in (5) [43].
Where c is the number of classes and is the probability of class at point t. is defined as shown in (6).

= (6)
Where is the number of samples in class and N is the number of trained samples [44]. The k-nn algorithm is intrinsically non-parametric and is generally applied for pattern recognition purposes. The k-nn algorithm works by calculating the distance between the unknown sample and the nearest known sample. The unknown sample is assigned a label or a class which is calculated from the average distance of the k-nearest neighbor variables [45]. The main principle of k-nn is that the classified points are determined based on their proximity to k. For example = {( 1 , 1 ), ( 2 , 2 ), … . , ( , )} is training set and T is the number of observed data entities. The notation ∈ is a vector and ∈ = { 1 , 2 , … } as a label for data classification = 1,2, … . If there is data input x, the entity distance will be determined in the training data as the k-nn value which is denoted as ( ). The distance function (d) in k-nn is denoted by the (7) [46].
Ordinary kriging (OK) is an interpolation method that works using the principle of Spatial Autocorrelation, which assumes that a point closer to the prediction point has a greater value than a sample point that is farther from the prediction point [47]. OK is defined using (8).
Where ( ) is the variogram value at the distance d, ( ) is the value of the observed variable at location ( ), the notation ( ) is the number of all observation points in the range of distance d. The prediction of the variable distribution is calculated using the Z * (x) linear regression, as shown in (9). Where ( ) is weight, m (x) and m ( ) are mathematical expectations of random variables Z * (x) and Z * ( ) [48].

RESEARCH METHOD
The research area is Kebumen District, Central Java Province, Indonesia as shown in Figure 1. The research was conducted in five step in Figure 2, are: a. Data preprocessing. This stage includes collecting Sentinel 2 satellite image data obtained from www.earthexplorer.usgs.gov. The Sentinel 2 satellite images are corrected geometrically, radiometrically, and atmospherically. Geometric correction aims to correct the position of the coordinates of each pixel in the image the same as shown on the earth's surface by taking into account satellite movements, earth rotation and terrain effect [49]. Radiometric correction aims to restore the digital number (DN) value that has been calibrated to reduce the noise component that interferes with the reflectance reception. Atmospheric correction aims to reduce the atmospheric effect that interferes with the reflectance and reduce the DN value [50]. b. Data analysis and classification. This stage includes the extraction of Sentinel 2 image data using the NDVI, NDBI, NDWI, MSAVI, and MNDWI algorithms. The results of data extraction are in the form of numerical values that can be used for classification and prediction using ML. Image data is classified through a supervised classification to form a thematic map of the Land Used land Cover of the study area. The SRTM DEM image data are classified to analyze the contours of the study area. c. Prediction of NDVI, NDBI, NDWI, MSAVI, and MNDWI data using RF, MARS, and CART. d. Testing the accuracy of prediction results using ML which is performed statistically using the MSE, ME, RMSE, MAE, MPE, and MAPE equations. If the result of the prediction testing is accurate, the analysis of the level of vulnerability will be continued. If the result of the analysis is not accurate, it will be analyzed again using ML. A high degree of statistical accuracy can be seen from the calculation of MSE, ME, RMSE, MAE, and MPE MAPE which is close to zero. e. Data visualization is done by using Geostatistical spatial interpolation, namely Ordinary Kriging.

RESULTS AND DISCUSSION
Vulnerability components are: 1) physical condition; 2) socio-culture, (3) economy and 4) disasterprone environment. One of the vulnerability components that play an important role in controlling the level of tsunami hazard is LULC. Areas that have LULC with high vegetation density will reduce the speed of free water flow compared to the presence of buildings, water bodies and mixed vegetation [51]. In this study, LC is classified into 5 classes, namely 1) vegetation class; 2) water body class; 3) built-up land class; 4) open land class and 5) land cover class outside the classification. The vegetation class in the study area shows a decline from 2017 that covered an area of 135 km 2 , in 2018 that covered an area of 124 km 2 , 2019 that covered an area of 96 km 2 and in 2020 that covered an area of 91 km 2 . This decrease was caused by changes in land cover from coastal forests to built-up lands for social, cultural and economic functions for the community in Figure 3  Smaller vegetation area compared to built-up land is an indicator of the low ability of the vegetation to absorb tsunami energy. In this study, land cover is classified into 9 classes, namely 1) rice field; 2) forest 3) river 4) pond 5) shoreline 6) settlement 7) road 8) industry and 9) vegetation. The forest land use is in a separate class from vegetation because forest is defined as an ecosystem in the form of a stretch of land containing biological natural resources dominated by trees with woody, high canopy and decades of age characteristics. Vegetation is defined as the entire plant community at a particular location in the form of forests, gardens, grasslands, agricultural crops and so on. Land use for social, cultural, physical and economic activities in coastal areas was very high from 2017 to 2020. Data show that land use for paddy fields around coasts was relatively widespread and fluctuated due to seasonal patterns. During the planting season in 2018, land use covered 65 km 2 . Land use for road access was very high in 2017 that covered an area of 132 km 2    The data distribution pattern vegetation indices and ML years of 2016-2020 is expressed in the form of Quartile (Q) values which consist of Quartile 1, Quartile 2, and Quartile 3. Quartile 1 is 25% of the data, Quartile 2 is 50% of the data and Quartile 3 is 75% of data using boxplot in Figure 4. Data of MNDWI has a minimum value of -0.3710, Q1 has a value of -0.2250, Q2 has a value of -0.1354, Q3 has a value of -0.930 and the maximum value is 0.1300. The prediction data of NDWI has a value of -0.4470, has a value of -0.3190, Q2 has a value of -0.2721, Q3 has a value of -0.2360 and the maximum is 0.0500. The positive value on the water index indicates that LULC in the coastal area of Kebumen Regency is in the form of water surfaces and water bodies while the negative value represents land surfaces. The testing of the accuracy and validation of NDWI and MNDWI predictions for 182 villages in the study area was carried out using the statistics method as presented in Table 3.  Table 3 show that compared to other ML algorithms, the RF and k-nn algorithms show the smallest value so that it is the most optimal algorithm approaching the observation data. The results of RF and k-nn predictions can be used as an indicator to assess changes in the LULC, especially changes in the water index, which can be seen from changes in MNDWI and NDWI.
The NDVI is the main source of land cover information on LULC, used as an indicator of whether vegetation contains a chlorophyll element. MSAVI is a result of NDVI transformation to reduce the soil effect on the analysis results. A NDVI value closing to zero represents water bodies, water masses, wetlands, while a NDVI value approaching one represents grasses, shrubs, and thick canopy forest vegetation, in Figure 4. The prediction data for NDVI (a) shows a minimum value of -0.0530, Q1 (25% of data) of 0.2180, Q2 (50% of data) of 0.2614, Q3 (75% of data) of 0.3187, and a maximum of 0.5970. The prediction data for MSAVI (b) shows a minimum value of -1.3930, Q1 (25% of data) of -0.2797, Q2 (50% of data) of -0.1653, Q3 (75% of data) of 0.0210, and a maximum value of 0.5590. These results show that most of the value data is close to zero so that LULC in the coastal area of Kebumen Regency is in the form of water bodies, water masses, wetlands, and thick canopy forest vegetation. Testing of the accuracy and validation of NDVI and MSAVI predictions of 182 villages in the study area was carried out using the Statistics method as presented in Table 4.  Table 4 show that compared to other ML algorithms, the RF and k-nn algorithms show the smallest value so that it is the most optimal algorithm approaching the observational data. The results of RF and k-nn predictions can be used as indicators to assess changes in the LULC, especially changes in canopy vegetation seen from changes in NDVI and MSAVI. NDBI is an indicator used to assess built-up lands based on the unique spectral characteristics of an image. NDBI values are between -1 and 1 which are used to differentiate built-up lands from vegetation and water bodies in Figure 4. Negative value represents vegetation and water bodies while positive value represents built-up lands. The data of NDBI prediction shows a minimum value of -0.2560, Q1 (25% of the data) of -0.1777, Q2 (50% of the data) of -0.1355, Q3 (75% of the data) of -0.1030 and the maximum value of 0.0920. Table 5 shows that the smallest values compared to other ML algorithms are RF and k-nn, so that in the prediction of NDBI data, the RF and k-nn are the most optimal algorithms. The results of RF and KNN predictions can be used as an indicator of built-up lands, and the dynamics of the NDBI value shows structural changes in the LULC. The MARS algorithm is applied to predict VI on LULV in relation to vulnerability because the VI data includes a large number of variables with nonlinear characteristics and contains a high degree of interactions among predictors. The MARS algorithm works by using the cubic spline function, namely the interpolation of predictor data or known as the basic function. This basic function is used as a connectivity representation function between the predictor variables, namely the VI value and the target variable i.e. the value of VI prediction or the value of VI 2021. Testing of the accuracy and validation of the VI prediction of 182 villages in the study area using the CART algorithm is shown in Table 6. The operation of CART Algorithm is based on the Decision Tree concept used for VI prediction purposes using historical data. In this 2829 study, the CART algorithm evaluates the response and independent-dependent correlation between VI attributes which include NDVI, MSAVI, NDWI, MNDWI, and NDBI. The data for each VI is evaluated by partitioning it to achieve the most homogeneous condition according to the characteristics of the training data. The new, more homogeneous data resulting from the partitioning process from data VI is data of the CART RF prediction which this is an algorithm that works by forming a set of decision tree classifiers, which each classifier is built randomly using 70% of the total VI data as bootstrap, and the remaining 30% is used as out-of-bag data for internal error estimation. Each classifier decision tree will provide one vote from data VI, which the votes collection from will form a new class functioning as the prediction result data. VI prediction using RF algorithm is done because this algorithm can handle a large amount of VI time series data and minimize data of outliers and noise. Testing of the accuracy and validation of VI prediction of 182 villages in the study area using the RF algorithm is shown in Table 6. K-nn is an algorithm of classification that works in two stages, namely (1) learning the training data using 70% of the original data to find the characteristics of the data to be labeled in the form of classes and (2) making a prediction by comparing the schemes in the previous stages to process the new data. The unknown scheme of the new data is labeled and entered in each class according to its closest characteristics. Testing the accuracy of the predicted data from the combination of all VIs aims to determine the accuracy of high vulnerability areas. The test is carried out using 2 classification methods of accuracy, namely (1) scale-dependent measure consisting of RMSE, MSE, MAE, and ME (2) percentage error-based measure, namely MPE and MAPE. The test results for each method are shown in Table 6.  Table 6 shows that the accuracy test produces values close to 0 so it can be interpreted that the prediction results of VI have a high accuracy. The high accuracy of VI prediction indicates that the future LULC will have characteristics that are not much different from the current or previous LULC characteristics. The characteristics of LULC include the health condition of the vegetation, the proportion of built-up lands, surface water bodies and open lands. Table 6 shows that the CART, k-nn and Random Forest algorithms show the smallest value so that it can be represented as the most accurate VI prediction compared to other algorithms.
The VI spatial distribution as a representation of LULC is predicted using the Ordinary Kriging method. The predicted value of NDVI in 2021 was classified into 3 categories based on the level of vulnerability of LULC as shown in Table 7. and Figures 5 (a). NDVI is a vegetation index which plays as an indicator of vegetation density in LULC. Vegetation density is related to the ability to absorb tsunami energy and withstand various solid materials carried by the tsunami waves. The prediction results using the CART, k-nn and random forest methods for NDVI data in 2021 show that most areas had vegetation densities between (-0.1) to (0.3) with moderate to high LULC tsunami vulnerability, which shown in Figures 5 (b), (c) and (d) in yellow color (moderate) and red color (high). NDVI prediction in 2021 using CART, k-nn, and Random Forest shows a decrease of vegetated land areas compared to VI in 2020 (shown as blue color) which is in the range of 0.1 up to 0.5 see Figure 5 (b), (c) and (d). The area that gets wider in high LULC vulnerability classification illustrates the use of that area for socio-economic, agricultural and urban activities within a distance of < 2404 from the coastline [48]. The effect of soil background reflectance on NDVI is reduced by adding the factors of canopy density and soil elevation represented in the form of VI with the name of MSAVI. The MSAVI value in 2021 was predicted to be in the range between (-0.2) and -0.4. This value represents a high vegetation canopy density (maximum 1) and a low vegetation canopy density (minimum -1) dominated by low vegetation  Table 8. Based on this study, it can be seen that the k-nn and Random Forest algorithms show a spatial pattern similar to the LULC condition in 2020. Most of the areas have a range of density from (-0.4) -0.0 to 0.0 -0.4 with topographic characteristics show vegetation lands with moderate up to high land sloping and with the prediction of LULC vulnerability towards tsunami waves from low to moderate Table 9. The LULC spatial distribution of vulnerability to tsunami waves has patterns as shown in Figure 5 (g) and (h). The CART algorithm predicts that most areas have moderate to low LULC vegetation density and land sloping. The level of vegetation density and land sloping is in the range of (-0.4) -(0.0) to (-0.8) -(-0.4). The topographical characteristics show that the vegetation land has a moderate to low sloping and the level of vulnerability of LULC to tsunami waves is predicted to be moderate to high ( Table 9). The LULC spatial distribution of vulnerability to tsunami waves has a pattern as shown in Figure 5 (b, g, i and q). The NDWI and MNDWI are VIs developed for the purpose of measuring moisture content in vegetation, soil moisture, detecting open water, and detecting soil damage caused by liquefaction. Open water detection is carried out using high reflectance water in the green band and low reflectance in the NIR band [53]. The results show that the range of value from (0.1) to (-0.4) with the positive value of open water and the negative value of built-up lands, soil surface, and vegetation (see Table 10 and Figure 5   NDBI is referred to as the urban index or index which represents an area that functions as a place of settlement, concentration and distribution of government services, social services, and economic activities. The results show that the range of NDBI values is from (-0.2) to (-0.6) with the interpretation that most of the study area are built-up lands ( Figure 5   The DEM in this study was used to predict the slope angle and assess the stability of the coastal slope. Altitude is a variable in the elevation model so it is the first derivative of elevation, which is calculated to measure elevation variations over a distance [54]. SRTM DEM is the result of the transformation from geographic coordinates to Cartesian coordinates in the UTM projection system. All DEM pixels are classified into 6 classes as shown in Table 9. As seen in Table 9, each slope class has a different LULC so that it has a different level of vulnerability. The VI distribution is a representation of the LULC analysis based on land  Figure 6. Most of the coastal areas in the study area have a slope of 5-10% with a land composition consists of: green vegetation, open land, water bodies in the form of beaches and shrimp ponds, and built-up land ( Figure 6). Other coastal areas with a slope of 10-15% which are located at a distance of 1 to 2.5 km from the coastline, have a land composition consists of limestone hills, low plant vegetation, and built-up lands in the form of houses for residents. Other coastal areas with a slope of > 15% have a land composition consists of limestone hills that overgrown with low plant vegetation to forest and built-up lands in the form of houses for residents.

CONCLUSION
The physical component of vulnerability shows that in LULC the vegetation class in the study area in 2020 decreased to 67% from the original area in 2017, which was the size of 135 km 2 . Forest vegetation covering an area of 116 km 2 in 2017 had decreased by 45% to 53 km 2 in 2020. This decrease was caused by changes in land cover from coastal forests to be built-up lands for social, cultural and economic functions with the size of 187 km 2 in 2017. Although in 2020 the built-up lands only covered a size of 51 km 2 , there was an increase of open lands covering an area of 86 km 2 . On the other hand, the land use for fisheries from an area of 24 km 2 in 2017 increased to a size of 86 km 2 in 2020. The prediction of future changes in LULC was carried out by predicting VI using MARS, CART, RF, and k-nn algorithms. Testing the accuracy of the prediction results shows that the most optimal algorithms for classification and prediction are RF and k-nn. The VI spatial distribution was predicted using the Ordinary Kriging method. The prediction results show that in 2021 most of the area will have a vegetation density from (-0.1) to (0.3) with moderate to high vulnerability and risk of tsunami in LULC as a result of the decrease of vegetation areas. The prediction results in 2021 show a low density of vegetation canopy and a high degree of land surface slope. The prediction results in 2021 shows that there are built-up lands with a high vulnerability of tsunami waves in most of the study area (> 0.1).