A machine learning-based computer model for the assessment of tsunami impact on built-up indices using 2A Sentinel imageries

ABSTRACT


INTRODUCTION
Currently, the methodology for conducting tsunami vulnerability detect and assessments is very advanced and developed rapidly starting from modeling methods are linear, non linear, numerical, photogrammetry image analysis and remote sensing [1]- [5].Remote sensing image analysis methods include medium resolution images such as Landsat 8 OLI and Sentinel 2A, or high-resolution images such as SPOT 5 and Quickbird [6]- [8].A quick calculation of damage to buildings caused by a tsunami can be done because of the existence of various machine learning functions and built-up indices data extracted from remote sensing imageries [9].Machine learning methods have long been applied to mitigate tsunamis, including predicting inundation, maximum wave height and arrival time of tsunami waves on land, even though the uncertainty of the prediction results is very high [10].In Indonesia, tsunami is the threat of disaster in the future due to indicators of ancient tsunami silt deposits on the south coast of Java and the Euro-Asian and Indo-Australian plates which have the potential to cause large earthquakes and trigger tsunami waves from the Sunda Strait to Bali [11]- [18].
In terms of seismotectonic zones in Indonesia, the coastal areas of Central Java and Yogyakarta are included in the zone B with an intensity of tsunami events of more than 2.5 times in a period of 30-50 years.In this zone, tsunamis are generated by two types of earthquakes, namely the subduction of the Indian Ocean Plate under the Eurasian Plate and the pressure of the arc plate which lies east to west in the north of the Islands of Bali, Lombok and Sumbawa [19].Past data shows that the area of zone B, especially the southern seas of Central Java and Yogyakarta, has been hit by 20 tsunamis with varying strengths throughout human history, which were recorded from before 1600 until the end of 2006 [20].Currently, there is no computerbased modeling study to find a model that can detect a built-up land quickly and accurately in zone B tsunami disaster risk areas.As a solution to this problem, a study was conducted with the aims: i) building a computer model to detect built-up land from Sentinel 2A satellite imagery data; ii) classifying and optimizing the digital number (DN) data detection process for Sentinel 2A satellite images using machine learning Random Forest (RF) and eXtreme gradient boosting (XGBoost); and iii) predicting the spatial pattern of the distribution of built-up lands using the Ordinary Kriging (OK) method.This study produces a novelty of finding, namely a computer model to detect and predict the spatial distribution of built-up land in 4 scales: very low, low, high, and very high based on normalized built up area index (NBI), urban index (UI), normalize difference build-up index (NDBI), modified built-up index (MBI), index-based built-up index (IBI) data extracted from Sentinel 2A imagery.The cluster K-Mean algorithm (CKA) is an algorithm that works on Euclidian distance data y to the centroid (c) [21]- [23].Euclidian distance is formulated by (1): RF algorithm is a combination of non-parametric classification method and the classification and decision tree (CART) { (, _ )}_( = 1)^ method, which a is the input data observed in vector form, _ is sample data in vector form and is taken randomly from the input data set as a result of _1, _2, _3 … _( − 1) observations.T notation is the sample data which is used as training data of bootstrap.If depicted in a forest, sample data are trees that will be grown and selected randomly and classified into certain nodes or classes using the CART method [24].Number of trees grown is represented as n-tree and the number of classes or classifiers is represented as m-try.To classify a tree and divide it into certain nodes or classes, the Gini index function is used with (2): Where _^2,  is the ratio of training data taken randomly from historical data, and n notation is the category of classified nodes [25], [26].XGBoost is a machine learning algorithm that works based on the concept of a decision tree, where each decision tree node will be connected to one another hierarchically [27].Each tree will contribute to building a large classifier by forming an ensemble of weak classifiers [28].The XGBoost as (3)-( 5): Where _ is additive function that represents the tree, and  additive function to form new data as a prediction result.The notation (_ −  _ ) is to determine the difference between the prediction value  _ and predictor value _.Notation Ω(_ ) is a complexity model, the notation  represents the number of trees on each node and  represents the value of each tree [29].OK uses structural analysis and variogram to assess the weight of the location that is not the point of observation in all spatial fields [30].The OK as (6): Where (ℎ) is a notation from semi variogram, h is a notation from the distance lag, N (h) is a notation of the number of observation points with distance ℎ, (_ ) and (_ − ℎ) is a regionalized variable [31].The accuracy of the classification and optimization of RF and XGboost is carried out using the overall accuracy method and the Kohen Kappa method.The accuracy overall as (7): which B is the number of classes used in classification, value x is the amount of data testing and value n is the amount of data analyzed [32].The Kohen Kappa as (8): where _0 is its accuracy and _ is its probability [32].The proposed computer model algorithms as the novelty of finding of this study is shown are: The computer model was proposed using a framework consisting of 3 layers, namely: i) preprocessing, ii) analytical data, and iii) interpretation.The pre-processing layer consists of atmospheric, radiometric and geometric correction processes, image extraction using the NBI, UI, NDBI, MBI, IBI algorithms and the Raster Statistics for Polygon function, producing built-up indices numeric data.The data analytical layer consists of data classification process using the CKA method, optimization classification process using machine learning, and an accuracy test process using the Kohen Kappa function and overall accuracy.Layer interpretation is the process of spatial distribution using OK and the output classification process in 4 scales, namely: very low, low, high, and very high (Figure 1).The proposed computer model framework as the novelty of finding of this study is shown in Figure 1.
Coastal elevation is one of indicators that must be analyzed to determine tsunami vulnerability in addition to land use and land cover characteristics.Elevation is modeled using digital elevation model (DEM) aster imagery and elevation interpretation as shown in Table 2

RESULT AND DISCUSSION
The first indicator for assessing building damage as a result of the tsunami in this computer model is the built-up area along the coast.Determining the area of built-up land is carried out by the mechanism of extraction, identification and separation of DN values from Sentinel 2A imagery for the built-up land category from DN pixels for fisheries, forest and agriculture categories using the supervised classification method.This process will produce a map image with raster data format.The raster data format is converted to a vector data format to make it easier to calculate the land area.The comparison of the land area that is in the category of built-up land within a period of 5 years, namely 2017-2021 is shown in Figure 2. Built-up indices data are classified into 4 categories based on built-up land density indicators using the CKA algorithm, namely very low, low, high, and very high.The purpose of this classification process is to provide labels in the form of values for each observation or sampling area for each built index indicator.Testing of the results of the classification is carried out using machine learning RF and XGboost.
The results of testing with machine learning predict the distribution pattern using OK method.The spatial pattern of building density in the study area using the RF algorithm in; 2016 (Figure 4(a)), 2020 (Figure 4(b)), and 2021 (Figure 4(c)).The RF algorithm works by forming vectors from the input data, namely build-up indices which are denoted as _1, _2, _3 … _( − 1).Next, the formed vectors are randomly selected as training data which are notated as training data and some of them become testing data before being calculated with the Gini index.The built-up indices in the RF algorithm are nodes that are described as trees which each tree has branches of very low cluster, low cluster, high cluster, and very high cluster.The results of the analysis using the RF algorithm can be seen in Figure 4. Figure 4 shows comparison of the spatial patterns of distribution of building density in the study area using the RF algorithm between 2016 (Figure 4 4(a) has a low to very low building density which is shown in green to blue.In most areas of study area Figure 4(b).the density of built-up land is still high to very high (yellow and red).In 2021, there will be a higher increase compared to 2020, the study area Figure 4(c) has a high to very high building density which is shown in yellow to red colors.In most areas study, they show a high to very high built-up density (yellow and red colors).The XGBoost algorithm works through structured data processing using a decision tree, namely testing the data attributes (built up vegetation indices) in each node using the criteria for very low cluster, low cluster, high cluster and very high cluster, and the test results are represented on each branch.The results of the XGBoost analysis are almost the same as that of the RF.In 2016, it can be seen that study area has a low to very low building density which is shown in green to blue colors.In 2021 there will be a higher increase compared to 2016.The areas study has a high to very high building density which is shown in yellow to red colors.Testing the accuracy of the results of RF and XGboost analysis is carried out using 2 methods, namely overall accuracy and Kohen Kappa.The test results can be seen that the RF and XGBoost tests using the overall accuracy and Kohen Kappa methods have an accuracy above 80% so that classification and prediction are spatially very valid.RF have a test result 0.913 of overall accuracy and 0.866 of Kohen Kappa.XGBoost have a test result 0.947 of overall accuracy and 0.921 of Kohen Kappa.The final decision that will be represented in the computer model is made using the decision matrix method, which this method requires to form a scale for each variable (Table 3).The variable of build-up density is described in a scale as 1 for a very low density, 2 for a low density, 3 for a high density and 4 for a very high density.The elevation variable is described in a scale as 1 for a very low vulnerability, 2 for a low vulnerability, 3 for a medium vulnerability, 4 for a high vulnerability, and 5 for a very high vulnerability.The variable of accuracy is formulated in a scale > 0.9 for 1 and > 0.9 for 2. Based on the decision matrix in Table 3. it can be seen that the most accurate algorithm is XGBoost when it is compared to RF, and in both algorithms.It can be seen that 2020 and 2021 have a very high level of tsunami vulnerability due to their high density of buildings and due to a very low elevation of < 5 m above sea level.An assessment scale is produced by dividing the number of a value with the variable it uses so that the assessment scale is produced from the lowest value of 1 and the highest value of 4. Based on the experiments, it can be seen that the most optimal and accurate algorithm is XGBoost because it produces an assessment scale of a very high vulnerability area with a scale of 3.6, while the RF algorithm shows an assessment scale of a high vulnerability area with a scale of 3.3.

CONCLUSION
The results of the study show that the built-up indices IBI value of 0.080 and NBI value of 0.077 as built-up representing the dominance of built-up lands, a small portion of open lands and surface waters in the study area.High values of built-up indices represent the complexity of settlements, industrial areas and tourism areas, while low values represent open lands.The mean value of the built-up indices of MBI (-0.093),NBAI (-0.241),NDBI (-0.419) and UI (-0.204) represents the dominance of surface water areas such as paddy fields and aquacultural areas, vegetations such as shrubs, plantations and forests.Testing the performance accuracy of machine learning RF using the overall accuracy method shows the value of 0.913 with Kohen Kappa of 0.866 which indicates that the classification of built-up indices data is very valid.Testing the performance accuracy of XGBoost machine learning using overall accuracy shows the value of 0.947 and Kohen Kappa of 0.921 also shows that the classification of data built-up indices is very valid.

Figure 1 .
Figure 1.Framework computer model for assessing the impact of the tsunami on built-up indices optimized by machine learning classifier

Figure 2 .
Figure 2. Comparison of built-up land with fisheries, forest and agriculture in 5 years period of 2017-2021 (a)), 2020 (Figure 4(b)), and 2021 (Figure 4(c)) based on built-up indices data.Data of 2016 shows that study area Figure

Table 1 .
1.The built-up indices equation used in the study

Table 3 .
The confusion matrix on computer model for a decision of tsunami high vulnerability assessmentAlgorithm Year of data Build-up density Elevation Accuracy Assessment scale Symbol