Landslide Hazard Analysis Using a Multilayered Approach Based on Various Input Data Configurations

Landslide is a natural disaster that occurs mostly in hill areas. Landslide hazard mapping is used to classify the prone areas to mitigate the risk of landslide hazards. This paper aims to compare spatial landslide prediction performance using an artificial neural network (ANN) model based on different data input configurations, different numbers of hidden neurons, and two types of normalization techniques on the data set of Penang Island, Malaysia. The data set involves twelve landslide influencing factors in which five factors are in continuous values, while the remaining seven are in categorical/discrete values. These factors are considered in three different configurations, i.e., original (OR), frequency ratio (FR), and mixed-type (MT) data, which act as an input to train the ANN model separately. A significant effect on the final output is the number of hidden neurons in the hidden layer. In addition, three data configurations are processed using two different normalization methods, i.e., mean-standard deviation (Mean-SD) and Min-Max. The landslide causative data often consist of correlated information caused by overlapping of input instances. Therefore, the principal component analysis (PCA) technique is used to eliminate the correlated information. The area under the receiver of characteristics (ROC) curve, i.e., AUC is also applied to verify the produced landslide hazard maps. The best result of AUC for both Mean-SD and Min-Max with PCA schemes are 96.72% and 96.38%, respectively. The results show that Mean-SD with PCA of MT data configuration yields the best validation accuracy, AUC, and lowest AIC at 100 number of hidden neurons. MT data configuration with the Mean-SD normalization and PCA scheme is more robust and stable in the MLP model's training for landslide prediction.

Landslide events depend on various causative factors grouped into different categories, such as geomorphology, geology, soil, land cover, and hydrological conditions (Hutchinson, 1995;Varnes, 1984).
In Malaysia, most of the landslides often occur due to heavy rainfall in the annual monsoon. It has been observed that much damage has been done in the last decade due to landslides in Malaysia (Murakami et al., 2014;Pradhan & Lee, 2010).To mitigate and minimize the damage caused by landslides, many research projects have been carried out (Gian Quoc et al., 2018;Ya'acob et al., 2019). This is achieved by predicting and minimizing hazardous areas by proper action (Lee & Talib, 2005;Pradhan & Lee, 2010;Tay et al., 2014).
A landslide event can not be predicted in time and space. Therefore, a landslide area is considered and categorized into various categories of possible hazards (Varnes, 1984).
The Geographical Information System (GIS) and remote sensing methods have been used to assess landslide hazards. The landslide hazard maps have been produced using a variety of mathematical techniques, ranging from conventional statistic methods such as frequency ratio (Chen et al., 2020), statistical index and weights-of-evidence (Regmi et al., 2014), and logistic regression (Lombardo & Mai, 2018;Sun et al., 2018) to more recent advanced intelligence methods such as artificial neural network (ANN) (Shahri et al., 2019;Alkhasawneh et al., 2013;Alkhasawneh et al., 2014;Lee et al., 2020;Ortiz & Martínez-Graña, 2018). ANN is applied to many natural science applications such as speech recognition, human face recognition, classification of satellite images, and recognition of texture. The key feature of ANN is that it can process data from nominal and ordinal to linear, to ratios and any form of data distribution over any measurement scale. It also addresses qualitative factors that are usually used from different prediction and classification sources in the integrated analysis of spatial data (Kawabata & Bandibas, 2009). Multilayer perception (MLP) is a popular feedforward artificial neural network that is mostly used for classification and prediction problems.
Many researchers have studied landslide hazard mapping using either original data or frequency ratio data (Catani et al., 2013;Liu et al., 2019), and most research works have been limited to these two data configurations of the landslide data. Therefore, this work intends to prepare three configurations of landslide causative factor values, i.e., original (OR), frequency ratio (FR), and mixed-type (MT), trained by the MLP model. The significant 22 Ilyas Ahmad Huqqani et al. / Geosfera Indonesia 6 (1), 2021, 20-39 variations in the resulting landslide hazard maps are highlighted, and also insights are provided to decide which data configurations perform better. This paper aims to compare spatial landslide prediction performance using an artificial neural network (ANN) model based on different data input configurations, different numbers of hidden neurons, and two types of normalization techniques on the data set of Penang Island, Malaysia.

Methods
This research mainly focused on using various input data configurations for the MLP model to produce the landslide hazard map of Penang Island, Malaysia. Different numbers of hidden neurons are applied on MLP with the three configurations of input data for landslide analysis. Besides, the effects of normalization and the PCA scheme on the input data set are evaluated. The complete schematic diagram of landslide hazard analysis using the ANN model depicts in Figure 1. Fault lines also impact this island. Such lines occur from North to South in the middle of the island. The location map of the study area is shown in Figure 2. Island granites are covered in the geology profile. The island's soil texture consists of six different kinds of soils. Precipitation is the natural triggering factor for landslides on the island. The soil dampens, the debris and rocks are washed away due to heavy rainfall.

Selection of the study area: Penang Island
Because of limited rain measurement stations on the island, the inverse weight distance interpolation method is used to generate the precipitation profile. The landslide influencing profiles/maps are generated using the ArcGIS software and shown in Figure 3.

Data Set Configurations
The landslide data set consists of twelve landslide influencing factors, with five factors having continuous values, while the other seven have categorical/discrete values. The categorical data types are qualitative attributes treated as distinct symbols or just the name of the attributes. These twelve factors are arranged in three configurations. The first type of input data is Original (OR) data, which involves continuous and discrete/categorical values.
The second type of data is frequency ratio data, and the third one is mixed-type data. The Frequency Ratio (FR) data is calculated from the OR data for all the twelve factors based on Eq. (1): where indicates the frequency ratio of every factor, * is the area of landslides occurrence of class factor , denotes the area of class of factor b, * and are the total areas of the landslides occurrence and study area maps, respectively.    Data set pre-processing The normalization technique is employed at the pre-processing stage to transform the data set from the existing range into a new range. The input data set is normalized before applying PCA and feeding it into the MLP model. It guarantees the stable convergence of the MLP model's weights and biases. Two types of normalization methods, i.e., mean-standard deviation (Mean-SD) (de Souto et al., 2008;Kotsiantis et al., 2006) and Min-Max (Husin et al., 2008;Kaur et al., 2016), are applied on all three data set configurations separately.
Different normalization techniques behave uniquely on the same data set. The purpose of applying two different normalization approaches is to identify the most appropriate technique for better results.
Mean-SD normalization: It gives the standardized values based on the mean ( ̅ ) and standard deviation ( ) of the data set ( ). The mathematical expression of Mean-SD is given in Eq. (2).
where and denote the ℎ value of attribute and the normalized values of ℎ value, respectively.  One of the MLP model's main challenges is to estimate the number of neurons in the hidden layer of MLP structure because there is no analytical method to identify the optimal structure in advance. Therefore, Akaike Information Criterion (AIC) is used to determine the number of hidden neurons (Akaike, 1974). AIC is an asymptotically unbiased estimator of insample prediction error used to assess the goodness of the mathematical model. Many other models can be ranked according to their AIC, with the best for a data set being the one with the lowest AIC. The goodness of a model's parameters can be determined by the expected log-likelihood (Cousineau & Allan, 2015;Panchal et al., 2010). The triggering factors, like precipitation, influence future landslides (Chung & Fabbri, 1999).

Min-Max normalization
The ROC is to group the landslide hazard index (LHI) from the trained MLP model in descending order. The LHI indexes are split into one hundred sub-levels on the y-axis with 1% gaps on the x-axis (Pradhan & Lee, 2010). The ROC curve indicates the efficiency of the 31 Ilyas Ahmad Huqqani et al. / Geosfera Indonesia 6 (1), 2021, 20-39 methods, and the accuracy is achieved by measuring the area under the curve (Begueria, 2006;Chung & Fabbri, 1999). In this study, the MLP model is carried out using the three input data configurations, with and without pre-processing. The MLP model is implemented in MATLAB on a machine with the specification Core i7 3.40 GHz processor and installed RAM of 16 GB.

Results and Discussion
In this study, different numbers of hidden neurons are applied on MLP with the three configurations of input data. The implementation of MLP is carried out with and without preprocessing of normalization and the PCA scheme. AIC and ROC are used to evaluate the performance of the MLP. The MT data configuration performs better than the other two data configurations because all MT data variables are either continuous data or frequency ratio, which gives a better range of variation. It also provides a better and meaningful representation of landslide data. Contrary to FR data configuration, all variables are completely in frequency ratios which have minor variations. That is why FR data has lower validation accuracies. All variables of OR data configuration are a mixture of continuous and categorical/discrete values. However, this data has relatively more variations and produces higher validation accuracy than FR data configuration but slightly lower validation accuracy than MT data configuration. Furthermore, all three data configurations are processed through preprocessing, i.e., normalizations followed by the PCA technique before feeding into the MLP model to improve accuracy and eventually produce the hazard maps.  It is also observed that the Mean-SD with PCA scheme performs better than the Min-   performance (Nhu et al., 2020;Tien Bui et al., 2020). The results obtained from the MLP model of AUC of ROC are compared with other related studies as shown in Table 1. The MLP model contains one hidden layer between input and output layers and its hidden neurons play a vital role in predicting the landslide hazards. Besides, MT data configuration has an advantage to acquire better accuracy at lower number of hidden neurons. It is, therefore, observed that the performance of MLP model of present study shows higher AUC than the others.

Conclusion
The MT data configuration produces good validation accuracy at a lower number of hidden neurons than OR and FR data configurations, with and without pre-processing. The MT data configuration represents more meaningful information for the MLP to interpret than the other two data configurations. Pre-processing with normalization techniques and PCA improves the performance of the output. The best AUC of ROC of both Mean-SD and Min-Max with PCA schemes are 96.72% and 96.38%, respectively. The MT data configuration with Mean-SD normalization with PCA scheme gives the best results with good accuracies, as well as a smaller number of hidden neurons. The Mean-SD with PCA scheme shows the smaller complexity of the MLP model with a smaller number of hidden neurons. Therefore, MT data configuration with the Mean-SD normalization and PCA scheme is more robust and stable in the MLP model's training for landslide prediction.