Implementation of Long-Short Term Memory Neural Network (LSTM) for Predicting The Water Quality Parameters in Sungai Selangor

Predictions of future events must be factored into decision-making. Predictions of water quality are critical to assist authorities in making operational, management, and strategic decisions to keep the quality of water supply monitored under specific criteria. Taking advantage of the good performance of long short-term memory (LSTM) deep neural networks in time-series prediction, the purpose of this paper is to develop and train a Long-Short Term Memory (LSTM) Neural Network to predict water quality parameters in the Selangor River. The primary goal of this study is to predict five (5) water quality parameters in the Selangor River, namely Biochemical Oxygen Demand (BOD), Ammonia Nitrogen (NH 3 -N), Chemical Oxygen Demand (COD), pH, and Dissolved Oxygen (DO), using secondary data from different monitoring stations along the river basin. The accuracy of this method was then measured using RMSE as the forecast measure. The results show that by using the Power of Hydrogen (pH), the dataset yielded the lowest RMSE value, with a minimum of 0.2106 at station 004 and a maximum of 1.2587 at station 001. The results of the study indicate that the predicted values of the model and the actual values were in good agreement and revealed the future developing trend of water quality parameters, showing the feasibility and effectiveness of using LSTM deep neural networks to predict the quality of water parameters. the of


INTRODUCTION
Today, the growing human population has increased the demand for forewater consumption and their anthropogenic activities, such as for land use, deforestation, industrialisation, transportation, solid waste generation and excess wastewater generation; cumulatively, changing the natural structure of the planet earth. According to the Department of Statistics in 2019, domestic and non-domestic metered water consumption in 2018 had risen by 6.2% and 14.3% respectively from 2014. The number of public sewage treatment plants in Malaysia between 2015 and 2018 had increased by 5.5%. Rapid development has created vast volumes of domestic, industrial, commercial and transportation waste, which eventually end up in water sources (Huang et al., 2015). The Selangor River Basin occupies an area of 2,200 km 2 or about 28% of Selangor, the most developed state in Malaysia (Santhi & Mustafa, 2012). Huge watersheds however pose many challenges to water quality monitoring and management, especially in multinational basins where regulatory mechanisms and goals for water resource management can vary (Bloesch et al., 2012). Leading onto effective river basin management requires consistent monitoring through the following four efforts: 1) identify patterns over time; 2) thoroughly consider the impacts of activities and their relationships in the watershed; 3) identify the impacts of downstream activities; and 4) the rest (Chapman et al., 2016).
Many technologies have been developed to consider the changes of water quality, such as Fuzzy Mathematics, 3S Engineering and ANN Lee, 2018 andMaier et al., 2010). However, the ANN methodology is famous for its excellent applicability to unforeseen and non-linear circumstances for forecasting water quality (Liu et al., 2019). Artificial Neural Network (ANN) is one of the most reliable and commonly used forecasting models with effective applications in social, technological, engineering, foreign exchange, and stock problems (Khashei & Bijari, 2010). In the field of information, the neural network can overcome the conventional approach of processing information by offering fair recognition and judgement (Wu & Feng, 2017). Not only limited in the field of information technology, but they also noted that ANN was widely used in medical care due to the variability and unpredictability of the human body and health conditions. The complex non-linear interaction of biological information is worthy for the implementation of ANN. Apart from that, ANN is also popular in water quality analysis.
Three separate Artificial Neural Network (ANN) simulation techniques were used to identify the optimum forecast of water quality parameters by Najah et al. in 2012, which included the Logistic Regression Model (LRM), Multi-Layer Perceptron Neural Networks (MLP-NN) and Radial Basis Function Neural Network (RBF-NN). In their study, the RBF-NN Model was found to be the fastest computational model which increased the precision of predicting water quality parameters. The feed-forward ANN also facilitates fast simulation of the WQI and enables the recognition of the comparative significance to model predictions (Gazzaz et al., 2012). According to them, their analysis emphasised that ANN is an important water quality river evaluation instrument that simplifies the computation of WQI and saves significant effort and time by optimising the calculations. Based on Hayder et al. in 2020, with enough datapoints, a good prediction of WQP can be obtained by using three-layered Feedforward Neural Network. However, based on research from Zhou et al. in 2018, Long Short-Term Neural Network (LSTM) which is the new type of recurrent neural network is faster and easier to converge to the optimal solution when dealing with time series prediction. This is supported by a study published in 2017 by Wang et al., who concluded that the LSTM Neural Network is the best method for predicting water quality parameters when compared to the online sequential extreme learning method and the back propagation neural network method. The Root Mean Square Error (RMSE) values obtained from all three methods were compared in their study, and they discovered that the RMSE value for LSTM Neural Network consistently produces the lowest value for all time steps. This paper proposes a water quality prediction model based on LSTM deep neural networks to predict water quality parameters data measured by the automatic monitoring station of the Sungai Selangor and then compares the predicted results with the measured data. The results show the potential of application of LSTM and deep learning in predicting water quality parameters.

METHODOLOGY
This section focuses on data collection and data analysis. Steps in formulating and measuring the model validation will be explained concurrently.

Method of Data Collection
In this analysis, the researcher wants to scrutinize the water quality parameters in the Selangor River. Thus, the dataset used are the time series data of five criteria of water quality, which are DO, COD, pH and NH3-N. The data for this study was obtained from the Department of Environmental Malaysia (DOE) and was collected from 10 monitoring stations along the Selangor River.

Method of Data Analysis
In data analysis, 3 phases area used in this research. The phases are the pre-processing data, formulation of the LSTM model to predict water quality parameters in Selangor River and measurement of model accuracy.

Pre-processing Data
The data size for each station varies depending on the data availability. The total number of data received in the first 4 stations out of 10, namely 2BSEL001, 2BSEL004, 2BSEL005 and 2BSEL010; is 24 data points measured every two months spanning over four years from January 2016 to November 2019. In the meantime, the comprehensive range of data received within the remaining 5 stations, specifically 2BSEL011, 2BSEL014, 2BSEL015, 2BSEL017 and 2BSEL018, is 15, spanning from July 2017 to November 2019. Finally, station 2BSEL023, the newest monitoring station along the river basin, has the smallest data available, with 13 data points recorded from November 2017 to November 2019. Figure 1 depicts the data size distribution for every station. The dataset used in this study is the WQP from the first 4 stations, which are 2BSEL001, 2BSEL004, 2BSEL005 and 2BSEL010. This is due to the fact that the number of data points for the remaining 6 stations is insufficient for predictions because they are still considered new stations. To avoid inaccurate prediction, the dataset trained will only include the 4 previously mentioned stations. Not only that, but shortage of time is also a factor contributing to the decrease in the number of stations trained in this study.
Based on the WQP dataset of the 4 stations, the linear interpolation technique was used to treat the missing value in the data using Microsoft Excel with NumXL function installed. Linear interpolation is a curve fitting method to generate new data points within the range of a discrete set of known data points. By implementing this method, the missing value at a particular time was fixed by taking the value before and after the time into account. Following the linear interpolation method, statistical data analysis was performed to analyse the data characteristics before proceeding with the prediction phases. The variabilities measured in this analysis are the mean, minimum, maximum, standard deviation, skewness, kurtosis, white noise, and stationarity of data based on each station.

LSTM Neural Network Model Formulation and Measurement of Accuracy
The Artificial Neural Network (ANN) is a technique that has been biologically influenced by the human brain and nervous system biology. It is a computational model composed of multiple computing components based on their predefined activation functions. For predicting the water quality parameters, the LSTM method is implemented. LSTM is a standard neural network consisting of an input layer that receives external data to perform pattern recognition, an output layer that solves the problem and a hidden fully connected intermediary layer that distinguishes the other layers (Wang et al., 2017). In this research, the network used consisted of 4 layers. Because the data used in the network was in time series, the first layer, which is the input layer, is called the Sequence Input Layer. The LSTM Layer emerges next, which discovers long-term correlations between time steps in a time series or data sequence. Several essential properties were determined in this layer, including the number of hidden units or hidden size, output format, and input size. The next important LSTM property is the activation function. The activation function is divided into two types: state activation function and gate activation function. The state activation function updates the cell and hidden state, whereas the gate activation function controls the gates in the LSTM. The third layer is the Fully Connected Layer which multiplies the input by a weight matrix and then adds a bias vector. Finally, the Regression Layer generates the output layer. The regression layer computes the halfmean-squared-errors loss for regression tasks. The network design used in this study is depicted in the Figure 2 below and taken from MatLab model's configuration.

FINDINGS AND DISCUSSIONS
The results and discussion of the LSTM model will be explained in this section and in this section, all the results explained will be using the DO data from stations 001, 004, 005 and 010 considering the number of data points for these two stations is sufficient for predictions. In the meantime, the number of data points for the remaining six stations is insufficient for predictions because they are still considered new stations. To avoid inaccurate prediction, the dataset trained will only include the 4 previously mentioned stations. Not only that, but shortage of time is also a factor contributing to the decrease in the number of stations trained in this study. The five water quality parameters (WQP) considered in this study are Dissolved Oxygen (DO), Biochemical Oxygen Demand (BOD), Chemical Oxygen Demand (COD), pH and Ammonia Nitrogen (NH3-N) The model's ability to predict based on different water quality parameters will be discussed by examining the smallest error measurement.

Pre-Processing Data
The data were analysed by using Microsoft Excel with the NumXL function installed. The steps involved in this process are as follows: Step 1: The data received from DOE were divided into 4 different stations and were arranged in ascending time format. The missing data was adjusted by using Linear Interpolation Method. Figure 3 shows the linear interpolation method performed in Microsoft Excel for each WQP value in Station 001. Step 2: Following the Linear Interpolation Method, the statistical analysis of data was performed. Table 1 and Table 2 provide a summary of the statistical analysis of water quality parameters based on station 001, 004, 005 and 010.The table describes the mean, minimum, maximum, standard deviation, skewness, kurtosis, white noise, and data stationarity. As shown in Table 1, the standard deviation value for BOD and COD in Station 001 is high, indicating that the data are dispersed or less reliable. Meanwhile, the rest of the parameters have a low standard deviation, indicating that the values are spread out around the mean. As the value is less than -1 or greater than 1, all parameters except DO and pH are highly skewed. Meanwhile, the DO and pH are moderately skewed as the values lie around -1 to -0.5. Furthermore, the researcher discovered that the excess kurtosis values for DO and pH are negative, indicating that the distributions are less peaked. Meanwhile, the presence of outliers is indicated by the other parameters with positive excess kurtosis values making the prediction difficult. As shown in Table 2, the standard deviation value for BOD and COD in Station 004 is high, indicating that the data are dispersed or less reliable. Meanwhile, the rest of the parameters have a low standard deviation, indicating that the values are spread out around the mean. DO parameter for this station is fairly skewed, while the BOD and pH parameters are moderately skewed. Other than that, the parameters are highly skewed. All parameters have positive excess kurtosis indicating the presence of outliers. All the parameters except BOD and COD are white noise. As shown in Table 3, the standard deviation values for BOD and COD in Station 005 are high, indicating that the data are dispersed or less reliable. Meanwhile, the rest of the parameters have a low standard deviation, indicating that the values are spread out around the mean. Based on the skewness value, parameters of DO and pH are skewed, while the rest of the parameters are highly skewed because the values are less than -1 or greater than 1. Furthermore, the researcher found out that the excess kurtosis value for pH is negative, signalling that the distributions are less peaked. Meanwhile, the presence of outliers is indicated by the other parameters with positive excess kurtosis values making the prediction difficult. As shown in Table 4, the standard deviation values for BOD and COD in Station 010 are high, representing that the data are dispersed or less reliable. Meanwhile, the rest of the parameters have a low standard deviation, signifying that the values are spread out around the mean. Based on the skewness value, parameters of DO and pH are skewed, while the parameter SS is highly skewed. Meanwhile, the rest of the parameters are moderately skewed because the values are between 0.5 and 1. Furthermore, the researcher discovered that the excess kurtosis value for pH is negative, indicating that the distributions are less peaked. Meanwhile, the presence of outliers is indicated by the other parameters with positive excess kurtosis values making the prediction difficult. Finally, all parameters in stations 001, 004, 005 and 010 are stationary.

Forecast Values Using the Trained LSTM
In this study, the data that had gone through the linear interpolation method were used to train the LSTM Neural Network using MATLAB software. After the LSTM model was trained, the value of all WQP were predicted using the model trained. The following table summarises the forecasted value of WQP in all stations.    We concluded that the predicted values had a good agreement with the effective values of the model, indicating that this model performed well in predicting the water quality parameters since the closeness of agreement between an actual value and a predicted value. Our result reveals the potential of applying LSTM and deep learning to predict drinking water quality, which can provide a reliable foundation for the formulation for water source protection policies and concrete measures.

RMSE for the LSTM Network
The following table summarises the results of the RMSE value obtained from LSTM predictions for all WQP in all stations. Table 9 shows the RMSE value obtained from the LSTM network using predicted values.

CONCLUSION AND RECOMMENDATIONS
Many important factors should be considered when developing a neural network, such as the ANN parameters, the design, which includes the number of layers, hidden numbers, epochs, and activation functions. In this study, the LSTM model was designed and trained to predict WQP in all four monitoring stations along the Selangor River using the appropriate parameters listed in the methodology. As a result, the model trained with the pH dataset consistently produced the lowest RMSE with a minimum of 0.2106 at Station 004. The established prediction model can be trained and learned automatically in the face of different water quality data samples and thus has broad application scenarios. The result shows that the built water quality model can predict the water quality parameters in the future, offering a feasible approach for water quality prediction.
Several other methods can be used to forecast the WQP in the Selangor River. Regression Analysis (RA), Grey Systems (GS), Support Vector Regression (SVR), and other ANN models such as Feedforward Neural Network, Backpropagation Neural Network, Non-Linear Input Variable Selection (IVS) algorithm, and Multi-Layer Perceptron Neural Network (MLP-NN) are examples of methods that can be used. Future researchers can use the suggested ways to compare two or more methods for forecasting WQP in river basins by using datasets in Selangor River and in any river basins worldwide to determine the best prediction method in different locations. Future researchers can also tweak the ANN parameters to achieve a more accurate model. For example, changing the ratio of training and test data, using a different activation function, increasing, or decreasing the number of epochs and hidden numbers