HTML
-
Sichuan province, China is selected for the study. In terms of topography, Sichuan is the most complex province in China, which is located in the southwestern part of the country, with high terrain in the west and low terrain in the east, sloping from northwest to southeast(Xie and Wang [33]; Lu et al. [34]). The altitude difference between the highest and lowest points in Sichuan is more than 7300 m, and the terrain is quite undulating(Huang et al. [35]).
Meteorologically, Sichuan is generally divided into three regions for analysis: the Western Sichuan Plateau, the Sichuan Basin(central and eastern Sichuan), and Panxi Area(southwestern Sichuan)(Luo et al. [36]; Zeng et al. [37]). The Western Sichuan Plateau is located on the east side of the Qinghai-Tibet Plateau, with an average altitude of more than 4000 m(Zhang [38]). The Sichuan Basin, the central region of thr province, has a total of 17 cities and is typically at 500 m or lower altitudes(Chen and Xie [39]). With an average altitude of 1300 m, the Panxi Area is a part of the Yunnan-Guizhou Plateau(Li et al. [40]).
-
The CMPAS product is the subject of this revision, and the participants are the HRCLDAS temperature and wind speed products, DEM data, and automatic weather stations data. The study used the data from October 2020, January 2021, April 2021, and July 2021, representing the four seasons of autumn, winter, spring, and summer, respectively. The heavy precipitation process from June 12 to 13, 2021 is selected for the case study, and the hydrological station data from July 2021 is chosen for an independent analysis.
The CMPAS and HRCLDAS datasets are provided by the NMIC, CMA(China Meteorological Administration), with a resolution of 0.01° × 0.01°(original resolution: 1 km), and again the temporal resolution is hourly. China Meteorological Administration Multisource Precipitation Analysis System_Real Time(CMPAS_RT)from the CMPAS product is selected for the study, and CMPAS_RT is a real-time radar-satellite-gauge merged precipitation product.
The hourly surface precipitation, temperature, and wind speed data are collected from 1899 national automatic weather stations and regional automatic weather stations in Sichuan province, provided by the CMA and Sichuan Meteorological Service.
Topographic factors such as slope, slope direction, slope variability, and surface roughness of the CMAPS_RT grid points and all weather stations were extracted from the 90-m resolution DEM data released by the NMIC.
Hourly precipitation data of hydrological stations were from real-time shared data with Sichuan Provincial Water Resources Departments. Through the collaborative quality control of adjacent meteorological stations and radar products, 449 stations were selected to participate in the independent evaluation.
-
Through nearest-interpolation, meteorological stations and CMPAS grids were spatially matched. Wu et al. [21] demonstrated that the nearest-interpolation approach produces better results for CMPAS evaluation, and since precipitation was local and dispersive, the nearest-interpolation was selected. And the errors of station precipitation and grid precipitation were calculated, BIAS, Mean Bias(MB), Root Mean Square Error(RMSE), Relative Error(RE), and Correlation Coefficient(COR)are the primary evaluation metrics. The MB reflects the average deviation of the grid values from the observed values, the RMSE reflects the degree of dispersion of the data, the RE reflects the accuracy of the grid values, and the COR shows the degree of correlation between the grid and observed values. The BIAS, MB, RMSE, RE, and COR are calculated as
$$\mathrm{BIAS}=G_i-O_i$$ (1) $$\mathrm{MB}=\frac{1}{N} \sum_{i=1}^N\left(G_i-O_i\right),$$ (2) $$\mathrm{RMSE}=\sqrt{\frac{1}{N} \sum\nolimits_{i=1}^N\left(G_i-O_i\right)^2},$$ (3) $$\mathrm{RE}=\frac{1}{N} \sum\nolimits_{i=1}^N \frac{\left|G_i-O_i\right|}{O_i}$$ (4) $$\mathrm{COR}=\frac{\sum\nolimits_{i=1}^N\left(G_i-\bar{G}\right)\left(O_i-\bar{O}\right)}{\sqrt{\sum\nolimits_{i=1}^N\left(G_i-\bar{G}\right)^2} \sqrt{\sum\nolimits_{i=1}^N\left(O_i-\bar{O}\right)^2}}$$ (5) where Oi is the station observation value, Gi is the value obtained by interpolating the CMPAS products to stations and N is the total number of samples(number of stations).
The traditional score is based on the classification of observations and CMPAS_RT, as shown in Table 1.
Type OBSERVATIONS TRUE OBSERVATIONS FALSE CMPAS_RT TRUE NA NB CMPAS_RT FALSE NC ND Table 1. Precipitation dichotomy.
The Threat Score(TS), Probability of Detection(POD), Missing Alarm Rate(MR), and False Alarm Rate(FAR)are calculated as:
$$\mathrm{TS}=\frac{\mathrm{NA}}{\mathrm{NA}+\mathrm{NB}+\mathrm{NC}}$$ (6) $$\mathrm{POD}=\frac{\mathrm{NA}}{\mathrm{NA}+\mathrm{NC}}$$ (7) $$\mathrm{MR}=\frac{\mathrm{NC}}{\mathrm{NA}+\mathrm{NC}}$$ (8) $$\mathrm{FAR}=\frac{\mathrm{NB}}{\mathrm{NA}+\mathrm{NB}}$$ (9) -
The BIAS between the CMPAS_RT product and the station rain gauge data is used as the target value to participate in the machine learning correction.
As meteorological factors, temperature and wind speed are taken into consideration. Slope, slope variability, slope direction, and surface roughness extracted from the DEM data are topographic factors. Assume that there are i sets of variables associated with precipitation, and each set of variables has j factors, the meteorological and topographic factors are standardized(Chen et al. [32])as
$$y_{i j}=\frac{x_{i j}-\overline{x_j}}{S_j}$$ (10) where yij is the standardized factor value, xij is the original factor, xj is the arithmetic mean of the jth factor, and Sj is the sample standard deviation.
All the standardized impact factors are divided into several independent principal components using Principal Component Analysis(Abdi and Williams [41]; Lasisi and Attoh-Okine [42]). The implementation of Principal Component Analysis in this study is based on the'scikit-learn'of python language, and the main principles are as follows:
First, the contribution of the principal components to the precipitation results is calculated. Then, by computing the loadings between the impact factors and the principal components, the contribution of the impact factors to the precipitation results is analyzed. Finally, the principal components with a cumulative contribution of 90% are selected for the machine learning revision.
-
Parametric experiments on machine learning models are conducted using grid search(Bergstra and Bengio [43])and k-fold cross-validation(Refaeilzadeh et al. [44])methods. The grid search algorithm is a method to optimize model performance by traversing a given combination of parameters. The accuracy of each model for the test set is assessed for each pair of parameters, and the accuracy of each pair of parameters is compared through k-fold cross-validation to select the optimal parameters.
The whole training set data is averaged into k pieces. The remaining k−1 parts are used as the crossvalidation training set, while the kth part is used as the validation set. The model is trained using the data set of k cases to produce k models under the current parameter settings, and the corresponding validation set is used to examine the prediction results of these k models to produce k correctness indicators, which are then averaged as the corresponding scores. The optimal parameters of the model are determined by scores.
-
Following the results of the experiments, three ensemble learning methods(Sagi and Rokach [45])are chosen for revision. By establishing several models, ensemble learning solves the single prediction problem. Its working principle is to generate multiple classifiers or models that can independently learn and predict(Dong et al. [46]). These forecasts are eventually grouped into a combined forecast, which is superior to forecasting in any single category.
-
Random forests(Belgiu and Drǎgut [47])are used to resample multiple samples from the original sample and model a decision tree for each sample, and then average the predicted values of the multiple decision trees to obtain the final prediction results. First, the dataset is created by:
$$D=\left\{\left(x_m, y_m\right), m=1, 2 \cdots, n\right\}$$ (11) where ym is the bias between the value interpolated to the station for the CMPAS and the station rain gauge data, and xm is the principal component.
The training dataset Dj is then drawn at random from a subset of the dataset. A random forest is created by repeatedly training N decision trees hi, each of which is built using a random subspace partitioning approach, from which the best features are chosen for splitting. The average of each decision tree represents the projected outcome.
-
AdaBoost(Cao et al. [48]; Rätsch et al. [49])is an abbreviation for 'Adaptive Boosting'. AdaBoost regression algorithm can be briefly described in three steps:
First, initialize weights. For dataset D(Eq. 8), if there are n samples, the weight for each sample xm is initialized to 1/n. D1 is used for the training of the first weak learner h1 and Dt is used for the training of the tth weak learner ht.
Second, repeat the loop T times, recording the number of weak learners in each iteration as t, t=1, 2, 3, …, T. The weight distribution of the sample set xt is changed after calculating the error rate of the learner ht and updating the learner's current weight following the error size. In this approach, the entire training procedure is iterated.
Third, based on the learners' weight rankings after T rounds of iterations, the median weight learner is chosen as the outcome.
-
The Bootstrap aggregating, also known as the Bagging algorithm(Bauer and Kohavi [50]), serves as the foundation for more sophisticated algorithms like Random Forest. Data are put-back extracted from the original dataset D(Eq. 8). To get t Bootstrap resampling datasets, this is repeated t times. A weak learner is then obtained for each Bootstrap resampled dataset for a total of t weak learners for regression. The final result is calculated by integrating the t weak learners and taking the mean of these t weak learners.
2.1. Study area
2.2. Data
2.3. Analysis methods
2.4. Construction of the revised model
2.4.1. DATA PREPROCESSING
2.4.2. MODEL TRAINING AND VALIDATION
2.5. Machine learning methods
2.5.1. RANDOM FOREST REGRESSION
2.5.2. ADABOOST REGRESSION
2.5.3. BAGGING REGRESSION
-
The overall evaluation scores before and after the revision are displayed in Table 2.The original TS score is 0.91, POD is 0.952, MR is 0.048, and FAR is 0.047, indicating that the model's precipitation accuracy is already high. The three machine learning algorithms' revised outcomes are comparable, with revised TS scores of 0.94, POD of 0.984, MR of 0.016, and FAR of 0.046.All indicators have improved to varying degrees, with the most notable decrease of 66% in MR.
Product TS POD MR FAR CMPAS_RT 0.910 0.952 0.048 0.047 Random Forest Regression 0.940 0.984 0.016 0.046 AdaBoost Regression 0.941 0.984 0.016 0.045 Bagging Regression 0.941 0.984 0.016 0.046 Table 2. TS, POD, MR, and FAR for CMPAS_RT and revised products.
The hourly precipitation is subdivided by class, and Table 3 shows that as the amount of precipitation rises, the RMSE also goes up. All of the revised RMSE decreased, with the Random Forest Regression revisions being the most successful in doing so for each class.
Product Precipitation (mm h-1) 0.1-1.9 2-4.9 5-9.9 10-19.9 ≥20 CMPAS_RT 0.284 0.846 1.614 3.463 6.495 Random Forest Regression 0.274 0.828 1.597 3.414 6.239 AdaBoost Regression 0.277 0.834 1.614 3.420 6.275 Bagging Regression 0.281 0.833 1.605 3.456 6.299 Table 3. RMSE of different precipitation levels for CMPAS_RT and revised products.
-
Rain gauge precipitation from 1899 automatic weather stations across Sichuan is statistically evaluated with CMPAS_RT and the revised products from three methods. Figs. 1-4 depict the spatial distribution of the indicators. The distribution of the number of stations for each indicator before and after the revision is also counted(Fig. 5).
Figure 1. MB distribution of CMPAS_RT and revised products from each station(a: CMPAS_RT; b: Random Forest Regression; c: AdaBoost Regression; d: Bagging Regression).
Figure 2. RMSE distribution of CMPAS_RT and revised products from each station(a: CMPAS_RT; b: Random Forest Regression; c: AdaBoost Regression; d: Bagging Regression).
Figure 3. RE distribution of CMPAS_RT and revised products from each station. (a: CMPAS_RT; b: Random Forest Regression; c: AdaBoost Regression; d: Bagging Regression).
Figure 4. COR distribution of CMPAS_RT and revised products from each station(a: CMPAS_RT; b: Random Forest Regression; c: AdaBoost Regression; d: Bagging Regression).
Figure 5. Comparison of indicators of CMPAS_RT with revised products (a: MB; b: RMSE; c: RE; d: COR).
The overall revised results are similar for all three methods. In terms of MB, it is mainly concentrated in the range of -0.01−0.01mm h-1 in CMPAS_RT, with 980 stations having an MB of less than 0 mm h-1, indicating that CMPAS_RT underestimates precipitation for most stations. The three methods greatly minimize the MB between the products and stations. Following the adjustment, the MB of roughly 800 stations is -0.005−0 mm h-1, and that of nearly 600 stations is 0−0.005 mm h-1. The number of stations with MB between -0.01 mm h-1 and 0.01 mm h-1 is reduced by about 30%. In overview, 95% of the stations are revised to reduce MB, and 85% of the stations reduce MB by more than 30%.
For the RMSE, the stations with a large RMSE in CMPAS_RT are mostly concentrated in the basin, with the RMSE of 939 stations concentrated in the range of 0.12 − 0.2 mm h-1. The three approaches produce comparable spatial distributions of the RMSE corrected. The RMSE of roughly 900 stations is 0 − 0.04 mm h-1, and it greatly reduces throughout the basin. After correction, the RMSE of 82% of the stations decreased, and the RMSE of 80% of the stations decreased by more than 20%. It demonstrates that the three revision techniques are superior for modifying discrete data.
The stations with large RE in CMPAS_RT are mostly concentrated within Sichuan Basin, with 54% of stations having an RE of 0.15 − 0.35.The revised RE within the basin is significantly decreased, with 547 stations having an RE of 0−0.05 and 50% having an RE of less than 0.1.In brief, a total of 77% of stations in Sichuan have lowered RE, with 62% having reduced RE by more than 20%.
Most of the stations in Sichuan have a COR greater than 0.85, indicating that CMPAS_RT is substantially connected with non-independent stations. The COR for the majority of the stations after the modification is 0.9 or higher, accounting for 70% of the total. Since the COR is already high before the correction, there is a minor rise after the revision.
-
For analysis, Sichuan province is split into three regions: the Western Sichuan plateau, the Panxi Area, and the Sichuan Basin. The RE of CMPAS_RT and the revised products for different regions are shown in Table 4.Affected by the vast number of stations in the Sichuan Basin and the complicated station environment, the average RE of CMPAS_RT in the Sichuan Basin is 0.38;the RE of the Western Sichuan Plateau is the smallest, which is 0.266, and AdaBoost regression has the best correction impact on the RE of the whole Sichuan, which is 0.136 for the Western Sichuan Plateau, 0.184 and 0.169 for the Panxi Area and Sichuan Basin, respectively. The PDF of the RE change rate for the revised CMPAS_RT products is shown in Fig. 6a-6c for various regions, with the Western Sichuan Plateau showing the most obvious effects of the revision, where the RE is reduced by more than 90% for about 30% of the stations. Bagging regression performs best in the range of 70%-80% of RE reduction for the Panxi Area and the Sichuan Basin, and the proportion of stations in this range is the biggest, at 20% and 23%, respectively.
Region RE CMPAS_RT Random Forest Regression AdaBoost Regression Bagging Regression Western Sichuan Plateau 0.266 0.140 0.136 0.154 Panxi Area 0.357 0.184 0.184 0.193 Sichuan Basin 0.380 0.170 0.169 0.175 Table 4. RE for CMPAS_RT and revised products in different regions.
Figure 6. PDF of indicators'change rate of the revised CMPAS_RT products in different regions (a, b, c: RE; d, e, f: RMSE; g, h, i: COR).
Table 5 shows the RMSE of CMPAS_RT and the revised products for different regions, which is greater in the Sichuan Basin than in the Panxi area or the Western Sichuan Plateau. With a reduction of 61% and 62% in the RMSE for the Panxi Area and the Sichuan Basin, respectively, Bagging Regression performed marginally better than the other approaches. The three machine learning techniques produced comparable outcomes for the Western Sichuan Plateau, showing a significant 77% reduction.
Region RMSE (mm h-1) CMPAS_RT Random Forest Regression AdaBoost Regression Bagging Regression Western Sichuan Plateau 0.141 0.032 0.032 0.033 Panxi Area 0.197 0.079 0.080 0.076 Sichuan Basin 0.241 0.093 0.093 0.091 Table 5. RMSE for CMPAS_RT and revised products in different regions.
The PDF of the RMSE change rate is shown in Fig. 6d-6f. On the Western Sichuan Plateau, the RMSE is decreased by more than 90% at roughly 30% of the stations, and the Random Forest revision is ideal for this interval. Besides, the RMSE is reduced by 70%−80% at approximately 20% of the stations, and Adaboost regression is optimal in this interval. For the Panxi Area, 20% of the stations have a 70% − 80% reduction in RMSE, Bagging Regression worked best in the interval where the RMSE is decreased by 60% − 70%, while Random Forest Regression performs the best in the interval where the RMSE is dropped by 30%−60%. The most significant outcome of the correction is the reduction in RMSE by more than 70% at approximately 70% of the stations on the Western Sichuan Plateau.
The COR for various regions is illustrated in Table 6.Due to the reduced precipitation, the original COR is higher for the Western Sichuan Plateau at 0.925 and lower for the Panxi Area at 0.857.The PDF of the COR change rate for the revised products is displayed in Fig. 6g-6i, Random Forest has a good effect on revising the COR, more than 50% of the stations on the Western Sichuan Plateau have a rise of roughly 0−10% in COR, and a small number of stations in the Panxi Area see an increase of more than 10%. In the Sichuan basin, the COR varies less.
Region COR CMPAS_RT Random Forest Regression AdaBoost Regression Bagging Regression Western Sichuan Plateau 0.925 0.941 0.938 0.934 Panxi Area 0.857 0.892 0.881 0.884 Sichuan Basin 0.917 0.921 0.920 0.918 Table 6. COR for CMPAS_RT and revised products in different regions.
-
The RE of CMPAS_RT and the revised product for each season are compared in Table 7, with CMPAS_RT having the biggest RE in summer(0.59)and the smallest RE in winter(0.208). Bagging regression performs somewhat better in spring, summer, and autumn, with similar results for the three methods of revision in winter. The average RE is lowered by 80% in winter, 62% in spring, 60% in summer, and 22% in autumn. The findings of the Random Forest revisions are chosen for further analysis since the three machine learning revision results are comparable for all seasons(Fig. 7). Fig. 7a-7c demonstrates that the autumn revision's RE decrease is lower, and for the Sichuan Basin, the RE reduction is mostly focused at around 20%.
Season RE CMPAS_RT Random Forest Regression AdaBoost Regression Bagging Regression Spring 0.298 0.117 0.114 0.112 Summer 0.590 0.250 0.238 0.234 Autumn 0.285 0.221 0.226 0.222 Winter 0.208 0.040 0.040 0.041 Table 7. RE of CMPAS_RT and revised product for different seasons.
Figure 7. PDF of RE, RMSE, and COR change rate of the revised CMPAS_RT products in different seasons (a, b, c: RE; d, e, f: RMSE; g, h, i: COR).
The RMSE is presented in Table 8, with the maximum RMSE of 0.541 mm h-1 in summer and the smallest RMSE of 0.031 mm h-1 in winter. The overall effect of the three revision approaches is similar throughout all seasons, with a considerable reduction in RMSE. The biggest reduction in RMSE is recorded in winter(71%), spring(69%), summer(65%), and the least in autumn(31%).
Season RMSE (mm h-1) CMPAS_RT Random Forest Regression AdaBoost Regression Bagging Regression Spring 0.123 0.038 0.038 0.038 Summer 0.541 0.190 0.193 0.194 Autumn 0.138 0.064 0.065 0.066 Winter 0.031 0.009 0.009 0.010 Table 8. RMSE of CMPAS_RT and revised product for different seasons.
The PDF of the RMSE change rate for the modified CMPAS_RT products for each region during different seasons is shown in Fig. 7d-7f. For the majority of stations and during all seasons, the RMSE drops by over 90% for the Western Sichuan Plateau. For stations in the Panxi Area, the reduction in RMSE is greater in the spring, summer, and winter months than in the autumn. The majority of the stations in the Sichuan Basin see a decrease of more than 90% in RMSE in spring and summer, a reduction concentrated in the range of 60% to 90% in winter, while the revision effect in autumn is limited, with a decline of 40% to 60% in RMSE.
The COR of CMPAS_RT and the revised products during various seasons are shown in Table 9.CMPAS_RT's COR is somewhat greater in summer at 0.935 and marginally lower in the fall at 0.862.The COR increases modestly in all seasons following the adjustment because it is already high before it, with an increase of 2.2% in autumn, 1.6% in winter, 1.2% in spring, and 0.8% in summer. Fig. 7g, 7h, and 7i displays the probability distribution of COR for the updated CMPAS_RT products during various seasons. With the majority of stations exhibiting a 0 − 5% increase in the revised COR change, the trend of the revised change in the three areas is generally similar across the four seasons.
Season COR CMPAS_RT Random Forest Regression AdaBoost Regression Bagging Regression Spring 0.930 0.941 0.938 0.932 Summer 0.935 0.943 0.940 0.940 Autumn 0.862 0.881 0.875 0.874 Winter 0.932 0.947 0.942 0.944 Table 9. COR of CMPAS_RT and revised product for different seasons.
In the Panxi Area and Sichuan Basin, the revised effect of CMPAS_RT is somewhat weaker for autumn. Reviewing the weather and climate profiles for October 2020 reveals that the majority of the basin had more than 15 days of precipitation, and a large percentage of the Panxi Area had between 10 and 18 days, all of which were higher than in the same period in the typical year. The machine learning method to construct the error relationship between precipitation products and observed values may be impacted by the month's high precipitation, leading to a marginally decreased accuracy.
-
On June 12-13, 2021, the northeastern part of the Sichuan Basin had heavy rainfall, with the center of the precipitation occurring there. The hourly rainfall intensity and cumulative rainfall throughout this process were strong. When the 24-hour cumulative precipitation from CMPAS_RT is compared to the rain gauge data(Fig. 8a and 8b), the CMPAS_RT precipitation area, direction, and rainband pattern are very comparable with the observations. Since the three machine learning techniques produce similar results, the updated random forest findings are chosen for in-depth investigation.
Figure 8. Spatial distribution of 24-hour cumulative precipitation and MB (a: 24-hour cumulative precipitation of weather stations; b: 24-hour cumulative precipitation of CMPAS_RT; c: MB distribution of CMPAS_RT; d: MB distribution of the revised product).
The MB of hourly rainfall before and after the revision is illustrated in Fig. 8c and Fig. 8d. The region with the highest MB prior to the adjustment is in the northeastern half of the basin's heavy precipitation zone where 18% of the stations have MB greater than 0.02 mm h-1 or less than -0.02 mm h-1. With only 9% of the stations having a considerable MB, the MB inside the area of severe precipitation has significantly decreased since the revision.
For analysis, the top five stations in terms of 24- hour cumulative precipitation are chosen(Fig. 9). Four stations recorded 24-hour cumulative precipitation totals that are higher than that of CMPAS_RT, showing that CMPAS_RT does somewhat underestimate heavy precipitation. The MB is decreased to varying degrees after the revision, with a maximum reduction of 58%. The COR at the five stations also slightly increased following the correction.
-
The data from 449 quality-controlled hydrological stations in July 2021 are selected for the independent analysis, and the station distribution is shown in Fig. 10a. Fig. 10b-10f displays a comparison of the indicators before and after the adjustment. The independent revision is most impacted by Bagging Regression. The range of MB variations after the revision significantly decreased, although the median change in MB was slightly different. The median COR between hydrological stations and CMPAS_RT is slightly lower than that of meteorological stations, at 0.6, because the accuracy of precipitation data at hydrological stations is 0.5 mm and that of CMPAS_RT is 0.1 mm, which is subject to some error. The range of fluctuation of COR is dramatically reduced after the revision, with the median increasing to 0.63.
Figure 10. Revision effect of hydrological stations (a: Spatial distribution of hydrological stations; b: MB; c: COR; d: TS; e: POD; f: MR).
The majority of the CMPAS_RT and hydrological stations'TS and POD fall within the range of 0.7 − 1.Following the revision, the median has slightly increased, and the TS and POD ranges are primarily in the range of 0.8−1.The MR is largely between 0 and 0.3 before the adjustment, and between 0 and 0.2 after it, with less missing rate.