HTML
-
We take the weather radar echoes collected at regular intervals as the input sequence and the weather radar echoes at a certain time in the future as the output sequence with the same temporal resolution that needs to be predicted. We consider the overall weather radar echo extrapolation task as a spatiotemporal sequence prediction task. The following equation was constructed by Shi et al.[20]:
$$ \widetilde{X}_{t+1}, \ldots, \widetilde{X}_{t+K}=\underset{X_{t+1}, \ldots, X_{t+K}}{\operatorname{argmax}} p\left(X_{t+1}, \ldots, X_{t+K} \mid \widehat{X}_{t-J+1}, \widehat{X}_{t-J+2}, \ldots, \widehat{X}_t\right) $$ Each frame X contains the measurements of all grid points on a grid area of size M rows and N columns. The spatiotemporal sequence prediction task is to predict the most likely K future frames based on the first J frames obtained from the observations. Based on the above construction of the problem, we describe and compare the CFSN in detail in this section.
-
As shown in Fig. 1, most spatiotemporal neural network models are based on the structure of RNN, and the multi-input method is to fuse on the channel at the early stage of the model (Zhang et al.[22]). This is done with multiple input channels and a single output channel. We take pre-processed radar echo data and wind speed data with the same spatial and temporal resolution to superimpose the channel and input together. The model extracts the spatiotemporal information for the first n flames through the encoder with l layers and stores it in the memory unit for transmission to the decoder. Finally, the decoder outputs the radar echo maps predicted for the last m flames.
-
The front section of the Fusion GRU module adds a memory N on the basis of the Gated Recurrent Unit (GRU) (Chung et al.[39]) to store the spatiotemporal information of the two types of data separately while using a cascade to iteratively compute around the memory H, as shown on the left of Fig. 2. We introduce the current secondary input wind speed data Yt and combine the memory Ht1 and Nt1 from the previous moment to complete the update of the spatiotemporal information of the wind speed data stored in N. The radar echo data is then used as the current primary input Xt, combined with the updated Nt, to update the memory H. Meanwhile, we enhance the depth of construction of the non-linear relationship by skip connections of input Xt and Yt, thus facilitating the fitting of the evolutionary relationships of the data. This enables the description and memory of the feature information of Xt and Yt. The latter part of the Fusion GRU is to further capture the long-term spatial dependencies of the updated memory H and N, as shown on the right of Fig. 2. Through a cross-attention mechanism, the interactions between the data are enhanced, and the spatial distribution characteristics of the current data features are extracted and stored in the memory M.
Figure 2. The Fusion GRU consists of two parts. The left part accepts the input data Xt and Yt, and completes the computation of the cascade structure, while the right part uses the cross-attention module to further extract and store features.
The Fusion GRU generates gates rt', zt' and gt' to control the flow of information on the memory N by performing convolution operations on Yt, Ht–1 and Nt–1 respectively. gt' selects the information stored in the memory N at the previous moment by the reset gate rt' and combines the input data Yt to generate the information for the current moment. N uses the update gate zt' to select the information stored at the previous moment and the new information from gt', thus completing the update. Subsequently, we use N to influence the generation of the gate zt and reproduce Xt, Yt and Nt in the gate gt by the skip connection to complete the update of the memory H. We compute the cross attention with the updated Ht' and the memory units Mt–1 and Nt respectively and combine them by convolution. Finally, combining the information generated by the cross-attention mechanism, the memory M is updated to achieve the capture of long-term spatial dependencies of the data and generate the new Ht. The update equation of the Fusion GRU is shown below, where σ is the sigmoid activation function, Attention is shown in Vaswani et al. [40] and * and ⊙ denote the convolution operator and element multiplication, respectively.
$$ \begin{aligned} & r_t^{\prime}=\sigma\left(W_1 *\left[Y_t, H_{t-1}, N_{t-1}\right]\right) \\ & z_t^{\prime}=\sigma\left(W_2 *\left[Y_t, H_{t-1}, N_{t-1}\right]\right) \\ & g_t^{\prime}=\tanh \left(r_t^{\prime} \odot\left(W_3 * N_{t-1}\right)+W_4 * Y_t\right) \\ & N_t=\left(1-z_t^{\prime}\right) \odot N_{t-1}+z_t^{\prime} \odot g_t^{\prime} \\ & r_t=\sigma\left(W_5 *\left[X_t, H_{t-1}\right]\right) \\ & z_t=\sigma\left(W_6 *\left[X_t, H_{t-1}, N_t\right]\right) \\ & g_t=\tanh \left(r_t \odot\left(W_7 * H_{t-1}\right)+X_t+Y_t+N_t\right) \\ & H^{\prime}{ }_t=\left(1-z_t\right) \odot H_{t-1}+z_t \odot g_t \\ & Z_n=\operatorname{Attention}\left(W_8 * H^{\prime}{ }_t, W_9 * N_t, W_{10} * N_t\right) \\ & Z_m=\operatorname{Attention}\left(W_{11} * H^{\prime}{ }_t, W_{12} * M_{t-1}, W_{13} * M_{t-1}\right) \\ & Z=W_{14} *\left[Z_n, Z_m\right] \\ & z^{\prime \prime}{ }_t=\sigma\left(W_{15} *\left[Z, H^{\prime}{ }_t, M_{t-1}\right]\right) \\ & g^{\prime \prime}{ }_t=\tanh \left(W_{16} *\left[Z, H^{\prime}{ }_t, M_{t-1}\right]\right) \\ & o_t=\sigma\left(W_{17} *\left[Z, H^{\prime}{ }_t, N_t, M_{t-1}\right]\right) \\ & M_t=\left(1-z^{\prime \prime}{ }_t\right) \odot M_{t-1}+z^{\prime \prime}{ }_t \odot g^{\prime \prime}{ }_t \\ & H_t=o_t \odot \tanh \left(M_t\right) \end{aligned} $$ -
CFSN uses an encoder-decoder structure, as shown in Fig. 3. The encoder can be divided into three parts: the Secondary Block that extracts feature maps of wind speed data using CBAM to aid the prediction, the backbone network consisting of three layers of ST-LSTM (Wang et al. [31]) for radar echo extrapolation, and the fourth-layer module Fusion GRU, which fully combines the information from both. The decoder is the prediction network corresponding to the encoder. In addition, Top Connection extracts the global features from the encoder output on the Fusion GRU and generates the input data for the decoder accordingly.
Figure 3. (a) The architecture of the CFSN. (b) The flow chart of the Secondary Block. Adopt convolution for downsampling and transposed convolution for upsampling. The blue line segment indicates radar information, the green indicates wind speed information, and the purple indicates a mixture of radar and wind speed information.
We put the main input, the radar echo history data, into the encoder backbone network to construct the spatiotemporal information flow and output feature information for each moment. Then, the secondary input, the wind speed maps, is resized by pooling operations to fully extract the spatial information and output feature maps by CBAM. We input these features into the Fusion GRU separately to calculate the spatiotemporal relationship and capture long-term spatial dependencies. We overlay the state feature maps about radar echo output from the Fusion GRU on the channel as a long sequence and further extract the feature information. Top Connection, as shown in Fig. 4, uses this to obtain a global receptive field through the CBAM module and strengthen the connection between the encoder and decoder. It generates an initialization distribution, giving the input of the decoder a priori knowledge rather than learning from zero. This ensures that the model can base its predictions on the global spatial characteristics of the input data, forming certain constraints on the spatial distribution of the data to improve the effect of radar echo extrapolation.
3.1. Single-input neural network
3.2. Fusion GRU
3.3. Cascade Fusion Spatiotemporal Network
-
In this section, we measure the performance of the model to demonstrate its efficiency and reliability. We use the data from the Aliyun Tianchi 2022 Jiangsu Weather AI Algorithm Challenge as the dataset to complete the comparison experiments and ablation study of the model on the radar echo extrapolation task.
-
The dataset from the competition covers weather radar and automatic station observation elements in Jiangsu Province for the period from April to September 2019–2021, containing radar echo sequences, precipitation, and mean wind elements. All the data in the study possess a horizontal resolution of 0.01°, a temporal resolution of 6 minutes, and a grid size of 480 × 560 pixels. The data included in the radar echo sequences represents the radar-based reflectivity at the height of 3 km after undergoing quality control and mosaic of multiple S-band weather radars in Jiangsu, covering the entire area of Jiangsu. The data values range from 0 to 70 dBZ, which indicates the intensity of the radar echoes. The mean wind element dataset is generated by using the Inverse Distance Weighting (IDW) (Babak et al.[41]) interpolation method to interpolate mean wind data collected from automatic meteorological stations in Jiangsu and its surrounding regions onto a uniform grid. The values in the dataset range from 0 to 35 m s–1. IDW interpolation method is based on Tobler's First Law (Miller and Harvey [42]). The value of the point to be interpolated is determined based on the reciprocal of the distance between the point to be interpolated and the sample point. Consequently, the farther the point to be interpolated is from the sample point, the lesser impact it will have on the interpolated value. Although this generates a certain degree of error and does not fully represent the actual observations, it makes full use of discrete observations to reflect the spatial distribution of meteorological elements objectively.
We take the radar data and the mean wind data and reduce the original resolution of 480 × 560 pixels to 120 × 140 pixels by bilinear interpolation to facilitate experiments. We use the 28, 158 sequences for the training set and the 2987 sequences for the test and validation set. We read the image sequences in chronological order using a 20-frame wide sliding window. The model takes the first 10 frames of radar echo and wind speed. The radar echo sequence for the next 10 frames is then predicted. In other words, we predict the radar echo sequence for the next 60 minutes based on the last 60 minutes of observations.
-
We set the batch size to 8, the patch size to 4, and the learning rate to 10–3 during training. All models are trained for 50 rounds using Adam as the optimizer and MSE as the loss function. To prevent overfitting during the model training process, we read each batch every 2 strides and revert to 1 stride during testing. Since the field of meteorology focuses on CSI, we select the model with the highest CSI score at a threshold of 30 dBZ as the best-trained model for the test comparison.
We use probability of detection (POD), false alarm ratio (FAR) (Barnes et al.[43]), CSI and HSS, and four meteorological forecast scores as indicators to measure the effectiveness of model forecasts. CSI reflects the proportion of both predicted and factual occurrences, penalizing false and missing reports, while HSS excludes cases where a random forecast is correct. Irrespective of predicted and true values, if the radar echo value exceeds the specified threshold, it is recorded as 1. Otherwise, it is recorded as 0. The corresponding values are then calculated and accumulated to obtain the confusion matrix, where TP denotes true positives (prediction=1, truth=1), TN denotes true negatives (prediction=0, truth=0), FP denotes false positives (prediction=1, truth=0), FN denotes false negatives (prediction=0, truth=1). We took thresholds of 20 dBZ and 30 dBZ in our experiments and calculated POD, FAR, CSI, and HSS as defined below.
$$ \begin{aligned} & \mathrm{CSI}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}+\mathrm{FP}} \\ & \mathrm{POD}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}} \\ & \mathrm{FAR}=\frac{\mathrm{FP}}{\mathrm{FP}+\mathrm{TP}} \\ & \mathrm{HSS}=\frac{2 \times(\mathrm{TP} \times \mathrm{TN}-\mathrm{FN} \times \mathrm{FP})}{(\mathrm{TP}+\mathrm{FN}) \times(\mathrm{FN}+\mathrm{TN})+(\mathrm{TP}+\mathrm{FP}) \times(\mathrm{FP}+\mathrm{TN})} \end{aligned} $$ -
Table 1 shows the four scores of each model on the competition dataset based on different thresholds. In addition to the CFSN proposed in this study, the single-input models ConvLSTM, PredRNN, MIM, and PredRNN ++, which fuse two types of input data on the channel, are tested in experiments. All models use the same training set, validation set, and test set, and the same randomization mechanism to initialize the data order and model weights. Overall, we can first find that CFSN performs best.
Model CSI↑ POD↑ FAR↓ HSS↑ 20 30 20 30 20 30 20 30 ConvLSTM 0.538 0.321 0.697 0.516 0.309 0.440 0.634 0.429 PredRNN 0.548 0.323 0.670 0.431 0.265 0.348 0.644 0.430 PredRNN++ 0.538 0.314 0.658 0.433 0.279 0.377 0.636 0.419 MIM 0.541 0.321 0.639 0.410 0.250 0.342 0.639 0.426 SimVP 0.481 0.258 0.596 0.353 0.282 0.364 0.575 0.350 CFSN 0.565 0.357 0.669 0.467 0.237 0.326 0.661 0.463 Table 1. Comparison results of the four indices for each type of model on the dataset at thresholds of 20 and 30 dBZ, respectively. Use PredRNN as the baseline and the highest scores are indicated in red. ↑ indicates that the higher the score, the better, while ↓ indicates that the lower the score, the better.
Whether the threshold is 20 or 30 dBZ, CFSN has the best CSI, FAR, and HSS. At the 30 dBZ threshold, the CSI and HSS of CFSN are 10.5% and 7.6% higher, respectively, than those of the baseline PredRNN. And at the 20 dBZ threshold, they are 3.1% and 2.6% higher, respectively. Although ConvLSTM has the highest POD, the FAR is also the highest. CFSN, which has the lowest FAR, is 5.2% and 4.6% lower than the next-best MIM at thresholds of 20 and 30 dBZ, respectively. In addition, the POD of CFSN at the 30 dBZ threshold is also the next best. This suggests that CFSN adequately incorporates the data and forms effective constraints on the distribution, especially focusing on strong echo. To better illustrate the experimental results, we visualize some of the experimental data in Fig. 5.
Figure 5. Each model predicts the last 10 frames in the radar echo sequence on the dataset. These are images from frames 1, 4, 7 and 10.
From the above, we can see that the model in this study retains strong echoes in the area corresponding to the weather radar image at the late stage of the forecast. In contrast, the echo predicted by PredRNN and MIM decays faster, and the strong echo is less, while ConvLSTM and SimVP predict too much strong echo. Fig. 6, showing the scores of each model at each time, again confirms the above discussion. On the one hand, CFSN maintains a low score of FAR at each predictive frame regardless of the threshold. On the other hand, at thresholds of 20 and 30 dBZ, CFSN outperforms other models, with both CSI and HSS higher than the comparison model.
-
To further investigate the effectiveness of Fusion GRU and Top Connection in the CFSN, we continue to conduct ablation experiments on the Jiangsu meteorological station dataset. We remove the cross-attention module, Fusion GRU, and Top Connection, respectively, on CFSN to conduct comparative experiments with the full CFSN model. The experimental results are shown in Table 2.
Model CSI↑ POD↑ FAR↓ HSS↑ 20 30 20 30 20 30 20 30 CFSN 0.565 0.357 0.669 0.467 0.237 0.326 0.661 0.463 CFSN no Fusion LSTM 0.553 0.342 0.663 0.444 0.260 0.349 0.650 0.446 CFSN no Top Connection 0.560 0.351 0.679 0.473 0.264 0.363 0.656 0.457 CFSN no cross attention 0.567 0.353 0.673 0.462 0.236 0.318 0.663 0.457 Table 2. Comparative results of the four indices for ablation experiments at thresholds of 20 and 30 dBZ. The highest scores are indicated in red. ↑ indicates that the higher the score, the better, while ↓ indicates that the lower the score, the better.
Table 2 shows that CFSN does not achieve the highest scores of CSI and HSS for the full model either by removing Fusion GRU or by removing Top Connection. Both Top Connection and Fusion GRU have a beneficial effect on the scores of the multiple data fusion radar extrapolation. Without Fusion GRU, the model is unable to fully fuse the data. Without Top Connection, the model loses constraints on the prediction of the data distribution, which results in a high POD but also a high FAR. Only when the two are combined can the best extrapolation results be achieved. In addition, we removed the cross-attention module from the Fusion GRU. Although the lowest FAR is obtained, the focus of the model on strong echo is decreased. As a result, the CSI and HSS scores at the 30 dBZ threshold decrease by 0.004 and 0.006, respectively. Out of the need to focus on strong echo development, we choose to retain the cross-attention module according to the higher CSI score. Details of scores are shown in Fig. 7, which again confirms the above discussion. A complete CFSN generally outperforms the CFSN without a module. The cross-attention mechanism makes the model tend to focus on the stronger parts of echo.