Skip to main content

Improving air pollutant prediction in Henan Province, China, by enhancing the concentration prediction accuracy using autocorrelation errors and an Informer deep learning model


Air pollution is an important issue affecting sustainable development in China, and accurate air quality prediction has become an important means of air pollution control. At present, traditional methods, such as deterministic and statistical approaches, have large prediction errors and cannot provide effective information to prevent the negative effects of air pollution. Therefore, few existing methods could obtain accurate air pollutant time series predictions. To this end, a deep learning-based air pollutant prediction method, namely, the autocorrelation error-Informer (AE-Informer) model, is proposed in this study. The model implements the AE based on the Informer model. The AE-Informer model is used to predict the hourly concentrations of multiple air pollutants, including PM10, PM2.5, NO2, and O3. The experimental results show that the mean absolute error (MAE) and root mean square error (RMSE) values of AE-Informer in multivariate prediction are 3% less than those of the Informer model; thus, the prediction error is effectively reduced. In addition, a stacking ensemble model is proposed to supplement the missing air pollutant time series data. This study uses Henan Province in China as an example to test the validity of the proposed methodology.

1 Introduction

Air pollutants (PM10, PM2.5, O3, NO2, etc.) are important problems in ecological environments [1,2,3] that cause several issues, such as reduced air quality and human health risks [4]. The maximum 8-h 90th quantile concentration of ozone in cities such as Beijing, Tai'an, Zibo, Dezhou, Handan, and Kaifeng increased from 2015 to 2018, the annual concentration went up from 168 to 212 μg m−3 [5]. In recent years, public safety studies found that the levels of PM2.5 and O3 were closely related to cardiovascular, cerebrovascular, nervous system, and respiratory diseases [6], while long-term exposure to O3 and NO2 increased the risk of death [7]. In addition, air pollutants affect people's happiness, population migration, and other livelihood issues [8]. Short-term air quality prediction can be used not only to reduce future high air pollution events but also to decrease resident exposure. Therefore, the real-time acquisition of air pollutant concentration information and the accurate prediction of future concentration information are of great significance for air pollution management and public health protection.

The most commonly used air pollutant concentration prediction methods are deterministic methods, statistical methods, and machine learning methods [9,10,11]. Deterministic methods predict the concentration of air pollutants by simulating atmospheric chemical diffusion and transport processes. Commonly used deterministic methods include chemical transport models [12], and operational street pollution models [13]. Although these methods can be used to generate pollutant predictions, they have considerable computational costs, and the prediction results may be inaccurate due to a lack of actual observation data [14, 15]. Statistical methods address the problem of limited data in deterministic methods. The most commonly used statistical methods include the autoregressive integrated moving average (ARIMA) method, geographically weighted regression method and generalized additive model [16,17,18]. These methods have been widely used for time series prediction of air pollutant levels. For example, Slini et al. [19] used the ARIMA model to predict ozone concentrations in Greece. However, most statistical methods assume linear relationships between variables and labels, which is inconsistent with real-world nonlinearities. To solve this problem, researchers applied nonlinear models in machine learning. For example, Ma et al. [20] used support vector machines to predict the concentrations of air pollutants such as PM10 and PM2.5. Rubal et al. [21] used the random forest (RF) model to predict the future 1-h concentrations of seven pollutants, including NO2. Although these models obtain improved prediction accuracy, they ignore time series trends in air pollutant concentrations.

With the rapid development of deep learning techniques, traditional machine learning and shallow neural network models no longer obtain state-of-the-art performance. Different kinds of deep learning models have been proposed to improve the air quality prediction performance. For example, Ma et al. [22] used a bidirectional long short-term memory neural network model (Bi-LSTM) based on the recurrent neural network (RNN) structure and transfer learning to predict future 1-h, 1-d and 1-wk concentrations of PM10. Chauhan et al. [23] used the convolutional neural network (CNN) structure to predict the future 1-d concentrations of five pollutants, including PM10 and PM2.5, in India. RNNs are limited in solving gradient problems. CNNs are limited in obtaining long-term historical information, and they cannot obtain accurate predictions of air pollutant concentrations. In the past two years, the transformer model [24] was introduced in the field of time series prediction, and its self-attention mechanism provides an effective method for obtaining long-term macroscopic information in time series. Many improved transformer based models have been proposed. For example, the LogTrans model [25] showed high accuracy in predicting future hourly electricity consumption and reduced the running cost of the model. The Star-Transformer model [26] improved the prediction performance of future hourly meteorological index. Additional models, such as the MetaFormer [27], AutoFormer [28], Transformer-XL [29], and Set Transformer [30] models, all exhibited considerable gains in time series prediction. The Informer model [31] is an improved transformer time series prediction model based on the Kullback–Leibler (KL) divergence that was proposed in 2021. The Informer model improves the time series prediction accuracy while reducing the running cost of the model, saving considerable time. This model showed improved performance for power consumption time series prediction and traffic flow time series prediction but has not been applied in air quality prediction. In this study, we apply the Informer model to air quality time series prediction and modify the method to further improve its prediction accuracy.

Due to factors such as human and machine failures, a large amount of data is lost in the acquired state control site data sets, resulting in discontinuous time states, which seriously affect subsequent data analyses. The issue of missing data in the time series needs to be addressed. In this paper, we use the stacking ensemble model to supplement the missing data in the air pollutant concentration time series and compare the effectiveness of the stacking ensemble model with existing approaches. Then, the Informer model is used to obtain air pollutant concentration time series predictions, and the performance of the proposed model is compared with that of other deep learning models. The AE-Informer model is proposed, which combines the AE strategy [32] with the Informer model to improve the prediction accuracy. To verify the framework of the proposed method, we model the levels of four major air pollutants, namely, PM10, PM2.5, NO2, and O3, in the study area in Henan Province.

2 Materials and methods

Figure 1 presents the methodological framework of the model proposed this paper. The framework has three parts: (1) air pollutant data collection and missing value supplementation, (2) structural design of the AE-Informer model and the prediction of air pollutants, and (3) analysis of the prediction result and generalization tests.

Fig. 1
figure 1

Methodological framework (data, model application and result analysis)

2.1 Research area and data

As shown in Fig. 2, Henan Province is located at the junction of the coastal open areas and the central and western regions. It is the core area of China's economic and social development, as well as one of the most densely populated and polluted areas in China [1, 33]. In this study, ground measurements of the PM10, PM2.5, NO2 and O3 mass concentrations were collected hourly from January 1, 2019, to December 31, 2020, at 60 stations in Henan Province by the China Environmental Monitoring Centre (CEMC). The green dots denote ground-based CEMC sites, and the red five-pointed star denotes the site used in the case study and experiments in this paper. We first removed invalid values and outliers due to instrument calibration issues. The collected data have missing values due to instrument damage, human error, and other factors. Therefore, we use the stacking ensemble learning model to fill in the missing values; more details are provided in Sect. 3.1.

Fig. 2
figure 2

Overview of Henan Province overlaid with ground-based stations

2.2 Methods

2.2.1 Informer model

In air pollutant time series, the value at the current moment is correlated with the value at each moment in the previous period, and the air pollutant concentration at the current moment can be predicted based on the historical time series information. Informer [31] is an improved time series prediction model based on the Transformer. The Informer model has an encoder-decoder structure, and the core of this model is the self-attention mechanism. In contrast to models with RNN and CNN structures, models with self-attention mechanisms do not need to consider the position in the sequence when obtaining the historical time series information, and the cost of calculating the association between two positions in the time series does not increase with increasing distance. Therefore, historical information can be obtained more effectively to accurately predict the air pollutant concentration at the current moment. The calculation equation is shown in Eq. (1):


The calculation process is shown in Fig. 3. In this figure, q and k are sequences that are obtained by multiplying \(\mathrm{X}\) by the weights \({\mathrm{W}}^{\mathrm{q}}\) and \({\mathrm{W}}^{\mathrm{k}}\), respectively; these sequences are essentially the same as \(\mathrm{X}\). The inner product of q and k is equivalent to \({\mathrm{XX}}^{\mathrm{T}}\), which represents the inner product of the current moment and the value at each moment in the previous period. The result of the inner product is normalized by the softmax function to generate a new sequence \(\widehat{\mathrm{\alpha }}\). The larger the value at a certain position in the sequence is, the higher the correlation between the value at the moment to be predicted and the value at that position. When predicting the value at the current moment, more information about this moment is considered. Finally, the inner product of \(\widehat{\mathrm{\alpha }}\) and v is used to obtain the attention score, which is an internal representation of \(\mathrm{X}\) in the model that represents various features of \(\mathrm{X}\).

Fig. 3
figure 3

Self-attention calculation process (X represents the input pollutant time series; q, k and v are sequences obtained by multiplying \(\mathrm{X}\) by the weights \({\mathrm{W}}^{\mathrm{q}}\), \({\mathrm{W}}^{\mathrm{k}}\) and \({\mathrm{W}}^{\mathrm{v}}\); \(\mathrm{\alpha }\) is the inner product of q and k; and \(\widehat{\mathrm{\alpha }}\) is \(\mathrm{\alpha }\) normalized by the softmax function)

In addition, the Informer model combines the self-attention mechanism with the KL divergence strategy to create ProbSparse self-attention. Since most of the historical information is provided by the values at a few positions in the time series, to reduce the computational costs, the positions that provide a large amount of information are found according to the sparse scores at various positions, and dot product calculations are performed to obtain the available historical information. These dot product operations are not required at other locations. The calculation equation is shown in Eq. (2):


where \({\mathrm{q}}_{\mathrm{i}}\) is the value at the i-th position in the air pollutant time series \(\mathrm{X}\), \(\mathrm{K}\) is the entire \(\mathrm{X}\) sequence, and \({\mathrm{L}}_{\mathrm{k}}\) is the length of \(\mathrm{X}\). Dividing by \(\sqrt{\mathrm{d}}\) ensures that the input to the softmax function is not too large, which would cause the partial derivative to approach 0. The larger the M value at the i-th position is, the more information this position carries, and the more important this position is in the self-attention operation.

2.2.2 AE-Informer model

Autocorrelated errors are introduced when insufficient covariates are added, data collection errors occur, and when the time series prediction model does not fully fit. To reduce the influence of these errors on the prediction results, AE [31] can be incorporated into the Informer model. Since the errors are autocorrelated, the current moment error can be represented by the errors at each moment in the previous period. The calculation equation is shown in Eq. (3):

$$\begin{array}{c}{e}_{t}={\rho }_{1}{e}_{t-1}+\dots +{\rho }_{p}{e}_{t-p}+{\varepsilon }_{t},\left|{\rho }_{i}\right|<1,\forall i\end{array}$$

where \(\uprho\) is the error parameter at each moment, e is the error at each moment, and \({\upvarepsilon }_{t}\) is the error of the entire period. For the convenience of calculation, the equation is reduced to the first-order form, yielding \({\mathrm{e}}_{\mathrm{t}}={\uprho }_{1}{\mathrm{e}}_{\mathrm{t}-1}\). Assuming that \(\widehat{\varepsilon }=\widehat{{\mathrm{X}}_{\mathrm{t}}}-\widehat{\uprho }{\mathrm{X}}_{\mathrm{t}-1}\), the new input and output of the model can be constructed by combining the two equations. The input changes from the observed value of the air pollutant concentration at each moment in the previous period to the error value at each moment, and the output changes from the predicted value at the current moment to the predicted value of the error at the current moment, where ρ is used as a parameter to train the model. Finally, by applying \(\widehat{{\mathrm{X}}_{\mathrm{t}}}=\widehat{\upvarepsilon }+\widehat{\uprho }{\mathrm{X}}_{\mathrm{t}-1}\), the predicted value of the error at the current moment is added to the observed value at the previous moment to obtain the predicted value at the current moment. This approach improves deep learning models such as LSTM; thus, this method was used in the Informer model to improve the accuracy of the air pollutant concentration predictions.

To improve the hourly prediction accuracy of the Informer model [31], in this study, we fuse the Informer model with AEs [32] (Fig. 4) and propose the AE-Informer model. Figure 4a shows the traditional Informer model, and 4b presents the modified AE-Informer model. When the air pollutant concentration \(\widehat{{\mathrm{X}}_{\mathrm{t}}}\) is predicted at time t, the input to the Informer model is adjusted from the hourly pollutant concentration observations \(\left\{{\mathrm{X}}_{\mathrm{t}-\mathrm{w}},\dots ,{\mathrm{X}}_{\mathrm{t}-1}\right\}\) to the hourly observations and the error values between these observations and those in the previous hour \(\left\{{\mathrm{X}}_{\mathrm{t}-\mathrm{w}}-{\mathrm{X}}_{\mathrm{t}-\mathrm{w}}-{\mathrm{X}}_{\mathrm{t}-\mathrm{w}-1},\dots ,{\mathrm{X}}_{\mathrm{t}-1}-{\mathrm{X}}_{\mathrm{t}-2}\right\}\). The output changes from the predicted air pollutant concentration at the current time \(\widehat{{\mathrm{X}}_{\mathrm{t}}}\) to the predicted value of the error between the current and previous time \(\widehat{{\mathrm{X}}_{\mathrm{t}}}-\uprho {\mathrm{X}}_{\mathrm{t}-1}\). Then, \(\uprho {\mathrm{X}}_{\mathrm{t}-1}\) is added to the prediction result to obtain the final prediction value \(\widehat{{\mathrm{X}}_{\mathrm{t}}}\) at time t. \(\uprho\) is a parameter of the error between each moment and the previous moment that is added to the Informer model. Finally, the model is trained and iterated.

Fig. 4
figure 4

Improvement of the Informer model based on the AE strategy: Informer (a) and AE-Informer (b)

2.2.3 Stacking ensemble learning

Before the air pollutant concentration can be predicted, the time series must be supplemented to address the missing values. In recent years, with the rapid development of machine learning, many studies have applied machine learning models to the field of missing data supplementation [34, 35], and ensemble methods can integrate these basic machine learning models to improve performance.

Stacking ensembles are ensemble learning techniques that fuse multiple regression models through a meta-regressor. Each base regression model uses the complete training set during training, and the output of each base regression model during the ensemble learning process is used as a meta-feature that becomes the input of the meta-regressor. The meta-regressor fits these meta-features to obtain multiple fused models. In this approach, a variety of meta-regressors can be used to effectively reduce the bias and variance of the prediction results. Therefore, in this study, we use the stacking ensemble method (Fig. 5) to fuse five basic models: extreme gradient boosting (XGBoost), light gradient boosting (LGBM), gradient boosting decision tree (GBDT), random forest (RF) and extra tree (ET). This approach improves the accuracy of the supplemented missing data in the air pollutant time series.

Fig. 5
figure 5

Stacking ensemble model structure and workflow

2.2.4 Model evaluation

To evaluate the collected time series data, the missing data supplementation experiments were first performed. Then, the data were input into the model to conduct several experiments to (1) determine the optimal input sequence length, (2) determine the optimal sampling factor size, and (3) generate the multivariate air pollutant concentration time series predictions. To evaluate the results of each experiment, three performance metrics were used in this study: the correlation coefficient (R2), the root mean square error (RMSE), and the mean absolute error (MAE). These metrics are calculated as shown in Eqs. (4), (5) and (6):

$$\begin{array}{c}{R}^{2}=\frac{{\sum }_{i=1}^{n}\left({\widehat{y}}_{i}-{u}_{\widehat{y}}\right)\left({y}_{i}-{u}_{y}\right)}{\sqrt{{\sum }_{i=1}^{n}{\left({\widehat{y}}_{i}-{u}_{\widehat{y}}\right)}^{2}\sqrt{{\sum }_{i=1}^{n}{\left({y}_{i}-{u}_{y}\right)}^{2}}}}\end{array}$$

where n is the predicted sequence length, \({y}_{i}\) is the observed value at the i-th position in the air pollutant time series, and \({\widehat{y}}_{i}\) is the predicted value at the i-th position in the sequence. \({u}_{\widehat{y}}\) and \({u}_{y}\) are the average forecasted and average observed air pollutant concentrations, respectively.

3 Results and discussion

3.1 Missing data supplementation

The stacking ensemble model and five machine learning models were used to conduct missing data supplementation experiments to verify the effectiveness of the stacking ensemble model in improving the accuracy of the supplemented data. First, the five models, namely, ET, RF, GBDT, XGBoost, and LGBM, were adjusted by Bayesian optimization to achieve the best model effect. These five models were then used as the first layer in the stacking ensemble model. The prediction data of each model's cross operation were fused to form a new training set, and the prediction results of each model's test set were fused to form a new test set. The new training and test sets were passed to the second-layer ridge regression model to train the model, achieving more accurate supplemented data.

As shown in Table 1, compared with other machine learning models, the R2 value of the four pollutants was the highest in the stacking ensemble model. Among them, the missing value of PM2.5 had the highest R2 value (0.979), and the R2 values of the other three air pollutants were all greater than 0.87. Compared with the XGBoost machine learning model, the MAE and RMSE of the proposed model were generally reduced by 1–6%. Except for the MAE of NO2 in the of XGBoost model, the stacking ensemble method yielded better results in terms of all other metrics. The stacking ensemble model improved the accuracy of the supplemented data based on its composition model and exhibited a wide range of applicability to four kinds of pollution (PM10, PM2.5, NO2, and O3).

Table 1 Comparison of the supplemented missing data with the stacking ensemble models

In addition, to more intuitively display the effects after supplementing the missing data, Fig. 6 shows the verification results of the four predicted pollutants PM10 (a), PM2.5 (b), NO2 (c), and O3 (d) versus the ground measurements. In the scatter plots of the four pollutants, the scatter points are distributed near the diagonal, which indicates that the prediction values are close to the observed values with low error. Therefore, the stacking ensemble model reliably supplements the missing CEMC data. Among the four pollutants, NO2 has more scattered points than PM2.5, PM10 and O3. This result occurred because NO2 has more missing values, leading to fewer training samples, which affects the prediction performance of the model. Compared with NO2 and PM10, the concentration values of O3 and PM2.5 are more evenly distributed, which leads to a higher prediction accuracy.

Fig. 6
figure 6

Scatter plots of the predicted and observation results of PM10, PM2.5, NO2 and O3 based on the stacking ensemble model

3.2 Determining the input sequence length

In the Informer model, the input sequence length represents how many hours or days of pollutant concentration data the model needs to use to predict the pollutant concentration in the next hour or day, shorter input sequences cannot ensure that the model has sufficient historical air pollutant data, while long input sequences increase irrelevant inputs and the computational complexity. Therefore, it is necessary to determine the optimal input sequence length to achieve the best model prediction performance. To determine the most appropriate input sequence length, 10 prediction experiments were conducted with different input sequence lengths for 60 state-controlled stations in the study area, and the results of each experiment were averaged to obtain the final RMSE and MAE performance indicators.

Table 2 shows the MAE and RMSE values obtained with different encoder lengths at the Luohe University site. When the input sequence length was 6–48, the MAE and RMSE ranged from 5.5–5.8 and 9.4–9.8, respectively. When the encoder length was 8, the MAE and RMSE reached their lowest values, which indicates that the AE-Informer model has the lowest prediction error and highest model accuracy. Other national control stations also obtain better prediction effects when the encoder length is 8. Therefore, the length of the input sequence is set to 8 in this study.

Table 2 Prediction performance using different input sequence lengths

3.3 Optimal sampling factor size

By evaluating the KL divergence, we found that there was a large difference between the attention distribution and the uniform distribution, demonstrating that the self-attention mechanism was sparse, and only a small amount of data in the air pollutant sequence contributed important historical information. Selecting data at fewer positions in the input sequence yields less historical information, resulting in insufficient information to make accurate predictions. However, selecting data at too many locations leads to considerably complex historical information, which increases the noise in the prediction results and the computational costs. Therefore, it is very important to select data at an appropriate number of positions in the sequence. According to the size of the sampling factor (Factor), the first few positions with the highest sparsity scores are selected by the model to obtain historical information. To determine the optimal sampling factor size, we conducted 10 prediction experiments using different sampling factors, and the default input sequence length was the optimal input sequence length determined in Sect. 3.2. The results of each experiment were averaged to obtain the final RMSE and MAE to determine the optimal sampling factor.

The experimental results from the Luohe University site are shown in Table 3. When the sampling factor was 5, the MAE and RMSE of the air pollutant prediction results reached 5.57 and 9.4, respectively, which proves that the model achieves the best prediction effect with this sampling factor. In the experiments at other national control stations, high prediction accuracy was also achieved when the factor was 5. Therefore, the sampling factor is set to 5 in this work.

Table 3 Prediction performance using different sampling factors on site data of Luohe University

3.4 Comparison of experimental results

To demonstrate the effectiveness of the AE-Informer model, multivariate prediction experiments were conducted on the AE-Informer model and other commonly used models, and the experimental results were compared. The comparison models included ARIMA [17], Informer [31], Transformer [24], Bi-LSTM [22], the gated recurrent unit (GRU) [36], long short-term memory (LSTM) model [37] and long short-term network (LST-Net) [38]. The ARIMA model was divided into three components: the autoregressive (AR) term, the differential term, and the moving average (MA) term. The AR term refers to the past value used to predict the next value, the MA term defines the number of past prediction errors when predicting future values, and the difference term specifies the number of times that the difference operation is performed on the sequence. The difference operation ensures that the data remain balanced. The traditional Transformer and Informer models are typical models that use attention mechanisms for time series prediction. The Bi-LSTM, GRU, and LSTM models are typical models that use RNN structures to solve time series prediction problems. LST-Net applies the CNN structure to the field of time series prediction.

The results show that the AE-Informer model proposed in this paper outperforms the traditional Informer model and the other comparative models in terms of air pollutant prediction (Table 4). The R2, MAE, and RMSE of the AE-Informer model reached 0.976, 5.42, and 9.41, respectively, and the error was reduced by 3–7% compared with the other models. The MAE and RMSE of the Informer 13% less than those of the ARIMA model, and the prediction accuracy was significantly improved. The experiment proved that the Transformer and Informer models based on the self-attention mechanism outperform the RNN-based Bi-LSTM, GRU, and LSTM models and the CNN-based LST-Net model. The traditional ARIMA prediction model appears to have inadequate time series prediction performance.

Table 4 Performance comparison of the AE-Informer model and other models for multivariate time series prediction

Table 5 shows the evaluation indicators for the individual prediction results of the four pollutants. The simultaneous prediction of multiple pollutants does not affect the prediction effects of single pollutants, and the correlation between the predicted and true values of each pollutant is approximately 0.85.

Table 5 Evaluation indicators for the prediction results of PM2.5, PM10, O3, and NO2

3.5 Prediction performance of the AE-Informer model

After improving the Informer model by introducing AEs, AE-Informer was used to predict the hourly concentrations of four common pollutants. To compare the prediction performance before and after the model was improved, the change curve of the predicted and actual value was generated. The change curve demonstrates the effectiveness of the AE-Informer model by showing the consistency between the predicted and actual results. Figure 7 depicts a comparison of the observed and predicted change curves of the AE-Informer model ((a) is PM10, (b) is PM2.5, (c) is NO2, and (d) is O3). The blue line represents the actual air pollutant value, and the red line represents the predicted value. In the change curve of the four pollutants, for some extreme value predictions, the error between the AE-Informer prediction value and the observation value is small. In addition, at some time series steps, the predicted value of the AE-Informer model is consistent with the observed value.

Fig. 7
figure 7

Comparison between the change curves of the predicted and observed values based on the AE-Informer model

The above experiments were based on the data of one site as an example. To evaluate the broad applicability of the proposed AE-Informer model, the model was applied to all other state control sites within the study area. The Informer model and the AE-Informer model were applied to predict the hourly pollutant concentrations at each monitoring station in Henan Province. Then, kriging interpolation was used to interpolate the RMSE and MAE values in the study area to more intuitively demonstrate the improvement in the prediction performance.

The AE-Informer and Informer prediction evaluation indicators after kriging interpolation are shown in Fig. 8, with (a) showing the AE-Informer multivariate prediction MAE and (b) showing the AE-Informer multivariate prediction RMSE. The bottom color in the color bar indicates a higher prediction level and a smaller prediction error, while the top color indicates a lower prediction level and a larger prediction error. The MAE value of the AE-Informer model ranges from 4.56–7.48, and the RMSE value ranges from 7.53–12.8. The prediction performance in the whole study area is significantly improved in terms of both the MAE and RMSE. Thus, adding the AE effectively reduces the prediction error of the Informer model and improves the hourly prediction accuracy. Moreover, the model is generally applicable in the whole research area, proving the effectiveness of the method proposed in this paper.

Fig. 8
figure 8

Spatial distribution of the performance indicators for air pollutant time series prediction: a MAE and b RMSE

4 Conclusions

In conclusion, in this paper, we proposed a methodological framework for studying the effectiveness of the Informer model and AE in improving the prediction accuracy of air pollutant concentrations and compared the prediction performance of various models. Introducing the AE improved the air pollutant concentration time series prediction accuracy. The hourly air pollutant concentration data from all available monitoring stations in Henan Province from 2019 to 2020 were obtained to test the validity of our method. The main contributions of this study can be summarized as follows:

  1. (1)

    The stacking ensemble method was introduced to supplement missing time series data. Five basic meta-regressors, XGBoost, LGBM, GBDT, RF, and ET, were integrated, and their performance was compared. The experimental results showed that stacking improved the accuracy of missing time series data supplementation; compared with the XGBoost model, the MAE and RMSE of PM2.5 were reduced by up to 6% when the proposed model was applied.

  2. (2)

    For the first time, the Informer model was applied in the field of air pollutant time series prediction. The self-attention mechanism in the Informer model efficiently obtained historical time series information. The experimental results showed that the MAE and RMSE of the proposed model were 13% less than those of the ARIMA model, and the prediction accuracy was significantly improved.

  3. (3)

    This paper is one of the few pioneering studies that fuses deep learning with the AE strategy to predict air pollutant concentration. This model can help governments and researchers assess trends more accurately in long-term air quality analyses, especially for multivariate time series forecasting.

  4. (4)

    The results showed that the AE-Informer model proposed in this paper effectively improved the prediction of air pollutant concentrations in multivariate time series. Compared with the Informer model, the MAE and RMSE values of the proposed model were reduced by 3%, and the errors of the predicted values were also reduced.

This research can be extended to explore higher resolution data. Moreover, transfer learning can be introduced to achieve daily time series prediction, and data from more state controlled sites can be applied to assess areas with fewer state controlled sites.

Availability of data and materials

All data that were generated or analyzed during this study are available upon request.


  1. Cai K, Li SS, Zheng FB, Yu C, Zhang XY, Liu Y, et al. Spatio-temporal variations in NO2 and PM2.5 over the Central Plains Economic Region of China during 2005–2015 based on satellite observations. Aerosol Air Qual Res. 2018;18:1221–35.

    Article  Google Scholar 

  2. Li SS, Ma ZW, Xiong XZ, Christiani DC, Wang ZX, Liu Y. Satellite and ground observations of severe air pollution episodes in the winter of 2013 in Beijing, China. Aerosol Air Qual Res. 2016;16:977–89.

    Article  Google Scholar 

  3. Li SS, Chen LF, Xiong XZ, Tao JH, Su L, Han D, et al. Retrieval of thE HAZE OPTICAL THICKNess in North China Plain using MODIS Data. IEEE T Geosci Remote. 2013;51:2528–40.

    Article  Google Scholar 

  4. Li G, Zeng Q, Pan X. Disease burden of ischaemic heart disease from short-term outdoor air pollution exposure in Tianjin, 2002–2006. Eur J Prev Cardiol. 2016;23:1774–82.

    Article  Google Scholar 

  5. Xiao CC, Chang M, Guo PK, Gu MF, Li Y. Analysis of air quality characteristics of Beijing–Tianjin–Hebei and its surrounding air pollution transport channel cities in China. J Environ Sci-China. 2020;87:213–27.

    Article  Google Scholar 

  6. Zhou CJ, Wei G, Zheng HP, Russo A, Li CC, Du HD, et al. Effects of potential recirculation on air quality in coastal cities in the Yangtze River Delta. Sci Total Environ. 2019;651:12–23.

    Article  Google Scholar 

  7. Chen ZJ, Cui LL, Cui XX, Li XW, Yu KK, Yue KS, et al. The association between high ambient air pollution exposure and respiratory health of young children: A cross sectional study in Jinan, China. Sci Total Environ. 2019;656:740–9.

    Article  Google Scholar 

  8. Song Y, Zhou AN, Zhang M. Exploring the effect of subjective air pollution on happiness in China. Environ Sci Pollut R. 2020;27:43299–311.

    Article  Google Scholar 

  9. Kang Z, Qu ZY. Application of BP neural network optimized by genetic simulated annealing algorithm to prediction of air quality index in Lanzhou. In: 2nd IEEE International Conference on Computational Intelligence and Applications (ICCIA). Beijing: IEEE; 2017.

  10. Li X, Peng L, Yao XJ, Cui SL, Hu Y, You CZ, et al. Long short-term memory neural network for air pollutant concentration predictions: Method development and evaluation. Environ Pollut. 2017;231:997–1004.

    Article  Google Scholar 

  11. Singh KP, Gupta S, Kumar A, Shukla SP. Linear and nonlinear modeling approaches for urban air quality prediction. Sci Total Environ. 2012;426:244–55.

    Article  Google Scholar 

  12. Stern R, Builtjes P, Schaap M, Timmermans R, Vautard R, Hodzic A, et al. A model inter-comparison study focussing on episodes with elevated PM10 concentrations. Atmos Environ. 2008;42:4567–88.

    Article  Google Scholar 

  13. Berkowicz R. OSPM - A parameterised street pollution model. Environ Monit Assess. 2000;65:323–31.

    Article  Google Scholar 

  14. Catalano M, Galatioto F. Enhanced transport-related air pollution prediction through a novel metamodel approach. Transport Res D-Tr E. 2017;55:262–76.

    Article  Google Scholar 

  15. Kukkonen J, Partanen L, Karppinen A, Ruuskanen J, Junninen H, Kolehmainen M, et al. Extensive evaluation of neural network models for the prediction of NO2 and PM10 concentrations, compared with a deterministic modelling system and measurements in central Helsinki. Atmos Environ. 2003;37:4539–50.

    Article  Google Scholar 

  16. Jian L, Zhao Y, Zhu YP, Zhang MB, Bertolatti D. An application of ARIMA model to predict submicron particle concentrations from meteorological factors at a busy roadside in Hangzhou, China. Sci Total Environ. 2012;426:336–45.

    Article  Google Scholar 

  17. Shukur OB, Lee MH. Daily wind speed forecasting through hybrid KF-ANN model based on ARIMA. Renew Energ. 2015;76:637–47.

    Article  Google Scholar 

  18. Davis JM, Speckman P. A model for predicting maximum and 8 h average ozone in Houston. Atmos Environ. 1999;33:2487–500.

    Article  Google Scholar 

  19. Slini T, Karatzas K, Moussiopoulos N. Statistical analysis of environmental data as the basis of forecasting: an air quality application. Sci Total Environ. 2002;288:227–37.

    Article  Google Scholar 

  20. Ma J, Cheng JCP. Data-driven study on the achievement of LEED credits using percentage of average score and association rule analysis. Build Environ. 2016;98:121–32.

    Article  Google Scholar 

  21. Rubal, Kumar D. Evolving differential evolution method with random forest for prediction of air pollution. Procedia Comput Sci. 2018;132:824–33.

    Article  Google Scholar 

  22. Ma J, Cheng JCP, Lin CQ, Tan Y, Zhang JC. Improving air quality prediction accuracy at larger temporal resolutions using deep learning and transfer learning techniques. Atmos Environ. 2019;214:116885.

    Article  Google Scholar 

  23. Chauhan R, Kaur H, Alankar B. Air quality forecast using convolutional neural network for sustainable development in urban environments. Sustain Cities Soc. 2021;75:103239.

    Article  Google Scholar 

  24. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. In: 31st Annual Conference on Neural Information Processing Systems (NIPS). Long Beach: MIT Press; 2017.

  25. Li SY, Jin XY, Xuan Y, Zhou XY, Chen WH, Wang YX, et al. Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. In: 33rd Conference on Neural Information Processing Systems (NIPS). Vancouver: MIT Press; 2019.

  26. Guo QP, Qiu XP, Liu PF, Shao YF, Xue XY, Zhang Z. Star-Transformer. In: Conference of the North-American-Chapter of the Association-for-Computational-Linguistics - Human Language Technologies (NAACL-HLT). Minneapolis: ACL; 2019.

  27. Yu WH, Luo M, Zhou P, Si CY, Zhou YC, Wang XC, et al. MetaFormer is actually what you need for vision. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans: IEEE; 2022.

  28. Chen MH, Peng HW, Fu JL, Ling HB. AutoFormer: searching transformers for visual recognition. In: EEE/CVF International Conference on Computer Vision (ICCV). Virtual: IEEE; 2021.

  29. Dai ZH, Yang ZL, Yang YM, Carbonell J, Le QV, Salakhutdinov R. Transformer-XL: attentive language models beyond a fixed-length context. In: Association for Computational Linguistics (ACL) 2019. Florence: ACL; 2019.

  30. Lee J, Lee Y, Kim J, Kosiorek AR, Choi S, The YW. Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks. In: 36th International Conference on Machine Learning (ICML). California: ACM; 2019.

  31. Zhou HY, Zhang SH, Peng JQ, Zhang S, Li JX, Xiong H, et al. Informer: beyond efficient transformer for long sequence time-series forecasting. In: 35th AAAI Conference on Artificial Intelligence (AAAI-21). Virtual: AAAI; 2021.

  32. Sun FK, Lang CI, Boning DS. Adjusting for autocorrelated errors in neural networks for time series. In: 35th Conference on Neural Information Processing Systems (NIPS). Virtual; 2021 Dec 6–14.

  33. Liu SH, Hua SB, Wang K, Qiu PP, Liu HJ, Wu BB, et al. Spatial-temporal variation characteristics of air pollution in Henan of China: Localized emission inventory, WRF/Chem simulations and potential source contribution analysis. Sci Total Environ. 2018;624:396–406.

    Article  Google Scholar 

  34. Arriagada P, Karelovic B, Link O. Automatic gap-filling of daily streamflow time series in data-scarce regions using a machine learning algorithm. J Hydrol. 2021;598:126454.

    Article  Google Scholar 

  35. Hurtado JC Chacon, Alfonso L, Solomatine D. Comparison of machine learning methods for data infilling in hydrological forecasting. In: EGU General Assembly 2014. Vienna; 2014 Apr 27–May 2.

  36. Dey R, Salem FM. Gate-variants of Gated Recurrent Unit (GRU) neural networks. In: 60th IEEE International Midwest Symposium on Circuits and Systems (MWSCAS). Boston; 2017 Aug 6–9.

  37. Yu Y, Si XS, Hu CH, Zhang JX. A review of recurrent neural networks: LSTM cells and network architectures. Neural Comput. 2019;31:1235–70.

    Article  MathSciNet  MATH  Google Scholar 

  38. Li LD, Wang K, Li S, Feng XC, Zhang L. LST-Net: Learning a convolutional neural network with a learnable sparse transform. In: 16th European Conference on Computer Vision (ECCV) 2020. Glasgow; 2020 Aug 23–28.

Download references


We would like to acknowledge the use of the mass PM10, PM2.5, NO2, and O3 concentration data from


This work was supported by the National Natural Science Foundation of China (Grant No. 42071409), the Open Foundation of Key Laboratory of Ecological Environment Protection of Space Information Application of Henan (22FW070108), the Key Research Projects of Henan Higher Education Institutions (23A520024), the Shenzhen Special Foundation of Central Government to Guide Local Science & Technology Development (2021Szvup032) and the Major Project of Science and Technology of Henan Province (201400210300, 201300311400).

Author information

Authors and Affiliations



Kun Cai designed the experiments, downloaded the data, processed the data, and wrote the paper. Xusheng Zhang and Ming Zhang edited the paper and adjusted unreasonable sentences. Qiang Ge provided visualizations, participated in the investigation and developed the methodology. Shenshen Li revised the full text, provided useful suggestions about the validation experiments and provided funding support. Baojun Qiao supervised the experiments and provided resources. Yang Liu provided funding support. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Qiang Ge.

Ethics declarations

Competing interests

The authors declare no conflict of interest.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cai, K., Zhang, X., Zhang, M. et al. Improving air pollutant prediction in Henan Province, China, by enhancing the concentration prediction accuracy using autocorrelation errors and an Informer deep learning model. Sustain Environ Res 33, 13 (2023).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: