There are several factors affecting the electricity consumption of a country. One aspect for sure to be considered is the industrial developments in that country. In an underdeveloped country, since a lot of people don’t have the access to electricity, they won’t be able to consume it. This is the case for many developing countries such as Nigeria, Bangladesh and Indonesia. On the other hand, in a developed country with great industrial advancements, a significant portion of the electricity consumption is dedicated to the industry and most of its citizens are connected to the electricity grid.
Another very important aspect is the overall climate in that country. When electricity consumption per capita is investigated, it can be seen that a country that is generally considered to have a colder climate and harsher weather conditions tends to have a higher consumption per capita. This is most probably due to the increasing utilization of heating sources in that country. That’s one of the reasons why countries like Canada, Iceland, Norway and Sweden have the highest electricity consumption per person among other countries. However, one should also consider that air-conditioning requires a significant amount of electricity too. So it’s not necessary for a country to be considered “cold” to consume a great deal of electricity, but this is mostly applicable to a more developed country.
Population is also an important indicator of the level of electricity consumption. If there are more people living in a country, it is very much expected that it has a higher overall electricity consumption. Surely the effect of population can be seen in countries like China and India. However, it is worth mentioning that because some highly populated countries are still “developing”, they have relatively lower electricity consumption levels. Therefore having a high population is not a guarantee of high electricity consumption as well.
All in all, there are many aspects, some of which have not been discussed such as the level of insulation of houses in that country, that have an impact on the electricity consumption of a country. For the case of Turkey, it is for sure that the weather conditions will have a great influence, since Turkey is a country where all 4 seasons can be experienced. Furthermore, Turkey is a country that has 7 regions with very distinct climates, industrial developments and populations. So, it should be expected that different regions will have very different levels of consumption.
The aim of this study is to investigate the hourly electricity consumption in Turkey, decide on the possible factors that may affect the electricity consumption and build a forecasting model accordingly. The study will start with the visualization of the electricity consumption data, then move on with building a forecasting model. Afterwards, hourly consumption amounts will be predicted for 14 days using the model and the results will be interpreted, as well as several alternative improvements will be discussed.
The set of data used for the assignment includes the date, the hour, the consumption amounts (for each hour of the day) and again hourly temperature values from 7 different locations in Turkey from January 2017 to February 2021. Below, the head of the data is shown.
For the sake of the project, the consumption amounts will be subjected to an in-depth visual analysis in order to decide on the general outline of the approach that will be implemented to make forecasts. Firstly, it is very necessary to plot the hourly consumption amounts together before making any further comments.
Although it is difficult to interpret the graph since the scale of the data to be manipulated is quite big, it can be observed that there is a significant degree of yearly seasonality, in that for each year, there seems to be a repeating pattern.
It should be stated that for the rest of the visual analysis and for the approach to be implemented, daily values will be utilized, since it is much easier to make observations with daily values. In order to obtain the new set of data, first, the mean consumption for each day was calculated and for the case of the temperatures, the maximum and the minimum for each hour among all locations were taken and two columns were created, where the the mean of maximum and the minimum temperatures for each day are stored. Below the newly created head of data can be seen.
Now, it may be useful to plot the mean consumption amounts together and reanalyze the daily graph to make further inferences.
At first glance, the yearly seasonality is again very visible and it should be accounted for when building a forecasting approach. Secondly, there seems to be some outliers where the mean consumption plummets. It may be a good idea to investigate those points where the sudden decreases occur, since if they are left unconsidered, they might affect the approach and result in poor forecasts. Plotting the mean consumption values in one year should be useful to make further comments.
When the outliers are investigated, it can be noticed that most of them coincide with religious holidays such as Eid al-Fitr or Eid al-Adha. The first day of the year is also and almost always can be considered as an outlier. The reason that the consumption decreases in those days specifically is probably that during those days, the workplaces are mostly closed or active with a limited capacity. Therefore, it should be wise to take into account the dates of religious holidays and the first day of the year when building the approach.
Above, the plot of mean consumption amounts for 5 weeks can be seen. By looking at the graph, it is immediately observed that there is a significant degree of weekly seasonality that should be considered. Moreover, the consumption appears to be higher during the week and lower on weekends. This is most probably due to the fact that workplaces are usually closed on weekends.
It may be helpful to plot the autocorrelation function and make further comments regarding the data.
The autocorrelation function seems to support to the argument that there is a degree of weekly seasonality, in that the ACF values appear to rise at every 7th lag. It should also be mentioned that the data shows a significant level of autocorrelation at lag 1.
Above, the data is shown starting from the beginning of September 2019. It should be noticed that starting with April 2020, there seems to be a significant decrease in consumption. This is presumably because of the fact that the Coronavirus entered and started to spread in Turkey first in March 2020 and soon after, the schools were temporarily closed and some businesses switched to working from home. So, the effects of the pandemic has to be considered and when building a forecasting model.
Before moving forward, the temperature values should be considered.
By comparison, it can easily be observed that both mean maximum temperature and mean minimum temperature have a degree of correlation with daily electricity consumption amounts, in that their overall patters seem to be quite similar with each of them increasing and decreasing in similar periods of the year. Therefore, it has to be a good idea to take at least one of them as a regressor when building a model.
As it is discussed in the visualization part of the study, independent variables that were considered as possible regressors (i.e. Temperature, Day of the Week, Day of the Year, Holidays etc.) are added to the main daily data.
The tail of the data with its new additions can be seen below.
To find the most suitable model, different combinations of the candidate regressors in the daily data set are fitted into separate regression models. According to the comparison of the results from these models, best one will chosen to make predictions.
As a base model, only the time index and temperature variables are added to and a model is built to explore the relationship between the mean consumption, time and the given temperature information. Since the distribution of mean maximum and mean minimum temperature values appear to be extremely similar in comparison, using only one of them is reasonable. It should be mentioned that both of in a model together may result in overfitting, i.e. poor forecasts.
##
## Call:
## lm(formula = mean_consumption ~ time_index + mean_maxt, data = daily_consumption1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14274 -1898 189 2484 6804
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.043e+04 2.143e+02 142.003 < 2e-16 ***
## time_index 1.615e+00 2.066e-01 7.816 1.03e-14 ***
## mean_maxt 5.779e+01 9.879e+00 5.850 6.05e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3320 on 1463 degrees of freedom
## Multiple R-squared: 0.06928, Adjusted R-squared: 0.06801
## F-statistic: 54.45 on 2 and 1463 DF, p-value: < 2.2e-16
As it is expected, added variables have turned out to be significant to the model; however, the adjusted R-squared value is much lower than desired and the residual standard error is very high. To increase the adjusted R-squared value, other possible regressors will be considered.
Since weekly seasonality is very visible in the graphs shown while analyzing the daily consumption data, a variable that specifies the day of the week can be added to the model in order to somewhat account for the seasonality.
##
## Call:
## lm(formula = mean_consumption ~ time_index + mean_maxt + w_day,
## data = daily_consumption1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14363.7 -1784.3 261.1 2042.0 5805.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31420.6061 261.1683 120.308 < 2e-16 ***
## time_index 1.6183 0.1793 9.024 < 2e-16 ***
## mean_maxt 56.0500 8.5754 6.536 8.71e-11 ***
## w_dayCumartesi -1580.8573 281.8935 -5.608 2.45e-08 ***
## w_dayÇarşamba 177.2913 281.9261 0.629 0.52954
## w_dayPazar -4728.0174 281.5578 -16.792 < 2e-16 ***
## w_dayPazartesi -872.3076 281.5708 -3.098 0.00199 **
## w_dayPerşembe 286.6998 281.9039 1.017 0.30932
## w_daySalı -32.7910 281.5812 -0.116 0.90731
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2882 on 1457 degrees of freedom
## Multiple R-squared: 0.3019, Adjusted R-squared: 0.2981
## F-statistic: 78.76 on 8 and 1457 DF, p-value: < 2.2e-16
When the summary is examined, it appears that although only some of the days are significant to the model, adjusted R-squared value has increased with a considerably lower residual standard error.
Additionally, the yearly seasonality that was observed earlier may be explained by specifying the day of the year with a new variable. This variable can take values between 1 and 365 (or 366 if it’s a leap year).
##
## Call:
## lm(formula = mean_consumption ~ time_index + mean_maxt + w_day +
## y_day, data = daily_consumption1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14280.0 -1784.9 249.4 2019.7 5810.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31321.1754 269.9555 116.023 < 2e-16 ***
## time_index 1.5631 0.1833 8.529 < 2e-16 ***
## mean_maxt 51.5022 9.1298 5.641 2.03e-08 ***
## w_dayCumartesi -1581.4533 281.7879 -5.612 2.39e-08 ***
## w_dayÇarşamba 179.7523 281.8253 0.638 0.524
## w_dayPazar -4728.9171 281.4527 -16.802 < 2e-16 ***
## w_dayPazartesi -871.2618 281.4660 -3.095 0.002 **
## w_dayPerşembe 287.1545 281.7982 1.019 0.308
## w_daySalı -30.3767 281.4804 -0.108 0.914
## y_day 1.1292 0.7801 1.448 0.148
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2881 on 1456 degrees of freedom
## Multiple R-squared: 0.3029, Adjusted R-squared: 0.2986
## F-statistic: 70.3 on 9 and 1456 DF, p-value: < 2.2e-16
Unfortunately, the day of the year appears to be insignificant to the model with a p-value of 0.148 and very little change in the residual standard error. This may be due to some of the yearly seasonality being explained by the day of the week effect. However, before removing this variable immediately, one or two variables will be put into the model to see if the day of the year becomes significant or not. If not, it is safe to assume that removing the variable will not cause a notable loss to the accuracy of the model. The reason for this approach is that the effect of a variable can change depending on the addition or removal of other regressors from the model.
In this model, a variable to specify the month is added, which may be helpful to see if the month a day is in has a considerable influence on the electricity consumption on that day or not.
##
## Call:
## lm(formula = mean_consumption ~ time_index + mean_maxt + w_day +
## y_day + mon, data = daily_consumption1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11157.8 -665.4 147.8 1112.3 5770.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34171.0838 1452.2437 23.530 < 2e-16 ***
## time_index 1.5578 0.1286 12.118 < 2e-16 ***
## mean_maxt 25.0177 16.3657 1.529 0.12657
## w_dayCumartesi -1597.7124 195.7758 -8.161 7.16e-16 ***
## w_dayÇarşamba 182.5186 195.9386 0.932 0.35175
## w_dayPazar -4749.1448 195.5703 -24.284 < 2e-16 ***
## w_dayPazartesi -891.1668 195.6448 -4.555 5.68e-06 ***
## w_dayPerşembe 299.5775 195.8291 1.530 0.12629
## w_daySalı -27.1716 195.7462 -0.139 0.88962
## y_day 2.2608 5.9431 0.380 0.70370
## monAralık -1743.0739 814.6483 -2.140 0.03255 *
## monEkim -5167.2020 455.5215 -11.343 < 2e-16 ***
## monEylül -1814.5456 314.4283 -5.771 9.63e-09 ***
## monHaziran -4237.6778 454.5140 -9.324 < 2e-16 ***
## monKasım -3589.8386 635.6301 -5.648 1.96e-08 ***
## monMart -3228.3358 1001.4614 -3.224 0.00129 **
## monMayıs -5635.4884 621.0982 -9.073 < 2e-16 ***
## monNisan -5235.3711 812.2896 -6.445 1.57e-10 ***
## monOcak -1164.8935 1365.0538 -0.853 0.39360
## monŞubat -1080.9577 1188.4150 -0.910 0.36320
## monTemmuz 947.3277 313.6770 3.020 0.00257 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2001 on 1445 degrees of freedom
## Multiple R-squared: 0.6661, Adjusted R-squared: 0.6615
## F-statistic: 144.1 on 20 and 1445 DF, p-value: < 2.2e-16
Increased adjusted R-squared and decreased residual standard error indicate that the model has improved a lot and most of the months appear to be significant. On the other hand, day of the year is still seems irrelevant to the model; so, it is removed permanently.
As it was discussed in the previous part, with the onset of the pandemic, some visible changes were observed in the pattern of the data. To reflect these changes, a regressor to remark the pandemic can be added to the model. For this purpose, the period starting from the 11th of March 2020 is taken, which is the date where the Ministry of Health announced the first Covid-19 case in Turkey.
##
## Call:
## lm(formula = mean_consumption ~ time_index + mean_maxt + w_day +
## mon + pandemic, data = daily_consumption1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11159.7 -665.8 150.3 1108.3 5245.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34135.1386 466.2938 73.205 < 2e-16 ***
## time_index 2.4306 0.1738 13.986 < 2e-16 ***
## mean_maxt 33.1048 16.0907 2.057 0.039828 *
## w_dayCumartesi -1598.9568 192.2932 -8.315 < 2e-16 ***
## w_dayÇarşamba 180.2325 192.4522 0.937 0.349170
## w_dayPazar -4750.6096 192.0908 -24.731 < 2e-16 ***
## w_dayPazartesi -888.5020 192.1642 -4.624 4.11e-06 ***
## w_dayPerşembe 298.2547 192.3417 1.551 0.121205
## w_daySalı -25.6659 192.2595 -0.133 0.893820
## monAralık -1418.6609 398.3872 -3.561 0.000381 ***
## monEkim -5018.7715 280.5919 -17.886 < 2e-16 ***
## monEylül -1756.7039 253.6953 -6.924 6.56e-12 ***
## monHaziran -4288.6737 260.9167 -16.437 < 2e-16 ***
## monKasım -3344.8202 345.1331 -9.691 < 2e-16 ***
## monMart -3457.6416 378.0016 -9.147 < 2e-16 ***
## monMayıs -5701.7918 278.0683 -20.505 < 2e-16 ***
## monNisan -5297.2229 331.2813 -15.990 < 2e-16 ***
## monOcak -1580.1680 451.9711 -3.496 0.000486 ***
## monŞubat -1494.6953 431.2924 -3.466 0.000545 ***
## monTemmuz 900.5274 249.8231 3.605 0.000323 ***
## pandemic -1328.9835 182.6090 -7.278 5.55e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1966 on 1445 degrees of freedom
## Multiple R-squared: 0.6779, Adjusted R-squared: 0.6734
## F-statistic: 152.1 on 20 and 1445 DF, p-value: < 2.2e-16
The regressor is found to be significant to the model. Moreover, adjusted-R squared has increased and residual standard error has decreased considerably.
In this further model, a variable that is named holiday is added in order to explain the effect of the religious holidays and the first day of the year since it appears that there is usually very low electricity consumption on these days. Among the mentioned dates, only weekdays (not weekends) are marked as 1 and the remaining ones as 0, since in a regular week, the consumption is notably lower on weekends, so including those days when creating this variable may mislead the model.
##
## Call:
## lm(formula = mean_consumption ~ time_index + mean_maxt + w_day +
## mon + pandemic + holiday, data = daily_consumption1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11155.3 -764.5 84.6 993.8 4676.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34225.1573 385.9110 88.687 < 2e-16 ***
## time_index 2.5159 0.1439 17.489 < 2e-16 ***
## mean_maxt 55.2648 13.3440 4.142 3.65e-05 ***
## w_dayCumartesi -1858.0445 159.4543 -11.653 < 2e-16 ***
## w_dayÇarşamba 85.6355 159.3117 0.538 0.590981
## w_dayPazar -5007.5227 159.2819 -31.438 < 2e-16 ***
## w_dayPazartesi -892.8200 159.0312 -5.614 2.37e-08 ***
## w_dayPerşembe 119.6018 159.3286 0.751 0.452979
## w_daySalı -78.1821 159.1230 -0.491 0.623267
## monAralık -1584.5964 329.7599 -4.805 1.71e-06 ***
## monEkim -5433.2642 232.7670 -23.342 < 2e-16 ***
## monEylül -2147.0971 210.4974 -10.200 < 2e-16 ***
## monHaziran -4320.4510 215.9328 -20.008 < 2e-16 ***
## monKasım -3611.5633 285.8121 -12.636 < 2e-16 ***
## monMart -3642.5297 312.9085 -11.641 < 2e-16 ***
## monMayıs -5964.1456 230.3482 -25.892 < 2e-16 ***
## monNisan -5571.3988 274.3675 -20.306 < 2e-16 ***
## monOcak -1363.9768 374.1358 -3.646 0.000276 ***
## monŞubat -1599.3305 356.9518 -4.481 8.03e-06 ***
## monTemmuz 381.8056 207.7235 1.838 0.066261 .
## pandemic -1415.4601 151.1606 -9.364 < 2e-16 ***
## holiday -9013.6658 349.3159 -25.804 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1627 on 1444 degrees of freedom
## Multiple R-squared: 0.7795, Adjusted R-squared: 0.7763
## F-statistic: 243.1 on 21 and 1444 DF, p-value: < 2.2e-16
The low p-value of the holiday variable tells that it is a significant regressor and the changes in the mean consumption can be explained by it to some extent. Moreover, adjusted R-squared got higher again with a lower residual standard error.
In addition to the pandemic variable,it was decided that a lockdown variable which specifies dates starting from the very first day of the strict prohibitions related with the spread of the Coronavirus, can be added to model since it is also expected this period may cause a change in the electricity consumption.
##
## Call:
## lm(formula = mean_consumption ~ time_index + mean_maxt + w_day +
## mon + pandemic + holiday + lockdown, data = daily_consumption1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11232.0 -741.9 81.2 972.2 4583.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34270.4530 379.8182 90.229 < 2e-16 ***
## time_index 2.5005 0.1416 17.661 < 2e-16 ***
## mean_maxt 57.0470 13.1339 4.343 1.50e-05 ***
## w_dayCumartesi -1859.3492 156.9138 -11.849 < 2e-16 ***
## w_dayÇarşamba 92.8859 156.7768 0.592 0.5536
## w_dayPazar -5006.7033 156.7440 -31.942 < 2e-16 ***
## w_dayPazartesi -890.5242 156.4977 -5.690 1.53e-08 ***
## w_dayPerşembe 126.8054 156.7933 0.809 0.4188
## w_daySalı -76.7092 156.5878 -0.490 0.6243
## monAralık -2043.1352 331.1670 -6.170 8.88e-10 ***
## monEkim -5423.1828 229.0628 -23.676 < 2e-16 ***
## monEylül -2146.7119 207.1435 -10.363 < 2e-16 ***
## monHaziran -4314.7654 212.4939 -20.305 < 2e-16 ***
## monKasım -3768.6325 282.1677 -13.356 < 2e-16 ***
## monMart -3655.1458 307.9281 -11.870 < 2e-16 ***
## monMayıs -5955.7437 226.6812 -26.274 < 2e-16 ***
## monNisan -5554.2990 270.0071 -20.571 < 2e-16 ***
## monOcak -1467.7975 368.4785 -3.983 7.13e-05 ***
## monŞubat -1644.8245 351.3255 -4.682 3.11e-06 ***
## monTemmuz 376.2539 204.4153 1.841 0.0659 .
## pandemic -1722.5520 155.1973 -11.099 < 2e-16 ***
## holiday -9090.0306 343.9263 -26.430 < 2e-16 ***
## lockdown 1959.2086 282.3770 6.938 5.98e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1601 on 1443 degrees of freedom
## Multiple R-squared: 0.7867, Adjusted R-squared: 0.7834
## F-statistic: 241.9 on 22 and 1443 DF, p-value: < 2.2e-16
Although lockdown variable is not very effective in increasing the adjusted R-squared value or lowering the residual standard error, it has turned out to be significant to the model. Therefore, it will remain in the further steps of the analysis.
After reaching a considerably good adjusted R-squared value with this latest model, it is argued that checking the residuals of the model could be useful in determining what to implement next.
##
## Breusch-Godfrey test for serial correlation of order up to 30
##
## data: fitlm_7
## LM test = 919.88, df = 30, p-value < 2.2e-16
Looking at the plot of the residuals obtained, it can be seen that the residuals are distributed over a wide range of values due to a few outliers.It may be necessary to narrow this range by adding outlier variables which consider the values that are lower than 10% and higher than 95% of the residuals with the help of the quantile function.
Also based on the ACF plot and the result from Breusch-Godfrey test, autocorrelation at lag 1 seems to be problematic, can be handled with different approaches, at least to some extent. One of them is to shift the residuals by one and add them to the daily consumption data to build a new model with the newly obtained shifted residuals as a regressor.
The new model is established by adding three new variables (small outliers ,large outliers and lagged residuals) that were found to be necessary for the accuracy of the predictions.
##
## Call:
## lm(formula = mean_consumption ~ time_index + mean_maxt + w_day +
## mon + pandemic + holiday + lag1_manip + small_outlier + large_outlier +
## lockdown, data = trydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5789.2 -395.6 30.1 428.9 3868.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.495e+04 1.916e+02 182.424 < 2e-16 ***
## time_index 2.645e+00 7.123e-02 37.134 < 2e-16 ***
## mean_maxt 2.810e+01 6.596e+00 4.261 2.17e-05 ***
## w_dayCumartesi -1.759e+03 7.808e+01 -22.533 < 2e-16 ***
## w_dayÇarşamba 1.724e+02 7.799e+01 2.211 0.027222 *
## w_dayPazar -4.904e+03 7.807e+01 -62.819 < 2e-16 ***
## w_dayPazartesi -8.486e+02 7.785e+01 -10.901 < 2e-16 ***
## w_dayPerşembe 2.008e+02 7.801e+01 2.575 0.010134 *
## w_daySalı -5.234e+01 7.790e+01 -0.672 0.501758
## monAralık -2.703e+03 1.668e+02 -16.206 < 2e-16 ***
## monEkim -5.795e+03 1.161e+02 -49.924 < 2e-16 ***
## monEylül -2.207e+03 1.047e+02 -21.084 < 2e-16 ***
## monHaziran -4.461e+03 1.070e+02 -41.684 < 2e-16 ***
## monKasım -4.124e+03 1.419e+02 -29.052 < 2e-16 ***
## monMart -4.288e+03 1.557e+02 -27.548 < 2e-16 ***
## monMayıs -5.992e+03 1.135e+02 -52.788 < 2e-16 ***
## monNisan -5.668e+03 1.357e+02 -41.759 < 2e-16 ***
## monOcak -2.039e+03 1.850e+02 -11.018 < 2e-16 ***
## monŞubat -2.348e+03 1.773e+02 -13.243 < 2e-16 ***
## monTemmuz 3.471e+02 1.022e+02 3.397 0.000699 ***
## pandemic -1.568e+03 7.764e+01 -20.202 < 2e-16 ***
## holiday -7.230e+03 1.849e+02 -39.104 < 2e-16 ***
## lag1_manip 4.969e-01 1.741e-02 28.546 < 2e-16 ***
## small_outlier -2.400e+03 8.624e+01 -27.824 < 2e-16 ***
## large_outlier 1.411e+03 1.085e+02 12.998 < 2e-16 ***
## lockdown 1.735e+03 1.412e+02 12.284 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 796.1 on 1439 degrees of freedom
## Multiple R-squared: 0.9472, Adjusted R-squared: 0.9463
## F-statistic: 1034 on 25 and 1439 DF, p-value: < 2.2e-16
Lastly added variables are all found to be significant by the model and the adjusted R-squared value has dramatically increased. Moreover, the residual standard error is halved.
It can be seen that the distribution of the residuals is quite similar to Normal as it’s desired. Additionally, the mean is around zero with mostly constant variance still with some outliers that were not accounted for by the quantile function. Most of the lag values are within the desired interval shown with the blue dashed lines. However, there is still a significant level of autocorrelation at lag 1. At this point, it may be reasonable to add the mean consumption itself as a new regressor; however, this approach can be challenging when making predictions since this study aims to make predictions for 2 days and one of the lagged values will solely be based on the prediction of the previous day.
To compare the last two models, residuals are plotted together:
When the plot of the residuals are compared, it can be seen that the last model’s residuals lie between a narrower range thanks to the added variables to Model 8. Based on the adjusted R-squared value, residual standard error, plot of residuals and ACF plot, it is decided that the Model 8 will be used to make predictions.
In this part, two new rows are added to the data set. These rows represent the day that the submission is made and the day after (actual day to be forecasted); so, they take Sys.Date() and Sys.Date() + 1 for their Date columns respectively. The other variables that are used in the model are filled as well, since their values are already known.
After filling the empty slots for the day of the submission, a prediction was made. With the help of the newly forecasted value, the model was utilized again, since the lagged residuals had to be filled using the prediction from the previous day.
In this step, forecasted daily mean consumption amounts are divided into hourly values since the aim of this study is to predict the electricity consumption for each hour. This transformation was done by utilizing the distributions from the same day of the last 4 weeks. A percentage distribution for an hour was obtained by dividing each hour’s value to the daily total consumption for each day to be used. So, with the acquired percentage vector, daily predictions were distributed to 24 hours.
Below is an exemplary plot of the hourly distribution for the 7th of January 2021.
For this next part of the study, the obtained results are summarized and the model is evaluated by comparing the actual and predicted values of the estimated days.
When the established models are examined from the beginning, it can be seen that the adjusted R-squared values of the models were constantly increasing with the last model’s value being 0.9463. Similarly, the residual standard error was decreasing and the lowest value was obtained with the latest model. After obtaining the hourly values by distribution, the forecasts should be compared with the actual values.
According to the graph of the predictions that were made for 14 days and their actual values, some days appear to be overpredicted with some of them being underpredicted. The distribution of the same day of one week is quite similar to the other due to the fact that the total consumption on each day was distributed to 24 hours based on the same day of the week for the last 4 weeks. Another point to notice is that early hours of the day are predicted pretty close to the actual values.
In addition to the visual interpretation, some statistics like bias, WMAPE and so forth are obtained by placing forecasted and actual values in a function in order to be able to evaluate the estimates according to a certain criteria.
## n mean sd bias mape mad wmape
## 1 24 35959.29 3555.814 0.02527854 0.03484564 1172.108 0.03259541
When the function is run and statistics are taken for the first day of the forecast, it can be observed that the model shows relatively promising results. Also, when the plot is examined, it shoul be noticed that the predicted values are overlapping with actual ones especially for later hours the day.
## n mean sd bias mape mad wmape
## 1 24 36445.91 5573.227 0.06803414 0.06401683 2479.566 0.06803414
When the statistics for the third day are examined, it is observed that the WMAPE value is almost twice the value of the first day. The result was not found good enough; therefore, it was necessary to return to the model and make some improvements. The small and large outlier variables mentioned above in the Approach part were among the improvement steps added to the model at this stage in order to make the residuals of the model more stationary and obtain better predictions. Furthermore, although the actual mean consumption value of the day in question and the forecasted mean value are not very different from each other, it was concluded that the model was not successful enough while distributing the obtained forecasts on an hourly level.
The methods tried to perform this transformation can be listed as follows:
Obtaining a percentage coefficients vector by taking the mean of the last two weeks’ hourly consumption values for each hour and dividing them to the total sum
Obtaining a percentage coefficients vector by taking the mean of the last four weeks’ hourly consumption values for each hour and dividing them to the total sum based on the idea that a long period of time would result in a more reliable and better distribution
Obtaining a percentage coefficients vector by taking the mean of the same day of the week for the last 4 weeks for each hour and dividing them to the sum of them, since the distribution of the consumption to each hour is quite similar for the same days days of a week
Since the last one of the possible transformations gave the best results, this method was chosen for the rest of the forecasting period.
## n mean sd bias mape mad wmape
## 1 24 36262.57 4036.248 -0.03133955 0.03130055 1218.997 0.03361584
The effect of the improvements were visible on the fifth day, and the statistics started to give better results consistently. After this stage, no new arrangement was made on the model. The results of the next few days are shown below as an example.
## n mean sd bias mape mad wmape
## 1 24 34965.04 5117.476 -0.02127399 0.02052862 790.4313 0.02260633
## n mean sd bias mape mad wmape
## 1 24 36265.92 4376.688 -0.007807631 0.01088025 417.0553 0.01149992
When the entire estimated period is evaluated , average results have turned out to be as follows:
## n mean sd bias mape mad wmape
## 1 360 35002.22 4174.216 -0.0140788 0.03337818 1210.059 0.03457093
It can be seen that although significantly better results were obtained after the first couple of days, the overall WMAPE and MAPE still show the effects mistakes that were made in earlier predictions, however it is worth mentioning that especially in the last week the WMAPE results were between 0.01149 and 0.03775 with most of them being below 0.03457093.
To conclude, it sis important to go over what has been done in the study so far. The aim of the study was to predict the next day’s hourly electricity consumption with the help of a model which could make use of the exact values up to the day before the actual day of the prediction. Since handling an hourly data can cause problems and requires the use of the predictions, daily data was preferred to make forecasts. Everyday, actual values are taken from the EPIAS website and transformed into mean daily consumption amounts by summing up the values for each hour and dividing the acquired sum by 24.
There are several methods to build an approach for making predictions such as decomposing the time series and using an ARIMA model which requires to work with a stationary data, using an ARIMAX model and using a linear regression model which was preferred in this project. Besides given temperature values of seven different locations around Turkey, some other candidate regressors were added to a data table and used in alternative models with different combinations. It’s worth mentioning that there might have been better options for the independent variables such as humidity values which is very effective on the felt temperature, a variable for national holidays or a curfew variable that specifies the prohibitions for weekends as well. Hence, the model could be improved by adding more different and effective variables. This can be considered one of the aspects that might increase the margin of error of this study.
Furthermore, the best model, which had the highest adjusted R-squared value and the lowest residual standard error value, was chosen to make predictions. Since the predictions were made for the total consumption for a day, thwy required to be transformed into hourly values. After observing rather high WMAPE values in the first couple days, outliers were handled, the lockdown variable was added and also a different approach is followed while distributing daily predictions to each hour which provided better WMAPE values in the following days. In addition, in the first days of the submission period, transition to hourly values was made based on the last two weeks; however, since everyday differs from one another, the code was manipulated to consider the day of the week while distributing to 24 hours. As a result, the model was improved based on each day’s WMAPE results by adding new variables.
Although a model is already built and used for predictions, there is always room for improvement. Here are several possible suggestions that could result in a better model, hence a better forecast.
Adding Lagged Values as a Regressor
As it has been discussed before, there is a degree of autocorrelation, specifically at lag 1 among the daily consumption values. In the model, this was not accounted for, moreover, although Model 8 was accepted and used for predictions, the autocorrelation function of the residuals still showed significant autocorrelation at lag 1. Therefore, adding lagged electricity consumption values as a regressor may in fact be a good idea.
##
## Call:
## lm(formula = mean_consumption ~ time_index + mean_maxt + w_day +
## mon + pandemic + holiday + lag1_manip + small_outlier + large_outlier +
## lockdown + lagged_cons_manip, data = trydata_for_lag)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5775.8 -345.5 17.7 361.9 2945.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.286e+04 6.807e+02 33.577 < 2e-16 ***
## time_index 1.761e+00 8.017e-02 21.969 < 2e-16 ***
## mean_maxt 2.012e+01 5.954e+00 3.379 0.000748 ***
## w_dayCumartesi -1.612e+03 7.076e+01 -22.777 < 2e-16 ***
## w_dayÇarşamba 2.917e+02 7.052e+01 4.137 3.72e-05 ***
## w_dayPazar -4.214e+03 7.971e+01 -52.866 < 2e-16 ***
## w_dayPazartesi 8.822e+02 1.174e+02 7.512 1.02e-13 ***
## w_dayPerşembe 2.563e+02 7.029e+01 3.646 0.000276 ***
## w_daySalı 3.565e+02 7.358e+01 4.845 1.40e-06 ***
## monAralık -1.841e+03 1.573e+02 -11.699 < 2e-16 ***
## monEkim -3.903e+03 1.467e+02 -26.605 < 2e-16 ***
## monEylül -1.511e+03 1.015e+02 -14.884 < 2e-16 ***
## monHaziran -2.894e+03 1.287e+02 -22.479 < 2e-16 ***
## monKasım -2.718e+03 1.490e+02 -18.247 < 2e-16 ***
## monMart -2.895e+03 1.593e+02 -18.168 < 2e-16 ***
## monMayıs -3.942e+03 1.513e+02 -26.057 < 2e-16 ***
## monNisan -3.743e+03 1.609e+02 -23.258 < 2e-16 ***
## monOcak -1.298e+03 1.714e+02 -7.573 6.48e-14 ***
## monŞubat -1.583e+03 1.650e+02 -9.596 < 2e-16 ***
## monTemmuz 1.753e+02 9.245e+01 1.897 0.058088 .
## pandemic -1.024e+03 7.593e+01 -13.481 < 2e-16 ***
## holiday -5.644e+03 1.875e+02 -30.094 < 2e-16 ***
## lag1_manip 2.644e-01 2.015e-02 13.123 < 2e-16 ***
## small_outlier -1.863e+03 8.295e+01 -22.462 < 2e-16 ***
## large_outlier 1.081e+03 9.935e+01 10.878 < 2e-16 ***
## lockdown 1.150e+03 1.310e+02 8.777 < 2e-16 ***
## lagged_cons_manip 3.440e-01 1.873e-02 18.369 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 716.7 on 1438 degrees of freedom
## Multiple R-squared: 0.9573, Adjusted R-squared: 0.9565
## F-statistic: 1239 on 26 and 1438 DF, p-value: < 2.2e-16
By looking at the summary of the model, it appears that the adjusted R-squared value has increased and residual standard error has decreased significantly. Furthermore, all of the variables can be considered as significant to the model, with only one of them having a p-value between 0.1 and 0.05. It may be useful to look at the autocorrelation function of the residuals and compare with the actual model.
It should be noticed that the overall autocorrelation seems to have decreased in the alternative model, however the autocorrelation at lag 1 is still significant for the residuals.
Adding Trend for Pandemic as a Regressor
When daily electricity consumption values are plotted together, it can be seen that after plummeting in the beginning of the pandemic, the consumption has risen almost in a linear fashion. So instead of specifying the period of the pandemic with 1’s, adding a trend component for that period only may lead to better predictions.
##
## Call:
## lm(formula = mean_consumption ~ time_index + mean_maxt + w_day +
## mon + holiday + lag1_manip + small_outlier + large_outlier +
## pandemic_new_trend + lockdown, data = trydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5299.0 -420.6 19.9 456.6 4574.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.501e+04 2.066e+02 169.437 < 2e-16 ***
## time_index 2.262e+00 7.145e-02 31.656 < 2e-16 ***
## mean_maxt 3.291e+01 7.132e+00 4.615 4.28e-06 ***
## w_dayCumartesi -1.749e+03 8.361e+01 -20.922 < 2e-16 ***
## w_dayÇarşamba 1.733e+02 8.352e+01 2.074 0.0382 *
## w_dayPazar -4.896e+03 8.360e+01 -58.560 < 2e-16 ***
## w_dayPazartesi -8.516e+02 8.336e+01 -10.215 < 2e-16 ***
## w_dayPerşembe 2.072e+02 8.354e+01 2.480 0.0132 *
## w_daySalı -5.380e+01 8.342e+01 -0.645 0.5190
## monAralık -2.516e+03 1.786e+02 -14.089 < 2e-16 ***
## monEkim -5.623e+03 1.252e+02 -44.903 < 2e-16 ***
## monEylül -2.106e+03 1.122e+02 -18.775 < 2e-16 ***
## monHaziran -4.558e+03 1.145e+02 -39.797 < 2e-16 ***
## monKasım -3.904e+03 1.528e+02 -25.552 < 2e-16 ***
## monMart -4.342e+03 1.666e+02 -26.057 < 2e-16 ***
## monMayıs -6.136e+03 1.213e+02 -50.582 < 2e-16 ***
## monNisan -5.842e+03 1.449e+02 -40.312 < 2e-16 ***
## monOcak -1.875e+03 1.987e+02 -9.439 < 2e-16 ***
## monŞubat -2.200e+03 1.902e+02 -11.570 < 2e-16 ***
## monTemmuz 2.737e+02 1.094e+02 2.501 0.0125 *
## holiday -7.056e+03 1.980e+02 -35.637 < 2e-16 ***
## lag1_manip 5.387e-01 1.894e-02 28.445 < 2e-16 ***
## small_outlier -2.541e+03 9.212e+01 -27.583 < 2e-16 ***
## large_outlier 1.338e+03 1.161e+02 11.528 < 2e-16 ***
## pandemic_new_trend -7.188e+00 5.484e-01 -13.107 < 2e-16 ***
## lockdown 2.382e+03 1.835e+02 12.977 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 852.5 on 1439 degrees of freedom
## Multiple R-squared: 0.9395, Adjusted R-squared: 0.9385
## F-statistic: 893.9 on 25 and 1439 DF, p-value: < 2.2e-16
After investigating the summary of the alternative model, it should be noticed that although the trend component is considered as significant to the model, the R-squared value has decreased and the residual standard error has increased. Consequently, it shouldn’t be expected for the alternative model to give better results that the model at hand, however, it may make more sense to introduce a trend element for a shorter period.
Using Hourly Values for Regression
Using hourly values to build a regression model is a possibility and may lead to better results, since the model at hand does not really consider if there is any degree of autocorrelation among the consecutive hours of a day. However, for reasons related to the convenience of application and the ease of interpretation, it was decided that hourly values were to be distributed after the prediction of total value for that day.
Time Series Analysis for Predictions
By treating the given consumption data as a time series, one can decompose it from trend and seasonality and then fit an ARIMA model. This approach would lead to different predictions, however in this case, because an external variable was given (i.e. Temperature), time series analysis was not preferred. It’s also worth mentioning that there might be difficulties in dealing with the data as a time series since multiple seasonalities are present.
Using an ARIMAX Model
An ARIMAX is different than an ARIMA model, in that the X added to the end of the word stands for “exogenous”. In other words, it suggests adding a separate external variable to help fit the variable at hand to a model. For the case of this study, it makes more sense to use an ARIMAX model, since there are known outside factors affecting the electricity consumption such as the weather.
Using a Different Distribution Technique
For the distribution of the total consumption to each hour of the day, the last four weeks were utilized, however a different possible application would be to extend the period or distribute the weight unequally, i.e. giving more weight to closer dates.
The R Markdown file of the report and the code used for submissions can be reached by clicking the following links: