1 Introduction

In this project, the main task is to develop an approach for forecasting Turkey’s hourly electricity consumption rate (MWh) from 30 January 2021 to 13 February 2021. Every day, the predictions consist of the electricity consumption rates of the next 24 hours of the next day using the data available, i.e. electricity consumption data until the day before (included).

In addition to the past consumption data, the hourly temperature data belonging to seven different locations close to the big cities in Turkey are also provided. These cities are Antalya, Adana, Konya, İzmir, Eskişehir, Ankara and İstanbul respectively. However, some of the locations coincide with the outskirts of the cities, so they may not reflect the temperatures of the corresponding cities accurately.

Moreover, the monthly industrial production index data is incorporated to the prediction model so that the residuals resulting from lower-than-normal production levels can be explained.

The best point to start is analyzing the data visually. Here is the plot containing three consecutive days:

# Read data from file
bulkdata <- fread('bulk_consumption_with_temp.csv')

# Data manipulation
bulkdata <- bulkdata[, Date := as.Date(Date)]
upcons <- fread('realtimecons.csv')
upcons[,Consumption := gsub('\\,', '', `Consumption (MWh)`)]
upcons[,Consumption := as.numeric(Consumption)]
bulkdata$Consumption <- upcons$Consumption[1:nrow(bulkdata)]

# Plotting a part of data
plot(bulkdata$Consumption[1:72],
     xlab = "Hour",
     ylab = "Consumption (MWh)",
     main = "Hourly Electricity Consumption Data between 01/01/17 and 03/01/17",
     type = 'l')

As can be seen above, hourly electricity consumption patterns of different days are similar. However, the overall consumption levels change from day to day. From now on, data is going to be transformed from hourly to daily level by using mean consumption for convenience. The back transformation from daily to hourly levels will be mentioned later. To see the daily patterns, three weeks of daily consumption data is plotted below:

# Adding weighted  mean temperature column to bulkdata
bulkdata[, weightedT:= 0.0919*T_1 + 0.0721*T_2 + 0.066*T_3 + 0.1656*T_4 + 0.0321*T_5 + 0.1504*T_6 + 0.4219*T_7]

daily_data <- bulkdata[,list(mean_consumption=mean(Consumption, na.rm = T),
                             mean_temp = mean(weightedT, na.rm = T)), by = list(Date)]

# Reading day type data from file
day_type <- read_excel("day_type.xlsx", range = "C399:C1866", col_names = FALSE)

# Adding day_type column to daily_data
daily_data <- cbind(daily_data, day_type)
names(daily_data)[4] <- "day_type"

# Plotting a portion of daily data
ggplot(data = daily_data[1:21], aes(x = Date, y = mean_consumption)) +
  geom_line(color = "darkred") +
  labs(title = "Daily Consumption Data between 01/01/17 and 21/01/17 ",
       x = "Date",
       y= "Consumption (MWh)") +
  scale_x_date(date_minor_breaks = "1 day", date_breaks = "3 days", date_labels = "%d %b") +
  theme_minimal()

Plot above implies that daily consumption data has a weekly pattern with lows on weekends. To model this pattern, day types in the model are going to be used. Day types are going to indicate the days of the week.

Next, it is sensible to plot the data in a wholesome manner to see any yearly pattern:

# Plotting the whole daily data
ggplot(data = daily_data[Date < '2021-01-01'], aes(x=Date, y= mean_consumption)) +
  geom_line(color= "darkred") +
  labs(title = "Daily Electricity Consumption Data between 01/01/17 and 01/01/21 ",
       x = "Date",
       y= "Consumption (MWh)") +
  scale_x_date(date_minor_breaks = "3 months", date_breaks = "6 months", date_labels = "%b %Y") +
  theme_minimal()

As expected, the data also has a seasonal pattern with highs on summers and winters. Hence, it makes sense to add another regressor indicating the month. Along with lagged consumption values, these two factors will be the non-external regressors.

In order to use the temperature data in the model without making the model complicated, weighted means of 7 different temperature values are calculated. Weights are determined using the consumption proportions of these cities in Turkey in 2019 and these proportions are supplied by Republic of Turkey Energy Market Regulatory Authority (Turkish: T.C. Enerji Piyasası Düzenleme Kurumu (EPDK)). After summarizing 7 values in one weighted mean temperature value, it will be transformed into daily basis by taking means as it is done for consumption data. After this point, the most crucial part is to decide how to use these temperatures. The method used is explained with reasons in the next section.

2 Related Literature

In this project, temperature’s effect on electricity consumption is approached in a unique way. From the lectures, it is known that there is no direct linear relationship between temperature and electricity consumption. Therefore, a better relationship between the two is needed. A research study made by Derya AYDIN, Ahmet Faruk KAVAK and Hüseyin TOROS which favors heating degree day (HDD) and cooling degree day (CDD) values over temperature itself when modeling electricity consumption..

Basically, heating degree day (HDD) is a measurement aimed to quantify the demand for energy needed to heat a building. HDD values are obtained from measurements of outside air temperature. The heating necessities of a given building at a particular region are considered to be directly proportional to the HDD value at that region. Oppositely, cooling degree day (CDD), quantifies demand for air conditioning. (Chen 2019)

HDD expresses how cold the average air temperature is in a day compared to the base value. In other words, the high number of heating degree days means that the need for heating is high in that area. CDD expresses how hot the average temperature of a day is compared to the base value. The number of degree days for cooling also refers to the number of days needed for cooling. (Toros, Aydın, and Kavak 2015, 30) For example, if HDD is 3, then one will be in need of heating for 3 degrees. The logic behind CDD is also the same. For instance, if CDD is 3, then one will be in need of cooling for 3 degrees.

From the definitions, base degrees have to be chosen for HDD and CDD computations. In the research study, the base degrees 18°C and 22°C are suggested for HDD and CDD, respectively. (Toros, Aydın, and Kavak 2015, 36) In this project, different combinations of other base degrees for HDD and CDD are tested. However, the base degrees 16°C and 20°C are the most correlated ones, which will be shown in the later parts of this report.

The formulas of HDD and CDD are: (Toros, Aydın, and Kavak 2015, 31)

\[ \mathrm{HDD} = \begin{cases} (T_b - T_m) \times d & T_m \leq T_b \\ 0 & T_m > T_b \end{cases} \hspace{4em} \mathrm{CDD} = \begin{cases} (T_m - T_b) \times d & T_m \geq T_b \\ 0 & T_m < T_b \end{cases} \]

Where

\(T_b\): Base temperature
\(T_m\): Daily average temperature
\(d\): number of days

There is also one more useful information used from the research study: the day type. It is mentioned above that data has weekly seasonality. In addition to days of the week, three more day types are included: religious holidays (Eid al-Fitr and Eid al-Adha), national holidays, and eve days. (Toros, Aydın, and Kavak 2015, 33–35) With the help of these additional day types, outliers resulting from holidays are greatly reduced. In the research study, it is suggested that the day type’s effect on consumption can be discarded. This approach is not used in this model. Instead, these day types are added as regressors in the regression model in order to reduce the steps for forecasting.

3 Approach

In this section, one base model will be established with only day type and month regressors, and the addition of other regressors will be decided with a comparative analysis. The base model is as follows:

# Adding month column to daily_data
daily_data[, month:= as.factor(month(Date))]

# First linear regression model
model_base <- lm(formula = mean_consumption~ -1 + as.factor(day_type) + as.factor(month),
             data =  daily_data)
summary(model_base)

## 
## Call:
## lm(formula = mean_consumption ~ -1 + as.factor(day_type) + as.factor(month), 
##     data = daily_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6857.9  -818.3   132.0   850.1  5579.5 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## as.factor(day_type)1  30426.42     161.69 188.180  < 2e-16 ***
## as.factor(day_type)2  34697.85     161.16 215.306  < 2e-16 ***
## as.factor(day_type)3  35573.58     160.27 221.958  < 2e-16 ***
## as.factor(day_type)4  35720.66     160.02 223.224  < 2e-16 ***
## as.factor(day_type)5  35836.10     160.03 223.938  < 2e-16 ***
## as.factor(day_type)6  35523.22     161.18 220.390  < 2e-16 ***
## as.factor(day_type)7  33790.60     161.31 209.480  < 2e-16 ***
## as.factor(day_type)8  25078.27     314.93  79.632  < 2e-16 ***
## as.factor(day_type)9  30873.07     297.41 103.808  < 2e-16 ***
## as.factor(day_type)10 29464.39     448.39  65.711  < 2e-16 ***
## as.factor(month)2      -528.51     188.95  -2.797  0.00523 ** 
## as.factor(month)3     -2486.78     184.43 -13.484  < 2e-16 ***
## as.factor(month)4     -4332.10     185.68 -23.331  < 2e-16 ***
## as.factor(month)5     -4076.65     184.43 -22.104  < 2e-16 ***
## as.factor(month)6     -2135.20     187.55 -11.385  < 2e-16 ***
## as.factor(month)7      2912.74     184.16  15.817  < 2e-16 ***
## as.factor(month)8      2791.02     186.21  14.989  < 2e-16 ***
## as.factor(month)9       -44.34     186.21  -0.238  0.81182    
## as.factor(month)10    -3143.24     184.62 -17.025  < 2e-16 ***
## as.factor(month)11    -1642.87     185.99  -8.833  < 2e-16 ***
## as.factor(month)12       22.50     184.42   0.122  0.90292    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1469 on 1447 degrees of freedom
## Multiple R-squared:  0.9981, Adjusted R-squared:  0.9981 
## F-statistic: 3.598e+04 on 21 and 1447 DF,  p-value: < 2.2e-16

checkresiduals(model_base)

## 
##  Breusch-Godfrey test for serial correlation of order up to 24
## 
## data:  Residuals
## LM test = 1007.8, df = 24, p-value < 2.2e-16

From the outputs above, it is possible to see that day type and months have significant effects on the consumption amounts. Base model’s residual standard error is 1469 which is the value that allows comparison between models. From residual analysis, it can be seen that none of the linear regression assumptions holds: The mean and variance of residuals are not stationary, residuals are highly autocorrelated, and normality assumption doesn’t hold either.

Next, an autoregressive part consisting of lag1 values of consumption data will be added with the expectation of lowering autocorrelation. New model is as follows:

# Adding consumption lag-1 column to daily_data
daily_data[, lag_1:= shift(mean_consumption,1)]

# Second linear regression model
model <- lm(formula = mean_consumption~ -1 + as.factor(day_type) + as.factor(month) +
               lag_1,
             data = daily_data)
summary(model)

## 
## Call:
## lm(formula = mean_consumption ~ -1 + as.factor(day_type) + as.factor(month) + 
##     lag_1, data = daily_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3787.8  -388.7   -24.4   352.9  4882.6 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## as.factor(day_type)1   5659.3722   412.9080  13.706  < 2e-16 ***
## as.factor(day_type)2  12412.4721   373.3616  33.245  < 2e-16 ***
## as.factor(day_type)3  10223.7814   422.0531  24.224  < 2e-16 ***
## as.factor(day_type)4   9743.0648   432.0588  22.550  < 2e-16 ***
## as.factor(day_type)5   9690.0068   434.7527  22.289  < 2e-16 ***
## as.factor(day_type)6   9309.0628   435.9616  21.353  < 2e-16 ***
## as.factor(day_type)7   7840.2561   431.7572  18.159  < 2e-16 ***
## as.factor(day_type)8   6230.3600   349.3914  17.832  < 2e-16 ***
## as.factor(day_type)9   6355.4894   432.7220  14.687  < 2e-16 ***
## as.factor(day_type)10  5151.9014   461.4833  11.164  < 2e-16 ***
## as.factor(month)2      -313.3105    99.5426  -3.148  0.00168 ** 
## as.factor(month)3      -884.9829   100.6424  -8.793  < 2e-16 ***
## as.factor(month)4     -1264.5902   110.0535 -11.491  < 2e-16 ***
## as.factor(month)5     -1050.1449   109.2461  -9.613  < 2e-16 ***
## as.factor(month)6      -507.8245   102.3384  -4.962 7.80e-07 ***
## as.factor(month)7       701.2907   103.3111   6.788 1.65e-11 ***
## as.factor(month)8       716.2072   103.5845   6.914 7.04e-12 ***
## as.factor(month)9      -203.6299    98.0525  -2.077  0.03800 *  
## as.factor(month)10     -860.6618   104.2868  -8.253 3.45e-16 ***
## as.factor(month)11     -518.8496    99.6809  -5.205 2.22e-07 ***
## as.factor(month)12     -201.4896    97.1409  -2.074  0.03824 *  
## lag_1                     0.7362     0.0120  61.361  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 772.5 on 1445 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.9995, Adjusted R-squared:  0.9995 
## F-statistic: 1.244e+05 on 22 and 1445 DF,  p-value: < 2.2e-16

checkresiduals(model)

## 
##  Breusch-Godfrey test for serial correlation of order up to 25
## 
## data:  Residuals
## LM test = 352.38, df = 25, p-value < 2.2e-16

New model’s residual standard error is almost the half of the base model, and the lag_1 regressor is highly significant. These results imply that added regressor is useful on explaining the data. New regressor was also helpful at decreasing the autocorrelation and stationarizing the mean of residuals as expected.

At this point, temperature’s effect will be added to the model with the help of CDD and HDD calculations as mentioned before. CDD20, CDD22 and HDD16, HDD18 options are compared and HDD16 - CDD20 pair is the most successful pair. Here is the new version of the model with temperature effect added:

# Calculating various HDD and CDD data and adding them to daily_data
daily_data <- daily_data[mean_temp<20,CDD20:=0]
daily_data <- daily_data[is.na(CDD20),CDD20:= mean_temp - 20]
daily_data <- daily_data[mean_temp<22,CDD22:=0]
daily_data <- daily_data[is.na(CDD22),CDD22:= mean_temp - 22]

daily_data <- daily_data[mean_temp>16,HDD16:=0]
daily_data <- daily_data[is.na(HDD16),HDD16:= 16 - mean_temp]
daily_data <- daily_data[mean_temp>18,HDD18:=0]
daily_data <- daily_data[is.na(HDD18),HDD18:= 18 - mean_temp]

# Adding second model's residuals to daily_data
daily_data[, residual := NA]
daily_data$residual[2:1468] <- model$residuals

# Third linear regression model
model <- lm(formula = mean_consumption~ -1 + as.factor(day_type) + as.factor(month) +
               lag_1 +
               HDD16 + CDD20,
             data = daily_data)
summary(model)

## 
## Call:
## lm(formula = mean_consumption ~ -1 + as.factor(day_type) + as.factor(month) + 
##     lag_1 + HDD16 + CDD20, data = daily_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4044.2  -345.3   -12.1   336.7  4468.6 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## as.factor(day_type)1   7.120e+03  4.129e+02  17.242  < 2e-16 ***
## as.factor(day_type)2   1.368e+04  3.750e+02  36.488  < 2e-16 ***
## as.factor(day_type)3   1.171e+04  4.213e+02  27.802  < 2e-16 ***
## as.factor(day_type)4   1.126e+04  4.308e+02  26.145  < 2e-16 ***
## as.factor(day_type)5   1.123e+04  4.337e+02  25.896  < 2e-16 ***
## as.factor(day_type)6   1.087e+04  4.355e+02  24.964  < 2e-16 ***
## as.factor(day_type)7   9.390e+03  4.314e+02  21.768  < 2e-16 ***
## as.factor(day_type)8   7.075e+03  3.478e+02  20.341  < 2e-16 ***
## as.factor(day_type)9   7.865e+03  4.317e+02  18.217  < 2e-16 ***
## as.factor(day_type)10  6.558e+03  4.565e+02  14.366  < 2e-16 ***
## as.factor(month)2     -2.736e+02  9.578e+01  -2.856 0.004347 ** 
## as.factor(month)3     -8.574e+02  1.065e+02  -8.055 1.65e-15 ***
## as.factor(month)4     -1.263e+03  1.281e+02  -9.856  < 2e-16 ***
## as.factor(month)5     -1.061e+03  1.492e+02  -7.109 1.83e-12 ***
## as.factor(month)6     -8.559e+02  1.563e+02  -5.476 5.12e-08 ***
## as.factor(month)7     -2.762e+01  1.725e+02  -0.160 0.872785    
## as.factor(month)8     -2.592e+01  1.730e+02  -0.150 0.880934    
## as.factor(month)9     -4.702e+02  1.547e+02  -3.039 0.002419 ** 
## as.factor(month)10    -7.514e+02  1.470e+02  -5.110 3.66e-07 ***
## as.factor(month)11    -3.918e+02  1.185e+02  -3.307 0.000965 ***
## as.factor(month)12    -8.477e+01  9.858e+01  -0.860 0.389954    
## lag_1                  6.832e-01  1.199e-02  57.000  < 2e-16 ***
## HDD16                  2.688e+01  9.612e+00   2.797 0.005229 ** 
## CDD20                  2.754e+02  2.051e+01  13.426  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 727.9 on 1443 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.9995, Adjusted R-squared:  0.9995 
## F-statistic: 1.284e+05 on 24 and 1443 DF,  p-value: < 2.2e-16

checkresiduals(model)

## 
##  Breusch-Godfrey test for serial correlation of order up to 27
## 
## data:  Residuals
## LM test = 409.63, df = 27, p-value < 2.2e-16

Seemingly, HDD16 and CDD20 regressors are significant and residual standard error decreases with the usage of them. There is no obvious difference in residual analysis. Model becomes more accurate, but there is still need for other regressors in order to get close to the assumptions.

As an improvement, industrial production index value is proposed next. If the residual plot is analyzed, it can be said that there is an overall drop on consumption at March-April 2020 which coincides with the emergence of COVID-19 pandemic. This values are probably due to low production volumes at shutdown conditions. Therefore, adding the industrial production index as an indicator of the electricity consumption with industrial production purposes may explain this overall decrease and maybe other fluctuations.

The data will be used is obtained from EVDS and it is on monthly basis. For the sake of the compliance, monthly values are reproduced in a daily manner with duplicated values in months. Following is the model with industrial production index added:

# Reading industrial production index data from file
prod <- read_excel("production.xlsx")

# Adding industrial production index column to daily_data
daily_data[, prod_index:= prod$production[1:1468]]

# Fourth linear regression model
model <- lm(formula = mean_consumption~ -1 + as.factor(day_type) + as.factor(month) +
              lag_1 +
              HDD16 + CDD20 +
              prod_index,
            data = daily_data)
summary(model)

## 
## Call:
## lm(formula = mean_consumption ~ -1 + as.factor(day_type) + as.factor(month) + 
##     lag_1 + HDD16 + CDD20 + prod_index, data = daily_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3487.5  -314.4   -12.6   328.2  4043.4 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## as.factor(day_type)1   6148.2008   378.6664  16.236  < 2e-16 ***
## as.factor(day_type)2  12267.1200   349.5680  35.092  < 2e-16 ***
## as.factor(day_type)3  10812.5524   385.6165  28.040  < 2e-16 ***
## as.factor(day_type)4  10483.0837   393.3126  26.653  < 2e-16 ***
## as.factor(day_type)5  10478.3324   395.8353  26.471  < 2e-16 ***
## as.factor(day_type)6  10137.9464   397.3417  25.514  < 2e-16 ***
## as.factor(day_type)7   8617.6611   393.8158  21.882  < 2e-16 ***
## as.factor(day_type)8   5313.3230   330.9863  16.053  < 2e-16 ***
## as.factor(day_type)9   6860.2246   395.8166  17.332  < 2e-16 ***
## as.factor(day_type)10  5641.0683   417.3992  13.515  < 2e-16 ***
## as.factor(month)2      -157.4671    87.1468  -1.807   0.0710 .  
## as.factor(month)3     -1379.8058   101.0173 -13.659  < 2e-16 ***
## as.factor(month)4     -1384.8236   116.4608 -11.891  < 2e-16 ***
## as.factor(month)5     -1320.9606   136.1773  -9.700  < 2e-16 ***
## as.factor(month)6      -822.9284   141.8051  -5.803 7.98e-09 ***
## as.factor(month)7       -74.6286   156.4852  -0.477   0.6335    
## as.factor(month)8       371.3014   158.5616   2.342   0.0193 *  
## as.factor(month)9      -740.9664   141.2143  -5.247 1.78e-07 ***
## as.factor(month)10    -1567.2284   141.2000 -11.099  < 2e-16 ***
## as.factor(month)11    -1064.8411   114.0449  -9.337  < 2e-16 ***
## as.factor(month)12     -960.6026   102.2957  -9.390  < 2e-16 ***
## lag_1                     0.5576     0.0130  42.889  < 2e-16 ***
## HDD16                    72.3174     9.0932   7.953 3.65e-15 ***
## CDD20                   359.7871    19.2164  18.723  < 2e-16 ***
## prod_index               44.4560     2.5205  17.638  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 660.4 on 1442 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.9996, Adjusted R-squared:  0.9996 
## F-statistic: 1.498e+05 on 25 and 1442 DF,  p-value: < 2.2e-16

checkresiduals(model)

## 
##  Breusch-Godfrey test for serial correlation of order up to 28
## 
## data:  Residuals
## LM test = 297.06, df = 28, p-value < 2.2e-16

Since production index has a very low p-value, it is significant in the model. Residual standard error is also lower than the previous model’s error. Luckily, residuals analysis also gives better results in terms of assumptions. Residuals get closer to normal distribution, and some parts of the changing mean is explained with the help of industrial production index.

At this point, it might be helpful to check the outliers. When outliers are examined using 0.05 quantiles at both ends, it is realized that some days that are close to holidays are generally overestimated. For example, people generally take vacation on friday when an holiday coincides with thursday. Since the reason behind this outliers are known and there is no such days in the prediction period, they can be specified with two dummy variables, one for .05 and one for .95 quantiles, and model can be enhanced with this approach.

Following is the model with abovesaid dummy variables added:

# Detecting lower and upper 5% residuals and marking them as outlier_small and outlier_great
daily_data$residual[2:1468] <- model$residual
daily_data[!is.na(residual), quant5:=quantile(residual, 0.05)]
daily_data[!is.na(residual), quant95:=quantile(residual, 0.95)]
daily_data[,outlier_small:= as.numeric(residual < quant5)]
daily_data[,outlier_great:= as.numeric(residual > quant95)]

# Fifth linear regression model
model <- lm(formula = mean_consumption~ -1 + as.factor(day_type) + as.factor(month) +
              lag_1 +
              HDD16 + CDD20 +
              prod_index +
              outlier_small + outlier_great,
            data = daily_data)
summary(model)

## 
## Call:
## lm(formula = mean_consumption ~ -1 + as.factor(day_type) + as.factor(month) + 
##     lag_1 + HDD16 + CDD20 + prod_index + outlier_small + outlier_great, 
##     data = daily_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1983.08  -290.68   -16.72   284.26  2245.00 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## as.factor(day_type)1   5.666e+03  2.597e+02  21.819  < 2e-16 ***
## as.factor(day_type)2   1.196e+04  2.393e+02  49.968  < 2e-16 ***
## as.factor(day_type)3   1.037e+04  2.646e+02  39.204  < 2e-16 ***
## as.factor(day_type)4   1.003e+04  2.701e+02  37.120  < 2e-16 ***
## as.factor(day_type)5   9.990e+03  2.715e+02  36.798  < 2e-16 ***
## as.factor(day_type)6   9.683e+03  2.724e+02  35.544  < 2e-16 ***
## as.factor(day_type)7   8.217e+03  2.702e+02  30.417  < 2e-16 ***
## as.factor(day_type)8   5.050e+03  2.303e+02  21.927  < 2e-16 ***
## as.factor(day_type)9   6.475e+03  2.766e+02  23.413  < 2e-16 ***
## as.factor(day_type)10  5.687e+03  2.918e+02  19.489  < 2e-16 ***
## as.factor(month)2     -1.010e+02  5.777e+01  -1.748   0.0807 .  
## as.factor(month)3     -1.273e+03  6.698e+01 -18.999  < 2e-16 ***
## as.factor(month)4     -1.158e+03  7.733e+01 -14.979  < 2e-16 ***
## as.factor(month)5     -1.148e+03  9.033e+01 -12.708  < 2e-16 ***
## as.factor(month)6     -6.488e+02  9.405e+01  -6.899 7.81e-12 ***
## as.factor(month)7      1.487e+02  1.038e+02   1.432   0.1523    
## as.factor(month)8      4.933e+02  1.051e+02   4.695 2.93e-06 ***
## as.factor(month)9     -5.948e+02  9.360e+01  -6.355 2.79e-10 ***
## as.factor(month)10    -1.457e+03  9.368e+01 -15.555  < 2e-16 ***
## as.factor(month)11    -9.146e+02  7.567e+01 -12.087  < 2e-16 ***
## as.factor(month)12    -8.271e+02  6.784e+01 -12.192  < 2e-16 ***
## lag_1                  5.800e-01  8.865e-03  65.423  < 2e-16 ***
## HDD16                  7.541e+01  6.032e+00  12.501  < 2e-16 ***
## CDD20                  3.334e+02  1.286e+01  25.925  < 2e-16 ***
## prod_index             4.069e+01  1.672e+00  24.339  < 2e-16 ***
## outlier_small         -1.628e+03  5.700e+01 -28.553  < 2e-16 ***
## outlier_great          1.501e+03  5.670e+01  26.467  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 437.4 on 1440 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.9998, Adjusted R-squared:  0.9998 
## F-statistic: 3.162e+05 on 27 and 1440 DF,  p-value: < 2.2e-16

checkresiduals(model)

## 
##  Breusch-Godfrey test for serial correlation of order up to 30
## 
## data:  Residuals
## LM test = 154.81, df = 30, p-value < 2.2e-16

As seen from the summary of the model, residual standard error decreases. More importantly, latest model’s residuals show the requested behaviour in most aspects. To be more precise, residuals seem more stationary than ever, and they fit to the normal distribution pretty well. For autocorrelation, possible enhancements will be discussed at next sections.

The last topic that will be discussed in this section is the transformation of the daily predictions to hourly ones. In this study, an easy but powerful method is used: Most recent same-type day, which is generally one week before, is used to find ratios of (hourly consumption)/(daily consumption mean). These ratios then used to calculate hourly predictions after daily mean consumption forecasting.

4 Results

As discussed above, the final model includes outlier points, lag-1, HDD-16, CDD-20, production index and day and month types. The summary of the final model and plotted fitted values and actual values are below.

summary(model)

## 
## Call:
## lm(formula = mean_consumption ~ -1 + as.factor(day_type) + as.factor(month) + 
##     lag_1 + HDD16 + CDD20 + prod_index + outlier_small + outlier_great, 
##     data = daily_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1983.08  -290.68   -16.72   284.26  2245.00 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## as.factor(day_type)1   5.666e+03  2.597e+02  21.819  < 2e-16 ***
## as.factor(day_type)2   1.196e+04  2.393e+02  49.968  < 2e-16 ***
## as.factor(day_type)3   1.037e+04  2.646e+02  39.204  < 2e-16 ***
## as.factor(day_type)4   1.003e+04  2.701e+02  37.120  < 2e-16 ***
## as.factor(day_type)5   9.990e+03  2.715e+02  36.798  < 2e-16 ***
## as.factor(day_type)6   9.683e+03  2.724e+02  35.544  < 2e-16 ***
## as.factor(day_type)7   8.217e+03  2.702e+02  30.417  < 2e-16 ***
## as.factor(day_type)8   5.050e+03  2.303e+02  21.927  < 2e-16 ***
## as.factor(day_type)9   6.475e+03  2.766e+02  23.413  < 2e-16 ***
## as.factor(day_type)10  5.687e+03  2.918e+02  19.489  < 2e-16 ***
## as.factor(month)2     -1.010e+02  5.777e+01  -1.748   0.0807 .  
## as.factor(month)3     -1.273e+03  6.698e+01 -18.999  < 2e-16 ***
## as.factor(month)4     -1.158e+03  7.733e+01 -14.979  < 2e-16 ***
## as.factor(month)5     -1.148e+03  9.033e+01 -12.708  < 2e-16 ***
## as.factor(month)6     -6.488e+02  9.405e+01  -6.899 7.81e-12 ***
## as.factor(month)7      1.487e+02  1.038e+02   1.432   0.1523    
## as.factor(month)8      4.933e+02  1.051e+02   4.695 2.93e-06 ***
## as.factor(month)9     -5.948e+02  9.360e+01  -6.355 2.79e-10 ***
## as.factor(month)10    -1.457e+03  9.368e+01 -15.555  < 2e-16 ***
## as.factor(month)11    -9.146e+02  7.567e+01 -12.087  < 2e-16 ***
## as.factor(month)12    -8.271e+02  6.784e+01 -12.192  < 2e-16 ***
## lag_1                  5.800e-01  8.865e-03  65.423  < 2e-16 ***
## HDD16                  7.541e+01  6.032e+00  12.501  < 2e-16 ***
## CDD20                  3.334e+02  1.286e+01  25.925  < 2e-16 ***
## prod_index             4.069e+01  1.672e+00  24.339  < 2e-16 ***
## outlier_small         -1.628e+03  5.700e+01 -28.553  < 2e-16 ***
## outlier_great          1.501e+03  5.670e+01  26.467  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 437.4 on 1440 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.9998, Adjusted R-squared:  0.9998 
## F-statistic: 3.162e+05 on 27 and 1440 DF,  p-value: < 2.2e-16

daily_data[,fitted := NA]
daily_data$fitted[2:1468] <- fitted(model)
cols <- c("fitted" = "orange", "actual" = "blue")
ggplot() + 
  geom_line(data = daily_data, aes(x = Date, y = fitted,color = "fitted")) +
  geom_line(data = daily_data, aes(x = Date, y = mean_consumption,color = "actual")) +
  labs(title = "Predicted vs. Actual Daily Electricity Consumption (MWh) in  Turkey",
    subtitle = "1 Jan 2017- 7 Jan 2021",
    x = "Time",
    y = "Electricity Consumption Rate (MWh)",
    color = "Color") +
  scale_color_manual(values = cols) +
  scale_x_date(date_breaks = "6 months", date_labels = "%b %Y", date_minor_breaks = "3 month") +
  theme_minimal()

statistics <- function(actual, forecasted){
  n=length(actual)
  error = actual - forecasted
  mean = mean(actual)
  sd = sd(actual)
  bias = sum(error) / sum(actual)
  mad = sum(abs(error)) / n
  wmape = mad / mean
  l = data.frame(n, mean, sd, bias, mad, wmape)
  colnames(l) <- c("N", "Mean", "Standard Deviation", "Bias", "MAD", "WMAPE")
  return(l)
}

daily_data[, base_fitted := fitted(model_base)]

kable(statistics(daily_data$mean_consumption[-1], daily_data$base_fitted[-1]),
      caption = "Statistics for Fitted Daily Mean Consumption According to Base Model", align = 'c')

Statistics for Fitted Daily Mean Consumption According to Base Model
N	Mean	Standard Deviation	Bias	MAD	WMAPE
1467	33189.57	3473.855	7.75e-05	1094.322	0.0329719

kable(statistics(daily_data$mean_consumption[-1], daily_data$fitted[-1]),
      caption = "Statistics for Fitted Daily Mean Consumption According to Final Model", align = 'c')

Statistics for Fitted Daily Mean Consumption According to Final Model
N	Mean	Standard Deviation	Bias	MAD	WMAPE
1467	33189.57	3473.855	0	341.8342	0.0102994

Along with the model, all of the regressors are significant. Even though month 7 and 2 have large p-values, they won’t be disposed, since they represent the monthly seasonality. Another important indicator of a regression model is residual standard error. This final model has the lowest residual standard error.

From the plotted fitted values, the model seems to perform very well in explaining of the available data. Fitted and actual data have a very close pattern. The deviations are very rare. It can be inferred from the statistics that the model has improved by the addition of the new regressors, when compared with the base (first) model. WMAPE (weighted mean absolute percentage error) has decreased significantly.

Let’s see how well the model performed when forecasting the day ahead consumption values:

test_data <- read_excel("test_data.xlsx")
hours <- c(0:23)
test_data$Hour <- rep(hours, times = 15)
test_data <- as.data.table(test_data)
daily_test_data <- test_data[,list(mean_actual=mean(Actual, na.rm = T),
                                    mean_predictions=mean(Predictions, na.rm = T)),by=list(Date)]

cols <- c("predicted" = "orange", "actual" = "blue")
ggplot() + 
  geom_line(data = daily_test_data, aes(x = Date, y = mean_predictions,color = "predicted")) +
  geom_line(data = daily_test_data, aes(x = Date, y = mean_actual,color = "actual")) +
  labs(title = "Predicted vs. Actual Daily Electricity Consumption (MWh) in  Turkey",
    subtitle = "30 Jan 2021 - 13 Feb 2021",
    x = "Time",
    y = "Electricity Consumption Rate (MWh)",
    color = "Color") +
  scale_color_manual(values = cols) +
  scale_x_datetime(date_breaks = "2 days", date_labels = "%d %b", date_minor_breaks = "1 day") +
  theme_minimal()

kable(statistics(daily_test_data$mean_actual, daily_test_data$mean_predictions),
      caption = "Statistics for Predicted Day Ahead Daily Mean Consumption", align = 'c')

Statistics for Predicted Day Ahead Daily Mean Consumption
N	Mean	Standard Deviation	Bias	MAD	WMAPE
15	35099.98	1775.932	-0.0040023	321.0273	0.0091461

As it can be seen from the daily predictions plot, the model performed quite well. There are no large deviations from the actual consumption values. Also WMAPE is quite small, indicating that the predictions are made by small errors.

Let’s see if the transformation of daily mean predictions into hourly levels are also satisfactory. The hourly predictions of the data was collected every day. The test data and the plot of actual and predicted hourly electricity consumption and related statistics are below.

test_data[,Date := as.POSIXct(Date) + Hour * 60 * 60,]
rmarkdown::paged_table(test_data[,-2])

cols <- c("predicted" = "orange", "actual" = "blue")

ggplot() + 
  geom_line(data = test_data, aes(x = Date, y = Predictions, color = "predicted")) +
  geom_line(data = test_data, aes(x = Date, y = Actual, color = "actual")) +
  labs(title = "Predicted vs. Actual Hourly Electricity Consumption (MWh) in  Turkey",
       subtitle = "30 Jan 2021 00:00 - 13 Feb 2021 23:00",
       x = "Time",
       y = "Electricity Consumption Rate (MWh)",
       color = "Color") +
  scale_color_manual(values = cols) +
  scale_x_datetime(date_breaks = "2 days", date_labels = "%d %b", date_minor_breaks = "1 day") +
  theme_minimal()

kable(statistics(test_data$Actual, test_data$Predictions),
      caption = "Statistics for Predicted Day Ahead Hourly Mean Consumption", align = 'c')

Statistics for Predicted Day Ahead Hourly Mean Consumption
N	Mean	Standard Deviation	Bias	MAD	WMAPE
360	35099.98	4214.446	-0.0040023	643.6498	0.0183376

It can be seen that the transformations made the predictions a bit more problematic. However, the predicted values are still good enough with 1.8% WMAPE.

5 Conclusion and Future Work

Even though the final model has a very low WMAPE, further improvements and extensions can be made in the model to have a better approach.

First, when the daily forecasts are transformed back to hourly data, the hourly distribution of the electricity consumption of the one week before is used. Choosing the best strategy to transform the daily data into an hourly one is a topic that is too complicated to discuss in this scope. Nevertheless, the method used in this model is a decent method to forecast the hourly electricity consumption rate in Turkey between 30 January 2021 to 13 February 2021, because luckily there was no religious/national holiday in this interval. If the interval included holidays or eves, this method would be problematic and not give satisfactory results. Because of this, a better back transformation method should be chosen for the future forecasts.

Next, it was realized that a considerable amount of the outliers were from the dates very close to holidays/eves. In Turkey, sometimes national/religious holidays are combined with weekends when there are only 1-2 week days in between. Because of this, those week days also become holidays and electricity consumption drops significantly. In this model, those types of days are not considered. For future work, those days should be detected and their corresponding day types should be fixed.

Moreover, outliers contained so many New Year’s Eve’s (December 31). In Turkey, New Year’s Eve is not a real “eve,” it is not an official holiday. Therefore, New Year’s Eve’s are not marked as an “eve” in the day_type regressor in the final model. We could have been less strict when deciding on whether a day is an eve or not.

Furthermore, the industrial production index data could be included in the model in a different way. The industrial production index data obtained from EVDS is monthly. The data is transformed to daily data by replicating the monthly value for all days of that month. This is rather a primitive method and a much more better method of transformation can be developed.

Last but not least, in this model, the weighted average of the temperature values of the seven different coordinates are calculated for every day. Thus, only the mean temperature value is used when calculating HDD and CDD values. Instead of this, the temperature values of the seven coordinates could have been used separately. This may reduce the residual standard error of the final model.

All things considered, the final multiple linear regression model built to forecast electricity consumption rates in Turkey for one day ahead is an acceptable model that explains the consumption rates in a satisfactory level.

6 Appendicies

The R Markdown of this report is here. (.Rmd file)
The R script used for retrieving data from API and making submissions is here. (.R file)
The data (.csv and .xlsx files) used in these files are here. (.zip file)

7 References

Chen, James. 2019. “Heating Degree Day - HDD.” https://www.investopedia.com/terms/h/heatingdegreeday.asp.

Toros, Hüseyin, Derya Aydın, and Ahmet Faruk Kavak. 2015. “Isınma Ve Soğuma Derece Günlerin Elektrik Tüketimi Üzerindeki Etkisi.” Edited by Ali Deniz, Bahtiyar Efe, Bihter Durna, and Pelin Cansu Çavuş. https://www.researchgate.net/publication/321155122_isinma_ve_soguma_derece_gunlerin_elektrik_tuketimi1.

IE 360 Project Report

Group 12: Tufan Berk Tuğ, Ferhat Turhan, Aral Dörtoğul