1. Introduction

In this homework, the ultimate task is to forecast the sales of UGS for every quarter of 2007 by using the data between 2000 and 2006.

1.1 Required Packages

library(RcppRoll)

## Warning: package 'RcppRoll' was built under R version 4.0.5

library(readxl)

## Warning: package 'readxl' was built under R version 4.0.5

library(lubridate)

## Warning: package 'lubridate' was built under R version 4.0.5

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(zoo)

## Warning: package 'zoo' was built under R version 4.0.5

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.0.5

library(scales)

## Warning: package 'scales' was built under R version 4.0.5

library(data.table)

## Warning: package 'data.table' was built under R version 4.0.5

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:lubridate':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday, week,
##     yday, year

library(corrplot)

## corrplot 0.92 loaded

library(ggcorrplot)

## Warning: package 'ggcorrplot' was built under R version 4.0.5

library(GGally)

## Warning: package 'GGally' was built under R version 4.0.5

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(forecast)

## Warning: package 'forecast' was built under R version 4.0.5

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.0.5

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:data.table':
## 
##     between, first, last

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

All Metrics:

UGS: Unleaded gasoline sale in a given quarter, RNUV: An index indicating the rate of new unleaded gasoline using vehicles being added to the traffic in a quarter, PU: Average price (adjusted with an index) of a liter of unleaded gasoline in a quarter, PG: Average price (adjusted with an index) of a liter of diesel gasoline in a quarter, NUGV: Number of unleaded gasoline using vehicles in the traffic, NDGV: Number of diesel gasoline using vehicles in the traffic (per 1000 people), GNPA: Agriculture component of Gross National Product (adjusted with an index), GNPC: Commerce component of Gross National Product (adjusted with an index), GNP: Grand total for GNP (agriculture, commerce and other components total).

1.2 Data Acquisition

path <- "C:/Users/asus/Desktop/IE360/HW2/IE360_Spring22_HW2_data.csv"

Data_HW2 <- read.csv(path,colClasses=c('character',rep('numeric',10)))
colnames(Data_HW2) <- c("Quarters", "UGS", "RNUV","NLPG","PU","PG","NUGV","NDGV","GNPA","GNPC","GNPT")

Data_HW2$Quarters <- as.Date(as.yearqtr(Data_HW2$Quarters, format = "%Y_Q%q"))


Data_Given <- Data_HW2[c(1:28),]
Data_Forecast <- Data_HW2[c(29,30,31,32),]

Data_Given <- data.table(Data_Given)
Data_Forecast <- data.table(Data_Forecast)

1.3 Visualization of the Data 2000-2006 and Results

ggplot(Data_Given, aes(x=Quarters,y=UGS, group = 1)) +
geom_point() +
geom_line() +
labs(y ='Quarters')+ 
ggtitle('Unleaded Gasoline Sales (UGS) 2000-2006')

Results: 1) There is a yearly seasonality in the data. 2) There is a strongly decreasing trend. 3) The data is not stationary with respect to mean because of declining trend 4) The data is stationary with respect to variance. With respect to this results, no transformation is unnecessary.

1.4 Autocorrelation Function of UGS

acf(Data_Given$UGS)

In the autocorrelation plot, the values for lag 1 and lag 4 are above the limits. Because the lag 4 can be explained by quarterly seasonality, the only lagged variable which is added in the model is lag 1.

1.5 Correlation Diagram

ggpairs(Data_Given)

According to correlation diagram above, attributes highly correlated with UGS are: 1) NLPG 2) PU 3) PG 4) NUGV 5) NDGV 6) GNPA

2. FORECASTING

2.1 Adding Trend and Seasonality Components to the Model

Data_Given[,trend := 1:.N ]
Data_Given[,Quarters_:=as.character(month(Quarters))]

Data_Forecast[,trend := 29:32 ]
Data_Forecast[,Quarters_:=as.character(month(Quarters))]

In the part of 1.3 Visualization of the Data 2000-2006 and Results, we conclude that there is a yearly seasonality in the data and strongly decreasing trend. To optimize the model, seasonality and trend component is added.

Model_2.1 <- lm(UGS ~ trend+Quarters_, Data_Given)
summary(Model_2.1)

## 
## Call:
## lm(formula = UGS ~ trend + Quarters_, data = Data_Given)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -81167 -31283  -3458  28640  94502 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1060372      23653  44.830  < 2e-16 ***
## trend         -13497       1147 -11.764 3.28e-11 ***
## Quarters_10    95049      26189   3.629 0.001405 ** 
## Quarters_4    121532      25987   4.677 0.000104 ***
## Quarters_7    273619      26063  10.498 3.03e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 48570 on 23 degrees of freedom
## Multiple R-squared:  0.9119, Adjusted R-squared:  0.8966 
## F-statistic: 59.53 on 4 and 23 DF,  p-value: 8.446e-12

Since the value of residual standard error is low and the value of R-squared is high, we prove that the new model is much better that the old one just by adding seasonality and trend components.

2.2 Adding the Correlated Variables

In the part of 1.5 Correlation Diagram, it is mentioned that some variables are correlated with UGS. Because of this correlations, the correlated variables should be added in the model.

Model_2.2 <- lm(UGS ~ trend+NLPG+PU+PG+NUGV+NDGV+GNPA + Quarters_, Data_Given)
summary(Model_2.2)

## 
## Call:
## lm(formula = UGS ~ trend + NLPG + PU + PG + NUGV + NDGV + GNPA + 
##     Quarters_, data = Data_Given)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -67794 -10461   1987  18461  32208 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.094e+06  8.283e+05   3.735  0.00165 ** 
## trend        1.514e+04  1.284e+04   1.178  0.25486    
## NLPG        -2.937e-01  1.804e-01  -1.628  0.12195    
## PU           2.276e+01  1.017e+03   0.022  0.98241    
## PG          -1.129e+03  1.437e+03  -0.786  0.44295    
## NUGV        -1.023e+00  3.343e-01  -3.061  0.00708 ** 
## NDGV         1.238e+04  3.581e+03   3.456  0.00302 ** 
## GNPA        -3.613e-02  2.764e-02  -1.307  0.20854    
## Quarters_10  1.607e+05  4.385e+04   3.664  0.00192 ** 
## Quarters_4   1.463e+05  2.356e+04   6.212 9.48e-06 ***
## Quarters_7   4.953e+05  1.597e+05   3.102  0.00648 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29700 on 17 degrees of freedom
## Multiple R-squared:  0.9757, Adjusted R-squared:  0.9613 
## F-statistic: 68.12 on 10 and 17 DF,  p-value: 1.055e-11

After adding correlated variables the value of residual standard error becomes lower than part 2.1 and the value of R-squared increase by adding correlated variables. So it is proven that the correlated variable adding is an essential update to get better forecasts. However, the NUGV and NDGV variables are not important variables according to p-values of them.

2.3 Adding the Lagged Variable to the Model

Data_Given$UGSlag1=lag(Data_Given$UGS,1)
Model_2.3 <-  lm(UGS ~ trend+PU+PG+NUGV+NDGV+GNPA +UGSlag1 + Quarters_, Data_Given)
summary(Model_2.3)

## 
## Call:
## lm(formula = UGS ~ trend + PU + PG + NUGV + NDGV + GNPA + UGSlag1 + 
##     Quarters_, data = Data_Given)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -43759 -14074  -4172  14341  37003 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.768e+06  6.313e+05   4.385 0.000462 ***
## trend       -1.136e+04  5.231e+03  -2.172 0.045214 *  
## PU          -1.120e+03  8.383e+02  -1.336 0.200129    
## PG           7.172e+02  1.103e+03   0.650 0.524635    
## NUGV        -7.265e-01  2.127e-01  -3.416 0.003536 ** 
## NDGV         9.541e+03  2.131e+03   4.478 0.000380 ***
## GNPA         1.350e-02  2.925e-02   0.462 0.650522    
## UGSlag1     -5.592e-01  2.347e-01  -2.383 0.029912 *  
## Quarters_10  1.927e+05  4.662e+04   4.134 0.000780 ***
## Quarters_4   7.895e+04  3.639e+04   2.170 0.045422 *  
## Quarters_7   2.357e+05  1.639e+05   1.438 0.169611    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 28210 on 16 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.9786, Adjusted R-squared:  0.9652 
## F-statistic: 73.17 on 10 and 16 DF,  p-value: 2.015e-11

After adding lagged variable the value of residual standard error becomes lower than part 2.1 and the value of R-squared increase by adding lagged variable. So it is proven that the lagged variable adding is an essential update to get better forecasts.

2.4 Model Check

checkresiduals(Model_2.3$residuals)

## Warning in modeldf.default(object): Could not find appropriate degrees of
## freedom for this model.

Plot_Model2.3=copy(Data_Given)
Plot_Model2.3[,actual:=UGS]
Plot_Model2.3[,predicted_trend:=predict(Model_2.3,Plot_Model2.3)]
Plot_Model2.3[,residual_trend:=actual-predicted_trend]

ggplot(Plot_Model2.3 ,aes(x=Quarters)) +
        geom_line(aes(y=UGS,color='current data')) + 
        geom_line(aes(y=predicted_trend,color='forecast'))

## Warning: Removed 1 row(s) containing missing values (geom_path).

As it seen above, the forecast approximately fits the real values. Therefore, Model2.3 can be used to forecast UGS values of 2007 quarters.

3. Forecasting 2007 UGS Values

1st Quarter - 655894.0 2nd Quarter - 850454.8 3rd Quarter - 961714.9 4th Quarter - 783118.0

Data_Forecast$UGSlag1[1]=Data_Given$UGS[28]


Data_Forecast[1,"UGS"]=as.numeric(predict(Model_2.3,newdata=Data_Forecast[1,]))
Data_Forecast[2,"UGS"]=predict(Model_2.3,newdata=Data_Forecast[2,])
Data_Forecast[3,"UGS"]=predict(Model_2.3,newdata=Data_Forecast[3,])
Data_Forecast[4,"UGS"]=predict(Model_2.3,newdata=Data_Forecast[4,])

Data_Forecast$UGSlag1[2]=as.numeric(Data_Forecast[1,"UGS"])
Data_Forecast$UGSlag1[3]=as.numeric(Data_Forecast[2,"UGS"])
Data_Forecast$UGSlag1[4]=as.numeric(Data_Forecast[3,"UGS"])

Data_Forecast[,"UGS"]

##         UGS
## 1: 655894.0
## 2: 729602.9
## 3: 949666.3
## 4: 833288.8

1st Quarter - 655894.0 2nd Quarter - 850454.8 3rd Quarter - 961714.9 4th Quarter - 783118.0