In this homework, the ultimate task is to forecast the sales of UGS for every quarter of 2007 by using the data between 2000 and 2006.
library(RcppRoll)
## Warning: package 'RcppRoll' was built under R version 4.0.5
library(readxl)
## Warning: package 'readxl' was built under R version 4.0.5
library(lubridate)
## Warning: package 'lubridate' was built under R version 4.0.5
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(zoo)
## Warning: package 'zoo' was built under R version 4.0.5
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.5
library(scales)
## Warning: package 'scales' was built under R version 4.0.5
library(data.table)
## Warning: package 'data.table' was built under R version 4.0.5
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:lubridate':
##
## hour, isoweek, mday, minute, month, quarter, second, wday, week,
## yday, year
library(corrplot)
## corrplot 0.92 loaded
library(ggcorrplot)
## Warning: package 'ggcorrplot' was built under R version 4.0.5
library(GGally)
## Warning: package 'GGally' was built under R version 4.0.5
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(forecast)
## Warning: package 'forecast' was built under R version 4.0.5
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.0.5
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
##
## between, first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
UGS: Unleaded gasoline sale in a given quarter, RNUV: An index indicating the rate of new unleaded gasoline using vehicles being added to the traffic in a quarter, PU: Average price (adjusted with an index) of a liter of unleaded gasoline in a quarter, PG: Average price (adjusted with an index) of a liter of diesel gasoline in a quarter, NUGV: Number of unleaded gasoline using vehicles in the traffic, NDGV: Number of diesel gasoline using vehicles in the traffic (per 1000 people), GNPA: Agriculture component of Gross National Product (adjusted with an index), GNPC: Commerce component of Gross National Product (adjusted with an index), GNP: Grand total for GNP (agriculture, commerce and other components total).
path <- "C:/Users/asus/Desktop/IE360/HW2/IE360_Spring22_HW2_data.csv"
Data_HW2 <- read.csv(path,colClasses=c('character',rep('numeric',10)))
colnames(Data_HW2) <- c("Quarters", "UGS", "RNUV","NLPG","PU","PG","NUGV","NDGV","GNPA","GNPC","GNPT")
Data_HW2$Quarters <- as.Date(as.yearqtr(Data_HW2$Quarters, format = "%Y_Q%q"))
Data_Given <- Data_HW2[c(1:28),]
Data_Forecast <- Data_HW2[c(29,30,31,32),]
Data_Given <- data.table(Data_Given)
Data_Forecast <- data.table(Data_Forecast)
ggplot(Data_Given, aes(x=Quarters,y=UGS, group = 1)) +
geom_point() +
geom_line() +
labs(y ='Quarters')+
ggtitle('Unleaded Gasoline Sales (UGS) 2000-2006')
Results: 1) There is a yearly seasonality in the data. 2) There is a
strongly decreasing trend. 3) The data is not stationary with respect to
mean because of declining trend 4) The data is stationary with respect
to variance. With respect to this results, no transformation is
unnecessary.
acf(Data_Given$UGS)
In the autocorrelation plot, the values for lag 1 and lag 4 are above the limits. Because the lag 4 can be explained by quarterly seasonality, the only lagged variable which is added in the model is lag 1.
ggpairs(Data_Given)
According to correlation diagram above, attributes highly correlated with UGS are: 1) NLPG 2) PU 3) PG 4) NUGV 5) NDGV 6) GNPA
Data_Given[,trend := 1:.N ]
Data_Given[,Quarters_:=as.character(month(Quarters))]
Data_Forecast[,trend := 29:32 ]
Data_Forecast[,Quarters_:=as.character(month(Quarters))]
In the part of 1.3 Visualization of the Data 2000-2006 and Results, we conclude that there is a yearly seasonality in the data and strongly decreasing trend. To optimize the model, seasonality and trend component is added.
Model_2.1 <- lm(UGS ~ trend+Quarters_, Data_Given)
summary(Model_2.1)
##
## Call:
## lm(formula = UGS ~ trend + Quarters_, data = Data_Given)
##
## Residuals:
## Min 1Q Median 3Q Max
## -81167 -31283 -3458 28640 94502
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1060372 23653 44.830 < 2e-16 ***
## trend -13497 1147 -11.764 3.28e-11 ***
## Quarters_10 95049 26189 3.629 0.001405 **
## Quarters_4 121532 25987 4.677 0.000104 ***
## Quarters_7 273619 26063 10.498 3.03e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 48570 on 23 degrees of freedom
## Multiple R-squared: 0.9119, Adjusted R-squared: 0.8966
## F-statistic: 59.53 on 4 and 23 DF, p-value: 8.446e-12
Since the value of residual standard error is low and the value of R-squared is high, we prove that the new model is much better that the old one just by adding seasonality and trend components.
Data_Given$UGSlag1=lag(Data_Given$UGS,1)
Model_2.3 <- lm(UGS ~ trend+PU+PG+NUGV+NDGV+GNPA +UGSlag1 + Quarters_, Data_Given)
summary(Model_2.3)
##
## Call:
## lm(formula = UGS ~ trend + PU + PG + NUGV + NDGV + GNPA + UGSlag1 +
## Quarters_, data = Data_Given)
##
## Residuals:
## Min 1Q Median 3Q Max
## -43759 -14074 -4172 14341 37003
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.768e+06 6.313e+05 4.385 0.000462 ***
## trend -1.136e+04 5.231e+03 -2.172 0.045214 *
## PU -1.120e+03 8.383e+02 -1.336 0.200129
## PG 7.172e+02 1.103e+03 0.650 0.524635
## NUGV -7.265e-01 2.127e-01 -3.416 0.003536 **
## NDGV 9.541e+03 2.131e+03 4.478 0.000380 ***
## GNPA 1.350e-02 2.925e-02 0.462 0.650522
## UGSlag1 -5.592e-01 2.347e-01 -2.383 0.029912 *
## Quarters_10 1.927e+05 4.662e+04 4.134 0.000780 ***
## Quarters_4 7.895e+04 3.639e+04 2.170 0.045422 *
## Quarters_7 2.357e+05 1.639e+05 1.438 0.169611
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 28210 on 16 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.9786, Adjusted R-squared: 0.9652
## F-statistic: 73.17 on 10 and 16 DF, p-value: 2.015e-11
After adding lagged variable the value of residual standard error becomes lower than part 2.1 and the value of R-squared increase by adding lagged variable. So it is proven that the lagged variable adding is an essential update to get better forecasts.
checkresiduals(Model_2.3$residuals)
## Warning in modeldf.default(object): Could not find appropriate degrees of
## freedom for this model.
Plot_Model2.3=copy(Data_Given)
Plot_Model2.3[,actual:=UGS]
Plot_Model2.3[,predicted_trend:=predict(Model_2.3,Plot_Model2.3)]
Plot_Model2.3[,residual_trend:=actual-predicted_trend]
ggplot(Plot_Model2.3 ,aes(x=Quarters)) +
geom_line(aes(y=UGS,color='current data')) +
geom_line(aes(y=predicted_trend,color='forecast'))
## Warning: Removed 1 row(s) containing missing values (geom_path).
As it seen above, the forecast approximately fits the real values. Therefore, Model2.3 can be used to forecast UGS values of 2007 quarters.
1st Quarter - 655894.0 2nd Quarter - 850454.8 3rd Quarter - 961714.9 4th Quarter - 783118.0
Data_Forecast$UGSlag1[1]=Data_Given$UGS[28]
Data_Forecast[1,"UGS"]=as.numeric(predict(Model_2.3,newdata=Data_Forecast[1,]))
Data_Forecast[2,"UGS"]=predict(Model_2.3,newdata=Data_Forecast[2,])
Data_Forecast[3,"UGS"]=predict(Model_2.3,newdata=Data_Forecast[3,])
Data_Forecast[4,"UGS"]=predict(Model_2.3,newdata=Data_Forecast[4,])
Data_Forecast$UGSlag1[2]=as.numeric(Data_Forecast[1,"UGS"])
Data_Forecast$UGSlag1[3]=as.numeric(Data_Forecast[2,"UGS"])
Data_Forecast$UGSlag1[4]=as.numeric(Data_Forecast[3,"UGS"])
Data_Forecast[,"UGS"]
## UGS
## 1: 655894.0
## 2: 729602.9
## 3: 949666.3
## 4: 833288.8
1st Quarter - 655894.0 2nd Quarter - 850454.8 3rd Quarter - 961714.9 4th Quarter - 783118.0