This data-driven work is dedicated to investigating possible underlying reasons of fluctuations in unleaded gasoline sales of a major distributor by generating forecasts of quarterly sales in 2007, based on the historical data recorded from 2000 to 2006. The data used to fit models contains the following fields:
The R Markdown file used to generate this report is provided in the GitHub repository along with the csv file used in the below code.
\(~\)
The mainly used libraries in the code are “forecast”, “ggplot2”, and “tidyverse”, along with several supplementary ones. The time series are first cleaned as “data.frame” objects, then stored in “ts” class to generate particular plots.
# Library imports
library(ggplot2)
library(GGally)
library(forecast)
library(tidyverse)
library(data.table)
library(zoo)
# Setting the working directory
setwd(getwd())
# Importing the data set
df <- fread("Data.csv")
colnames(df)<- c("Quarter", "UGS", "RNUV", "NLPG", "PU", "PG",
"NUGV", "NDGV", "GNPA", "GNPC", "GNP")
df$Quarter <- as.yearqtr(df$Quarter,format="%Y_Q%q")
df <- df %>%
mutate(across(where(is.character),
str_remove_all,
pattern = fixed(" "))) %>%
mutate_if(is.character,as.numeric)
str(df)
## Classes 'data.table' and 'data.frame': 32 obs. of 11 variables:
## $ Quarter: 'yearqtr' num 2000 Q1 2000 Q2 2000 Q3 2000 Q4 ...
## $ UGS : num 1128971 1199569 1370167 1127548 1033918 ...
## $ RNUV : num 0.0146 0.0205 0.0207 0.0163 0.0071 0.0051 0.0041 0.0048 0.0012 0.0032 ...
## $ NLPG : num 940000 941000 943500 948000 950000 ...
## $ PU : num 469 459 440 402 412 ...
## $ PG : num 356 345 327 301 306 ...
## $ NUGV : num 4647500 4742876 4840931 4919685 4954754 ...
## $ NDGV : num 282 284 287 288 288 ...
## $ GNPA : num 1040173 1760460 6974808 3267125 1004528 ...
## $ GNPC : num 3483132 4525451 5915204 4929778 3418387 ...
## $ GNP : num 18022686 21797130 30050207 24480153 15832648 ...
## - attr(*, ".internal.selfref")=<externalptr>
UGStsr <- ts(df[,"UGS"][1:28],
freq = 4,
start = c(2000,1),
end = c(2006,4))
ts.plot(UGStsr,
xlab = "Year",
ylab = "UGS")
\(~\)
From the above plot, it can be seen that UGS time series is not stationary due to its obviously declining trend through years. It poses a yearly seasonality since it draws a predictable pattern in each year: The sales make a peak at the 3rd quarter and decrease again in the following quarter. On the other hand, the variance of the series look somewhat stationary. By so far, it can be claimed that the sales data has a non-stationary mean and stationary variance.
acf(UGStsr)
\(~\)
The autocorrelation plot validates the assumptions made before: The strong correlation in the lag 0.25 implies the existence of trend and the high correlation in the lag 1 confirms the seasonal behavior. It can be noted that slightly higher correlations also exist at the lags 2 and 3, which can be regarded as repercussions of lag 1.
# Inclusion of the trend variable
df[,Trend:=1:.N]
# Encoding the seasonality variable into the data
df$Season <- as.factor(quarter(as.Date(df$Quarter)))
ggpairs(df[,-1])
\(~\)
The first thing can be noticed from the correlation plots is that except RNUV, GNPC, and GNP, all the features have significant correlation with the response variable UGS. Most of the correlated predictors seem like they have seasonal behaviors and it looks like NUGV potentially has a nonlinear relationship with UGS.
# Building the first model with trend and seasonality
model_01 <- lm(UGS~Trend+Season,
data=df)
summary(model_01)
##
## Call:
## lm(formula = UGS ~ Trend + Season, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -81167 -31283 -3458 28640 94502
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1060372 23653 44.830 < 2e-16 ***
## Trend -13497 1147 -11.764 3.28e-11 ***
## Season2 121532 25987 4.677 0.000104 ***
## Season3 273619 26063 10.498 3.03e-10 ***
## Season4 95049 26189 3.629 0.001405 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 48570 on 23 degrees of freedom
## (4 observations deleted due to missingness)
## Multiple R-squared: 0.9119, Adjusted R-squared: 0.8966
## F-statistic: 59.53 on 4 and 23 DF, p-value: 8.446e-12
checkresiduals(model_01)
##
## Breusch-Godfrey test for serial correlation of order up to 8
##
## data: Residuals
## LM test = 11.462, df = 8, p-value = 0.1769
\(~\)
The first trial with pure trend and seasonality gives a fairly good model, which has almost 90% adjusted R-squared value, yet the current state of residuals imply that it can possibly be enhanced by the elimination of correlations of the lags 1 and 2. Probably, introduction of the other variables will also have positive effects on the model. From this point on, a method similar to the “Forward Selection Method” in the machine learning domain will be used, that is, the significantly correlated features will be added to the model one by one. The change in the R-squared values will be checked at each step.
# Adding NUGV
model_02 <- lm(UGS~NUGV+Trend+Season,
data=df)
# Adding NLPG
model_03 <- lm(UGS~NLPG+NUGV+Trend+Season,
data=df)
# Adding GNPA
model_04 <- lm(UGS~GNPA+NLPG+NUGV+Trend+Season,
data=df)
# Adding PG
model_05 <- lm(UGS~PG+GNPA+NLPG+NUGV+Trend+Season,
data=df)
# Adding NDGV
model_06 <- lm(UGS~NDGV+PG+GNPA+NLPG+NUGV+Trend+Season,
data=df)
# Adding PU
model_07 <- lm(UGS~PU+NDGV+PG+GNPA+NLPG+NUGV+Trend+Season,
data=df)
Adjusted R-squared Values:
After including all the significantly correlated variables, it can be seen that the seventh model performed worse than the previous one, implying that dropping one of the features may improve the performance. To select the one to be dropped, individual p-values can be checked, and the least statistically significant predictor can be dropped.
summary(model_07)
##
## Call:
## lm(formula = UGS ~ PU + NDGV + PG + GNPA + NLPG + NUGV + Trend +
## Season, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -67794 -10461 1987 18461 32208
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.094e+06 8.283e+05 3.735 0.00165 **
## PU 2.276e+01 1.017e+03 0.022 0.98241
## NDGV 1.238e+04 3.581e+03 3.456 0.00302 **
## PG -1.129e+03 1.437e+03 -0.786 0.44295
## GNPA -3.613e-02 2.764e-02 -1.307 0.20854
## NLPG -2.937e-01 1.804e-01 -1.628 0.12195
## NUGV -1.023e+00 3.343e-01 -3.061 0.00708 **
## Trend 1.514e+04 1.284e+04 1.178 0.25486
## Season2 1.463e+05 2.356e+04 6.212 9.48e-06 ***
## Season3 4.953e+05 1.597e+05 3.102 0.00648 **
## Season4 1.607e+05 4.385e+04 3.664 0.00192 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29700 on 17 degrees of freedom
## (4 observations deleted due to missingness)
## Multiple R-squared: 0.9757, Adjusted R-squared: 0.9613
## F-statistic: 68.12 on 10 and 17 DF, p-value: 1.055e-11
\(~\)
The summary table suggests that PU is the variable with highest p-value. Thus, from this point on, the study will be continued based on the sixth model which includes all significantly correlated variables but PU.
The next step is checking the current residuals to investigate the possible improvements by lagged variable additions.
checkresiduals(model_06)
##
## Breusch-Godfrey test for serial correlation of order up to 13
##
## data: Residuals
## LM test = 25.683, df = 13, p-value = 0.01875
\(~\)
The ACF plot suggests that addition of 1 lagged response variable may have positive effect. This raises the following claim: The sales of the previous quarter has a remarkable effect on the considered quarter.
# Adding 1 lagged UGS field
df$UGS_1_lagged <- lag(df$UGS, 1)
model_08 <- lm(UGS~UGS_1_lagged+NDGV+PG+GNPA+NLPG+NUGV+Trend+Season,
data=df)
\(~\)
The 8th model gave the best performance so far in terms of adjusted R-squared value, which is 0.9672047, which confirms that previous quarter’s sales is an important predictor.
In the previous steps, it is claimed that a nonlinear relationship may be present between NUGV and UGS. Keeping that in mind, the final step of the model building in this study is to check the effect of adding the reciprocal of the NUGV column to the model.
df$NUGV_reciprocal <- 1/df$NUGV
model_09 <- lm(UGS~NUGV_reciprocal+UGS_1_lagged+NDGV+PG+GNPA+NLPG+NUGV+Trend+Season,
data=df)
summary(model_09)
##
## Call:
## lm(formula = UGS ~ NUGV_reciprocal + UGS_1_lagged + NDGV + PG +
## GNPA + NLPG + NUGV + Trend + Season, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -47554 -11498 4387 14550 31631
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.729e+07 1.277e+07 1.354 0.195918
## NUGV_reciprocal -3.388e+13 3.222e+13 -1.052 0.309581
## UGS_1_lagged -4.167e-01 2.297e-01 -1.814 0.089737 .
## NDGV 2.191e+04 7.537e+03 2.907 0.010836 *
## PG -1.286e+03 3.986e+02 -3.226 0.005657 **
## GNPA -2.883e-02 3.645e-02 -0.791 0.441425
## NLPG -4.937e-01 2.619e-01 -1.885 0.078971 .
## NUGV -2.926e+00 1.633e+00 -1.792 0.093326 .
## Trend 2.246e+04 1.639e+04 1.370 0.190793
## Season2 1.002e+05 3.745e+04 2.677 0.017247 *
## Season3 4.661e+05 2.036e+05 2.290 0.036954 *
## Season4 2.344e+05 5.314e+04 4.410 0.000506 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27310 on 15 degrees of freedom
## (5 observations deleted due to missingness)
## Multiple R-squared: 0.9812, Adjusted R-squared: 0.9674
## F-statistic: 71.19 on 11 and 15 DF, p-value: 6.872e-11
\(~\)
Even the improvement in the adjusted R-squared is slight, the ninth model can be considered as a sufficiently good model, and it can be picked to use for forecasts.
# Setting the ninth model as the final model
finalModel <- model_09
# Making predictions for 2007
for (i in 29: 32){
df$Predictions <- predict(finalModel, df)
df$UGS[i] <- df$Predictions[i]
if (i < 32){
df$UGS_1_lagged[i+1] <- df$Predictions[i]
}
}
df[29:32, c("Quarter", "Predictions")]
## Quarter Predictions
## 1: 2007 Q1 672777.4
## 2: 2007 Q2 843554.7
## 3: 2007 Q3 967914.0
## 4: 2007 Q4 800034.0
\(~\)
ggplot(df, aes(x=Quarter)) +
geom_line(aes(y=UGS,
color='real',
group = 1)) +
geom_line(aes(y=Predictions,
color = 'predictions',
group = 1) )
\(~\)
The above plot shows that the predictions are consistent when the pattern of the time series is considered.
# Residual analysis
df$Residuals <- df$UGS - df$Predictions
ggplot(df[1:28]) +
geom_point(aes(x = Predictions,
y = Residuals)) +
geom_abline(slope = 0,
color="red") +
labs(x = "Predicted UGS Values",
y = "Residuals")
# Actual versus predicted UGS
ggplot(df[1:28]) +
geom_point(aes(x = Predictions,
y = UGS)) +
geom_abline(slope = 1,
color = "red") +
labs(x = "Predicted Sales",
y = "Actual Sales")
\(~\)
The final state of the residuals seems satisfactory in terms of their distributions above and below zero. They have a mean of 3.4450314^{-9}, which is fairly close to zero. Moreover, the distribution of the deviations from the actual values pose no problem, as can be seen in the right hand plot.
After checking the final model, it can be concluded that unleaded gasoline sales in the given period is a simple time series and good results can be obtained with a quick study, keeping in the mind that variance of residuals and the minor autocorrelations for the remaining lags can still be improved with a more elaborate work.
\(~\)