1 Introduction

This data-driven work is dedicated to investigating possible underlying reasons of fluctuations in unleaded gasoline sales of a major distributor by generating forecasts of quarterly sales in 2007, based on the historical data recorded from 2000 to 2006. The data used to fit models contains the following fields:

  • UGS : Unleaded gasoline sale in a given quarter
  • RNUV : An index indicating the rate of new unleaded gasoline using vehicles being added to the traffic in a quarter
  • PU : Average price (adjusted with an index) of a liter of unleaded gasoline in a quarter
  • PG : Average price (adjusted with an index) of a liter of diesel gasoline in a quarter
  • NUGV : Number of unleaded gasoline using vehicles in the traffic
  • NDGV : Number of diesel gasoline using vehicles in the traffic (per 1000 people)
  • NLPG : Number of LPG using vehicles in the traffic
  • GNPA : Agriculture component of Gross National Product (adjusted with an index)
  • GNPC : Commerce component of Gross National Product (adjusted with an index)
  • GNP : Grand total for GNP (agriculture, commerce and other components total)

The R Markdown file used to generate this report is provided in the GitHub repository along with the csv file used in the below code.

\(~\)

2 Library Imports and Data Preprocessing

The mainly used libraries in the code are “forecast”, “ggplot2”, and “tidyverse”, along with several supplementary ones. The time series are first cleaned as “data.frame” objects, then stored in “ts” class to generate particular plots.

# Library imports

library(ggplot2)
library(GGally)
library(forecast)
library(tidyverse)
library(data.table)
library(zoo)

# Setting the working directory

setwd(getwd())

# Importing the data set

df          <- fread("Data.csv")

colnames(df)<- c("Quarter", "UGS", "RNUV", "NLPG", "PU", "PG",
                 "NUGV", "NDGV", "GNPA", "GNPC", "GNP")

df$Quarter  <- as.yearqtr(df$Quarter,format="%Y_Q%q")

df          <- df    %>%
    
    mutate(across(where(is.character),
                  str_remove_all,
                  pattern = fixed(" ")))    %>%
    
    mutate_if(is.character,as.numeric)
    
str(df)
## Classes 'data.table' and 'data.frame':   32 obs. of  11 variables:
##  $ Quarter: 'yearqtr' num  2000 Q1 2000 Q2 2000 Q3 2000 Q4 ...
##  $ UGS    : num  1128971 1199569 1370167 1127548 1033918 ...
##  $ RNUV   : num  0.0146 0.0205 0.0207 0.0163 0.0071 0.0051 0.0041 0.0048 0.0012 0.0032 ...
##  $ NLPG   : num  940000 941000 943500 948000 950000 ...
##  $ PU     : num  469 459 440 402 412 ...
##  $ PG     : num  356 345 327 301 306 ...
##  $ NUGV   : num  4647500 4742876 4840931 4919685 4954754 ...
##  $ NDGV   : num  282 284 287 288 288 ...
##  $ GNPA   : num  1040173 1760460 6974808 3267125 1004528 ...
##  $ GNPC   : num  3483132 4525451 5915204 4929778 3418387 ...
##  $ GNP    : num  18022686 21797130 30050207 24480153 15832648 ...
##  - attr(*, ".internal.selfref")=<externalptr>

3 Model Building

UGStsr <- ts(df[,"UGS"][1:28],
             freq = 4,
             start = c(2000,1),
             end = c(2006,4))

ts.plot(UGStsr,
        xlab = "Year",
        ylab = "UGS")
Figure 1. Unleaded gasoline sales through years (in 1000 cubic meters)

Figure 1. Unleaded gasoline sales through years (in 1000 cubic meters)

\(~\)

From the above plot, it can be seen that UGS time series is not stationary due to its obviously declining trend through years. It poses a yearly seasonality since it draws a predictable pattern in each year: The sales make a peak at the 3rd quarter and decrease again in the following quarter. On the other hand, the variance of the series look somewhat stationary. By so far, it can be claimed that the sales data has a non-stationary mean and stationary variance.

acf(UGStsr)
Figure 2. Autocorrelation Function of Quarterly Unleaded Gasoline Sales (Lags in Years)

Figure 2. Autocorrelation Function of Quarterly Unleaded Gasoline Sales (Lags in Years)

\(~\)

The autocorrelation plot validates the assumptions made before: The strong correlation in the lag 0.25 implies the existence of trend and the high correlation in the lag 1 confirms the seasonal behavior. It can be noted that slightly higher correlations also exist at the lags 2 and 3, which can be regarded as repercussions of lag 1.

# Inclusion of the trend variable

df[,Trend:=1:.N]

# Encoding the seasonality variable into the data

df$Season <- as.factor(quarter(as.Date(df$Quarter)))

ggpairs(df[,-1])
Figure 3. Pairwise Correlation Plots of the Data

Figure 3. Pairwise Correlation Plots of the Data

\(~\)

The first thing can be noticed from the correlation plots is that except RNUV, GNPC, and GNP, all the features have significant correlation with the response variable UGS. Most of the correlated predictors seem like they have seasonal behaviors and it looks like NUGV potentially has a nonlinear relationship with UGS.

# Building the first model with trend and seasonality

model_01 <- lm(UGS~Trend+Season,
              data=df)

summary(model_01)
## 
## Call:
## lm(formula = UGS ~ Trend + Season, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -81167 -31283  -3458  28640  94502 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1060372      23653  44.830  < 2e-16 ***
## Trend         -13497       1147 -11.764 3.28e-11 ***
## Season2       121532      25987   4.677 0.000104 ***
## Season3       273619      26063  10.498 3.03e-10 ***
## Season4        95049      26189   3.629 0.001405 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 48570 on 23 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.9119, Adjusted R-squared:  0.8966 
## F-statistic: 59.53 on 4 and 23 DF,  p-value: 8.446e-12
checkresiduals(model_01)
## 
##  Breusch-Godfrey test for serial correlation of order up to 8
## 
## data:  Residuals
## LM test = 11.462, df = 8, p-value = 0.1769
Figure 4. Residual Analysis for Model 01

Figure 4. Residual Analysis for Model 01

\(~\)

The first trial with pure trend and seasonality gives a fairly good model, which has almost 90% adjusted R-squared value, yet the current state of residuals imply that it can possibly be enhanced by the elimination of correlations of the lags 1 and 2. Probably, introduction of the other variables will also have positive effects on the model. From this point on, a method similar to the “Forward Selection Method” in the machine learning domain will be used, that is, the significantly correlated features will be added to the model one by one. The change in the R-squared values will be checked at each step.

# Adding NUGV

model_02 <- lm(UGS~NUGV+Trend+Season,
              data=df)

# Adding NLPG

model_03 <- lm(UGS~NLPG+NUGV+Trend+Season,
              data=df)

# Adding GNPA

model_04 <- lm(UGS~GNPA+NLPG+NUGV+Trend+Season,
              data=df)

# Adding PG

model_05 <- lm(UGS~PG+GNPA+NLPG+NUGV+Trend+Season,
              data=df)

# Adding NDGV

model_06 <- lm(UGS~NDGV+PG+GNPA+NLPG+NUGV+Trend+Season,
              data=df)

# Adding PU

model_07 <- lm(UGS~PU+NDGV+PG+GNPA+NLPG+NUGV+Trend+Season,
              data=df)

Adjusted R-squared Values:

  • Model 2: 0.8932869 (NUGV added)
  • Model 3: 0.9288838 (NLPG added)
  • Model 4: 0.9305015 (GNPA added)
  • Model 5: 0.9288137 (PG added)
  • Model 6: 0.9634755 (NDVG added)
  • Model 7: 0.9613281 (PU added)

After including all the significantly correlated variables, it can be seen that the seventh model performed worse than the previous one, implying that dropping one of the features may improve the performance. To select the one to be dropped, individual p-values can be checked, and the least statistically significant predictor can be dropped.

summary(model_07)
## 
## Call:
## lm(formula = UGS ~ PU + NDGV + PG + GNPA + NLPG + NUGV + Trend + 
##     Season, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -67794 -10461   1987  18461  32208 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.094e+06  8.283e+05   3.735  0.00165 ** 
## PU           2.276e+01  1.017e+03   0.022  0.98241    
## NDGV         1.238e+04  3.581e+03   3.456  0.00302 ** 
## PG          -1.129e+03  1.437e+03  -0.786  0.44295    
## GNPA        -3.613e-02  2.764e-02  -1.307  0.20854    
## NLPG        -2.937e-01  1.804e-01  -1.628  0.12195    
## NUGV        -1.023e+00  3.343e-01  -3.061  0.00708 ** 
## Trend        1.514e+04  1.284e+04   1.178  0.25486    
## Season2      1.463e+05  2.356e+04   6.212 9.48e-06 ***
## Season3      4.953e+05  1.597e+05   3.102  0.00648 ** 
## Season4      1.607e+05  4.385e+04   3.664  0.00192 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29700 on 17 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.9757, Adjusted R-squared:  0.9613 
## F-statistic: 68.12 on 10 and 17 DF,  p-value: 1.055e-11

\(~\)

The summary table suggests that PU is the variable with highest p-value. Thus, from this point on, the study will be continued based on the sixth model which includes all significantly correlated variables but PU.

The next step is checking the current residuals to investigate the possible improvements by lagged variable additions.

checkresiduals(model_06)
## 
##  Breusch-Godfrey test for serial correlation of order up to 13
## 
## data:  Residuals
## LM test = 25.683, df = 13, p-value = 0.01875
Figure 4. Residual Analysis for Model 06

Figure 4. Residual Analysis for Model 06

\(~\)

The ACF plot suggests that addition of 1 lagged response variable may have positive effect. This raises the following claim: The sales of the previous quarter has a remarkable effect on the considered quarter.

# Adding 1 lagged UGS field

df$UGS_1_lagged     <- lag(df$UGS, 1)

model_08            <- lm(UGS~UGS_1_lagged+NDGV+PG+GNPA+NLPG+NUGV+Trend+Season,
                          data=df)

\(~\)

The 8th model gave the best performance so far in terms of adjusted R-squared value, which is 0.9672047, which confirms that previous quarter’s sales is an important predictor.

In the previous steps, it is claimed that a nonlinear relationship may be present between NUGV and UGS. Keeping that in mind, the final step of the model building in this study is to check the effect of adding the reciprocal of the NUGV column to the model.

df$NUGV_reciprocal  <- 1/df$NUGV

model_09            <- lm(UGS~NUGV_reciprocal+UGS_1_lagged+NDGV+PG+GNPA+NLPG+NUGV+Trend+Season,
                          data=df)

summary(model_09)
## 
## Call:
## lm(formula = UGS ~ NUGV_reciprocal + UGS_1_lagged + NDGV + PG + 
##     GNPA + NLPG + NUGV + Trend + Season, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -47554 -11498   4387  14550  31631 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      1.729e+07  1.277e+07   1.354 0.195918    
## NUGV_reciprocal -3.388e+13  3.222e+13  -1.052 0.309581    
## UGS_1_lagged    -4.167e-01  2.297e-01  -1.814 0.089737 .  
## NDGV             2.191e+04  7.537e+03   2.907 0.010836 *  
## PG              -1.286e+03  3.986e+02  -3.226 0.005657 ** 
## GNPA            -2.883e-02  3.645e-02  -0.791 0.441425    
## NLPG            -4.937e-01  2.619e-01  -1.885 0.078971 .  
## NUGV            -2.926e+00  1.633e+00  -1.792 0.093326 .  
## Trend            2.246e+04  1.639e+04   1.370 0.190793    
## Season2          1.002e+05  3.745e+04   2.677 0.017247 *  
## Season3          4.661e+05  2.036e+05   2.290 0.036954 *  
## Season4          2.344e+05  5.314e+04   4.410 0.000506 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27310 on 15 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.9812, Adjusted R-squared:  0.9674 
## F-statistic: 71.19 on 11 and 15 DF,  p-value: 6.872e-11

\(~\)

Even the improvement in the adjusted R-squared is slight, the ninth model can be considered as a sufficiently good model, and it can be picked to use for forecasts.

# Setting the ninth model as the final model

finalModel <- model_09

4 Forecasting

# Making predictions for 2007

for (i in 29: 32){
    
    df$Predictions  <- predict(finalModel, df)
    df$UGS[i]       <- df$Predictions[i]
    
    if (i < 32){
        df$UGS_1_lagged[i+1] <- df$Predictions[i]
    }
}

df[29:32, c("Quarter", "Predictions")]
##    Quarter Predictions
## 1: 2007 Q1    672777.4
## 2: 2007 Q2    843554.7
## 3: 2007 Q3    967914.0
## 4: 2007 Q4    800034.0

\(~\)

ggplot(df, aes(x=Quarter))      +
    
    geom_line(aes(y=UGS,
                  color='real',
                  group = 1))   +
    
    geom_line(aes(y=Predictions,
                  color = 'predictions',
                  group = 1) ) 
Figure 5. Time Series of UGS with Predictions (in 1000 cubic meters)

Figure 5. Time Series of UGS with Predictions (in 1000 cubic meters)

\(~\)

The above plot shows that the predictions are consistent when the pattern of the time series is considered.

# Residual analysis

df$Residuals <- df$UGS - df$Predictions

ggplot(df[1:28])                        +
    
    geom_point(aes(x = Predictions,
                   y = Residuals))      +
    
    geom_abline(slope = 0,
                color="red")            +
    
    labs(x = "Predicted UGS Values",
         y = "Residuals")

# Actual versus predicted UGS
 
ggplot(df[1:28])                    +
    
    geom_point(aes(x = Predictions,
                   y = UGS))        +
    
    geom_abline(slope = 1,
                color = "red")      +
    
    labs(x = "Predicted Sales",
         y = "Actual Sales")
Figure 6. Left: Residual Plot for the Final Model, Right: Actual vs. Predicted UGS ValuesFigure 6. Left: Residual Plot for the Final Model, Right: Actual vs. Predicted UGS Values

Figure 6. Left: Residual Plot for the Final Model, Right: Actual vs. Predicted UGS Values

\(~\)

The final state of the residuals seems satisfactory in terms of their distributions above and below zero. They have a mean of 3.4450314^{-9}, which is fairly close to zero. Moreover, the distribution of the deviations from the actual values pose no problem, as can be seen in the right hand plot.

5 Conclusion

After checking the final model, it can be concluded that unleaded gasoline sales in the given period is a simple time series and good results can be obtained with a quick study, keeping in the mind that variance of residuals and the minor autocorrelations for the remaining lags can still be improved with a more elaborate work.

\(~\)