Overview and Exploratory Analysis

The data consists of the demand information about 8 different products along with some other related statistics. Such as; product’s price that day, the number of times the products have been clicked on that day, number of times it has been marked as favorite or added to basket etc.

The aim is to make a 2-days ahead forecast, since today’s data is not available yet, and planning is required for tomorrow. Therefore, all of our models will be built considering this.

Clearly, the sales must have a remarkable random component, and even if there are relations between the external factors and the value to be predicted; the additional information regarding the next day is not available as well. Therefore, if they are to be included in the model, they must be predicted too.

Sales Data over Time

It is the most intuitive starting point to check the target data’s values over time while starting the analysis. In order to spot any trends, seasonalities, or peaks; we started by plotting the sales count over time. These analyses have been conducted separately for each product. As an example, we will choose the one with id #32939029 as it has a “predictable” progression.

As seen with this one and the other products, there is no seasonal effect that can be measured most of the time. In fact, there are some random fluctuations and peaks.The autocorrelation function also supports this idea, and the most important statistic to be used in this case can be the current level of the data, since the closest previous observations are highly correlated. This indicates that a moving average model will be useful, but first, let’s see if we can extract any other knowledge from the external variables.

Visit Count

As we see, the number of visitors highly coincide with the number of items sold each day (when their values are properly scaled). So, we may assume that there is a constant percentage of customers that buy the product after viewing it. However, since the next day’s view data is not available, we create this model with the lagged version of the visit counts. When we assume this 2-lagged value also to have a constant ratio and predict with it.

A similar test by checking whether visit/sold is constant over time:

It really appears to follow a constant value. Then we check if the same thing is applicable with the 2-lagged version of visit.

Although having a higher variance than the same day visit/Sold data, this can be helpful in prediction. Mean of the visit count/sold count observations is 83.37; so we can make a prediction this way:

Next day’s sold count=Previous day’s visit count / 83.37

When applied, the results are as follows with 59% MAPE. Although it looks meaningful already, a model that will update the ratio over time can be developed for further precision. Because although the ratio assumption makes sense, it also changes over time.

Other External Parameters

When plotted and the correlations are calculated, price, favored count and basket count also have strong relationships with sold count. However, they are more difficult to incorporate in such a simple model, and some possibly have polynomial relations instead of linear ones. They may or may not improve the models, so they will be included in the arima models as external regressor to check this.

Arima Fit

As expected, the sales numbers are usually best described with arima models that have high number of moving average terms. In the current case, the best fit is with (3,1,4). The mean absolute error is about 17, compared to 47 of the previous model that was based on pure visit count.

## Series: d32939029$sold_count 
## ARIMA(3,1,4) 
## 
## Coefficients:
##          ar1      ar2     ar3      ma1     ma2      ma3     ma4
##       1.7689  -1.5899  0.5967  -2.1900  2.3295  -1.3508  0.2518
## s.e.  0.2215   0.2479  0.1400   0.2302  0.3683   0.2936  0.1312
## 
## sigma^2 estimated as 2182:  log likelihood=-2265.39
## AIC=4546.78   AICc=4547.12   BIC=4579.31
## 
## Training set error measures:
##                     ME     RMSE      MAE MPE MAPE      MASE          ACF1
## Training set 0.2268554 46.27601 15.96943 NaN  Inf 0.9278543 -0.0004444436

Adding the visit count to the model further decreased the mean absolute error to 8.83. After this point, adding the other external data did not improve the model’s performance. This was probably because basket and favored counts were correlated with the visit number. Also, surprisingly, price did not make any improvement. Therefore, in the final model we decided to use an ensemble model of the arima fit without external regressors, and the initial model we developed with the “constant purchases/visit percentage”.

Ensemble Predictor

Later on, we tried to develop the standard arima fit using the visit count predictions as well; taking the average of the two prediction results. However, due to the changing average over time, we decided to calculate the current level at each time step by taking average of previous (2-lagged visit count)/(sold count) ratios. This way, the ratio assumption gave better results, comparable to that of arima fit. Taking the average of two predictions improved the result in most of the product types, since the view count data has also been taken into account.

d32939029$prev_ratio1<-shift(d32939029$visit_count,3)/shift(d32939029$sold_count,1)
d32939029$prev_ratio2<-shift(d32939029$visit_count,4)/shift(d32939029$sold_count,2)
d32939029$current_ratio1<-(d32939029$prev_ratio1+d32939029$prev_ratio2)/2