The data consists of the demand information about 8 different products along with some other related statistics. Such as; product’s price that day, the number of times the products have been clicked on that day, number of times it has been marked as favorite or added to basket etc.
The aim is to make a 2-days ahead forecast, since today’s data is not available yet, and planning is required for tomorrow. Therefore, all of our models will be built considering this.
Clearly, the sales must have a remarkable random component, and even if there are relations between the external factors and the value to be predicted; the additional information regarding the next day is not available as well. Therefore, if they are to be included in the model, they must be predicted too.
It is the most intuitive starting point to check the target data’s values over time while starting the analysis. In order to spot any trends, seasonalities, or peaks; we started by plotting the sales count over time. These analyses have been conducted separately for each product. As an example, we will choose the one with id #32939029 as it has a “predictable” progression.
As seen with this one and the other products, there is no seasonal effect that can be measured most of the time. In fact, there are some random fluctuations and peaks.The autocorrelation function also supports this idea, and the most important statistic to be used in this case can be the current level of the data, since the closest previous observations are highly correlated. This indicates that a moving average model will be useful, but first, let’s see if we can extract any other knowledge from the external variables.
As we see, the number of visitors highly coincide with the number of items sold each day (when their values are properly scaled). So, we may assume that there is a constant percentage of customers that buy the product after viewing it. However, since the next day’s view data is not available, we create this model with the lagged version of the visit counts. When we assume this 2-lagged value also to have a constant ratio and predict with it.
A similar test by checking whether visit/sold is constant over time:
It really appears to follow a constant value. Then we check if the same thing is applicable with the 2-lagged version of visit.
Although having a higher variance than the same day visit/Sold data, this can be helpful in prediction. Mean of the visit count/sold count observations is 83.37; so we can make a prediction this way:
Next day’s sold count=Previous day’s visit count / 83.37
When applied, the results are as follows with 59% MAPE. Although it looks meaningful already, a model that will update the ratio over time can be developed for further precision. Because although the ratio assumption makes sense, it also changes over time.
When plotted and the correlations are calculated, price, favored count and basket count also have strong relationships with sold count. However, they are more difficult to incorporate in such a simple model, and some possibly have polynomial relations instead of linear ones. They may or may not improve the models, so they will be included in the arima models as external regressor to check this.
As expected, the sales numbers are usually best described with arima models that have high number of moving average terms. In the current case, the best fit is with (3,1,4). The mean absolute error is about 17, compared to 47 of the previous model that was based on pure visit count.
## Series: d32939029$sold_count
## ARIMA(3,1,4)
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2 ma3 ma4
## 1.7689 -1.5899 0.5967 -2.1900 2.3295 -1.3508 0.2518
## s.e. 0.2215 0.2479 0.1400 0.2302 0.3683 0.2936 0.1312
##
## sigma^2 estimated as 2182: log likelihood=-2265.39
## AIC=4546.78 AICc=4547.12 BIC=4579.31
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 0.2268554 46.27601 15.96943 NaN Inf 0.9278543 -0.0004444436
Adding the visit count to the model further decreased the mean absolute error to 8.83. After this point, adding the other external data did not improve the model’s performance. This was probably because basket and favored counts were correlated with the visit number. Also, surprisingly, price did not make any improvement. Therefore, in the final model we decided to use an ensemble model of the arima fit without external regressors, and the initial model we developed with the “constant purchases/visit percentage”.
Later on, we tried to develop the standard arima fit using the visit count predictions as well; taking the average of the two prediction results. However, due to the changing average over time, we decided to calculate the current level at each time step by taking average of previous (2-lagged visit count)/(sold count) ratios. This way, the ratio assumption gave better results, comparable to that of arima fit. Taking the average of two predictions improved the result in most of the product types, since the view count data has also been taken into account.
d32939029$prev_ratio1<-shift(d32939029$visit_count,3)/shift(d32939029$sold_count,1)
d32939029$prev_ratio2<-shift(d32939029$visit_count,4)/shift(d32939029$sold_count,2)
d32939029$current_ratio1<-(d32939029$prev_ratio1+d32939029$prev_ratio2)/2