ods. Since the dependency of definite reservations in
sales amount was higher than the other types of reser-
vations, daily sliding values of definite reservations
were also added to the dataset.
In addition to these features, as the target variable
had a high amount of zeros, we inserted a new fea-
ture representing the number of days the hotel has
not been sold in a specific period. Thus, we aim to
improve the success of the model in distinguishing
the selling and non-selling hotels. Apart from these,
the cumulative sums of net total cost, total night, to-
tal rooms, my price, total min-price, clicks, hotel im-
pression, and the number of different types of reser-
vations, which are definite, pre, and canceled, were
added to the dataset.
Last but not least, some features related to date
considered to be important in online travel agencies
sector were integrated to the model. These were the
day information, the number of days left to the clos-
est public holiday and the length of the closest holiday
in terms of days. Hence, all the columns were deter-
mined and the enriched dataset was obtained. Yet,
the dataset required additional pre-processing steps to
handle with missing and categorical values. Missing
values of features other than standard deviation fea-
tures were filled with 0’s. The missing values in the
standard deviation columns were filled with average
values of the related column.
Another pre-processing operation applied on the
dataset is to encode the categorical features using one-
hot-encoding. Once we completed the pre-processing
steps, the features that belong to time t + 1 (next day),
for which sales prediction will be made have been ex-
tracted from the dataset to avoid bias about the sales
value at time t + 1 which is the target variable of the
regression problem. After all these steps, the enriched
dataset was obtained which contains 375000 rows and
315 columns belonging to the dates between 1 Febru-
ary 2018 and 1 July 2018.
4 MODELLING
Sales prediction is a regression problem in which the
sales amount of each hotel for the next day is aimed
to be predicted. As described in the previous section,
hotel reservation data has high variance. There ex-
ists seasonal trends, weekly trends, different patterns
for summer and city hotels, increases in bookings in-
dependent of seasonal trends due to marketing strate-
gies, etc. Furthermore, approximately a third of all
reservations get canceled. Due to this high variance
in data, we focused on non-linear prediction methods
and creating relevant features with a time-delay data
pre-processing approach.
During modelling, we used train/test split cross-
validation approach for model training and valida-
tion. We created training and test sets by includ-
ing 66% of data belonging to hotel in the training
set and 33% in the test set, in order to have sam-
ples from each hotel in the training set. We did 5-
fold cross-validated (again, hotel-based random split)
random search (Bergstra and Bengio, 2012) to tune
the hyper-parameters using a part of the training set
as validation set. Three different evaluation metrics
were used; R Squared (RSQ or coefficient of determi-
nation), Root Mean Square Error (RMSE) and Mean
Absolute Error (MAE). We considered all of these
three evaluation metrics to determine the best model.
RSQ is a well-known evaluation metric used in re-
gression problems, and it is defined as the propor-
tion of the variance in the target variable that is pre-
dictable from the explanatory variables. It measures
the goodness of fit of prediction values to the real val-
ues. RMSE is the standard deviation of the actual
target variable from the predicted target variable. It
measures the error between the set of observed and
predicted values. MAE is another error metric used
in regression problems which measures the average
magnitude of the error between the set of actual and
predicted target values.
We have used a different type of non-parametric
machine learning algorithms to validate the contribu-
tion of the data-enrichment process in sales predic-
tion. One of these approaches is the tree-based al-
gorithms which combine multiple weak learners to
obtain a single generalizable model. Extreme gradi-
ent boosting (XGBoost) (Chen and Guestrin, 2016)
is a technique that recently became popular among
data scientists, based on its popularity on many ma-
chine learning challenges (Mangal and Kumar, 2016;
Hengl et al., 2017; Zhou and Feng, 2017). Gradi-
ent boosting combines the gradient descent algorithm
with boosting to minimize overfitting when creating
ensembles of trees. In XGBoost (Chen and Guestrin,
2016), there are additional regularization parameters
that control the size and shape of trees, which makes
predictions more robust and the algorithm more gen-
erally applicable. Finally, random forest, gradient
boosting, and extreme gradient-boosting (XGBoost)
algorithms from tree-based algorithms were chosen
to be applied in our study, as they have been shown
to perform high accuracies on various regression tasks
(Breiman, 2001; Geurts et al., 2006; Friedman, 2001).
In addition to the above-mentioned tree-based al-
gorithms, we have also used a deep neural network
which has more than one hidden layer to cope with the
highly complex nature of the underlying model. Each
Forecasting Hotel Room Sales within Online Travel Agencies by Combining Multiple Feature Sets
569