Forecasting Hotel Room Sales within Online Travel Agencies by

Combining Multiple Feature Sets

Gizem Aras

, G

uls¸ah Ayhan

, Mehmet Ali Sarikaya

, A. Aylin Tokuc¸

and C. Okan Sakar

Data Science Department, Cerebro Software Services Inc., Istanbul, Turkey

Computer Engineering Department, Bahc¸es¸ ehir University, Istanbul, Turkey

Keywords:

Sales Forecasting, Data Enrichment, XGboost, Online Travel Agency (OTA), Advanced Bookings Model.

Abstract:

Hotel Room Sales prediction using previous booking data is a prominent research topic for the online travel

agency (OTA) sector. Various approaches have been proposed to predict hotel room sales for different pre-

diction horizons, such as yearly demand or daily number of reservations. An OTA website includes offers of

many companies for the same hotel, and the position of the company’s offer in OTA website depends on the bid

amount given for each click by the company. Therefore, the accurate prediction of the sales amount for a given

bid is a crucial need in revenue and cost management for the companies in the sector. In this paper, we forecast

the next day’s sales amount in order to provide an estimate of daily revenue generated per hotel. An important

contribution of our study is to use an enriched dataset constructed by combining the most informative features

proposed in various related studies for hotel sales prediction. Moreover, we enrich this dataset with a set of

OTA speciﬁc features that possess information about the relative position of the company’s offers to that of its

competitors in a travel metasearch engine website. We provide a real application on the hotel room sales data

of a large OTA in Turkey. The comparative results show that enrichment of the input representation with the

OTA-speciﬁc additional features increases the generalization ability of the prediction models, and tree-based

boosting algorithms perform the best results on this task.

1 INTRODUCTION

A few decades ago, the idea of booking a hotel

just with a few clicks would have been unthinkable.

Nowadays, it is becoming the norm to make hotel

reservations online. According to an online travel

industry report from 2016, about 180 million peo-

ple visited online travel sites in a month that is a 27

percent increase from the year-earlier period (Nasr,

2015). This increase in online search for hotel reser-

vations reﬂects the growth of hotel bookings made

through online travel agencies (OTA). OTAs simplify

the process of searching and selecting hotel rooms

for consumers. Instead of going on a few individ-

ual hotel pages, people can go to OTA websites and

browse many hotels in a speciﬁc place for a speciﬁc

date at once. Due to the popularity of online re-

search, nowadays hotel marketers are very dependent

on OTAs. With the help of intermediary companies

such as booking.com, expedia.com, hostelworld.com,

and kayak.com, online hotel booking becomes widely

popular in the hospitality industry. This area is in re-

cent years for live practitioners and hotel marketing

of utmost importance due to the increasing share ob-

tained from these platforms.

The stochastic nature of hotel bookings (in terms

of metrics such as number of nights, number of

rooms, type of room, fare class.) needs to stochastic

programming and simulations. Despite the fact that

daily hotel booking predictions are more practical for

short-term cost calculations for players in the OTA in-

dustry, a realistic model for the market demand needs

to be more complex and should offer innovative and

differentiated solutions. Forecasting hotel room sales

has proven to be a challenging task because of the dy-

namics and complexity of the booking process. Book-

ings are inﬂuenced by many factors such as season-

ality, group bookings, events, hotel types, and hotel

properties. Capturing such factors properly is essen-

tial for the accuracy of the forecast.

In this paper, we propose a novel forecasting

framework on hotel room reservation that uses book-

ing data, hotel properties, and OTA information. Our

model is based on the comparison of various machine

learning models. As an application, we used 4-month-

long dataset collected by one of the largest OTAs in

Aras, G., Ayhan, G., Sarikaya, M., Tokuç, A. and Sakar, C.

Forecasting Hotel Room Sales within Online Travel Agencies by Combining Multiple Feature Sets.

DOI: 10.5220/0007383205650573

In Proceedings of the 8th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2019), pages 565-573

ISBN: 978-989-758-351-3

565

Turkey. The forecast is updated with daily incom-

ing booking information, and the model predicts the

bookings for the next day. Most existing work on

forecasting for OTAs use only a single point forecast

(past reservation information). Our work differs from

previous studies as we combine booking data, hotel

facility properties, hotel prices, and OTA report data

which includes information about the click, bid and

impression values for hotels. Our hypothesis was that

this enrichment would improve sales forecasting. The

model is used as a real application for daily booking

forecasting by a company that offers daily bids for

many hotels in a large OTA.

The rest of the paper is organized as follows: In

section II, a comprehensive literature review about

hotel sales prediction is given. In section III, the

methodology of the experiments is presented, and im-

plementation details of these methods are given in

section IV. Finally, in section V, conclusions and po-

tential future works are given.

2 LITERATURE REVIEW

Reviewing the literature, we observed that there are

different approaches for forecasting hotel room sales.

We present some of this work brieﬂy in this section.

In a recent related study, Tse and Poon (Tse and Poon,

2015) used one year long reservation data provided

by Hotel ICON in Hong Kong. They observed certain

trends in data and developed a regression model us-

ing a second order equation. Their proposed system

predicts hotel reservations using reservation curves

from the past data. To predict a speciﬁc day’s reser-

vation, they use a 90-day window. Authors do daily

and weekly predictions, and observe that their model

performs better on weekly predictions. This result

makes sense, as daily changes in reservations smooth

out when looking at weekly cumulative data.

A study by Cezar and

ut (Cezar and

ut,

2016) examines the impact of consumer feedback,

suggestion systems and ranking on search listings for

hotels. For 1037 hotels in Paris and 540 hotels in

Barcelona; they collect data about the number of on-

line reservations, average hotel price per night, num-

ber of stars, number of customer comments, average

customer rating and supply, number of employees,

service, facility, cleanliness, comfort, money and lo-

cation. Their data range between April 2012 and June

2012 and is obtained from Booking.com. Authors use

two fractional models: a quasi-maximum likelihood

estimation model and a regression model with beta

distribution. They ﬁnd that high ratings, high num-

ber of recommendations and high ranks in search list-

ings have a signiﬁcant positive effect on conversion

rates. Furthermore, they observe that high number

of recommendations increases reservations even with

low ranks in search listings. Their ﬁndings show that

enriching data with recommendation or user-based-

ranking features can also be used to improve our

model’s performance as a future direction.

Ellero and Pelegrini (Ellero and Pellegrini, 2014)

assess the performance of different widely-adopted

models from literature to forecast Italian hotel oc-

cupancy. They ﬁnd that exponential smoothing, ad-

vanced pick-up, and moving average models show

the best success in the compared models. They cre-

ate their models using historical data on occupancy

rates for ﬁve Italian hotels between 2007 and 2010.

Their studies conclude that even with the best models,

the predictions were unsatisfactory (average MAPE

of 20%). Their ﬁndings imply that hotel occupancy

prediction is a hard problem and that it may be difﬁ-

cult to get satisfactory results in this domain.

The demand for hotel rooms in the hotel industry

in Turkey between the years 2002-2013 is estimated

using ARIMA by Efendioglu and Bulkan in a recent

study (Efendio

glu and Bulkan, 2017). In their studies,

they determine the hotel room capacity according to

the costs of the unsold rooms and the ARIMA distri-

bution. They also report that the hotel room demands

in the country could be affected by political crises and

warnings about terrorism. This result is another ex-

ample of the non-deterministic nature of hotel room

sales, as it shows how unpredictable factors can affect

the demand for hotels.

In another study, Shenoy et al.(Shenoy et al.,

2017) demonstrate their estimation of reservation in-

formation based on user activity and search results us-

ing the data provided by Expedia. Their studies show

that acquisition of signiﬁcant results becomes possi-

ble through clustering and ensemble operations.

Lee(Lee, 2018) aims to predict the hotel room de-

mand using the basic characteristics of the reserva-

tion. This study examines the time varying demand

rates, the high variability in the recent demands and

the positive correlations between the demands at dif-

ferent time intervals. Three Poisson mixture mod-

els (Poisson, Negative Binomial, Negative Multino-

mial) are investigated to obtain the characteristics of

reservations. The study reports that the dynamic up-

dating method that utilizes inter-temporal correlations

signiﬁcantly improves the short-term predictability of

hotel room demands.

Xie and Lee investigate the relationship between

search, click and reservation(Xie and Lee, 2015).

They use data obtained from 39,574 search queries

that include 81,648 distinct hotels. Their data comes

ICPRAM 2019 - 8th International Conference on Pattern Recognition Applications and Methods

566

from Expedia, which is a major online travel agency

(OTA) website. The study shows that ranking high in

hotel features (search results - position, quality indi-

cators, incentives, and brand link) improves bookings

signiﬁcantly.

3 DATA PREPARATION

3.1 Dataset Descriptions

In this section, we describe the dataset used to con-

struct our model. We worked with Company X, an

OTA that sells hotels online. Company X sells ho-

tels through a travel metasearch engine as well as

other channels. We built our sales model to predict

sales that company X would do through this travel

metasearch engine. Our datasets were provided by

Company X and daily predictions of our model were

shared with Company X. From hereafter, we refer to

company X as the OTA and the intermediary OTA

website that Company X was using as a channel to sell

its hotels as the travel metasearch engine. Features

were chosen from the datasets based on their correla-

tions with the target variable, and according to their

importance in a simple regression. Then, additional

features were extracted from these columns following

a time-delay approach.

The ﬁrst dataset contains sales-related information

about the hotels. This dataset can be distinguished

from the other datasets as it has the target variable of

the model, so called the ‘net total cost’. The features

extracted from the provided raw dataset can be seen

in Table 1.

Table 1: The descriptions of features obtained from reser-

vation data of company X.

Feature Description Range

Total

night

The staying total night of a

hotel

(1,542)

Total

rooms

The purchasing total rooms of

a hotel

(1,60)

Person

The number of person who stays

in a hotel

(1,98)

Net

total cost

The net sale amount of a hotel

in terms of Turkish Liras

(0,3608.08)

Count pre

The number of

pre-reservations made for a hotel

(0,16)

Count exacts

The number of

deﬁnite reservations made for a hotel

(1,44)

Count cancels

The number of

cancelled reservations made for a hotel.

(0,9)

The ﬁrst dataset is enriched with the report dataset

which is provided by the travel metasearch engine

to each company that gives daily bids in order to be

demonstrated online in the travel metasearch engine

website. The daily report gives information about

how the OTA is performing in terms of appearing in

search results and getting bookings, based on the bid

OTA gives per hotel. The features of the report dataset

consist of clicks, bid, average booking value, gross

revenue, top position share, hotel impression, oppor-

tunity cost per click, region, stars, rating and hotel

types. These features are listed in Table 2 along with

their descriptions. Some of these features, such as

the number of clicks received per hotel or how much

bid is needed to guarantee the ﬁrst position are on-

line evaluation metrics that are commonly used in e-

commerce.

Table 2: The descriptions of features obtained from daily

report provided by the travel metasearch engine.

Feature Description Range

Clicks

The daily amount of clicks

received for a hotel

in the metasearch engine.

(0,1618)

Bid

Given bid from OTA

to travel metasearch engine

(0.009, 0.56)

Average booking value

The given value of each hotel

regarding their bookings

by travel metasearch engine

(0, 131534)

Gross Revenue

The obtained gross revenue of hotels

in terms of Turkish Liras

(0, 131534)

Top position share

The ratio of being in the ﬁrst position

of a hotel

(0,1)

Hotel impression

Number

of daily pageviews of a hotel

(0,28418)

Opportunity Cost

Per Click

The bid amount recommended

by the travel metasearch engine

to appear ﬁrst on hotel listings.

(0,1)

Cost Total cost per click of a hotel. (0,244.63)

Region The region where the hotel is located.

7 Class

Categorical values

Stars The star number of a hotel. (0,5)

Rating

The rating of a hotel given

by the travel metasearch engine’s users

(0,96.17)

Hotel Types

The binary variable that indicates

if a hotel is a summer or winter hotel.

2 Class

Categorical Values

The third dataset provided by the OTA includes

price and position information of all the companies

offering in travel metasearch engine along with their

hotel Ids’, starting date of reservation and ending date

of reservation. The bulk features of the data given

by OTA can be seen in Table 3. As there are many

companies that give offers in the metasearch engine,

we organized this data and extracted the daily prices

and the daily positions of each hotel offered by our

OTA. Besides, the minimum price of a hotel in the

travel metasearch website and the minimum price of

a hotel which is in the top 4 online demonstrated or-

der was taken regardless of which company made the

offer. Here we took top 4, as the travel metasearch

website shows prices for four hotels initially. In this

Forecasting Hotel Room Sales within Online Travel Agencies by Combining Multiple Feature Sets

567

way, we had the knowledge of competitor companies’

prices for the hotels that OTA offers in that partic-

ular day. We summarized the information from this

dataset with the following columns; my price, my po-

sition, top4 min price and total min price. The created

features from price and position data are given in Ta-

ble 4.

Table 3: The descriptions of features obtained from the

price and position data.

Feature Description Range

Price Price of a hotel (26, 57399)

Position

The display ranking of a hotel

in the travel metasearch website

(1, 244)

Table 4: The descriptions of features enriched from price

and position data.

Feature Description Range

My price

The price of a hotel which OTA offers.

(30, 27037)

My position

The position of the OTA’s hotel on the

metasearch engine listings.

(1, 200)

Total min price

Minimum price of a hotel

offered by any company

in the metasearch engine listings.

(28, 27037)

Top 4 min price

Minimum price of a hotel

belonging to the top 4 position

demonstrated in travel metasearch website.

(28, 24094)

Finally, the fourth dataset used to construct the

model is called the facility dataset. It contains a set

of features that determine the success of the hotels re-

garding some evaluation criteria. The features with

their descriptions from this dataset can be seen in Ta-

ble 5.

Table 5: The descriptions of features obtained from facility

dataset.

Feature Descriptions Range

Score Online Reputation (-110, 5780)

Survey Score Customer Satisfaction (0,100)

Total Points Total of The Scores (0,316)

Total Votes Total Votes by The Public (0,4)

3.2 Preprocessing

After collecting data under the four subsets mentioned

above, a combining process was required. While the

reservation dataset includes only the hotels that were

selling, the other three datasets contain all hotels of

the OTA that are displayed in the travel metasearch

engine. Therefore, merging these four subsets re-

sulted in a ﬁnal dataset containing 375000 rows with

91% of target variable being 0. This meant there were

not conﬁrmed reservations for 91% of data that came

from daily metasearch engine report dataset.

Considering that we need to predict the next day’s

sales amount, the sliding window approach could

be applied to construct a time-delay based machine

learning model. For this purpose, additional features

that are able to capture the temporal aspects were cre-

ated. These features are obtained by computing the

moving averages and standard deviations of 3, 7, 15,

30 and 45 days of the original features. Furthermore,

daily sliding windows approach was implemented for

only the features which were correlated with the tar-

get variable. The most correlated feature to the target

variable was the target variable itself, as represented

in the past days’ sliding windows columns. Findings

support the fact that total nights and rooms are di-

rectly related to sales amount. These features were

adjusted as sliding windows parameters and the val-

ues were added as columns to the dataset from the day

before the current day to eleven days before the cur-

rent day. In order to acquire the optimal window size

in terms of days, we observed different sizes in the

data along with XGBoost Cross Validation algorithm

and deduced the optimal size as 10. The obtained re-

sults can be seen in Fig. 1. Hence, 10 days for each

daily prediction of each hotel is used to capture sea-

sonal trends. Additional features were added to the

dataset with their 10 days daily moving values. These

features are 10 days daily sliding values of opportu-

nity cost per click, clicks, average booking value, bid,

cost, gross revenue, and person, i.e., number of people

who stayed in the hotel, respectively. Besides, proﬁt

value was calculated by subtracting the multiplication

of bid and clicks from the net total cost. The moving

sums of tracking 3, 7, 15, 30, 45 days are integrated

into the dataset. Sliding operation implemented based

on correlations is shown in Fig. 2.

0.60

0.58

0.56

0.54

RSQ

Window Size

7 10

Window Size

7 10

2000

1900

1800

1700

MSE

1600

Figure 1: Window size effect on XGboost Cross Validation.

Moreover, we were able to obtain the number of

deﬁnite reservations, pre-reservations and canceled

reservations from the reservation data. These counts

were added to data with their moving averages and

standard deviations in 3, 7, 15, 30 and 45 days peri-

ICPRAM 2019 - 8th International Conference on Pattern Recognition Applications and Methods

568

ods. Since the dependency of deﬁnite reservations in

sales amount was higher than the other types of reser-

vations, daily sliding values of deﬁnite reservations

were also added to the dataset.

In addition to these features, as the target variable

had a high amount of zeros, we inserted a new fea-

ture representing the number of days the hotel has

not been sold in a speciﬁc period. Thus, we aim to

improve the success of the model in distinguishing

the selling and non-selling hotels. Apart from these,

the cumulative sums of net total cost, total night, to-

tal rooms, my price, total min-price, clicks, hotel im-

pression, and the number of different types of reser-

vations, which are deﬁnite, pre, and canceled, were

added to the dataset.

Last but not least, some features related to date

considered to be important in online travel agencies

sector were integrated to the model. These were the

day information, the number of days left to the clos-

est public holiday and the length of the closest holiday

in terms of days. Hence, all the columns were deter-

mined and the enriched dataset was obtained. Yet,

the dataset required additional pre-processing steps to

handle with missing and categorical values. Missing

values of features other than standard deviation fea-

tures were ﬁlled with 0’s. The missing values in the

standard deviation columns were ﬁlled with average

values of the related column.

Another pre-processing operation applied on the

dataset is to encode the categorical features using one-

hot-encoding. Once we completed the pre-processing

steps, the features that belong to time t + 1 (next day),

for which sales prediction will be made have been ex-

tracted from the dataset to avoid bias about the sales

value at time t + 1 which is the target variable of the

regression problem. After all these steps, the enriched

dataset was obtained which contains 375000 rows and

315 columns belonging to the dates between 1 Febru-

ary 2018 and 1 July 2018.

4 MODELLING

Sales prediction is a regression problem in which the

sales amount of each hotel for the next day is aimed

to be predicted. As described in the previous section,

hotel reservation data has high variance. There ex-

ists seasonal trends, weekly trends, different patterns

for summer and city hotels, increases in bookings in-

dependent of seasonal trends due to marketing strate-

gies, etc. Furthermore, approximately a third of all

reservations get canceled. Due to this high variance

in data, we focused on non-linear prediction methods

and creating relevant features with a time-delay data

pre-processing approach.

During modelling, we used train/test split cross-

validation approach for model training and valida-

tion. We created training and test sets by includ-

ing 66% of data belonging to hotel in the training

set and 33% in the test set, in order to have sam-

ples from each hotel in the training set. We did 5-

fold cross-validated (again, hotel-based random split)

random search (Bergstra and Bengio, 2012) to tune

the hyper-parameters using a part of the training set

as validation set. Three different evaluation metrics

were used; R Squared (RSQ or coefﬁcient of determi-

nation), Root Mean Square Error (RMSE) and Mean

Absolute Error (MAE). We considered all of these

three evaluation metrics to determine the best model.

RSQ is a well-known evaluation metric used in re-

gression problems, and it is deﬁned as the propor-

tion of the variance in the target variable that is pre-

dictable from the explanatory variables. It measures

the goodness of ﬁt of prediction values to the real val-

ues. RMSE is the standard deviation of the actual

target variable from the predicted target variable. It

measures the error between the set of observed and

predicted values. MAE is another error metric used

in regression problems which measures the average

magnitude of the error between the set of actual and

predicted target values.

We have used a different type of non-parametric

machine learning algorithms to validate the contribu-

tion of the data-enrichment process in sales predic-

tion. One of these approaches is the tree-based al-

gorithms which combine multiple weak learners to

obtain a single generalizable model. Extreme gradi-

ent boosting (XGBoost) (Chen and Guestrin, 2016)

is a technique that recently became popular among

data scientists, based on its popularity on many ma-

chine learning challenges (Mangal and Kumar, 2016;

Hengl et al., 2017; Zhou and Feng, 2017). Gradi-

ent boosting combines the gradient descent algorithm

with boosting to minimize overﬁtting when creating

ensembles of trees. In XGBoost (Chen and Guestrin,

2016), there are additional regularization parameters

that control the size and shape of trees, which makes

predictions more robust and the algorithm more gen-

erally applicable. Finally, random forest, gradient

boosting, and extreme gradient-boosting (XGBoost)

algorithms from tree-based algorithms were chosen

to be applied in our study, as they have been shown

to perform high accuracies on various regression tasks

(Breiman, 2001; Geurts et al., 2006; Friedman, 2001).

In addition to the above-mentioned tree-based al-

gorithms, we have also used a deep neural network

which has more than one hidden layer to cope with the

highly complex nature of the underlying model. Each

Forecasting Hotel Room Sales within Online Travel Agencies by Combining Multiple Feature Sets

569

Figure 2: Correlation Matrix.

hidden layer in the deep neural network increases

both the selectivity and the invariance of the repre-

sentation. A deep neural network system can imple-

ment extremely intricate functions of its inputs that

are simultaneously sensitive for correlations and in-

sensitive to large irrelevant variations(Candel et al.,

2016). We have also applied a generalized linear

model (GLM) as a simpler baseline model. GLM is

a generalization of standard linear regression used to

predict responses both for dependent variables with

discrete distributions and for those which are nonlin-

early related to the predictors(Nykodym et al., 2016).

The results obtained with XGBoost, random forest,

gradient boosting, deep neural network and general-

ized linear model algorithms are presented in the next

section.

4.1 Environment Settings

All experiments were ran on a remote Linux (Ubuntu

17.10) server with 32-seed CPU. Algorithms were im-

plemented using open source libraries. The XGBoost

algorithm was used in Python with its Scikit-Learn

wrapping (Chen and Guestrin, 2016). H2O (H2O.ai,

2018) implementations of gradient boosting, random

forest, deep neural network, and generalized linear

model algorithms were used in R. The H2O web in-

terface was used from 54321 port of localhost once it

was initialized.

5 RESULTS & DISCUSSION

The most successful model obtained from reservation

dataset was gradient boosting as seen in Tables 6 and

8. On the other hand, XGBoost gave the highest RSQ

in the enriched dataset both on validation and test sets

(see Tables 7 and 9). In the reservation dataset, the

difference between the top two performing models is

not signiﬁcant, as ranking of the models differ when

different evaluation metrics are used. Similarly, in

the enriched dataset, r-squared errors of the top three

models cover each others’ ranges. Therefore, it is

difﬁcult to conclude that one model performs signiﬁ-

cantly better than the other ones. We should also note

that as seen in 8 the tree-based models performed bet-

ter than generalized linear and deep neural network

models on the enriched dataset. We should note that

different platforms were used to train different mod-

els. For instance, gradient boosting model was taken

from the H2O Library, and implemented in R. H2O

also had its own implementation of XGBoost, but we

preferred to use the native library in Python. Having

tried these models in other platforms might have lead

to slightly different results but not signiﬁcant changes.

The obtained results show that the machine learn-

ing algorithm performs better on the enriched dataset.

It is seen in Table 9 that the RSQ obtained with XG-

Boost increased from 0.269 to 0.574 on the test set,

a difference of 0.305. In cross validation results, the

ICPRAM 2019 - 8th International Conference on Pattern Recognition Applications and Methods

570

Table 6: Cross Validation Results for the Reservation

Dataset.

Model RSQ MSE MAE

eXtreme Gradient

Boosting

0.254 +/- 0.03 3612.53 +/- 158.11 20.15 +/- 0.48

Gradient Boosting

Machines

0.279 +/- 0.04 3492.20 +/- 173.91 20.06 +/- 0.47

Random

Forest

0.288 +/- 0.02 3737.91 +/- 400.93 20.44 +/- 0.44

Generalized Linear

Models

0.310 +/- 0.05 3618.99 +/- 240.44 19.82 +/- 0.25

Deep Neural

Network

0.286 +/- 0.06 3754.95 +/- 450.36 19.90 +/- 1.37

Table 7: Cross Validation Results for the Enriched Dataset.

Model RSQ MSE MAE

eXtreme Gradient

Boosting

0.611 +/- 0.02 1600.18 +/- 251.76 8.70 +/- 0.29

Gradient Boosting

Machines

0.585 +/- 0.03 2088.96 +/- 413.94 9.42 +/- 0.39

Random

Forest

0.582 +/- 0.02 2105.95 +/- 245.94 9.42 +/- 0.15

Generalized Linear

Models

0.450 +/- 0.04 2754.66 +/- 133.55 16.16 +/- 0.38

Deep Neural

Network

0.517 +/- 0.04 2426.38 +/- 248.03 14.86 +/- 1.20

Table 8: Test Results for the Reservation Dataset.

Model RSQ MSE MAE

eXtreme Gradient

Boosting

0.269 3884.67 19.83

Gradient Boosting

Machines

0.280 3827.25 19.91

Random

Forest

0.203 3571.57 20.17

Generalized Linear

Models

0.238 3415.74 19.60

Deep Neural

Network

0.216 3516.00 20.38

difference was even slightly larger: 0.357. These re-

sults show that the additional features integrated dur-

ing data preparation step detailed in Section 3.1 carry

important information in capturing the trend observed

in the net total cost. Fig. 3 shows the top ten features

ranked according to their feature importance level in

XGBoost model. Feature importance in this graph are

calculated using the average coverage of splits, which

is deﬁned as the number of samples affected by the

split (xgboost developers, 2018). As seen in Fig. 3,

three of the top ten features are not from the initial

Table 9: Test Results for the Enriched Dataset.

Model RSQ MSE MAE

eXtreme Gradient

Boosting

0.574 2267.02 9.45

Gradient Boosting

Machines

0.564 2159.98 9.54

Random

Forest

0.572 2121.13 9.40

Generalized Linear

Models

0.443 2760.80 16.18

Deep Neural

Network

0.507 2442.98 15.15

reservation dataset. The gross revenue and proﬁt re-

lated features were obtained from the OTA’s report

dataset. Net total cost is actually the daily revenue

generated per hotel by company X. So, the actual rev-

enue is naturally related to proﬁt and gross revenue

values calculated in the OTA’s report. The important

features that are from the reservation dataset represent

information about past net total costs and bookings.

From this observation, we can suggest that with ap-

propriate representation of past information, it is pos-

sible to capture relevant trends, even in hotel bookings

data, which has notably high variability and noise.

The actual and predicted plots of the target vari-

able (net total cost) can be seen in Figure 4. It is

observed that the actual net total costs with compara-

bly lower values are predicted better as they accumu-

lated a relatively narrow range around the line. Since

there are slightly more points under the line than the

above, it is seen that the predictions generated by the

model are lower than the actual values. The sales

amounts higher than 1000 can be considered as the

minority group in data, and the predictions are be-

coming less accurate after this point. Furthermore,

the sales amount after 1500-2000 range can be inter-

preted as outliers for our dataset, and it can be seen

that the model does not produce accurate predictions

for values around this range.

0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040

net_total_cost_5

total_night_3

avg_45_net_total_cost

cnt_2_30

profit_all

profit_30days

net_total_cost_3

cnt_2_3

stdev_30_gross_rev

net_total_cost_1

Figure 3: Top 10 Features from Extreme Gradient Boosting

Model with the Enriched Dataset.

Forecasting Hotel Room Sales within Online Travel Agencies by Combining Multiple Feature Sets

571

Figure 4: XGboost Model’s Predictions Plotted Against

Real Sales.

6 CONCLUSION & FUTURE

WORK

In this study, we investigated the performance of some

machine learning techniques on a baseline and en-

riched dataset hotel sales prediction. For this pur-

pose, different prediction algorithms including Gra-

dient boosting, XGboost, random forest, generalized

linear model and deep neural network were applied to

predict next day’s sales using the historical data. The

obtained results demonstrated that feature enrichment

was crucial for solving the complex problem of hotel

sales prediction. Compared to the studies in litera-

ture, the nature of our problem was different. To the

best of our knowledge, our study is the ﬁrst one that

aims to predict net total cost in the future, which is

a real value related to price for hotel sales. We im-

proved our models by using features that can sum-

marize the trends in the target variable well. The re-

sults also showed that algorithms that used ensembles

of trees, especially boosting algorithms, overwhelmed

other methods. We should also note that boosting al-

gorithms i.e., XGboost and gradient boosting, gave

comparable results.

In future work, data can be enriched further

by adding consumer-generated recommendation and

comments data. Some studies in the literature showed

that consumer-generated ratings and the number of

recommendations are important features for hotel

reservations (e.g., (Cezar and

ut, 2016)). We can

integrate this information into our prediction frame-

work to obtain more generalizable models. For this

purpose, natural language processing techniques such

as sentiment analysis can be used to extract features

from consumer-generated text data and the summa-

rized information obtained from this module can be

combined with the features used in this study.

REFERENCES

Bergstra, J. and Bengio, Y. (2012). Random search for

hyper-parameter optimization. Journal of Machine

Learning Research, 13(Feb):281–305.

Breiman, L. (2001). Random forests. Machine learning,

45(1):5–32.

Candel, A., Parmar, V., LeDell, E., and Arora, A. (2016).

Deep learning with h2o. H2O. ai Inc.

Cezar, A. and

ut, H. (2016). Analyzing conversion

rates in online hotel booking: the role of customer

reviews, recommendations and rank order in search

listings. International Journal of Contemporary Hos-

pitality Management, 28(2):286–304.

Chen, T. and Guestrin, C. (2016). Xgboost: A scalable

tree boosting system. In Proceedings of the 22nd acm

sigkdd international conference on knowledge discov-

ery and data mining, pages 785–794. ACM.

Efendio

glu, D. and Bulkan, S. (2017). Capacity manage-

ment in hotel industry for turkey. In Handbook of

Research on Holistic Optimization Techniques in the

Hospitality, Tourism, and Travel Industry, pages 286–

304. IGI Global.

Ellero, A. and Pellegrini, P. (2014). Are traditional forecast-

ing models suitable for hotels in italian cities? Inter-

national Journal of Contemporary Hospitality Man-

agement, 26(3):383–400.

Friedman, J. H. (2001). Greedy function approximation: a

gradient boosting machine. Annals of statistics, pages

1189–1232.

Geurts, P., Ernst, D., and Wehenkel, L. (2006). Extremely

randomized trees. Machine learning, 63(1):3–42.

H2O.ai (2018). R Interface for H2O. version 3.20.0.3.

Hengl, T., de Jesus, J. M., Heuvelink, G. B., Gonzalez,

M. R., Kilibarda, M., Blagoti

c, A., Shangguan, W.,

Wright, M. N., Geng, X., Bauer-Marschallinger, B.,

et al. (2017). Soilgrids250m: Global gridded soil

information based on machine learning. PLoS one,

12(2):e0169748.

Lee, M. (2018). Modeling and forecasting hotel room de-

mand based on advance booking information. Tourism

Management, 66:62–71.

Mangal, A. and Kumar, N. (2016). Using big data to en-

hance the bosch production line performance: A kag-

gle challenge. In Big Data (Big Data), 2016 IEEE In-

ternational Conference on, pages 2029–2035. IEEE.

Nasr, R. (2015). Online travel industry is booming: Report.

Retrieved July, 6:2016.

Nykodym, T., Kraljevic, T., Hussami, N., Rao, A., and

Wang, A. (2016). Generalized linear modeling with

h2o. Published by H2O. ai, Inc.

Shenoy, G. G., Wagle, M. A., and Shaikh, A. (2017).

Kaggle competition: Expedia hotel recommendations.

arXiv preprint arXiv:1703.02915.

Tse, T. S. M. and Poon, Y. T. (2015). Analyzing the use of

an advance booking curve in forecasting hotel reser-

vations. Journal of Travel & Tourism Marketing,

32(7):852–869.

xgboost developers (2018). Python API Reference for XG-

Boost.

ICPRAM 2019 - 8th International Conference on Pattern Recognition Applications and Methods

572

Xie, K. and Lee, Y.-J. (2015). Hotels at our ﬁngertips: Un-

derstanding consumer conversion from search, click-

through, to book.

Zhou, Z.-H. and Feng, J. (2017). Deep forest: Towards an

alternative to deep neural networks. arXiv preprint

arXiv:1702.08835.

Forecasting Hotel Room Sales within Online Travel Agencies by Combining Multiple Feature Sets

573