of requests for ad slot filling as episodes and we use
the Monte Carlo algorithm to learn state-action va-
lues based on averaging sample returns (Sutton et al.,
1998). Because there are not enough data to estimate
all state-action values, we use a prediction model to
find initial state-action values and use these initial va-
lues in the averaging part of Monte Carlo algorithm,
given a real bidding dataset provided by our industrial
partner. For each ad request, the prediction model out-
puts the probability of filling the ad slot when a cer-
tain ad network is selected. We then use these values
to find state-action values. Finally, using experiments
on real bidding dataset, we show that the expected re-
venue is increased if we choose ad networks based on
the state-action values.
This paper is structured as follows. Section 2 pre-
sents a brief literature review. In section 3 the propo-
sed method is discussed. The result of applying our
method to a real time bidding dataset is presented in
Section 4. Finally, in Section 5, we make our conclu-
ding remarks and discuss future work.
2 LITERATURE REVIEW
There is a lot of research on defining a method to in-
crease the publisher’s revenue through online adver-
tising. Most of them focus on setting the floor price
dynamically. There are few approaches considering
the ad networks ordering in the waterfall strategy. In
this section we review some of these works.
The process of programmatic advertising is de-
fined as the automated serving of digital ads in real
time based on individual ad impression opportunities
(Busch, 2016). Programmatic advertising helps pu-
blishers and advertisers to reach their goals and in-
crease the efficiency of online advertising. The pro-
grammatic buying and selling of ad slots prepares new
environment for publishers and advertisers to better
communicate with each other. Publishers may easily
find suitable advertisements for their ad slots while
advertisers may target suitable users, thus increasing
potential product sales and brand visibility (Wang
et al., 2017).
An important factor in determining publisher re-
venue is the reserve price. Reserve price or floor
price is the minimum price that a publisher expects
to obtain (Zhang et al., 2014). If it is too high and
no advertiser is willing to pay it, the advertisement
slot will not be sold, whereas if this price is set too
low the publisher’s profit is affected. For this reason,
specifying that price is important and adjustments in
reserve price may lead to increase in publisher reve-
nue. The adjustment of the reserve price is not a trivial
issue and has motivated a lot of research.
Wu et al. utilize a censor regression model to
fit the censor bidding data that a DSP suffers from
these censored information especially for lost bids
(Wu et al., 2015). Because the assumption of censo-
red regression does not hold on the real time bidding
data, they proposed a mixture model namely a combi-
nation of linear regression for observed data and cen-
sored regression for censored data, so as to predict the
winning price.
Xie et al. present a method in which the calcu-
lation of reserve price is based on their prediction of
distinguished top bid and the difference between top
bid and second bid (Xie et al., 2017). They have built
several families of classifiers and fit them with his-
torical data. They convert the identification of high
value inventories to a binary decision. They also con-
vert the gap between the top and the second bid to a
binary value, by assigning 1 for significant and 0 for
not significant difference compared to a threshold. In
the next step they use the idea of cascading (Quinlan,
1986) and try to reduce the false positive rate of the
prediction algorithm by combining the series of clas-
sifiers obtained before. They inspire (Jones and Viola,
2001) who follow the same basic idea with their own
feature and classification models. After predicting
whether the top bid is high or low and the difference
between top bid and second bid is significant or not,
they change the reserve price for high top bids. In
other research, the reserve price is predicted through
optimizing the weight of features (Austin et al., 2016).
In this paper, two vectors define feature values and fe-
ature weights respectively. The inner product of these
two vectors computes the value of the reserve price.
The main process lies in learning the feature weight
vector. For this purpose, they use gradient descent.
Yuan et al. model real time bidding as a dynamic
game and adjust the reserve price by following a game
between publisher and advertiser (Yuan et al., 2014).
The game is to increase or decrease the current va-
lue of reserve price based on the auction. There are
some other works in the context of optimizing the
floor price e.g. (Cesa-Bianchi et al., 2015).
The other research area that is the main topic of
this paper, is to choose proper ad networks in the wa-
terfall strategy. Finding the best ad network for each
user impression is a research topic which has gained
less attention in recent years in comparison to reserve
price optimization. However, it is an important topic.
Sometimes there is a contract between a publisher and
an ad network. There should be a balance between
selecting this ad network and other ad networks that
may achieve higher revenue (Muthukrishnan, 2009).
According to (Ghosh et al., 2009), when the number
A Reinforcement Learning Method to Select Ad Networks in Waterfall Strategy
257