validate the proposed method through simulations of
the SOA system construction. For real-world appli-
cations, the proposed method is effective in problem
settings such as SOA system construction, where a
dynamic environment and AI with strict accuracy are
not required.
2 PRELIMINARIES
2.1 Service Oriented Architecture
These days, Computer System spreads to everywhere
around us. It is impossible to control all of computer
in one place. So distributed control is needed. These
computers consist of various type of platform and de-
vice and each computer has their own function. Link-
ing these function across devices and platforms has a
potential to create good and large scale service. Ser-
vice Oriented Architecture (SOA) is one of method to
create large scale computer system which consist of
multiple services across network or platform.
2.2 Deep Reinforcement Learning
Reinforcement learning (RL) (Sutton and Barto,
2018) is optimization for action sequence in given
environment. Agent learns behavior to fulfill the
goal during the interaction with environment. Agent
decides its behavior based on policy. Agent can
recognize various information from environment,
this information treat as state. After agent take some
action, agent moves to next state due to change of
information given from environment. This single
process is called step and this is minimum unit of
agent’s behavior. Agent repeats this process until
agent moves from initial state to terminal state. This
sequential process is called episode. In shooting
game, one step is one minimum recognizable frame
and one episode is a flow from the start to end of
game. To evaluate behavior, scalar value is used and
this value is called reward. Agent acts in environment
and get reward. Agent improves the policy based
on reward. In reinforcement learning, markov
decision process (MDP) is used to define learning
environment. MDP consists of state, reward, action
and transition probability.
Q-Learning: Q-Learning (Watkins and Dayan, 1992)
is classic method of reinforcement learning. In Q-
Learning, Q-function is used to predict the expected
total reward. Q-function is updated based on Be-
low equation learning rate α adjusts weight ratio of
current q-value when q-value is updated. γ is dis-
count rate which decides the importance of the re-
wards given in later steps.
Q(s
t
, a
t
) = (1 − α)Q(s
t
, a
t
) +
α(R(s
t
, a
t
) + γ max
a
t+1
Q(s
t+1
, a
t+1
))
Deep Q-Network: DQN (Mnih et al., 2015) (Mnih
et al., 2013) is an extension of Q-learning. DQN uses
deep neural network to represent Q-function. DQN
also uses experience replay. Experience replay is a
method to use previous trajectory within the last fixed
period for updating Q-function.
2.3 Offline RL
Traditional RL supposes interaction with environment
to collect information. However, infinite times of in-
teraction with environment is impossible in real appli-
cation. For example, recommender system limits the
times to interact with environment because the time
of user access is not infinite. In autonomous driv-
ing, learning model from scratch is dangerous, be-
cause infant model will causes traffic accident. Social
Game often changes the game setting like introduc-
ing new function to the game. It would be very costly
to learn from scratch each time game change the set-
ting slightly. Hence, the problems of RL to adopt real
application are below.
• Problem 1 : the times to collect data or available
data is limited.
• Problem 2 : pre-training model which provide
minimum required ability is needed.
• Problem 3 : changing environment setting makes
the model useless.
In Offline RL, agent learns from the fixed dataset.
Offline RL tackle the problem 1 and 2. Offline
RL doesn’t always solve Problem 3. This research
tackles problem 3. Suppose that dataset is a bit of
datum which consists of state, action, transited state
and reward. Offline RL uses this dataset to learn
from. Learning from fixed dataset causes problem
that the information agent can use is limited. In
detail, agent don’t uses the action sequences which
is not contained in dataset. This force agent to learn
the model to maximize their utility in the distribution
dataset gives. The distribution which is not contained
in dataset called out-of-distribution (OOD). It is
important to keep utility in OOD having no-effect to
utility of q-function. The rest of this section focuses
on introduction of method for Offline RL.
Service Selection for Service-Oriented Architecture using Off-line Reinforcement Learning in Dynamic Environments
65