prediction outcomes. However, these interventions
need to be reviewed for effectiveness. Controlled
experiments are the best scientific design, to ensure
the causal relationship between intervention and
changes in user behavior (Kohavi et al. 2007). A/B-
tests are used on large sites such as amazon, google
or Bing to calculate the effect of user interface (UI)
changes of apps and websites, algorithms, or other
adjustments (Kohavi and Longbotham 2017). In a
simple case, A/B-tests consist of control (default
version, A) and treatment (changed version, B). Users
are randomly assigned to one of these versions and
their actions on the website or app are logged. A
previously defined overall evaluation criterion (OEC)
is a quantitative measure of the change’s objective. At
the end, after the experimentation period, a
hypothesis test is done to find out if the difference in
OEC between the two variants is statistically
significant (Kohavi et al. 2007). This enables data-
driven decision-making in web-facing industries.
2.1.1 Randomization Unit
The randomization or experimentation unit is the item
on which observations are made, in most cases this is
the user (Kohavi et al. 2007). The users are randomly
assigned to one variant, but the assignment is
persistent. Further, the entities should be distributed
equally, which means in the case of an A/B-test that
the users are split up by 50%. While the distribution
should be equally, best practice is to have a treatment
ramp-up before (Kohavi et al. 2007). This starts with
a lower percentage for the treatment which is
gradually increased. Each phase runs for few hours
which offers the opportunity to check for problems
and errors, before it is shown to a wide range of users.
There are different designs of randomised trial, for
example student-level random assignment, teacher-
level random assignment or school-level random
assignment. When choosing the randomisation
design, there are theoretical and practical reasons,
which are summarised by Wijekumar et al. (2012).
Choosing the teacher-level or school-level has the
advantage that users in a school class belong to the
same control group. This is particularly useful if the
experiment is carried out in class - otherwise the
teacher would have to divide the class. Statistical
power plays a role in the decision between teacher-
level and school-level, as the analysis of which has
shown that within-school random assignments are
more efficient than school-leveled random
assignments (Campbell et al. 2004). The disadvantage
of choosing teacher-level or school-level assignments
is a reduction in effective sample size, considering that
observations withinside the cluster have a tendency to
be correlated (Campbell et al. 2004).
2.1.2 Overall Evaluation Criterion
The OEC defines the goal of the experiment and must
be defined in advance (Kohavi et al. 2007). It can also
be referred to as a response variable, dependent
variable, or evaluation metric. The definition of the
OEC is of rather great importance, as the rejection of
the null hypothesis is based on the comparison
between the OEC of the two variants. As the
experimentation period is in most cases only few
weeks, the OEC must be measurable in the short-term
while being predictive in the long-term. Deng and Shi
differentiate between three types of metrics that can
be used as OECs (Deng and Shi 2016): business
report driven metrics, simple heuristic based metrics
and user behavior-driven metrics. Business report
driven metrics are based on long-term goals and are
associated with the business performance, such as
revenue per user (Deng and Shi 2016). Simple
heuristic-based metrics are describing the interaction
of the user on the website, for example, an activity
counter. User behavior-driven metrics are based on a
behavior model, for example for satisfaction or
frustration. Whatever type of metric is chosen in the
end, there are two important characteristics for
metrics: directionality and sensitivity (Deng and Shi
2016). Directionality describes that the interpretation
of the metric must have a clear direction, for example,
the bigger the OEC the better and vice versa.
Sensitivity means that the metric should be sensitive
for the changes made in the variant (Deng and Shi
2016).
2.1.3 Architecture
There are three important architecture components of
A/B-tests: randomization algorithm, assignment
method and data path (Kohavi and Longbotham
2017). The randomization algorithm is the function
that maps the user persistently to one variant. As
stated above, the distribution between the variants
should be equal. In the second step, the assignment
method allocates the user to the mapped variant. This
can be either by redirecting the user to a new
webpage, by traffic splitting, or client-sided by
dynamically adjusting the web page according to the
variant changes (Kohavi and Longbotham 2017). The
data path describes the component which collects the
user interaction (e.g., the clickstream data) and
aggregates and processes it afterwards. Another tool
which should be built-in is a diagnostics system,
which graphs the numbers of randomization units in