address this problem by exploring one kind of trace
left by virtualization on Xen based Clouds - domain
ids (domids). The rest of this paper is structured as
follows: In section 2 we review relevant related
work to offer background to the problem; in section
3, we discuss the Xen hypervisor and the generation
of domain ids, and in section 4 we discuss the results
of domids collected from a small sample (100) of
virtual servers in the Amazon Cloud, and use this as
a basis for tests for co-location in section 5. In
section 6 we use domids collected from further
samples to demonstrate likely recycling of resources.
Finally, in section 7 we present our conclusions, and
future directions of this work.
2 RELATED WORK
The ability for one instance to degrade the
performance of other co-located instances is well
known, and is referred to as noisy neighbours. Intel
identifies the primary cause of the problem as the
sharing of resources, such as the L2 cache, which
cannot be partitioned (Intel, 2014); that is, there is
no mechanism to limit how much of the resource an
instance may consume. Consequently, it is possible
for instances to use such resources
disproportionately, to the detriment others.
The standard metric for compute performance is
execution time. Identifying if a running task is likely
to suffer from poor performance i.e. need increased
execution time, is difficult. On their production
clusters, Google detects likely poor performance by
repeatedly measuring a task’s cycles per instruction
(CPI), i.e. the number of cycles required to execute
an instruction, and comparing with the known CPI
distribution (Zhang et al, 2013). If more outliers
(defined as more than 2 standard deviations from the
mean) are detected than expected, then performance
of the task is likely to be poor. The protagonist, i.e.
the noisy neighbour, is identified by correlating
other instances’ CPU usages with the increase in
CPI outliers for the victim.
On a Public Cloud, information about when an
instance is scheduled for CPU time by the
hypervisor is only available to the provider, and is
not subsequently made available to customers. As
such, it is not possible to precisely state when an
instance is running or not. A coarser approach would
be to attempt to correlate instance performance using
compute benchmarks. Such an approach would
likely require a minimum number of co-located
instances on a given host in order to be successful,
and so this already requires co-location to be
knowable, and there is the potential to miss a small
degree of co-location per host.
The problem of extracting information between
co-locating virtual machines has been investigated
by a number of authors. In (Zhang et al, 2012) the
sharing of an L2 cache between VMs was shown to
be vulnerability when it was demonstrated that one
VM may extract cryptographic keys from another
VM on the same host. Such an attack is known as an
access driven side channel attack. Particularly
noteworthy, is the fact that the attack was
demonstrated on an SMP system. In this case the
challenge of core migrations i.e. the scheduling of
VMs onto different cores during its lifetime, as
would be encountered in a Cloud environment,
needs to be overcome. However, the demonstration
was on a standalone Xen system rather than on a
Public Cloud.
The vulnerability of a shared cache relies, in
part, on exploiting hypervisor scheduling. Methods
to increase the difficulty of successfully using such
attacks are under development (Lui, Ren and Bai,
2014), and indeed, are already being integrated into
Xen. Whilst such work mitigates fine grained
attacks, denial of service attacks, which seek to
obtain a large share of the L2 cache, are considered
viable.
This has led to work on targeted attacks in the
Cloud, whereby an attacker seeks to co-locate with a
specific target. This requires methods for
determining co-location with the target before the
attack can be launched. In (Ristenpart, Tromer,
Shacham and Savage, 2010) a number of network
based probes have been proposed, for example ping
trip time and common IP address of dom0. In order
to test the veracity of these methods they also use
access timings of shared drives. No details are
provided of the type of drive being used (local or
network) or how the disk is being shared.
However, as the authors state, a provider can
easily obfuscate network based probes and this
already appears to the case. From our experiments
we can confirm this. Whilst access times to shared
drives may potentially be used for detecting co-
locating siblings, there are a number of issues not
discussed that need require further investigation.
Perhaps most importantly, is the widely reported
variation in disk read/write timings on EC2
(Armbrust et al, 2009), which clearly needs to be
accounted for in any test that proposes to use them.
In (Bates et al, 2013) watermarking of network
flows is proposed and demonstrated on a variety of
stand- alone virtual systems. However, as the
authors state, there a number of defences against
watermarking in place in Public Clouds, and in
particular on EC2, which prevented them from
successfully using the tests.
In (Zhang, Juels, Oprea and Reiter, 2011) a
cache avoidance strategy is used so that instances
AddressingIssuesofCloudResilience,SecurityandPerformancethroughSimpleDetectionofCo-locatingSiblingVirtual
MachineInstances
61