In parallel to the VM Scheduler process there is a
continuous credit update. Each VM is running a VM
Monitoring agent, which is reporting periodical heart
beats with statistical information to the VM Man-
ager. For governance purposes, heart beat includes
a parameter of the number of VCPUs of the VM. On
the server side, the VM Manger sets instance history
record in a database, calculates the gap between the
last heart beat of the instance and the current one, and
adds the VCPU hours to the used VCPU hours in the
running pod database. The VCPU is wall-time since
the resource is allocated even if not real CPU is used,
same as commercial clouds scheme.
5 PROTOTYPE TEST
This section presents a test designed in order to eval-
uate elasticity key feature, including scale-up and
down. It also evaluates the overall system response
when some of the running pod is active but without
credit left, so it has to run the payload using oppor-
tunistic resources. The test is not aiming a stage with
a power demand larger than power supply. In this
sense it is a test with no job response time constrains,
the target is to validate the core features of the credit
model for cloud governance, particularly about elas-
ticity.
5.1 Test Set-up
Test set-up top down description defines two user
groups: A and B. Each group has two running pod
because it is uses different instance types, m1.small
flavour with 1 VCPU, 2GB of memory and 20 GB of
disk, and small instance type with 1VCPU, 1GB of
memory and 10 GB of disk. The job submissions of a
group can run in both instance type running pods, but
these jobs can not match with the VMs submitted by
the other running pod.
The RP of the test are part of the EGI Fedcloud.
CESGA cloud with OpenNebula and BIFI cloud with
OpenStack. Maximum number of VCPUs by each
IaaS provider is 25 VCPUs, with small instances for
CESGA and m1.small for BIFI. Both of them have a
maximum of 5 VCPUs for opportunistic use.
Submission pattern is composed by 10 series of
100 jobs each. The jobs of a single series are submit-
ted about the same moment using DIRAC parametric
jobs. The sequence is to start to submit the 100 jobs
series of a group, then the other group series, and re-
peating the period up to 10 series. Using this pattern it
is possible to evaluate the adaptability of the resources
for VMs which can be used by a particular group.
Job workload is running a fractal plotting software
(mandelbrot), with high CPU/IO ratio. The binary
is allocated in the VMs using the job input sandbox
and the fractal plot is put in the output sandbox. A
single job process has been previously tested in the
clouds, with very similar execution time between dif-
ferent providers. Group B has been configured with
a maximum VCPU hours to run 3 series of 100 jobs,
then opportunistic use of resources will apply.
VM horizontal auto scale policy is configured for
a compromise between VM efficiency and total wall
time. The VM is configured with 5 minutes of margin
time before halt. I.e. previously to stop a VM is nec-
essary a margin time without any workload running.
If more workload of the next series is matched within
this halting margin time, then the VM is not stopped.
Only jobs of the same group can match the payload of
a VM. Series pattern time gaps are taken as reference
a normal distribution with average in the half of the
time of the larger job response time, obtained from
a previous processing of 100 jobs from cold system.
The idea is not to simulate real user behaviour, but to
test the elastic scale up and down of the system, pro-
ducing series which uses previous VMs of the same
group, as well as new VMs, depending on the work-
load gap times and the available power for a group in
a specific moment.
The contextualization is done by ssh with an agent
is polling the VMs in waiting for context status. This
method is needing synchronization between submis-
sion time and contextualization time, which it is a
disadvantage compared with context methods inte-
grated in the cloud managers, like cloudinit or pro-
log and epilog. In the other hand ssh contextualiza-
tion is not requiring additional libraries in the images
neither particular implementations of the cloud man-
agers. With a public IP and a sshd running in the
VM, the same context scripts can be launched inde-
pendently of the IaaS site.
5.2 Test Result
This section presents different results of the test, start-
ing from general plots then breaking down in more
detailed metrics.
Fig. 3 shows the running VMs, those VMs which
has been submitted and booted, then contextualized
and finally declared running by the VM Monitor when
Job Agent starts to match jobs. During the VM run-
ning, the VM Monitor is sending periodical heart
beats. If a VM is not changing status or sending
running heart beat in 30 minutes then it is declared
stalled. Two VMs were stalled in the test, without
reaching the running status the VM were in error sta-
CloudGovernancebyaCreditModelwithDIRAC
683