planner is run once per day for each server in a PoD
to determine any resource constraint, for example to
determine if there are capacity problems so that not
all LPARs can be hosted on the remaining hosts. If
this condition is detected, a warning notification is
sent to the cloud administrators for the purposes of
planning.
5 IMPLEMENTATIONS AND
RESULTS
5.1 Initial Implementation: Serial
Restart
The restart priority of LPARs is based on their SLA.
Thus, in case of failover, the highest SLA workloads
would be restarted first followed by the next highest
SLA. Within the same SLA level, restart priority is
random. In an early CMS release, restart capability
was needed only for workloads with the two highest
level SLAs. This initial Remote Restart
implementation was implemented as a single process
which, after the failure of a server is detected, and
the need for a failover process was determined,
would initiate the failover process.
For each LPAR on the affected server, the
failover planner determines a destination server, and
the restart process starts. The failover process is
performed for the highest priority LPARs first,
configuring the storage and network for these
LPARs to their destination servers, and restarting
them at the destination server. After all LPARs with
the highest restart priority are restarted at their target
servers, the next lower priority level LPARs are
processed.
There are two significant time components to
executing the restart. The first is the process of
unmapping the LUNs from the (failed) original
server and mapping them to the designated failover
server. This time is proportional to the number of
LUNs connected to the LPAR. The second time
component is the process of restarting the LPAR on
the designated failover server.
In this early CMS release, each LPAR was
allowed to have up to two LUNs. For the case where
only the top two SLAs were to be restarted, with up
to two LUNs per LPAR, the SLA time budget was
readily met.
However, in the subsequent releases of CMS, the
number of disks per LPAR was continuously
increased. In addition, it was necessary to extend
restart capabilities to all SLA levels. With these
increases, it was clear that we needed a solution for
Remote Restart which would handle restarts for a
larger number of LPARs containing more LUNs,
within the SLA time limits.
5.2 Parallel Restart
The requirement for an increased number of LUNs
per LPAR, and the increased number of LPARs
which need to be restarted motivated us to improve
the Remote Restart solution using parallel processes.
We chose to use server-level parallelism in which the
level of parallelism depends on the number of
operational servers in the PoD.
In our parallelization scheme, one restart process
is launched for each destination server. For example,
in a PoD with 6 servers, and one failed server, there
would be up to 5 destination failover servers. One
restart process is initiated for each destination server.
LPARs assigned for restart on that particular server
are restarted sequentially, starting with the highest
priority LPARs in that group. For each LPAR,
storage is mapped, storage and network drivers are
reconfigured for the target server, and the LPAR is
restarted at the destination server. Once all highest
priority LPARs assigned to that destination server
are restarted, the next SLA priority level LPARs are
processed. A similar process is performed in parallel
for all destination servers.
These parallelization steps ensured that the
failover time was well within the allowed SLA for
the subsequent releases of CMS.
5.3 Parallel Disk Mapping
However, the disk capacity in CMS continues to
increase. For the current release, each LPAR can
have up to 24 LUNs and up to 96 TB of storage. For
a large number of LPARs on a single server, this can
lead to the case where a very large number of storage
LUNs has to be mapped to different servers in short
time.
Analysis indicated that the procedure that was
taking the most amount of time was the process of
mapping disks to the destination server, so our next
improvement focused on parallel disk mapping. In
this implementation, in addition to the number of
parallel failover processes that is started, we also
initiate the mapping of multiple disks attached to a
single LPAR in parallel. We limit the number of
simultaneous mappings of disks for a single failover
stream to four to avoid potential bottleneck at the
storage management interface. By measuring the
time needed for restarting individual LPARs with a
HighPerformanceVirtualMachineRecoveryintheCloud
563