On the standby device, we use a script to check for
entries sent by an allotted active device to the server,
and notify when the active device has not responded
within a certain time period (determined by the user)
and caused a timeout. This was extended further to
inform the master device about the dead active de-
vice. The idea was to keep a list of active devices in
a file on the standby device, and vice versa on the ac-
tive device. The entries would include the hostname
and private IP of the devices since they are all in the
same network. Similarly, on the master device, active,
standby, and another list of devices would be man-
aged in files.
Each active device will perform its computation
(simple incremental addition) and report its status
with the current timestamp. Its corresponding standby
devices will monitor its status, and when the active
device has not reported its status in a long time, a fail-
ure is reported to the master device. The master de-
vice then stops all threads corresponding to the failed
active device, assigns the first standby device that re-
ported the failure as the newly active device, stops
monitoring threads run by it previously, and reassigns
standby devices to all the active devices out of the ex-
isting ones.
Python API Implementation: We use Python-based
API exposed by Mininet to implement different varia-
tions in the control flow in an edge cluster to simulate
different scenarios for live migration. Since Mininet
does not create hosts as a separate process, with its re-
source constraints, a custom Python code is developed
to allot a thread to each host for simulating various
actions, such as performing a computation, announc-
ing its live status, reporting failure, etc. There is a
master thread that monitors the whole cluster, active
threads for active devices, standby threads for standby
devices, and a printer thread to print out the current
status of each device. A global Python dict is main-
tained, containing the status, action, associated de-
vices for each device in the cluster, and queues having
alive, dead, free devices in them for easier allocation
and access.
We also implemented various functions of the
master device in the simulation code. The master de-
vice can detect failure from its requests, select the
failed active device, mark it as dead, select the first
standby device that reported the failure, remove it as
standby and mark it as the new active device. The
master thread was not able to wait for the active
threads to stop since it was not the parent thread. The
changes in host data is printed on a separate terminal
for better debugging.
We used threads instead of processes, since pro-
cesses don’t share data among themselves, but make
their own separate copy and work on it. Each thread
now had its flag, which indicated if a thread was run-
ning, and could be used to stop the thread as well.
The master thread was able to handle multiple device
failures now. A separate thread to consistently print
the status of all hosts had been created, which wrote
the status to a file, and a bash script showed the latest
updated lines.
6.2 Evaluation Methodology
The active devices, along with their hostname, IP, and
current state, also reported the current timestamp to
their respective standby devices. Any form of com-
munication between any of the devices had a times-
tamp attached to it, which is logged locally in the
devices. These timestamps have been used to calcu-
late the time between each communication from the
active device to the corresponding standby device(s),
which is used to calculate the timeout on the standby
device(s) as well. The difference between the times-
tamp when a failure of an active device is reported,
and when a standby resumes the computation from
the last state received has been used to calculate the
downtime during a random device failure. Since the
timestamps are recorded at the granularity of each
communication, the duration of operations like how
long it takes for the master device to find a free device
and allot it as the new standby, or how long it takes for
the standby device to resume operation could also be
obtained and observed for further improvement and
optimization.
6.3 Simulation Results
We have implemented two variants of the device al-
location algorithm (refer to Algorithm 3), and have
done a comparative evaluation of them. One variant
exactly matches Algorithm 3, while the other one was
assigned a dedicated standby device to each active de-
vice. In the latter variant, each standby device regu-
larly checks with its corresponding active device to
determine if it is alive. However, this approach has
the following pitfall. The standby devices transmit
broadcast messages to probe the active devices, which
results in overhead exceeding 7 seconds in case of
loss of response. The above overhead comprises the
timeout interval for the broadcast and the delay in de-
tection of outage of an active device from a respec-
tive standby device. On the other hand, the overhead
with the former approach is the interval between the
exchange of broadcast messages from active devices
notifying consecutive changes in the status.
CLOSER 2022 - 12th International Conference on Cloud Computing and Services Science
58