execution and comparison. According to the
replicated objects, transient error detection
techniques during processing can be divided into
instruction-level, thread-level and application-level
fault tolerance.
EDDI (Oh et al., 2002) and SWIFT (Reis et al.,
2005) are typical representatives of instruction-level
fault tolerance, which copy the instructions in the
original program at compile time, and insert
comparison instructions at appropriate locations to
detect errors. Thread-level fault tolerance method,
like AR-SMT (Rotenberg E, 1999), SRT (Reinhardt
et al., 2000) and CRT (Mukherjee et al., 2002), etc.,
use more than two hardware threads or cores to
execute the same task. A specific cache is added to
the processor to store the execution results of the two
threads, to detect errors by comparing the execution
results. Application-level fault tolerance method, like
PLR (Shye A, et al., 2009), performs replication and
comparison at a higher software level, such as
copying a process into multiple redundant processes
for concurrent execution, and then comparing the
program output.
Permanent error detection techniques can be
divided into two categories. One is the hardware
module fault detection technology at the micro-
architecture layer, which is often used in the design
of reconfigurable processors. The other is the
detection of node faults in high performance systems.
Since the MTBF decreases sharply, the node faults
are common in the system.
However, repeated execution may cost too much
time, which is not adopted by high performance
computing applications. High performance
computing systems mainly screen node operation
errors based on the screening point program, and
screen out the wrong points in advance, but the
screening program cannot cover all subject situations,
and errors during applications execution cannot be
found.
Error recovery techniques can be divided into
two categories: forward error recovery and backward
error recovery.
Forward error recovery tries to correct the error
after the error is detected and continue to execute
forward without roll back to the state before the error
moment. Redundancy is the basic way to realize
forward error recovery. Three Modular Redundancy
(TMR) is a widely used FER technology, which uses
3 modules to perform the same operation, and then
selects the data through a majority voter at the output
to achieve fault tolerance, but this method requires 3
times the computing resources and the overhead is
large, so this method is generally not used in high
performance system.
Backward error recovery returns to the state
before the error occurred after an error is detected.
The widely used backward error recovery method is
checkpoint. According to the content of the storage
checkpoint, checkpoint technology can be divided
into system-level checkpoint and application-level
checkpoint technology (Bronevetsky et al., 2004;
Faisal et al., 2018). According to the medium of
storage checkpoint, it can be divided into disk-based
and diskless checkpoint technology (Chen, 2010;
Alshboul et al., 2019).
Usually, error detection and recovery techniques
are combined together to ensure the correctness of the
applications. A task-based parallel programming
model is proposed in (Wang et al., 2016), in which
work-stealing scheduling scheme supporting fault
tolerance is adopted to achieve dynamic load
balancing support fault tolerance.
2.2 Parallel Application Model and
Task Scheduling
Most parallel applications can be divided into two
categories: data parallelism and task parallelism. Task
parallel applications usually decompose the task into
many sub-tasks, divide the data set, and execute the
tasks and corresponding data in parallel on different
computing resources. Task parallel applications are
widely used in drug screening, genetic research,
cryptanalysis, nuclear simulation and other fields.
There is no correlation between subtasks, but the
calculation number of subtasks may vary
significantly. In large-scale environments, an
efficient load balancing mechanism is the key to
ensure application performance, and the results of
each subtask have an important impact on the overall
results of the project.
Corresponding to task parallel applications, task
division is divided into static division and dynamic
division (Mohit et al., 2019). In static division, each
computing node is statically divided into the same
number of tasks and executed separately. Dynamic
partitioningis to dynamically adjust the tasks of
computing resources according to the load of each
computing resource, including dynamic scheduling
with management nodes and task stealing (Dinan et
al., 2009), etc. In high performance computing,
dynamic task partitioning is generally used to enable
applications to more fully utilize computing
resources(He et al., 2016).