EFFICIENT IMPLEMENTATION OF FAULT-TOLERANT DATA

STRUCTURES IN PC-BASED CONTROL SOFTWARE

Michael Short

Embedded Systems Laboratory, University of Leicester,University Road, Leicester,UK

Keywords: Open architecture controllers, software fault tolerance, critical systems.

Abstract: Recent years have seen an increased interest in the use of open-architecture, PC-based controllers for

robotic and mechatronic systems. Although such systems give increased flexibility and performance at low

unit cost, the use of commercial processors and memory devices can be problematic from a safety

perspective as they lack many of the built-in integrity testing features that are typical of more specialised

equipment. Previous research has shown that the rate of undetected corruptions in industrial PC memory

devices is large enough to be of concern in systems where the correct functioning of equipment is vital. In

this paper the mechanisms that may lead to such corruptions and the level of risk is examined. A simple,

portable and highly effective software library is also presented in this paper that can reduce the impact of

such memory errors. The effectiveness of the library is verified in a small example.

1 INTRODUCTION

Recent years have seen much interest in the use of

open-architecture controllers for robotic and

mechatronic systems. Such systems typically consist

of a combination of commercial off the shelf

(COTS) PC equipment alongside motion control,

network interface, and sensor/actuator equipment –

e.g. (Hong et al. 2001; Short 2003; Lee & Mavroidis

2000; Schofield & Wright, 1998). Such architectures

have been used to successfully implement novel

control algorithms in a number of research

installations (e.g. fuzzy force control (Burn et al.

2003) and H

∞

vibration control (Lee & Mavroidis

2000)); they are also being used in increasing

numbers in industrial applications (e.g. KUKA®,

STAUBLI®, RWT® systems). The flexibility of

such platforms primarily arises by giving engineers

the ability to develop and/or modify the control and

interfacing hardware and software, which is

typically developed in C++.

However, despite increased flexibility and

performance (along with marked unit cost

reductions), such COTS equipment lacks many of

the built-in integrity testing elements which are often

employed in more proprietary, specialised control

equipment. Many robotic and mechatronic systems,

by virtue of their design, are somewhat critical in

nature.

A study of available data by Dhillon and

Fashandi (1997) concluded that robot-related

accidents are primarily caused by unexpected or

unplanned motions of the manipulator; a

contributory cause of which was malfunctions of the

robot control system. Unexpected motions of a

manipulator or tooling may result in damage (or

complete destruction) of the system itself and any

(potentially expensive) equipment in the systems’

workspace. Additionally, when considering

applications such as surgical robotics, any

unexpected or unplanned motion could also result in

serious injury or death.

When such COTS equipment is used in

situations where their correct functioning is vital,

special care must be therefore taken to ensure that

the system is both reliable and safe (Storey 1996;

Levenson 1995). When considering the equipment

employed in a typical robot control system, attention

must be paid to potential permanent, transient or

intermittent failures of the hardware and software.

The need for fault-tolerant techniques is dependant

on the potential risk, which is primarily dependant

on the application and environment the system is

employed in.

Much research has concentrated on providing

hardware fault tolerance for such systems (e.g. see

Storey 1996). Recent years have also seen the

development of several software-based approaches

to implementing transient fault detection on COTS

214

Short M. (2007).

EFFICIENT IMPLEMENTATION OF FAULT-TOLERANT DATA STRUCTURES IN PC-BASED CONTROL SOFTWARE.

In Proceedings of the Fourth International Conference on Informatics in Control, Automation and Robotics, pages 214-219

DOI: 10.5220/0001618402140219

 SciTePress

processors. They are based around instruction

counting, instruction/task duplication and control

flow checking (e.g. Rajabzadeh & Miremadi 2006;

Rebaudengo et al. 2002; Oh et al. 2000).

Although such techniques are effective at

detecting many control flow errors, systems which

incorporate them may still be vulnerable to transient

errors in data memory (which may not result in

control-flow errors). This paper is particularly

concerned with the mitigation of transient errors in

COTS memory devices used in open-architecture

controllers. In section 2 of the paper, we will

consider the mechanisms that may lead to memory

corruption, the resulting effects, and the level of risk.

In section 3, a simple yet highly effective

software library that reduces the impact of memory

corruptions and overcomes these implementation

difficulties is presented and described in detail. In

section 4 we apply this library to a simple test

program; a 6 x 6 matrix multiplication program.

Fault injection results are described for both the un-

hardened and hardened programs. Section 5

concludes the paper.

2 MEMORY ERRORS

2.1 Mechanisms

Corruption of data in memory devices can come

from a variety of sources. Single event effects -

(SEE’s) - caused by particle strikes, may manifest

themselves in a variety of ways. They may cause

transient disturbances known as single event upsets

(SEU’s), manifested as random bit-flips in memory.

They may also cause permanent stuck-at faults over

an array of memory, caused by damage to the

read/write circuitry or chip latchup.

In addition, memory devices may also fail due to

normal electrical and thermal breakdown effects.

Such electrical or thermal failures and disturbances

in memory devices may be highly unpredictable,

manifesting themselves as complete device failures

or stuck-at faults over part (or all) of the memory

array.

Memory devices are also susceptible to

electromagnetic interference (EMI) from a variety of

sources. For example in an industrial robot workcell,

numerous devices such as electromechanical relays,

motor drives and welding equipment are all sources

of noise that are capable of corrupting many

electronic circuits (Ong & Pont 2002). Other

mechanisms that may lead to memory upsets include

power supply fluctuations and radio frequency

interference (RFI).

2.2 Level Of Risk

Failure rates for SEU’s in ground-based installations

are in the region of 10

-10

- 10

-12

failures per bit per

hour (Normand 1996). Failure rates for individual

devices due to electrical effects may be calculated

using a methodology such as (MIL 1991); they are

typically in the region of 10

-6

failures per device per

hour. Predicting the effects of EMI, RFI and power

supply disturbances are extremely difficult and

highly dependant on the operating environment and

the hardware mitigation techniques that are

employed (e.g. signal shielding).

From a practical perspective, experimental

studies have demonstrated that on COTS memory

devices with built-in integrity checks (such as parity

and error correction codes) the problem of

undetected memory corruption is large enough to be

of concern for some critical systems. Additionally,

much PC hardware does not even support such

integrity checks (Messer et al. 2001).

For example, a 4MB DRAM memory chip is

likely to encounter 6000 undetected memory failures

in 10

hours of operation (Messer et al. 2001). If a

control system PC employs several such devices,

with a total of 512 MB memory, this translates to an

undetected memory corruption approximately every

55 days of operation.

2.3 Activation Effects

Obviously, not all memory errors will become

activated. However, robotic control systems

typically involve extremely data-intensive

processing with hard real-time constraints.

Techniques such as co-ordinate transforms,

kinematics, resolved-motion rate control, path

planning and force control all typically require

hundreds (perhaps even thousands) of matrix

manipulations and feedback control calculations

every second (e.g. Fu et al. 1997). Considering that a

simple 6 x 6 matrix multiplication and storing of the

result typically involves 864 memory read/write

operations, it can be argued that the probability of

activating an error in such control software is

relatively high when compared to (for example) a

word processing application.

If a memory error does become activated, this

can lead to a variety of unpredictable faults (Ong &

Pont 2002). For example, they may cause an

incorrect value to be output to a port or peripheral;

EFFICIENT IMPLEMENTATION OF FAULT-TOLERANT DATA STRUCTURES IN PC-BASED CONTROL

SOFTWARE

215

or they may cause a further area of memory to be

corrupted by indexing an array out of its normal

bounds. In an open-architecture controller, all of

these faults can potentially escalate to full system

failures, and cause unpredictable motions of the

manipulator or tooling.

2.4 Mitigation Techniques

In order to address this vulnerability, some

researchers have investigated the use of Single-

Program Multiple Data (SPMD) techniques for data

redundancy in both single and multi processor

systems (e.g. Redaudengo et al. 2002; Gong et al.

1997). However, such approaches can be

problematic from the point of view of the control

system developer. Primarily because when the

techniques are actually applied, the complexity of

the resulting source code can increase dramatically,

and the basic meaning of the code can become

obscured. This may have an impact on code

development, testing and subsequent code

maintenance. To illustrate this point, consider the

segment of C code shown in Figure 1.

01: #define N (10)

02: int i;

03: int a[N],b[N];

04: for(i=0;i<N;i++)

05: {

06: b[i]=a[i];

07: }

Figure 1: Un hardened code.

For most programmers, this is “self

documenting” code, and the meaning is clear (the

programmer wishes to copy the contents of any

array of ten integers to another array of the same

size). Now, consider the same code, hardened using

the technique suggested by Redaudengo et al.

(2002). This is shown in Figure 2 (note the required

checksum initialization code and the XOR macro

CHK have been omitted for space reasons). The total

code segment, including this initialization (which

must be called before each operation), and the CHK

macro, is in excess of 36 lines in length; the meaning

of the code is also somewhat obscured. In addition,

the variable i in Figure 2 remains un-hardened. If the

variable i were to be hardened, the meaning of the

code would become further obscured, with the

check-and-correct code for i embedded within the

for loop construct; as more nested variables are

hardened, the problem can soon become difficult to

manage. This can be particularly troublesome when

writing matrix manipulation code which can often

require many levels of nesting.

01: #define N (10)

02: int i;

03: int a0[N],b0[N];

04 int b1[N],b1[N];

05 int c0,c1;

06: for (i=0;i<N;i++)

07: {

08: c0=c0^b0;

09: c1=c1^b1;

10: b0[i]=a0[i];

11: b1[i]=a1[i];

12: c0=c0^b0;

13: c1=c1^b1;

14: if(a0[i]!=a1[i])

15: {

16: if(CHK(a0,b0)==C0)

17: {

18: a1[i]=a0[i];

19: c1=c0;

20: }

21: else

22: {

23: a0[i]=a1[i];

24: c0=c1;

25: }

26: }

27: }

Figure 2: Hardened code.

Although this problem may be overcome by the

use of automatic code generators, this adds an extra

level of complexity and abstraction to the software

development process, and adds a real possibility of

introducing systematic errors into the design

process.

In the following section, a software-based

methodology will be proposed to simplify the

implementation of data redundancy. This technique

is an implementation of an SPMD-like architecture

to provide fault tolerance to transient errors in data

memory. This approach directly addresses the

problems of code complexity and compatibility with

other software-based approaches. It is primarily

suited to C++, but can easily be ported to other

object-oriented languages (e.g. JAVA).

3 THE NEW APPROACH

3.1 Requirements

The requirement for this software library was to

provide a portable and highly flexible set of new

data types for use with C++ programs. The new data

types should encapsulate a Triple Modular

Redundant (TMR) approach which is completely

ICINCO 2007 - International Conference on Informatics in Control, Automation and Robotics

216

hidden from the programmer. The data types, to all

intents and purposes, appear to the programmer as

their basic simplex counterparts; and all the new

data types can also be used interchangeably with

their simplex counterparts. The library should be as

non-intrusive as possible and not require the use of

an automatic code generator for its implementation.

Every write operation to the new types invokes a

write to three duplicated variables of the

corresponding type, and each read operation invokes

a two-from-three vote on the duplicated variables.

This concept is shown for a generic data type in

Figure 3.

We assume that if the data is so corrupted that a

two-from-three vote cannot be achieved, then a user-

defined error handler is called. This required

functionality of this error hander is highly

application dependant; it could, for example, freeze

the mechanical system and execute a software based

self test (SBST) algorithm to verify that no

permanent hardware failures have occurred in the

CPU peripherals or RAM. Hamdioui et al. (2002)

and Sosnowski (2006) have proposed efficient SBST

algorithms to achieve this.

Figure 3: Generic TMR data concept.

3.2 C++ Implementation

In order to create a generic and flexible

implementation, the required TMR behaviour was

defined in a generic C++ class template named

TMR_datatype. The prototype of the class template

is shown in Figure 4. This class template for a given

data type T can then be applied to any of the basic

in-built C++ data types by means of suitable #define

statements, also shown in Figure 4.

01: template <class T>

02: class TMR_datatype

03: {

04: public:

05: inline TMR_datatype(const T);

06: inline TMR_datatype(void);

07: inline T operator =(const T);

08: inline operator T();

09: inline T operator+=(const T);

10: inline T operator-=(const T);

11: inline T operator*=(const T);

12: inline T operator/=(const T);

13: inline T operator++(int);

14: inline T operator++(void);

15: inline T operator--(int);

16: inline T operator--(void);

17: inline T operator &=(const T);

18: inline T operator |=(const T);

19: inline T operator ^=(const T);

20: private:

21: T Primary_Copy;

22: T Secondary_Copy;

23: T Tertiary_Copy;

24: };

25: #define TMR_int TMR_datatype

<int>

26: #define TMR_float TMR_datatype

<float>

Figure 4: TMR_Datatype class template.

From the class template, it can be seen that each

derived object of the template contains three private

data declarations, Primary_Data, Secondary_Data

and Tertiary_Data, corresponding to the simplex

data type T. The required read and write operations

on this data are then achieved by defining new

operator member functions using the operator

keyword. It can be seen that all of these operator

functions are expanded inline by the compiler, with

the use of the inline keyword; this is to reduce any

overheads associated with the call of a member

function.

By way of example, the member functions for

both the assignment and reference operations on the

class template are shown in Figure 5. Note that the

use of the explicit reference operator is used in the

implementation; thus only the operator functions

that explicitly modify the data contents (such as =,

++, --, +=, and so on) needed to be overloaded; this

creates a very efficient and portable implementation.

01: inline T TMR_data::operator

=(const T Value)

02: {

03: Primary_Copy=Value;

04: Secondary_Copy=Value;

05: Tertiary_Copy=Value;

06: return(Value);

07: }

08: inline TMR_data::operator T()

09: {

10: if(Primary_Copy==Secondary_Copy)

EFFICIENT IMPLEMENTATION OF FAULT-TOLERANT DATA STRUCTURES IN PC-BASED CONTROL

SOFTWARE

217

11: {

12: Return(Primary_Copy);

13: }

14: else

if(Primary_Copy==Tertiary_Copy)

15: {

16: Return(Primary_Copy);

17: }

18: else

if(Secondary_Copy==Tertiary_Copy)

19: {

20: Return(Secondary_Copy);

21: }

22: else

23: {

24: Error();

25: }

26: }

Figure 5: Assignment and reference member functions.

From Figure 5, it can be seen that the TMR

behaviour has been captured by the template; when

an assignment (write) operator is encountered, the

value is written to the three copies of the data. When

the reference (read) operator is encountered, a two-

from-three vote is employed and the data returned. If

no vote is possible, the user defined function Error

is called.

3.3 Hardening Procedure

In Figure 6, the code library described in this section

is applied to the code example shown in Figure 1.

From Figure 6 it can be seen that the length of the

hardened source code is identical to the original and

is also highly readable. Additionally, it is noted that

– unlike the code shown in Figure 2 - the variable i

is also hardened in this case.

01: #define N (10)

02: TMR_int i;

03: TMR_int a[N],b[N];

04: for (i=0;i<N;i++)

05: {

06: b[i]=a[i];

07: }

Figure 6: Hardened code.

From these descriptions it can be seen that this

library does not require the use of automatic code

generators for its implementation: all that is required

is for the programmer to have a basic understanding

of the new data types. The hardening procedure can

be accomplished extremely rapidly; all that is

required is the inclusion of the new TMR template

into a project, and altering the variable declarations

that require hardening to their redundant

counterparts.

4 EXPERIMENTAL RESULTS

To assess the effectiveness of the proposed code

library, a fault-injection study was performed on a

Intel® Pentium 4-based PC, with a CPU speed of

2.6 GHz and 512 MB RAM, running the Windows

NT® operating system. A simple (yet

representative) application program was created to

perform a 6x6 floating point matrix multiplication.

During each experiment, transient faults were

injected into the program data area at random times,

performing random single bit-flips in the used data

areas. The fault injection was performed using a

secondary application running on the PC. In the

program, the source matrices are first initialized with

known constant values. The matrix multiplication is

then performed. The values contained in the result

matrix are then compared with known constant

results. The process then repeats endlessly. Any

failures or corrected errors are logged by the

application program.

Two different implementations of the program

were considered; the normal (simplex) case, and the

hardened TMR version. To asses the impact of

applying the library on execution time, we also

measured the iteration time for each loop of each

program using the Pentium performance counter.

Table 1 shows the recorded results. In the hardened

program, the number of faults injected was increased

to reflect the increased size of the program data

areas. Fault effects were classified into one of three

categories, as follows:

• Effect-less: the fault does not result in a

computation failure.

• Corrected: the fault is detected and has been

corrected.

• Failure: the fault is not detected or corrected and

results in an invalid computation output.

From these results, it can be seen that for both

cases, the error activation level was approximately

77%. Application of the TMR data structures

increased the execution time of the multiplication

task by a factor of 3.2; this is to be expected as we

have introduced instruction duplication and voting.

Additionally it should be noted that each hardened

variable increases the overall memory usage due to

its triplicate implementation. The increase in overall

program code size was 7.1% in this case.

ICINCO 2007 - International Conference on Informatics in Control, Automation and Robotics

218

Table 1: Fault injection results for each program.

Normal Hardened

Injected 10000 30000

No Effect 2240 6690

Failures 7760 0

Corrected 0 23010

Calc. Time (us) 4.72 15.27

We can also see from the results that of the 77%

of activated faults, 100% of these caused

computation failures in the normal program case.

The hardened case however, detected and corrected

100% of the activated faults.

5 CONCLUSIONS

In this paper, the mechanisms that can lead to

memory corruption in COTS PC control devices

have been considered. A novel approach to software

implemented fault-tolerance has been presented. The

approach, based on an SPMD architecture, can be

used to compliment existing error detection and

SBST techniques for COTS processors used in open

architecture controllers. The approach relies on data

and instruction duplication. It has been shown that

the method is easily applied, results in readable

code, and is able to tolerate 100% of the injected

faults in the benchmark described. Whilst the

application of the techniques provides high levels of

data fault tolerance, there is obviously a trade-off

with increases in the code and data size and task

execution time. Prospective designers must

obviously take these factors into account when

considering the techniques.

ACKNOWLEDGEMENTS

The work described in this paper was supported by

the Leverhulme Trust (Grant F/00 212/D).

REFERENCES

Burn, K., Short, M., Bicker, R., 2003. Adaptive And

Nonlinear Force Control Techniques Applied to

Robots Operating in Uncertain Environments. Journal

of Robotic Systems, Vol. 20, No. 7, pp. 391-400.

Dhillon, B.S., Fashandi, A.R.M., 1997. Safety and

reliability assessment techniques in robotics. Robotica,

Vol. 15, pp. 701-708.

Fu, K.S., Gonzales, R.C., Lee, C.S.G., 1987. Robotics:

Control, Sensing, Vision And Intelligence. McGraw-

Hill International Editions.

Gong, C., Melhem, R., Gupta, R., 1997. On-line error

detection through data duplication in distributed

memory systems. Microprocessors and Microsystems,

Vol. 21, pp. 197-209.

Hamdioui, S., van der Goor, A., Rogers, M., 2002. March

SS: A Test for All Static Simple RAM Faults. In Proc.

Of the 2002 IEEE Intl. Workshop on Memory Tech.,

Design and Testing.

Hong, K.S., Choi, K.H., Kim, J.G., Lee, S., 2001. A PC-

based open robot control system: PC-ORC. Robotics

and ComputerIntegrated Manufacturing, Vol. 17, pp.

355-365.

Lee, C.J., Mavroidis, C., 2000. WinRec V.1: Real-Time

Control Software for Windows NT and its

Applications. In Proc. American Control Conf.,

Chicago, Il., pp. 651-655.

Levenson, N.G., 1995. Safeware: System Safety and

Computers, Reading, M.A., Addison-Wesley.

Messer, A., Bernadat, P., Fu, G., Chen, G., Dimitrijevic,

Z., Lie, D., Mannaru, D.D, Riska, A., Milojicic, D.,

2001. Susceptibility of Modern Systems and Software

to Soft Errors, In Proc. Int. Conf. on Dependable Sys.

And Networks, Goteburg, Sweden.

MIL-HDBK-217F, 1991. Military Handbook of Reliability

Prediction of Electronic Equipment. December 1991.

Normand, E., 1996. Single Event Effects in Avionics,

IEEE Trans. on Nuclear Science, Vol. 43, No. 2.

Oh, N., Shivani, P.P., McCluskey, E.J., 2001. Control

Flow Checking by Software Signature. IEEE Trans.

On Reliability, September 2001.

Ong, H.L.R, Pont, M.J., 2002. The impact of instruction

pointer corruption on program flow: a computational

modelling study. Microprocessors and Microsystems,

25: 409-419.

Rajabzadeh, A., Miremadi, S.G., 2006. Transient detection

in COTS processors using software approach,

Microelectronics Reliability, Vol. 46, pp. 124-133.

Rebaudengo, M., Sonza Reorda, M., Violante, M., 2002.

A new approach to software-implemented fault

tolerance. In Proc. IEEE Latin American Test

Workshop, 2002.

Schofield, S., Wright, P., 1998. Open Architecture

Controllers for Machine Tools, Part 1: Design

Principles. Trans. ASME Journ. of Manufacturing Sci.

& Engineer, Vol. 120, Pt. 2, pp. 417-424.

Short, M., 2003. A Generic Controller Architecture for

Advanced and Intelligent Robots. PhD. Thesis,

University of Sunderland, UK.

Sosnowski, J., 2006. Software-based self-testing of

microprocessors. Journal of Systems Architecture,

Vol. 52, pp. 257-271.

Storey, N., 1996. Safety Critical Computer Systems.

Addison Wesley Publishing.

EFFICIENT IMPLEMENTATION OF FAULT-TOLERANT DATA STRUCTURES IN PC-BASED CONTROL

SOFTWARE

219