NEW DESIGN TECHNIQUES FOR ENHANCING FAULT
TOLERANT COTS SOFTWARE WRAPPERS
Luping Chen and John May
Safety Systems Research Centre, University of Bristol, Bristol, BS8 1TR, UK
Keywords: Safety critical system, COTS software, fault tolerance, wrapper.
Abstract: Component-based systems can be built by assembling components developed independently of the systems.
Middleware code that connects the components is usually needed to assemble them into a system. The
ordinary role of the middleware is simple glue code, but there is an opportunity to design it as a safety
wrapper to control the integration of the components to help assure system dependability. This paper
investigates some architectural designs for the safety wrappers using a nuclear protection system example. It
integrates new fault-tolerant techniques based on diagnostic assertions and diverse redundancy into the
middleware designs. This is an attractive option where complete trust in component reliability is impossible
or costly to achieve.
1 INTRODUCTION
Commercial-off-the-shelf (COTS) software based
components are increasingly being included within
complex safety critical systems (
Profeta et al 1996).
Therefore it is vital both to distinguish adequate
software components from inadequate ones as well
as to determine the effect on system dependability of
replacing previous systems with COTS software
based components. Yet the know-how to construct
dependable safety critical applications from
dependable COTS software based components
remains the ‘holy grail’ within the area of
component-based software engineering (Crnkovic
and Larsson 2002).
The component-based software development
process is based on the reuse and integration of high-
level software components and bespoke components
to form a new system (
Brown and Wallnau 1998). A
key issue for such hybrid systems is to show that the
use of the software components (which will be
considered as 'black boxes') does not compromise
the safety, reliability and (perhaps) security of the
overall system, since the reliability of software
components cannot be fully assured prior to
integration (
Voas 1998). Furthermore, even if such
pre-assurance was a theoretical possibility, it would
seldom be available, since the software components
are commonly developed to unknown standards or
standards aimed at general use, which are
insufficient for safety applications (
Profeta et al 1996,
Lindsay et al 2000)
. These difficulties are sometimes
compounded by the inaccessibility of some COTS
code. Where examination of code is not permitted,
traditional assurance techniques (with the exception
of black box testing) cannot be applied to COTS
components post-purchase, to supplement the
supplier’s verification and validation (V&V)
activities. In general it is necessary to use
'middleware', possibly based on standard
infrastructure technologies to integrate the inevitably
disparate components (May 2002). This middleware
can also be used to play the role of a safety wrapper,
and offers an important opportunity to include
component adaptation and monitoring strategies, to
help ensure the overall system's dependability (Shin
and Paniagua 2006). One significant use of these
wrappers is to enable the replacement one COTS
software component of a safety critical system by
another without significantly modifying the wrapper
itself: for example, an upgraded component. The
wrapper in this case ensures correct behavior over
key safety aspects of the components functionality.
This paper develops some design techniques for
enhancing fault tolerant COTS software wrappers.
The underlying concept of software fault tolerance
assumes that any system has unavoidable and
undetectable software faults no matter how
thoroughly the software has been debugged,
277
Chen L. and May J. (2007).
NEW DESIGN TECHNIQUES FOR ENHANCING FAULT TOLERANT COTS SOFTWARE WRAPPERS.
In Proceedings of the Second International Conference on Software and Data Technologies - SE, pages 277-282
DOI: 10.5220/0001340202770282
Copyright
c
SciTePress
modularised, verified and tested. Hence
programming strategies to prevent or recover from
software failures must be included within a complex
safety critical system such that it can provide service
even in the presence of software faults. Current
programming strategies are classified as N-Version
Programming, Recovery Block and Self-Checking
Version Schemes. Some new methods have been
developed for improving software fault tolerance
based on diverse redundancy and diagnostic
assertions (
Napier 2001, Chen et al 2002), and these
designs can be assessed by fault injection
techniques. An approach to assessment using
Perturbation of Interface Parameters (PIP) of COTS
components has been developed to simulate a range
of internal COTS component faults (Chen et al
2004).
Based on the example of using smart sensors as
COTS components within a plant protection system,
this paper considers new diversity and diagnosis
strategies for safety wrapper design in COTS-based
systems, together with methods for assessing their
effectiveness.
2 FAULT-TOLERANCE DESIGNS
2.1 Fault Tolerance by Assertion
The use of assertion/diagnostics is often based on a
rather restricted view of failures. Traditional
approaches to fault detection in software often focus
on specific anticipated problems, usually those that
halt the program execution. However this view is far
from general because anticipated problems are a
small subset of all possible problems. The more
subtle and interesting failures are quite different and
not addressed by these traditional approaches. Such
failures are caused by errors in the design of the
underlying program algorithms - i.e. it is the
proposed solution to the problem that is flawed, not
the implementation of the solution (
Harel 1992).
Failures caused by these algorithmic errors do not
necessarily halt the program execution, they simply
compute an incorrect answer (Napier 2001).
Anticipated problems are easier to detect and
contingencies can be put in place to put it into a safe
state or initiate an appropriate recovery procedure.
Non-halting failures of a COTS component are
much more likely to remain unrevealed and if such
insidious erroneous states are allowed to propagate
from a component into the rest of the system
potentially disastrous consequences can result. A
wider view of fault detection is required which aims
to detect these algorithmic errors in addition to the
traditional anticipated problems.
There is a range of on-line diagnostic techniques
available including data encoding (Napier 2001).
Most approaches to safety wrapper design use a
conventional user-defined executable assertions
(
Napier et al 2000). User-defined assertions can be
either external to the original program, based on
input/output relationships, or applied to check
internal program states. A safety wrapper can be
applied to a particular component using assertions in
one of three ways: Pre-condition assertion, Post-
condition assertion or Point/intermediate assertion.
In general, user-defined executable assertions for
inspecting internal data states may be an integral
part or an external wrapper for the underlying
program. They can be implemented relatively simply
by adding extra lines of code, possibly utilising
special mechanisms provided by the high level
language. Assertions requiring access to program
variables that are not accessible at the components
I/O, are unsuitable for use with COTS software since
these variables are not visible outside the
component. In contrast, external assertions can be
integrated into the middleware of component-based
systems even if a program component needs to be
treated as a black box.
2.2 Fault Tolerance by Diversity
Early ideas for reliability improvement by diversity
design centred on multiple versions of software
fulfilling the same requirement specifications. The
versions are expected to show different (diverse)
failure behaviours, both in terms of the inputs that
cause them to fail and in terms of failure behaviour
when both versions fail at the same time, so that
discrepancies between the two versions flag failures.
Currently, the main strategy for building versions
with such diverse behaviour is to use developers
with different backgrounds or to force diversity by
use of different hardware, languages, compilers etc.
The hope is that the different versions will not
contain the same errors. However, practical
applications of diverse software have shown that
normal design methods can not be assumed to
achieve this goal. Certainly, assuming independent
failures in versions will often be optimistic
(overestimate reliability).
Two kinds of diversity can be considered in
middleware designs:
Structural diversity: An algorithm can be
implemented with different structures.
ICSOFT 2007 - International Conference on Software and Data Technologies
278
Functional diversity: a specification can be fulfilled
by different algorithms.
Structural diversity has to be abandoned as a
technique for achieving diversity between a wrapper
and a COTS component, since the internal software
design in a COTS component may be unknown
(Chen and May 2004). However, structural diversity
can be used in a multi-version approach to wrapper
design. In a multi-wrapper design, the reliability of a
system with a COTS component will depend on the
combined failure coverage of all safety wrappers.
The diversity between wrappers will be an important
factor. There is still a need for diversity between
wrappers and the COTS component; to be effective,
the wrapper should not fail (to trap the COTS
failure) at the same time as the COTS component
fails. This is a black-box issue and can not be
judged/assessed directly. But it seems plausible to
expect that if wrapper A is not diverse with the
COTS component and wrappers A and B are
diverse, then wrapper B will show some diversity
with the COTS component. Thus a potential strength
of a multi-wrapper approach is that the
diversity/reliability improvement issues are judged
on the basis of wrapper design and there is less
emphasis on the COTS design, about which little
may be known.
2.3 Data Diversity
Data diversity generally can be regarded as a special
assertion/diagnostic technique defending against
design faults. The rationale of this technique is that
failures of defective software are usually input
dependent, e.g. the faults contained in the software
can only be triggered by fixed input sequences
(Ammann & Knight 1988). The key technique is to
use re-expression of an original input in ‘retries’ that
use the same software code. In one form of the
technique, the goal of the retry executions is to use
these different inputs to generate outputs that can be
manipulated to re-create the correct output for the
original input. This provides two (or more)
computations of the required output based on
different inputs, so there is a reduced probability of
failure.
Data diversity is a diverse software technique
where different multi-versions use the same code,
but use re-expressions of the input, but it might be
possible to pre-empt this approach and incorporate
some degree of execution flexibility into the design
of software components to simplify the use of the
data diversity concept. In this paper, the data
diversity technique could be used in conjunction
with safety wrappers or in fault tolerance functions
within the component).
3 CASE STUDY PROGRAM
The case study used the protection system for a
power plant from a multi-version software diversity
project known as DARTS (Quirk & Wall 1991). A
single C version of the software was used in the
experiments. The DARTS software was developed
specifically for experimentation, but this was done
under commercial conditions to produce software
representative of that used in practice. The program
monitors various parameters from the plant
environment: neutron pressure (NP), steam drum
pressure (SDP) and steam drum level (SDL), and
sets one of seven output trIp signals together with
several status signals under various circumstances.
Each of the physical parameters is monitored by
triplicated Sensor Devices.
The signals received from the smart sensor can
not be used directly by the protection system. A
middleware module ASSIGN_VALUE has been
designed to transfer the sensor values into proper
formats accepted by the protection system. This
software module also plays an important role as a
safety wrapper, attempting to check the correctness
of the input values from the smart sensors using
consistency checks.
In its role as ‘safety wrapper,’ ASSIGN_VALUE
is a common module monitoring all three physical
parameters NP, SDL and SDP. It receives three
readings for a physical parameter, and checks their
compatibility by observing if the differences
between the values from the three sensors are within
allowed ranges (specified in the system
requirements). Then it will decide how to pass on
the raw input data from the sensors and set the data
statuses.
For example, one assertion/diagnosis function in
ASSIGN_VALUE is to check the three input values
then decide whether:
All three values are averaged
Only low and medium values are averaged
Only high and medium values are averaged
A valid value can not be assigned
The protection system then computes using the
pre-processed values output from wrapper
ASSIGN_VALUE. ASSIGN_VALUE was used in
a range of experiments measuring diversity and fault
coverage, as follows.
NEW DESIGN TECHNIQUES FOR ENHANCING FAULT TOLERANT COTS SOFTWARE WRAPPERS
279
3.1 Wrappers with Structural Diversity
Two distinct factors influencing software diversity,
structural coupling and fault distribution, were
identified in recent research [Chen et al 2002]. It
was also shown that there is scope to manipulate
these factors in practice to actively improve the
diversity of software versions. Therefore, for
example, if a safety system design were to use
diverse software-based sensors, the implementation
should choose COTS sensors that are, or can be
configured to be, naturally diverse according to
these factors. This has the potential to strengthen
safety cases for systems containing diverse channels.
One important structural factor for diversity
enhancement is to implement ‘orthogonal structures
in versions e.g. a two-dimensional input-space can
be processed: {For X=1 to N then {For Y=1 to N
then OP1}} or orthogonally {For Y=1 to N then
{For X=1 to N OP2}. This idea has been used to
redesign a new wrapper ASSIGN_VALUE_New.
Then two version-pairs were constructed for the
purpose of assessing diverse designs: A pair with
two versions of the original wrapper (original-
original) and a pair with two versions of the new
wrapper (New-New). The diversity between a
version pair is a measure of the difference between
their failure behaviours when the two versions are
injected with different faults. Using identical
versions with different faults simulates the case
where two versions are developed by different
developers that use the same structure (or at least
largely similar structures since bugs can alter
structure), but containing different bugs.
Four systems were compared using: the original
single wrapper, the new single wrapper, the original-
original wrapper pair and the new-new wrapper pair.
An empirical approach to diversity assessment
was employed in which diversity is measured based
on a sample of injected faults (Chen et al 2002). The
resulting diversity assessment measures the average
behaviour of the software over a range of fault pairs
in two versions. This process was used to assess the
diversity present in multi-safety wrappers. That is, to
measure their degree of independence (avoidance of
common failures) when each safety wrapper
contains one fault, and averaging this measured
diversity over the range of fault pairs used.
To assess the effectiveness of the fault tolerant
designs, the proportion of the faults caught by the
safety wrappers will be used as an index. A safety
wrapper is designed with the requirement to protect
system from failures in a software component, and
the quality of safety wrappers can be directly
measured by fault coverage. The approach is to
simulate a number of failures of the software
component and then to check how many faults can
be detected by the safety wrapper (Panel 2002). This
experiment used a test support tool to simulate 512
failure scenarios of the COTS component. For each
system, the proportion of the 512 failures that was
not trapped by its wrapper(s) was observed (called a
failure probability in the following results).
In summary, the tasks involved are:
1. Observing the system behaviours with different
faults in the multiple wrappers to assess the
diversity of the wrappers
2. Observing the system behaviour with faults in
the software component and wrappers to
estimate the improvement of fault-tolerant
ability due to multiple safety wrappers.
The main burden of these assessments is the
selection and simulation of representative faults in
the COTS component and in the safety wrapper by
means of PIP (perturbation of interface parameters)
and SFI (software fault injection) methods
respectively. Use of SFI to simulate faults inside a
safety wrapper is a normal fault injection approach.
Applying PIP to simulate faults inside the
component has been specially developed to simulate
fault injection where the COTS software must be
treated as a black box (Chen 2002) i.e. where fault
injection is not possible. The safety designs of the
wrapper/s aims to catch all potential failures
propagated from a COTS component. So the PIP
approach needs to simulate all possible types of
anomalies of the interface parameters that could be
generated by the COTS code.
One result was that the new wrapper pair showed
a lower failure probability than the original pair. The
new wrapper design does appear to provide a better
structure for avoiding common failures. The reduced
coupling of the functions has made the faults
distribute uniformly over the structure and over the
input space, affording fewer opportunities for
different faults to be triggered by the same inputs.
Moreover, the use of multiple wrappers can improve
the level of fault tolerance achieved. The results are
summarized in figure 1.
Figure 1: Failure probabilities of different systems.
0
0.02
0.04
0.06
0.08
0.1
pai
r
o
f
ori
g
inal
pai
r
o
f
new
sin
g
le
ori
g
inal
sin
g
le
new
failure
p
robabilit
y
ICSOFT 2007 - International Conference on Software and Data Technologies
280
The systems with multiple wrappers showed
higher fault tolerance than those with a single one.
The system containing the wrapper pair with a
higher diversity showed the best fault tolerance.
3.2 Wrappers with Functional
Diversity
The aim of designing an effective wrapper in this
experiment was to enhance its reliability using
functional diversity.
In a practical design, a safety wrapper may
contain some productive computation functions i.e.
that perform computation of values rather than just
checking relations between values. Therefore in
addition to a structural diversity approach, we can
use functional diversity in the design of multiversion
wrappers. Essentially, the idea is to build a
‘checking style’ wrapper for the initial wrapper —
thus ‘wrapping a wrapper.’ For example, the safety
wrapper ASSIGN_VALUE includes some productive
functions to calculate the average value of three
sensor readings under various conditions. Our
approach is to use different diagnosis strategies in
wrappers to check if the calculation of the average
value is accurate.
In the experiment, a ‘checking style’ wrapper
was built as a redundant assertion. With the original
wrapper, they can compose a two-version wrapper
array. This check style wrapper used reverse
computation to validate the output from the COTS
component or the previous wrapper. The assessment
simply compared the fault coverage of the diverse
wrappers with the original one. We use a subset of
faults (from the fault set in last experiment) to test
the two wrappers. Of these faults, the system shows
diversities on 22.2% of them.
3.3 Wrappers with Data Diversity
Application of data diversity to smart sensors relies
on finding an equivalent re-expression of the input-
output relation of the sensors.
A smart sensor is normally used as a COTS
component. This means that it is difficult to
implement diversity design because little is known
about the code inside the component. On the other
hand, smart sensors are often well suited to data
diversity due to their straightforward dataflow.
The functions of a smart sensor mainly include
Data Acquisition and Signal Processing. The
processor may be also used for data conversion. The
specification of a smart sensor is often to measure a
variable, manipulate it and, in some cases, to take
action based on its value. In many wider applications
involving smart sensors, one does not care about the
raw data, but only about the information derived
from it by the sensor. For example in the DARTS
protection system, the application may not need the
exact temperature or pressure, but is interested in
whether it has exceeded a certain threshold or not.
Instead of sending a continuous stream of
temperature or pressure readings, a smart sensor
would send just one message when the temperature
criteria are met. Thus, only the relevant information
is sent out to the surrounding system.
Similarly to design strategies using functional
diversity, data diversity techniques can be integrated
in the safety wrapper based on origin-point shifting
methods. To process an input variable, smart sensors
mainly use functions such as scaling, trimming, and
filtering. Input values within different ranges and
with different amplitudes of noise have been
specified to be handled by different formulas. The
above discussion suggests a realistic approach to use
of data diversity within a smart sensor i.e. treating it
as a white box. If a smart sensor is a black box, it is
difficult to set up a proper post-condition assertion
without knowledge about its filter, calibration or
linearization functions. But we often know the
function of the smart sensor essentially is
monotonic:
Suppose
f is the function of the smart sensor
such that:
)(inputfoutput
=
and D is its
defined input domain. Then we have the relation:
inputs
Dyx
, , if y
> then we have
)()( yfxf > .
To improve the safety of a smart sensor being
used in an alarm function, we can design data
diversity by re-expressing the trip points or other
thresholds using device configuration facilities. For
example, we can shift both the input value and trIp
point by a value L at the same time. The shifted
values are used as re-expressed inputs and fed into
the sensor. In our solution, the sensor component
was wrapped with a new checker to judge if the trip
criteria were met based on these shifted values.
Thereafter, an assertion simply compared this result
with original one.
4 CONCLUSIONS
This paper investigates some design strategies to
provide middleware with safety functions to enhance
dependability of systems containing COTS
components. The designs were evaluated using an
empirical method and support tool developed
NEW DESIGN TECHNIQUES FOR ENHANCING FAULT TOLERANT COTS SOFTWARE WRAPPERS
281
systematically for diversity estimation based on fault
injection and failure tracing. A key issue is the
assessment of fault tolerance, on which conclusions
are based. The presented approach used fault
injection. By definition these faults are artificial. The
current understanding of software failure modes is
insufficient to allow the definition of realistic fault
sets - it is not clear what these might be. Therefore
this approach relies on an assumption that
hypothetical fault sets used in this way are
informative.
Diverse wrappers and wrappers embedded with
diagnostic assertions or data diversity have been
demonstrated as providing some level of increased
effectiveness at protecting system from potential
defects inside a COTS component.
The experiments on diversity designs of safety
features in middleware did not distinguish functional
and structural diversity: in terms of performance
they were similar. Anyhow, from these limited
experimental results it would clearly not be possible
to make general claims about the fault detection
capabilities of the different assertion type.
Some tentative conclusions are suggested that
are relevant to practice:
A ‘wrapper’ can be built from multiple
smaller complementary wrappers which can
be very effective and easy to implement
Functional diversity is easier to design than
structural diversity in multi wrappers
The application of check-style wrappers
reduces the scope for faults because they are
usually simpler modules than other kinds of
functional wrappers. It was clear in our
experiments that check-style wrappers can be
considerably more succinct than the code they
check. This is not surprising; it is well known
that checking a function can be a less complex
task than computing it. This effect was
sometimes so pronounced that it was difficult
to select plausible fault modes for injection
into the check-style wrappers.
A degree of orthogonality between old and
new wrappers was observed, which suggests
that software reliability will be most improved
if both assertion types are used (particularly
for faults with small footprints in the input
space).
Data diversity would appear to offer an
effective and appropriate way to improve
safety in smart sensors. It remains unexplored
in practice.
ACKNOWLEDGEMENTS
The work presented in this paper comprises aspects
of a study (NewDISPO2-4) performed as part of the
UK Nuclear Safety Research programme, funded
and controlled by the CINIF together with elements
from the SSRC Generic Research Programme
funded by British Energy, Lloyd's Register, and the
Health and Safety Executive.
REFERENCES
Ammann, P.E., Knight, J.C., 1988. Data Diversity: An
Approach to Software Fault Tolerance. IEEE Trans.
on Computers, 37(4): pp. 418-425.
Brown, A., Wallnau, K., 1998. The Current State of
CBSE. IEEE Software, 15(5): pp.37-46.
Chen, L., May, J., Hughes, G., 2002. Assessment of the
Benefit of Redundant Systems, Lecture Notes in
Computer Science, volume 2434, Springer, pp.151-162.
Chen, L., May, J., 2004. Safety Assessment of Systems
Embedded with COTS Components by PIP technique,
Lecture Notes in Informatics 58 GI.
Crnkovic, I., Larsson, M., 2002. Building Reliable
Component-Based Software System, Artech House Books.
Harel, D., 1992. Algorithmics: The Spirit of Computing,
Addison-Wesley.
Lindsay, P., Smith, G., 2000. Safety Assurance of
Commercial-Off-The-Shelf Software, Proc 5th
Australian Workshop on Safety Critical Systems and
Software.
May, J., 2002. Testing the reliability of component-based
safety critical software. Proc. 20th International
System Safety Conference, pp. 214—224.
Napier, J., Chen, L., May, J., Hughes, G., 2000. Fault
Simulating to validate fault-tolerance in Ada.
International Journal of Computer Systems, 15(1):61-67
Napier, J., 2001. Assessing Diagnostics for Fault Tolerant
Software. PhD thesis, Department of Computer
Science, University of Bristol.
Panel Discussion, 2002. How useful is software fault
injection for evaluating the security of COTS
products. Proceedings of the 17
th
ACSAC, IEEE
Computer Society.
Profeta., J, Andrianos, N., Yu, B., 1996. Safety-Critical
Systems Built with COTS, IEEE Comp. 29(11), pp 46-
54.
Quirk, J., Wall, N., 1991. Customer Functional
Requirements for the Protection System to be used as
the DARTS Example, DARTS consortium deliverable
report DARTS-032-HAR-160190-G supplied under the
HSE programme on Software Reliability.
Shin, M., Paniagua, F., 2006. Self-Management of COTS
Component-Based Systems Using Wrappers, 30th
COMPSAC, pp. 33-36.
Voas, J., 1998. Certifying Off-The Shelf Software
Components, IEE Computer, pp.53-59.
ICSOFT 2007 - International Conference on Software and Data Technologies
282