
4. Servers of a given type on different computers
may have a different processing capacity.
5. The prototype derived from the conceptual
model should accommodate safety mechanisms
that will enable it to handle both crash-type and
arbitrary (Byzantine) failures, resulting in a higher
failure mode coverage (Laprie, 1995)
The challenge in providing fault tolerance in the
scenario described above stems from the dynamic
and uncertain nature of the network. As a case in
point, computers can be installed or removed in
real time, unexpected software/hardware crashes
may occur. It is the need to provide end-users with
quality service at a minimum level of response time
that prompts the seeking and evaluation of
mechanisms that will detect faults as well as
rapidly adjust the performance of the network so
that the desired quality standards are maintained.
Effective synchronization and communication
protocols are a critical asset for the success of such
a system.
The proposed reconfiguration model is algorithmic
and comprises the following elements:
Network Status: A set of vectors and matrices that
capture the actual state of the network at any given
point in time (termed logical configuration). These
elements describe which servers and computers are
active and which tasks are processed on each server
at any point in time. It also includes operational
instructions on what to do with the tasks running on
a server in case the host computer becomes
inoperative.
Task-Reconfiguration Algorithm: An algorithmic
set of procedures that transform the network status
elements so that they capture and react to changes
in the state of the network (termed events) with
minimal delay.
Note that contrary to the logical configuration, the
physical configuration of the network refers to the
hardware profile (e.g., ratio of memory/CPU
power, number of I/O devices etc.). Changes in the
physical configuration are therefore less frequent
than changes in the logical configuration. The
former fall outside the focus of the model because
they cannot affect the behavior of the model unless
they are first reflected in the logical configuration
(e.g., register a newly acquired computer in an
appropriate status matrix).
The basic principle of the model is to dynamically
redistribute tasks between servers available on the
network in response to threatening events. When
such an event occurs in the network (e.g., a
computer crashes, or an arbitrary failure occurs),
the model reallocates active tasks on running the
stalled computer to other available computers
according to a proportional ratio determined by the
relative importance of the servers. The importance
(vote) of a server is based on the system manager's
perception of the relative processing capacity of all
servers of a given type (running on different
computers). In case there is a leftover task as a
result of the above event, then this task is allocated
to the computer that has the highest remainder,
using the Hamilton method (Ibarkai and Katoh,
1988). This approach can be applied to the event
of system initialization as well.
3 EVALUATION OF THE
MODEL
The proposed reconfiguration model was evaluated
on a large national digital telecommunications
network comprising approximately 200 hubs of the
following types: TX-1, TMX-10, and TMX-100
(manufactured by Northern Telecom) and System-
12 (manufactured by Alcatel). The above hubs
serve in the range of 1000 to 20,000 customers
each. As an example, the System-12 hub is a
complex hardware and software device running
several tens of modules concurrently. The modules
are responsible for various tasks (e.g., central
control, connection with customers, message
routing, connection bus with other hubs,
distribution control and more). The System-12 hub
uses approximately 100 types of status messages in
order to monitor and coordinate the operation of
the hub (e.g., detecting and handling malfunctions).
The model for controlling the network was
implemented using the C programming language.
The system operates over the VAX/OpenVMS
operating system running on two VAX 4000-5000
computers and using the Digital RMS software for
file management. The computers are connected in a
cluster using the Digital Small Systems
Interconnect (DSSI), which enables sharing of
disks among computers, synchronization of events
and transmission of data. Connection between the
servers on the computers and the hubs they are
serving is implemented using a X.25 packet
switching network. This network transmits
instructions from the servers to the hubs and events
from the hubs to the servers. The performance of
the network was measured and recorded using
Digital's Monitor software package over a period of
one month. Several measurements were performed
during the day and an arithmetic average was used
to summarize the results. The effect of the
workload created by MONITOR on the results is
ICEIS 2004 - DATABASES AND INFORMATION SYSTEMS INTEGRATION
492