3.2 Current Configuration
Currently the SDX1 and USA15 areas together host
more than 200 fully occupied racks of equipment
from ATLAS TDAQ and detector sub-systems,
which can be subdivided into the following groups:
Core services, comprising a NAS, 2 gateways, 2
DNS, 2 CFS (Central File Server), web servers,
LDAP and MySQL clusters,
More than 1500 netbooted nodes served by 60
LFS (Local File Server), including 154 ROS
(Read-Out System), 841 HLT PU (6728 CPU
cores combined), 77 EB (Event Building nodes),
6 SFO (Sub-Farm Output buffer node to
permanent storage), 64 online and monitoring
nodes, 161 SBC (Single Board Computer), more
than 110 nodes of the preproduction system,
Many special purpose, locally installed systems:
24 ACR (standard Control Room machines, 4-
screen each), 45 SCR machines, Detector
Control System (DCS) nodes, sub-detector PCs,
public nodes.
The number of HLT Processor Units (PU) deployed
in SDX1 up to now represent only 50% of total
SDX1 capacity. This is assumed to be sufficient for
the initial phase of the accelerator programme and
will be increased to meet demand.
3.3 Increasing System Availability
Given the complexity and interdependency of the IT
infrastructure components, it takes a significant
amount of time to bring it to the production state
after a complete shutdown. Measures were therefore
taken to protect the TDAQ infrastructure from the
power cuts of various origins. The whole facility is
provided with two centralized UPS lines with diesel
generators backup, plus the independent UPS lines
that are available in SDX1 for mission critical
equipment. Nowadays about 5% of equipment
deployed in SDX1 is on UPS lines, some on dual
power (both UPS and mains power), and just a few
machines like CFS nodes are dual powered from
both a dedicated locally installed UPS and general
UPS power.
3.4 Hardware and Service Monitoring
Host monitoring is currently being done using the
NAGIOS platform. NAGIOS allows the monitoring
of various services for the machines. For all
machines, basic OS, network, and hardware state is
monitored. On certain hosts, such as LFS or
Application Gateways, also specific services are
monitored, such as NTP, NFS, DHCP, etc.
The NAGIOS graphs collected by the monitoring
system are stored on disks in the form of more than
29400 RRD files of total size approximately 5.5 GB.
Status information of all the nodes and NAGIOS
graphs for various parameters are published
automatically in the monitoring section of the
private ATLAS Operations web server. For selected
indicators which are of crucial importance to the
proper functioning of the ATLAS TDAQ
infrastructure E-mail/SMS alerts are provided.
3.5 Node Management Tools
Due to the large number of netbooted nodes being
managed, nontrivial automation is required to reduce
the amount of time consumed by performing various
routine operations, like rebooting a group of nodes,
assigning clients to the LFS servers, etc. A dedicated
set of tools based on a MySQL database and the
GUI front end to them called ConfdbUI, were
developed to cover these requirements. With few
exceptions, all the nodes are currently running
Scientific Linux CERN (SLC5).
The standard solution for the remote hardware
management in ATLAS Point 1 is based exclusively
on IPMI. Locally installed nodes, such as LFS,
Control Room, central servers are installed and
configured centrally, using a Quattor (Leiva, 2005)
based configuration management system.
3.6 Remote Access Subsystems
The access to the ATLAS Technical and Control
Network (ATCN) from outside of Point 1 is highly
restricted and only allowed via one of the following
gateway systems:
Point 1 Linux gateways, allowing expert users to
access a particular set of machines within Point 1
via the SSH protocol,
Remote Monitoring System, providing the
graphical terminal services for remote
monitoring of ATLAS sub-detector systems,
Windows Terminal Servers, allowing experts to
access the DCS SCADA system.
All the gateways and remote monitoring nodes are
provided with both host based and network based
accounting, security monitoring and intrusion
prevention systems. The gateways are implemented
on a XEN based virtualization solution which
ensures high availability and manageability for this
vital subsystem.
ICSOFT 2010 - 5th International Conference on Software and Data Technologies
208