with direct Grid submission and are unwilling to
completely break the ties. While important players,
they are a shrinking minority, so they will not be
further considered in this paper.
The remaining communities are served by only
a couple Glidein Factories, each of which covers
most of the available worldwide Grid resources used
by these communities. The funds for operations of
these Grid Factories is provided by large-scale Grid
infrastructures, the most prominent being the Open
Science Grid (OSG) (Sfiligoi et al., 2011a). Please
notice that several of the authors are involved with
operations of either a VO Frontend+WMS instance or
a Glidein Factory instance.
In our experience, the operational tasks of
the Glidein Factory and the VO Frontend+WMS
operators are clearly distinct:
• the Glidein Factory operators are responsible
for monitoring and reacting to the changes in
the Grid resource configurations, discovering and
troubleshooting misbehaving resources, and for
keeping up to date with updates in the pilot
software; while
• the VO Frontend+WMS operators are responsible
for keeping up to date with the needs of the users,
thus modifying accordingly both the validation
scripts and the matchmaking expressions, and for
keeping up to date with the updates in the central
services software.
The VO Frontend+WMS instances are extremely
varied, and include small departmental groups,
campus-wide deployments of mostly independent sci-
entists, and portal-based, well-organized international
scientific collaborations. Their configurations and
operational procedures are understandably just as
varied, so they will not be described in this paper.
Nevertheless, they all report very low operational
costs, with the vast majority of effort being spent on
helping users with misbehaving jobs. Please note that
this does not include any system level maintenance,
since that is typically handled by an independent IT
team in a uniform way across the local infrastructure.
Since there are only a few Glidein Factory
instances, there is instead very little variation between
them. For consistency, we will skip configuration and
operational procedures for these as well. Here the
operational cost tend to be higher, with the bulk of
the effort going into adapting to the ever-changing
environment of the Grid, and tends to scale linearly
with the number of independent Grid sites used
for resource provisioning. Again, system level
maintenance is not factored in.
4 MOVING TO THE CLOUD
As described in the previous section, the glideinWMS
architecture calls for a clear separation of duties
between the Glidein Factory and the VO Frontend.
When we decided to add support for Cloud resource
providers, we were thus required to decide which one
would own the system image.
Moreover, one has to decide if the pilot processes
will run as a superuser, e.g. UNIX root, or as a
non-privileged user. On the Grid, we manage to run
securely even without superuser privileges, but we do
sacrifice some flexibility in the process.
In this section we provide the three options that
we consider feasible, together with advantages and
disadvantages of each option.
4.1 Image Owned by the Glidein
Factory and Pilot Running as a
Non-privileged User
One option is to give full control over the system
image to the Glidein Factory, and to drop privileges
early in the process, thus running the pilot as a
non-privileged user. In this scenario, we basically
mimic the Grid operational mode.
This choice would provide several advantages:
• The maintenance and security patching of the
system image is maintained by the Glidein
Factory operators, likely leading to a much better
security posture and lower total operational costs
compared to the other alternatives.
• The absence of non-system services running with
superuser privileges further minimizes security
risks.
• The absence of runtime changes to the system
level settings allows for certification of the system
images.
• The VO Frontend administrators are completely
unaware of the difference between the Grid and
the Cloud resources, making life much easier for
them.
The disadvantages of this choice are instead:
• Reduced flexibility for the Glidein Factory op-
erators, since any change to the system level
requires a new image. This would likely require a
separate image for each resource type, increasing
the operational costs.
• The inability for the VO Frontend administrators
to influence the operating system in any way.
• Inability of the pilot to use all the OS features for
user job control and monitoring.
THEglideinWMSAPPROACHTOTHEOWNERSHIPOFSYSTEMIMAGESINTHECLOUDWORLD
445