SOFTWARE VERSIONING IN THE CLOUD
Towards Automatic Source Code Management
Filippo Gioachin
, Qianhui Liang
, Yuxia Yao
and Bu-Sung Lee
Hewlett-Packard Laboratories Singapore, Singapore, Republic of Singapore
Nanyang Technological University, Singapore, Republic of Singapore
Keywords:
Software development, Cloud computing, Version control system, Revisions, Collaboration.
Abstract:
With the introduction of cloud computing and Web 2.0, many applications are moving to the cloud environ-
ment. Version control systems have also taken a first step towards this direction. Nevertheless, existing systems
are either client-server oriented or completely distributed, and they don’t match exactly the nature of the cloud.
In this paper we propose a new cloud version control system focusing on the requirements imposed by cloud
computing, that we identified as: concurrent editing, history rewrite, accountability, scalability, security, and
fault tolerance. Our plan is to tackle these issues in a systematic way, and we present in this paper an overview
of the solutions organized in three separate layers: access API, logical structure, and physical storage.
1 INTRODUCTION
With the advent of cloud computing, and the ubiq-
uitous integration of network-based devices, online
collaboration has become a key component in many
aspects of people’s life and their work experience.
Users expect their data to be accessible from every-
where, and proximity to other users in the real world
is no longer essential for collaboration. In fact, many
projects and industries are set up such that different
people are not geospatially co-located, but rather dis-
tributed around the globe. Projects range from simple
asynchronous discussions on certain topics of interest,
to the creation of books or articles, to the development
of multipurpose software.
Software development, in particular, is a complex
process composed of many phases including analy-
sis, coding, integration, and testing. A large number
of programmers usually collaborate to the develop-
ment of a product, thus having efficient and effec-
tive collaboration tools is a key to improve produc-
tivity and enable higher quality of the software de-
veloped and better time-to-market. During the cod-
ing phase, programmers normally use version control
systems (VCS) in order to record the changes made to
the code. This is useful for multiple reasons. First of
all, it allows different programmers to better coordi-
nate independent changes to the code and simplify the
task to integrate these changes together, in particular
if the integration is performed or augmented by code
reviewers. Finally, during the testing phase, it helps
to detect more quickly why or how certain bugs have
been introduced, and therefore to correct them more
easily.
In order to be effective for the tasks described, the
revision history needs to be clean and clear. If too
few revisions are present, each exposing a very large
set of modifications, the reader may have a hard time
to understand exactly which changes produce certain
results. On the other hand, when committing changes
very often and in small amounts, the history may be
cluttered with irrelevant changes that have already
been reverted or further modified. Although the revi-
sion history is believed by many to be an immutable
entity, allowing the freedom to refine it can be benefi-
cial to better understand the code at later stages of the
software development process, thus increasing pro-
ductivity. These refinements to the history can be sug-
gested either by the user explicitly, or by automatic
tools that inspect the existing history and detect better
sets of changes.
Clearly history refinement needs to be organized
and regulated appropriately, and we shall describe
some principles in the later sections. For the moment,
we would like to highlight the fact that changes to the
history ought not to affect users during their routine
tasks of developing the software. This entails that the
tools, automatic or otherwise, ought to have a com-
plete view of all the revisions existing in the system.
A cloud environment is the most favorable for such
tools since all the data is known and always accessi-
ble, and users are not storing part of it in local disks.
160
Gioachin F., Liang Q., Yao Y. and Lee B..
SOFTWARE VERSIONING IN THE CLOUD - Towards Automatic Source Code Management.
DOI: 10.5220/0003609001600165
In Proceedings of the 6th International Conference on Software and Database Technologies (ICSOFT-2011), pages 160-165
ISBN: 978-989-8425-76-8
Copyright
c
2011 SCITEPRESS (Science and Technology Publications, Lda.)
The storage and management of many project
repositories in the cloud also requires special atten-
tion. This management does not affect the user di-
rectly, but can help to significantly improve system
performance and reduce the resource allocation of
cloud providers. For example, if a repository and
all its copies in use by different users can share the
same underlying resource, the space requirement can
be reduced. Ideally, the cloud should minimize the re-
source allocation to the minimum necessary to guar-
antee to their users the correct retention of data in case
of failures.
This paper continues by describing related work
in Section 2, and how out tool positions itself among
them. Subsequently, Section 3 will focus on the re-
quirements we deem important for our cloud version-
ing system, and Section 4 will give an overview of the
system design. Final remarks will conclude.
2 RELATED WORK
Version control systems are heavily used in software
development, and many solutions are available to this
extent, starting from the oldest Source Code Con-
trol System (SCCS) and Concurrent Versions System
(CVS) (Grune, 1986), which are based on a central-
ized client-server model. From that time till today, the
definition of change-sets has been delegated onto the
user, forcing him to spend time to record the changes,
while the system merely keeps track of the different
versions. The third generation of centralized systems
is currently embodied by Subversion (Pilato et al.,
2004). All these systems are based on a central repos-
itory storing all the revisions; any operation to up-
date the history is performed with an active connec-
tion to the server. Once a change-set has been com-
mitted to the server, and becomes part of it, it cannot
be modified anymore. This implies that history is im-
mutable, and can only progress forward. This is gen-
erally viewed as an advantage, but it forces artifacts
to correct mistakes when these have been pushed to
the central repository. Given their structure, they also
impose a limit of one commit operation performed at
a time on a repository, thus limiting the potential scal-
ability of the infrastructure.
Moreover, in Subversion, there is a single point
of failure due to the presence of a local .svn direc-
tory in the user’s local disk. If this directory is tam-
pered with or deleted, the local modifications may be-
come impossible to commit, or, in the worst scenario,
corruption could be propagated to the central reposi-
tory. Given the presence of complete snapshots of the
repository into local disks, the central server does not
have access to all the knowledge. In particular, in our
view, this limits the applicability of automatic tools
capable of refining the history.
More recently, distributed repositories like
Git (Loeliger, 2009) or Mercurial (Mackall, 2006)
have started to appear as a substitution to centralized
repositories, and have become extremely popular.
A key aspects of their success is their capability
to perform many operations locally, even when
connectivity is not available. This approach has the
disadvantage that tracking what users are doing has
become even harder since users can now maintain
locally not only their most recent modifications, but
a much larger number of change-sets. This hinders
even more what automatic tools can do to improve
the revision history.
As mentioned earlier, in all versioning systems,
the history is supposed immutable and projects should
only move forward. In fact, the history can be arbi-
trarily changed by contributors in any way. For exam-
ple, they can apply a new set of changes to an old ver-
sion, and then overwrite the repository safety checks.
When the history is modified in such way, no trace
will be left of the original history once the new one
has propagated. This poses a serious problem for the
accountability of changes for future reviewing pro-
cess.
Several additions to distributed systems have en-
abled them to be seamlessly shared in the cloud.
GitHub (GitHub, 2008) or SourceForge (Geeknet,
1999) are two such example. These also alleviate
another problem of distributed VCSs, which is the
management and coordination of all the clones of the
same repository. On top of these hosting facilities,
other efforts are integrating repositories with online
software development, where users edit their code di-
rectly in the cloud. One example is Cloud9 (Dani¨els
and Arends, 2011). Even in this case, due to the lim-
itations of current VCSs, a new copy of the data is
made per user. The user then accesses and commits
changes only to this personal copy. This imposes an
unnecessary stress to the cloud infrastructures, requir-
ing many copies of the same repository to exist, well
beyond what is required for fault tolerance purposes.
As mentioned earlier, all the VCSs seen so far re-
quire users to explicitly take charge of the versioning
of their codes by defining the changes they made to
the code, and committing them to the VCS. This pro-
cess of committing change-sets has been automated
in domains other than software development. Google
Docs (Google, 2006) and Zoho Writer (Zoho, 2005)
are two examples where a history is maintained trans-
parently to the users in the realm of office produc-
tivity tools: users write their documents online, and
SOFTWARE VERSIONING IN THE CLOUD - Towards Automatic Source Code Management
161
revisions are automatically created upon save. This
history is limited to a single branch, and merges are
always resolved immediately. Unfortunately, for soft-
ware development, merges cannot yet be resolved au-
tomatically, and branches are very important.
Our cloud versioning tool positions itself between
the two systems just described: between automatic
versioning used in office productivity, and explicit
revision management used in software engineering.
History is to be written automatically by the system
upon save, and updated transparently as the user con-
tinues to modify the source code. The close resem-
blance to traditional version control tools enables our
system to support branches and all the necessary is-
sues typical of software engineering. Automatic tools
can harvest the history and change it to present it to
users in a better format; users can also specify how re-
visions should be created, if needed. In particular, the
extreme flexibility at the basis of our design enables
new automatic tools to play an important role in the
future.
The field of software engineering has been vibrant
with new ideas on accurate automatic detectors of rel-
evant change-sets. For example, new tools are be-
ing proposed to discover code changes systematically
based on the semantic of the source code (Kim and
Notkin, 2009; Duley et al., 2010). These tools are
complementary to our work, and can use the cloud
VCS to refine the history to something that can bet-
ter describe the changes made to the code, without
requiring user intervention.
3 DESIRED REQUIREMENTS
After we studied in detail current solutions for soft-
ware revision control, and the structure of cloud com-
puting, we realized that these did not match. We
therefore analyzed the features of a VCS residing in
the cloud, and consolidated them in the following six
areas.
Security. Content stored in the system may be con-
fidential and may represent an important asset for a
company. In a cloud environment, content will be ex-
posed to possible read, write and change operations
by unauthorized people. This naturally poses secu-
rity concerns for developers and companies using this
VCS.
We identified three key issues for security: ac-
cess control, privacy, and data deletion. When new
content is created, appropriate access control policies
must be enforced, either automatically or manually
to limit what operations different users can perform
on the data. When existing content is updated, these
policies may need to be updated accordingly, for ex-
ample by revoking access to an obsolete release of a
library. Privacy is a closely related issue, and it refers
to the need to protect user assets against both unau-
thorized access, such as from other users of the cloud,
and illegal access, such as from malicious parties or
administrators. Finally, some data stored in the VCS
may be against local government regulations or com-
pany policies, in which case all references must be
permanently destroyed.
Fault Tolerance. Fault tolerance is always a bene-
fit claimed for cloud-based offerings. Current VCSs
do not support fault tolerance by themselves. Instead,
backups of the systems where they reside is used to
provide higher reliability to failures. We foresee the
need of making fault tolerance an integral part of the
VCS itself. Issues to be addressed in this process of
integration include the methodology to create repli-
cas of the repositories transparently, the granularity at
which repositories are partitioned, the speed of prop-
agation of updates, and the location of the copies re-
tained.
Scalability. As the number of developers using the
VCS changes, the cloud needs to adapt by increasing
or reducing the amount of resources allocated. In par-
ticular, codes and their revisions can easily increase
to the extent that more nodes are needed even to hold
a single project. A distributed storing architecture is
therefore a natural option. Furthermore, scalability
entails the efficient use of the allocated resources. If
this was not the case, and too many resources were
wasted, the scalability as a whole could be severely
hindered. For example, if the development team of
a project is geographically distributed, it would be
beneficial to respond to requests from servers located
nearby the specific users, and not from a single server.
History Refine. As explained earlier, after changes
are made to the source code, and these have been
committed to the repository, developers may find it
useful or necessary to improve their presentation. For
example, if a bug was corrected late in the develop-
ment of a feature, it is an unnecessary complication
to show it explicitly to reviewers; instead, it could be
integrated at the point in time where the bug first ap-
peared. Naturally, the original history is also impor-
tant, and it should be preserved alongside the mod-
ified one. In traditional VCSs this is impossible to
accomplish, and even the simpler action of rewriting
the history (without maintaining the old one) may not
be easily accomplishable.
Accountability. Every operation performed in the
system should be accountable for, be it the creation
ICSOFT 2011 - 6th International Conference on Software and Data Technologies
162
of a new version, the modification of access permis-
sions, the modification of the history, or the deletion
of protected data. In traditional VCSs, only the first
two are correctly accounted for, and only if the system
is not tampered with. All the others are ignored; for
example, if the history is modified, no trace is left of
who modified it or why. Clearly, even who has access
to the accounting information needs to be correctly
regulated.
Concurrent Editing. Simultaneous writes is a
highly desirable feature in cloud versioning. When
many developers of a project are working on the same
items, it is likely that they will commit to the reposi-
tory at the same time, especially if commits are auto-
mated, say every auto-save. In this case, there will be
many simultaneous writes to the same objects on dif-
ferent branches. (To note that here we are not trying to
automatically resolve merge conflicts, since different
users will never be sharing the same branch. Merges
are still resolved manually.) There are tradeoffs be-
tween whether we want a highly available system or
a highly consistent one. The key, as we shall see,
is to determine the minimum consistency required to
match user expectations, and provide therefrom the
highest availability possible.
4 SYSTEM DESIGN
To address the requirements described above for our
cloud VCS, we divided the design into three layers
as shown in Figure 1. At the topmost level the ac-
cess API is responsible for coordinating the interac-
tion between system and external applications. At the
core of the VCS, the logical structure defines how the
data structures containing the users’ data and the rel-
ative privileges are implemented. At the lowest level
the physical storage layer is responsible for mapping
the logical structure onto the physical servers that em-
body the cloud infrastructure; it takes care of issues
like replication and consistency. Each shall be de-
scribed in the remainder of this section.






 







Figure 1: System design layout.
4.1 Access API
This layer is the one visible to other tools interacting
with the VCS. Its correct definition is essential to en-
able third-parties to easily build powerful tools.
Version Control Commands. As any VCS, our
cloud solution has a standard API to access the stored
versions and parse the (potentially multiple) histories.
This API is designed to allow not only traditional op-
erations like creation of new commits and branches,
but also more unconventional operations, like the cre-
ation of a new version of the history between two
given commits.
File System Access. Files in the cloud are generally
accessed through web interfaces that do not require to
see all the files in a project. For example, in a project
with one thousand files, a developer may be editing
only a dozen of them. The other files do not need
to occupy explicit space in the cloud storage—other
than that in the VCS itself. Therefore, the cloud VCS
provides a POSIX-like interface to operate as a reg-
ular file system. This empowers the cloud to avoid
unnecessary replications when files are only needed
for a very short amount of time, such as during the
compilation phase when executables are produced.
4.2 Logical Structure
The logical structure of the system is the one that is
responsible for the correct functionality of the VCS. It
defines how data is laid out in the system to form the
revision histories, and how users can access projects
and share them.
Data Layout. Our VCS is based on the concept of
snapshot: each version represents an image of all the
files at the moment the version was created. The
basis of the system is a multi-branching approach,
where each user is in charge of his own branches,
and he has exclusive write access to them. This im-
plies that during the update process, when new revi-
sions are created, the amount of conflicts is reduced,
allowing easier implementation of concurrent edit-
ing. Given the lack of a traditional master or trunk
branch, any branch is a candidate to become a release
or a specially tagged branch. Privileges and organi-
zation ranks will determine the workflow of informa-
tion. Due to the requirement of history refining and
accountability for its changes, each component stored
in the system is updatable inline, without implying a
change in its identifying key. In addition, to enable
automatic tools, as well as human, to better interact
with the system, each node has tags associated, which
describe how the node was created and what privi-
SOFTWARE VERSIONING IN THE CLOUD - Towards Automatic Source Code Management
163
leges the node itself grants to different tools and users.
An important feature of the data layout is the capa-
bility to protect its integrity. In this context, we need
to worry only to detect a corruption in the system,
since data correction can be performedwith help from
the lower layer, and in particular of the data replica-
tion mechanism. Corruption may occur due either to
bugs in the implementation of the API at the layer
above, or to malfunctioning of the hardware devices.
The latter problem can be solved by adding simple
error detection codes like SHA or CRC. As for the
former, automatic tools will check the status of the
system after risky operations to ensure the integrity
of the system. In all cases, when a corruption has
occurred, another copy will be used to maintain the
cloud service active, and an appropriate notification
will be issued.
Privileges. We define each individual user as a de-
veloper. Each developer can create his own projects
and, at the same time, participate in collaborative
projects. He can also link other projects to use as ex-
ternal libraries or plugins. Each collaborative project,
or simply project, may involve a large number of de-
velopers, each with his own set of privileges. Some
developers may be actively developing the software,
and therefore be granted full access. Others may be
simply reviewing the code or managing releases; in
this case they will be granted read-only access.
When a project is made available for others to use
as a library, a new set of privileges can be expressed.
For example, the code inherited may be read-only,
with write permissions but without repackaging capa-
bility, or fully accessible. Moreover, some developers
may also be allowed to completely delete some por-
tion of the stored data, change the access privileges
of other developers or of the project as a whole. Fi-
nally, automatic tools, even though not directly devel-
opers, have their specific access privileges. For ex-
ample, they may be allowed to harvest anonymous in-
formation about the engineering process for statistical
analysis, or propose new revision histories to better
present the project to reviewers.
4.3 Physical Storage
The underlying physical storage design is critical to
the overall performance, especially to support a large
scale VCS in the cloud, where failures are common.
Our design considers the various aspects of a dis-
tributed storage architecture: scalability, high avail-
ability, consistency, and security.
Scalability. Having a single large centralized stor-
age for all the repositories would lead to poor per-
formance and no scalability. Therefore, our design
contemplates a distributed storage architecture. User
repositories will be partitioned, and distributed across
different storage nodes. As the number of users
grows, new partitions can be easily added on-demand,
thus allowing for quick scale out of the system. Natu-
rally, the definition of the principles driving the repos-
itories partitioning needs careful design, so that when
new storage nodes are added, the data movement nec-
essary to recreate a correct partitioning layout is min-
imized.
Replication. Users expect continuous access to the
cloud VCS. However, in cloud environments built
with commodity hardware, failures occur frequently,
leading to both disk storage unavailability and net-
work splits—where certain geographical regions be-
come temporarily inaccessible. To ensure fault toler-
ance and high availability, user repositories are repli-
cated across different storage nodes and across data
centers. Therefore, even with failures and some stor-
age nodes unreachable, other copies can still be ac-
cessed and used to provide continuous service. The
replication scheme eliminates the single point of fail-
ure present in other centralized, or even distributed,
architectures. In our scheme, there are no master and
slave storage nodes.
Consistency. Software development involves fre-
quent changes and updates to the repositories. In a tra-
ditional scenario, changes are immediately reflected
globally (e.g., a git push command atomically updates
the remote repository). Consistency becomes more
complicated when the repository is distributed as mul-
tiple copies on different storage nodes, and multiple
users have access to the same repository and perform
operations simultaneously. If strong consistency is
wanted, the system’s availability is damaged (Brewer,
2010). In many distributed cloud storage systems,
eventual consistency has been used to enable high
availability (meaning that if no new updates are made
to an object, all accesses will eventually return the
last updated copy of that object). In our scheme, we
guarantee “read-your-write” consistency for any par-
ticular user. This means that old copies of an object
will never be seen by a user after he has updated that
object. We also guarantee “monotonic write consis-
tency”: for a particular process all the writes are seri-
alized, so that later writes supersede previous writes.
Finally, we guarantee “causal consistency”, meaning
1) that if an object X written by process A is read by
process B, which later writes object Y, then any other
process C that reads Y must also read X; and 2) that
once process A has informed process B about the up-
date of object Z, B cannot read the old copy of Z any-
more.
ICSOFT 2011 - 6th International Conference on Software and Data Technologies
164
Encryption. Source codes are valuable to develop-
ers and they should be unreadable to users who have
not been explicitly granted access to them. We en-
vision encryption deployed at all levels of the stor-
age system to protect users’ information and source
codes. At the user level, users create credentials to
prevent unauthorized access (for example in the form
of RSA keys). At the source code level, source codes
are protected with an encryption algorithm before be-
ing written to persistent storage. At the hardware
level, individual disks are encrypted to further prevent
loss or disclosure of users’ data.
5 CONCLUSIONS
In this paper we discussed how online collaboration,
a key technology to enable productive software de-
velopment and timely-to-market deliverables, can be
enhanced to meet user needs. We surveyed current
tools available for collaboration, and how these have
shortcomings that can hinder their effectiveness. We
proposed a new solution based on a cloud version
control system, where the entire repository is hosted
securely in the cloud. Users can access their codes
from anywhere without requiring pre-installation of
specific software, and seamlessly from different de-
vices. Moreover, the cloud VCS is planned with flex-
ibility at its foundation, thus enabling automatic tools,
as well as humans, to modify the state of the history
in consistent manners to provide better insight on how
the software has been evolving.
REFERENCES
Brewer, E. (2010). A certain freedom: thoughts on the CAP
theorem. In Proceeding of the 29th ACM SIGACT-
SIGOPS symposium on Principles of distributed com-
puting, PODC ’10, pages 335–335, New York, NY,
USA. ACM.
Dani¨els, R. and Arends, R. (2011). Cloud9 IDE -
Development-as-a-Service. http://cloud9ide.com.
Duley, A., Spandikow, C., and Kim, M. (2010). A pro-
gram differencing algorithm for verilog hdl. In Pro-
ceedings of the IEEE/ACM international conference
on Automated software engineering, ASE ’10, pages
477–486, New York, NY, USA. ACM.
Geeknet (1999). Sourceforge. http://sourceforge.net.
GitHub (2008). - social coding. http://github.com/.
Google (2006). Google docs & spreadsheets. Google Press
Center. http://www.google.com/intl/en/press/annc/
docsspreadsheets.html.
Grune, D. (1986). Concurrent versions system, a method
for independent cooperation. Technical report, IR 113,
Vrije Universiteit.
Kim, M. and Notkin, D. (2009). Discovering and repre-
senting systematic code changes. In Proceedings of
the 31st International Conference on Software Engi-
neering, ICSE ’09, pages 309–319, Washington, DC,
USA. IEEE Computer Society.
Loeliger, J. (2009). Version Control with Git. O’Reilly Me-
dia, Inc., 1st edition.
Mackall, M. (2006). Towards a better SCM: Revlog and
Mercurial. In Proceedings of the Linux Symposium
(Ottawa, Ontario, Canada), volume 2, pages 83–90.
Pilato, M. C., Collins-Sussman, B., and Fitzpatrick, B. W.
(2004). Version Control with Subversion. O’Reilly
Media, Inc., 1 edition.
Zoho (2005). Zoho writer. http://writer.zoho.com/.
SOFTWARE VERSIONING IN THE CLOUD - Towards Automatic Source Code Management
165