SOFTWARE VERSIONING IN THE CLOUD

Towards Automatic Source Code Management

Filippo Gioachin

∗

, Qianhui Liang

∗

, Yuxia Yao

∗

and Bu-Sung Lee

∗†

∗

Hewlett-Packard Laboratories Singapore, Singapore, Republic of Singapore

†

Nanyang Technological University, Singapore, Republic of Singapore

Keywords:

Software development, Cloud computing, Version control system, Revisions, Collaboration.

Abstract:

With the introduction of cloud computing and Web 2.0, many applications are moving to the cloud environ-

ment. Version control systems have also taken a ﬁrst step towards this direction. Nevertheless, existing systems

are either client-server oriented or completely distributed, and they don’t match exactly the nature of the cloud.

In this paper we propose a new cloud version control system focusing on the requirements imposed by cloud

computing, that we identiﬁed as: concurrent editing, history rewrite, accountability, scalability, security, and

fault tolerance. Our plan is to tackle these issues in a systematic way, and we present in this paper an overview

of the solutions organized in three separate layers: access API, logical structure, and physical storage.

1 INTRODUCTION

With the advent of cloud computing, and the ubiq-

uitous integration of network-based devices, online

collaboration has become a key component in many

aspects of people’s life and their work experience.

Users expect their data to be accessible from every-

where, and proximity to other users in the real world

is no longer essential for collaboration. In fact, many

projects and industries are set up such that different

people are not geospatially co-located, but rather dis-

tributed around the globe. Projects range from simple

asynchronous discussions on certain topics of interest,

to the creation of books or articles, to the development

of multipurpose software.

Software development, in particular, is a complex

process composed of many phases including analy-

sis, coding, integration, and testing. A large number

of programmers usually collaborate to the develop-

ment of a product, thus having efﬁcient and effec-

tive collaboration tools is a key to improve produc-

tivity and enable higher quality of the software de-

veloped and better time-to-market. During the cod-

ing phase, programmers normally use version control

systems (VCS) in order to record the changes made to

the code. This is useful for multiple reasons. First of

all, it allows different programmers to better coordi-

nate independent changes to the code and simplify the

task to integrate these changes together, in particular

if the integration is performed or augmented by code

reviewers. Finally, during the testing phase, it helps

to detect more quickly why or how certain bugs have

been introduced, and therefore to correct them more

easily.

In order to be effective for the tasks described, the

revision history needs to be clean and clear. If too

few revisions are present, each exposing a very large

set of modiﬁcations, the reader may have a hard time

to understand exactly which changes produce certain

results. On the other hand, when committing changes

very often and in small amounts, the history may be

cluttered with irrelevant changes that have already

been reverted or further modiﬁed. Although the revi-

sion history is believed by many to be an immutable

entity, allowing the freedom to reﬁne it can be beneﬁ-

cial to better understand the code at later stages of the

software development process, thus increasing pro-

ductivity. These reﬁnements to the history can be sug-

gested either by the user explicitly, or by automatic

tools that inspect the existing history and detect better

sets of changes.

Clearly history reﬁnement needs to be organized

and regulated appropriately, and we shall describe

some principles in the later sections. For the moment,

we would like to highlight the fact that changes to the

history ought not to affect users during their routine

tasks of developing the software. This entails that the

tools, automatic or otherwise, ought to have a com-

plete view of all the revisions existing in the system.

A cloud environment is the most favorable for such

tools since all the data is known and always accessi-

ble, and users are not storing part of it in local disks.

160

Gioachin F., Liang Q., Yao Y. and Lee B..

SOFTWARE VERSIONING IN THE CLOUD - Towards Automatic Source Code Management.

DOI: 10.5220/0003609001600165

In Proceedings of the 6th International Conference on Software and Database Technologies (ICSOFT-2011), pages 160-165

ISBN: 978-989-8425-76-8

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

The storage and management of many project

repositories in the cloud also requires special atten-

tion. This management does not affect the user di-

rectly, but can help to signiﬁcantly improve system

performance and reduce the resource allocation of

cloud providers. For example, if a repository and

all its copies in use by different users can share the

same underlying resource, the space requirement can

be reduced. Ideally, the cloud should minimize the re-

source allocation to the minimum necessary to guar-

antee to their users the correct retention of data in case

of failures.

This paper continues by describing related work

in Section 2, and how out tool positions itself among

them. Subsequently, Section 3 will focus on the re-

quirements we deem important for our cloud version-

ing system, and Section 4 will give an overview of the

system design. Final remarks will conclude.

2 RELATED WORK

Version control systems are heavily used in software

development, and many solutions are available to this

extent, starting from the oldest Source Code Con-

trol System (SCCS) and Concurrent Versions System

(CVS) (Grune, 1986), which are based on a central-

ized client-server model. From that time till today, the

deﬁnition of change-sets has been delegated onto the

user, forcing him to spend time to record the changes,

while the system merely keeps track of the different

versions. The third generation of centralized systems

is currently embodied by Subversion (Pilato et al.,

2004). All these systems are based on a central repos-

itory storing all the revisions; any operation to up-

date the history is performed with an active connec-

tion to the server. Once a change-set has been com-

mitted to the server, and becomes part of it, it cannot

be modiﬁed anymore. This implies that history is im-

mutable, and can only progress forward. This is gen-

erally viewed as an advantage, but it forces artifacts

to correct mistakes when these have been pushed to

the central repository. Given their structure, they also

impose a limit of one commit operation performed at

a time on a repository, thus limiting the potential scal-

ability of the infrastructure.

Moreover, in Subversion, there is a single point

of failure due to the presence of a local .svn direc-

tory in the user’s local disk. If this directory is tam-

pered with or deleted, the local modiﬁcations may be-

come impossible to commit, or, in the worst scenario,

corruption could be propagated to the central reposi-

tory. Given the presence of complete snapshots of the

repository into local disks, the central server does not

have access to all the knowledge. In particular, in our

view, this limits the applicability of automatic tools

capable of reﬁning the history.

More recently, distributed repositories like

Git (Loeliger, 2009) or Mercurial (Mackall, 2006)

have started to appear as a substitution to centralized

repositories, and have become extremely popular.

A key aspects of their success is their capability

to perform many operations locally, even when

connectivity is not available. This approach has the

disadvantage that tracking what users are doing has

become even harder since users can now maintain

locally not only their most recent modiﬁcations, but

a much larger number of change-sets. This hinders

even more what automatic tools can do to improve

the revision history.

As mentioned earlier, in all versioning systems,

the history is supposed immutable and projects should

only move forward. In fact, the history can be arbi-

trarily changed by contributors in any way. For exam-

ple, they can apply a new set of changes to an old ver-

sion, and then overwrite the repository safety checks.

When the history is modiﬁed in such way, no trace

will be left of the original history once the new one

has propagated. This poses a serious problem for the

accountability of changes for future reviewing pro-

cess.

Several additions to distributed systems have en-

abled them to be seamlessly shared in the cloud.

GitHub (GitHub, 2008) or SourceForge (Geeknet,

1999) are two such example. These also alleviate

another problem of distributed VCSs, which is the

management and coordination of all the clones of the

same repository. On top of these hosting facilities,

other efforts are integrating repositories with online

software development, where users edit their code di-

rectly in the cloud. One example is Cloud9 (Dani¨els

and Arends, 2011). Even in this case, due to the lim-

itations of current VCSs, a new copy of the data is

made per user. The user then accesses and commits

changes only to this personal copy. This imposes an

unnecessary stress to the cloud infrastructures, requir-

ing many copies of the same repository to exist, well

beyond what is required for fault tolerance purposes.

As mentioned earlier, all the VCSs seen so far re-

quire users to explicitly take charge of the versioning

of their codes by deﬁning the changes they made to

the code, and committing them to the VCS. This pro-

cess of committing change-sets has been automated

in domains other than software development. Google

Docs (Google, 2006) and Zoho Writer (Zoho, 2005)

are two examples where a history is maintained trans-

parently to the users in the realm of ofﬁce produc-

tivity tools: users write their documents online, and

SOFTWARE VERSIONING IN THE CLOUD - Towards Automatic Source Code Management

161

revisions are automatically created upon save. This

history is limited to a single branch, and merges are

always resolved immediately. Unfortunately, for soft-

ware development, merges cannot yet be resolved au-

tomatically, and branches are very important.

Our cloud versioning tool positions itself between

the two systems just described: between automatic

versioning used in ofﬁce productivity, and explicit

revision management used in software engineering.

History is to be written automatically by the system

upon save, and updated transparently as the user con-

tinues to modify the source code. The close resem-

blance to traditional version control tools enables our

system to support branches and all the necessary is-

sues typical of software engineering. Automatic tools

can harvest the history and change it to present it to

users in a better format; users can also specify how re-

visions should be created, if needed. In particular, the

extreme ﬂexibility at the basis of our design enables

new automatic tools to play an important role in the

future.

The ﬁeld of software engineering has been vibrant

with new ideas on accurate automatic detectors of rel-

evant change-sets. For example, new tools are be-

ing proposed to discover code changes systematically

based on the semantic of the source code (Kim and

Notkin, 2009; Duley et al., 2010). These tools are

complementary to our work, and can use the cloud

VCS to reﬁne the history to something that can bet-

ter describe the changes made to the code, without

requiring user intervention.

3 DESIRED REQUIREMENTS

After we studied in detail current solutions for soft-

ware revision control, and the structure of cloud com-

puting, we realized that these did not match. We

therefore analyzed the features of a VCS residing in

the cloud, and consolidated them in the following six

areas.

Security. Content stored in the system may be con-

ﬁdential and may represent an important asset for a

company. In a cloud environment, content will be ex-

posed to possible read, write and change operations

by unauthorized people. This naturally poses secu-

rity concerns for developers and companies using this

VCS.

We identiﬁed three key issues for security: ac-

cess control, privacy, and data deletion. When new

content is created, appropriate access control policies

must be enforced, either automatically or manually

to limit what operations different users can perform

on the data. When existing content is updated, these

policies may need to be updated accordingly, for ex-

ample by revoking access to an obsolete release of a

library. Privacy is a closely related issue, and it refers

to the need to protect user assets against both unau-

thorized access, such as from other users of the cloud,

and illegal access, such as from malicious parties or

administrators. Finally, some data stored in the VCS

may be against local government regulations or com-

pany policies, in which case all references must be

permanently destroyed.

Fault Tolerance. Fault tolerance is always a bene-

ﬁt claimed for cloud-based offerings. Current VCSs

do not support fault tolerance by themselves. Instead,

backups of the systems where they reside is used to

provide higher reliability to failures. We foresee the

need of making fault tolerance an integral part of the

VCS itself. Issues to be addressed in this process of

integration include the methodology to create repli-

cas of the repositories transparently, the granularity at

which repositories are partitioned, the speed of prop-

agation of updates, and the location of the copies re-

tained.

Scalability. As the number of developers using the

VCS changes, the cloud needs to adapt by increasing

or reducing the amount of resources allocated. In par-

ticular, codes and their revisions can easily increase

to the extent that more nodes are needed even to hold

a single project. A distributed storing architecture is

therefore a natural option. Furthermore, scalability

entails the efﬁcient use of the allocated resources. If

this was not the case, and too many resources were

wasted, the scalability as a whole could be severely

hindered. For example, if the development team of

a project is geographically distributed, it would be

beneﬁcial to respond to requests from servers located

nearby the speciﬁc users, and not from a single server.

History Reﬁne. As explained earlier, after changes

are made to the source code, and these have been

committed to the repository, developers may ﬁnd it

useful or necessary to improve their presentation. For

example, if a bug was corrected late in the develop-

ment of a feature, it is an unnecessary complication

to show it explicitly to reviewers; instead, it could be

integrated at the point in time where the bug ﬁrst ap-

peared. Naturally, the original history is also impor-

tant, and it should be preserved alongside the mod-

iﬁed one. In traditional VCSs this is impossible to

accomplish, and even the simpler action of rewriting

the history (without maintaining the old one) may not

be easily accomplishable.

Accountability. Every operation performed in the

system should be accountable for, be it the creation

ICSOFT 2011 - 6th International Conference on Software and Data Technologies

162

of a new version, the modiﬁcation of access permis-

sions, the modiﬁcation of the history, or the deletion

of protected data. In traditional VCSs, only the ﬁrst

two are correctly accounted for, and only if the system

is not tampered with. All the others are ignored; for

example, if the history is modiﬁed, no trace is left of

who modiﬁed it or why. Clearly, even who has access

to the accounting information needs to be correctly

regulated.

Concurrent Editing. Simultaneous writes is a

highly desirable feature in cloud versioning. When

many developers of a project are working on the same

items, it is likely that they will commit to the reposi-

tory at the same time, especially if commits are auto-

mated, say every auto-save. In this case, there will be

many simultaneous writes to the same objects on dif-

ferent branches. (To note that here we are not trying to

automatically resolve merge conﬂicts, since different

users will never be sharing the same branch. Merges

are still resolved manually.) There are tradeoffs be-

tween whether we want a highly available system or

a highly consistent one. The key, as we shall see,

is to determine the minimum consistency required to

match user expectations, and provide therefrom the

highest availability possible.

4 SYSTEM DESIGN

To address the requirements described above for our

cloud VCS, we divided the design into three layers

as shown in Figure 1. At the topmost level the ac-

cess API is responsible for coordinating the interac-

tion between system and external applications. At the

core of the VCS, the logical structure deﬁnes how the

data structures containing the users’ data and the rel-

ative privileges are implemented. At the lowest level

the physical storage layer is responsible for mapping

the logical structure onto the physical servers that em-

body the cloud infrastructure; it takes care of issues

like replication and consistency. Each shall be de-

scribed in the remainder of this section.













 















Figure 1: System design layout.

4.1 Access API

This layer is the one visible to other tools interacting

with the VCS. Its correct deﬁnition is essential to en-

able third-parties to easily build powerful tools.

Version Control Commands. As any VCS, our

cloud solution has a standard API to access the stored

versions and parse the (potentially multiple) histories.

This API is designed to allow not only traditional op-

erations like creation of new commits and branches,

but also more unconventional operations, like the cre-

ation of a new version of the history between two

given commits.

File System Access. Files in the cloud are generally

accessed through web interfaces that do not require to

see all the ﬁles in a project. For example, in a project

with one thousand ﬁles, a developer may be editing

only a dozen of them. The other ﬁles do not need

to occupy explicit space in the cloud storage—other

than that in the VCS itself. Therefore, the cloud VCS

provides a POSIX-like interface to operate as a reg-

ular ﬁle system. This empowers the cloud to avoid

unnecessary replications when ﬁles are only needed

for a very short amount of time, such as during the

compilation phase when executables are produced.

4.2 Logical Structure

The logical structure of the system is the one that is

responsible for the correct functionality of the VCS. It

deﬁnes how data is laid out in the system to form the

revision histories, and how users can access projects

and share them.

Data Layout. Our VCS is based on the concept of

snapshot: each version represents an image of all the

ﬁles at the moment the version was created. The

basis of the system is a multi-branching approach,

where each user is in charge of his own branches,

and he has exclusive write access to them. This im-

plies that during the update process, when new revi-

sions are created, the amount of conﬂicts is reduced,

allowing easier implementation of concurrent edit-

ing. Given the lack of a traditional master or trunk

branch, any branch is a candidate to become a release

or a specially tagged branch. Privileges and organi-

zation ranks will determine the workﬂow of informa-

tion. Due to the requirement of history reﬁning and

accountability for its changes, each component stored

in the system is updatable inline, without implying a

change in its identifying key. In addition, to enable

automatic tools, as well as human, to better interact

with the system, each node has tags associated, which

describe how the node was created and what privi-

SOFTWARE VERSIONING IN THE CLOUD - Towards Automatic Source Code Management

163

leges the node itself grants to different tools and users.

An important feature of the data layout is the capa-

bility to protect its integrity. In this context, we need

to worry only to detect a corruption in the system,

since data correction can be performedwith help from

the lower layer, and in particular of the data replica-

tion mechanism. Corruption may occur due either to

bugs in the implementation of the API at the layer

above, or to malfunctioning of the hardware devices.

The latter problem can be solved by adding simple

error detection codes like SHA or CRC. As for the

former, automatic tools will check the status of the

system after risky operations to ensure the integrity

of the system. In all cases, when a corruption has

occurred, another copy will be used to maintain the

cloud service active, and an appropriate notiﬁcation

will be issued.

Privileges. We deﬁne each individual user as a de-

veloper. Each developer can create his own projects

and, at the same time, participate in collaborative

projects. He can also link other projects to use as ex-

ternal libraries or plugins. Each collaborative project,

or simply project, may involve a large number of de-

velopers, each with his own set of privileges. Some

developers may be actively developing the software,

and therefore be granted full access. Others may be

simply reviewing the code or managing releases; in

this case they will be granted read-only access.

When a project is made available for others to use

as a library, a new set of privileges can be expressed.

For example, the code inherited may be read-only,

with write permissions but without repackaging capa-

bility, or fully accessible. Moreover, some developers

may also be allowed to completely delete some por-

tion of the stored data, change the access privileges

of other developers or of the project as a whole. Fi-

nally, automatic tools, even though not directly devel-

opers, have their speciﬁc access privileges. For ex-

ample, they may be allowed to harvest anonymous in-

formation about the engineering process for statistical

analysis, or propose new revision histories to better

present the project to reviewers.

4.3 Physical Storage

The underlying physical storage design is critical to

the overall performance, especially to support a large

scale VCS in the cloud, where failures are common.

Our design considers the various aspects of a dis-

tributed storage architecture: scalability, high avail-

ability, consistency, and security.

Scalability. Having a single large centralized stor-

age for all the repositories would lead to poor per-

formance and no scalability. Therefore, our design

contemplates a distributed storage architecture. User

repositories will be partitioned, and distributed across

different storage nodes. As the number of users

grows, new partitions can be easily added on-demand,

thus allowing for quick scale out of the system. Natu-

rally, the deﬁnition of the principles driving the repos-

itories partitioning needs careful design, so that when

new storage nodes are added, the data movement nec-

essary to recreate a correct partitioning layout is min-

imized.

Replication. Users expect continuous access to the

cloud VCS. However, in cloud environments built

with commodity hardware, failures occur frequently,

leading to both disk storage unavailability and net-

work splits—where certain geographical regions be-

come temporarily inaccessible. To ensure fault toler-

ance and high availability, user repositories are repli-

cated across different storage nodes and across data

centers. Therefore, even with failures and some stor-

age nodes unreachable, other copies can still be ac-

cessed and used to provide continuous service. The

replication scheme eliminates the single point of fail-

ure present in other centralized, or even distributed,

architectures. In our scheme, there are no master and

slave storage nodes.

Consistency. Software development involves fre-

quent changes and updates to the repositories. In a tra-

ditional scenario, changes are immediately reﬂected

globally (e.g., a git push command atomically updates

the remote repository). Consistency becomes more

complicated when the repository is distributed as mul-

tiple copies on different storage nodes, and multiple

users have access to the same repository and perform

operations simultaneously. If strong consistency is

wanted, the system’s availability is damaged (Brewer,

2010). In many distributed cloud storage systems,

eventual consistency has been used to enable high

availability (meaning that if no new updates are made

to an object, all accesses will eventually return the

last updated copy of that object). In our scheme, we

guarantee “read-your-write” consistency for any par-

ticular user. This means that old copies of an object

will never be seen by a user after he has updated that

object. We also guarantee “monotonic write consis-

tency”: for a particular process all the writes are seri-

alized, so that later writes supersede previous writes.

Finally, we guarantee “causal consistency”, meaning

1) that if an object X written by process A is read by

process B, which later writes object Y, then any other

process C that reads Y must also read X; and 2) that

once process A has informed process B about the up-

date of object Z, B cannot read the old copy of Z any-

more.

ICSOFT 2011 - 6th International Conference on Software and Data Technologies

164

Encryption. Source codes are valuable to develop-

ers and they should be unreadable to users who have

not been explicitly granted access to them. We en-

vision encryption deployed at all levels of the stor-

age system to protect users’ information and source

codes. At the user level, users create credentials to

prevent unauthorized access (for example in the form

of RSA keys). At the source code level, source codes

are protected with an encryption algorithm before be-

ing written to persistent storage. At the hardware

level, individual disks are encrypted to further prevent

loss or disclosure of users’ data.

5 CONCLUSIONS

In this paper we discussed how online collaboration,

a key technology to enable productive software de-

velopment and timely-to-market deliverables, can be

enhanced to meet user needs. We surveyed current

tools available for collaboration, and how these have

shortcomings that can hinder their effectiveness. We

proposed a new solution based on a cloud version

control system, where the entire repository is hosted

securely in the cloud. Users can access their codes

from anywhere without requiring pre-installation of

speciﬁc software, and seamlessly from different de-

vices. Moreover, the cloud VCS is planned with ﬂex-

ibility at its foundation, thus enabling automatic tools,

as well as humans, to modify the state of the history

in consistent manners to provide better insight on how

the software has been evolving.

REFERENCES

Brewer, E. (2010). A certain freedom: thoughts on the CAP

theorem. In Proceeding of the 29th ACM SIGACT-

SIGOPS symposium on Principles of distributed com-

puting, PODC ’10, pages 335–335, New York, NY,

USA. ACM.

Dani¨els, R. and Arends, R. (2011). Cloud9 IDE -

Development-as-a-Service. http://cloud9ide.com.

Duley, A., Spandikow, C., and Kim, M. (2010). A pro-

gram differencing algorithm for verilog hdl. In Pro-

ceedings of the IEEE/ACM international conference

on Automated software engineering, ASE ’10, pages

477–486, New York, NY, USA. ACM.

Geeknet (1999). Sourceforge. http://sourceforge.net.

GitHub (2008). - social coding. http://github.com/.

Google (2006). Google docs & spreadsheets. Google Press

Center. http://www.google.com/intl/en/press/annc/

docsspreadsheets.html.

Grune, D. (1986). Concurrent versions system, a method

for independent cooperation. Technical report, IR 113,

Vrije Universiteit.

Kim, M. and Notkin, D. (2009). Discovering and repre-

senting systematic code changes. In Proceedings of

the 31st International Conference on Software Engi-

neering, ICSE ’09, pages 309–319, Washington, DC,

USA. IEEE Computer Society.

Loeliger, J. (2009). Version Control with Git. O’Reilly Me-

dia, Inc., 1st edition.

Mackall, M. (2006). Towards a better SCM: Revlog and

Mercurial. In Proceedings of the Linux Symposium

(Ottawa, Ontario, Canada), volume 2, pages 83–90.

Pilato, M. C., Collins-Sussman, B., and Fitzpatrick, B. W.

(2004). Version Control with Subversion. O’Reilly

Media, Inc., 1 edition.

Zoho (2005). Zoho writer. http://writer.zoho.com/.

SOFTWARE VERSIONING IN THE CLOUD - Towards Automatic Source Code Management

165