members with institutional review board approval can
access the data. Data analysis scripts and manuscript
text files are available to all team members on pub-
lic code hosting services. The linkage between the
data and the code is made possible by a feature of
the Git version control software: submodules. A 40-
character hexadecimal sequence (SHA-1 hash) allows
us to share the version of the data source publicly
without compromising the data itself.
The objective of this manuscript is to present a
workflow that we developed to 1) protect sensitive
data from unauthorized access, 2) allow multiple au-
thors, included those with and without data access
rights, to contribute to a single set of files, and 3) min-
imize the financial commitment to hardware and soft-
ware.
Our primary focus is on the use of the Git version
control system and specifically, Git submodules. We
will note other software tools and programs used in
our workflow, but they can often be substituted for
other similar software.
2 WORKFLOW OVERVIEW
Dynamic document authoring is a key component of
the overall reproducible research paradigm. Varia-
tions on literate programming (Knuth, 1984) are ideal
for this purpose. The R package knitr (Xie, 2015),
an evolution of R’s sweave (Leisch, 2002) pack-
age, provides a structured process for authoring a
manuscript using a literate programming paradigm.
knitr was highlighted several times at a recent
workshop supported by the National Academies
of Sciences, Engineering, and Medicine (National
Academies of Sciences, Engineering, and Medicine,
Division on Engineering and Physical Sciences,
Board on Mathematical Sciences and Their Applica-
tions, Committee on Applied and Theoretical Statis-
tics, 2016).
We typically perform data analysis with the sta-
tistical language R
3
and rely on either markdown
or L
A
T
E
X for markup. The desired format of our
deliverables dictates the markup language selection.
Weaving R code with a markup language is well-
described (Gandrud, 2015).
Our team manages collaborative projects using a
distributed version control system, Git
4
. Git is free
to use and is supported on all major operating sys-
tems. Distributed version control systems are becom-
ing more common than centralized systems, although
3
https://www.r-project.org/
4
https://Git-scm.com/
some distributed version control projects, including
many of ours, have a centralized design (De Alwis
and Sillito, 2009).
In the simplest centralized design, a Git server
hosts the repository and each team member would
push to, and pull from, that server. It is possible to
have the individual team members’ repositories di-
rectly linked, but we did not use this option because
of network security concerns. Another option is to
have a bare repository on a network drive act as the
central code repository. We use that design for a mi-
nority of projects with unusually sensitive data. For
most projects, our team takes advantage of the inte-
grated issue tracker, web editing interface, and addi-
tional read/write permissions provided by a Git server.
Several public Git repository sites exist. We chose
to use Atlassian’s Bitbucket
5
to host our reposito-
ries. At the time this choice was made, Bitbucket
allowed academic account holders unlimited private
repositories and unlimited collaborators. Recently,
Github.com has offered similar packages.
Code repositories solved the problems of dynamic
document authoring and collaboration, but we also
needed to track data set versions and limit data ac-
cess to approved team members without preventing
collaboration.
The solution was to use Git submodules. “Sub-
modules allow you to keep a Git repository as a subdi-
rectory of another Git repository. This lets you clone
another repository into your project and keep your
commits separate.” (Chacon and Straub, 2014). Also,
while the data files within the submodule exist in a
subdirectory and are visible in the working directory,
only the SHA-1 of the commit of the submodule is
stored in the primary project repository. Thus, when
the manuscript repository is pushed to bitbucket.org,
the only reference to the data is a 40-digit hexadec-
imal number. The data never leaves the team mem-
bers’ machines, but the status of the data is shared
and documented between team members.
3 INFRASTRUCTURE
Below we describe how we have used existing in-
frastructure, open source software, and free hosting
services to generate reproducible reports while pro-
tecting sensitive data. We designed the workflow
so that sensitive data is stored on a secure network
hard drive or whole-disk encrypted personal machine.
Data transfer between the network drive and a team
member’s machine only occurs on the institution’s
5
https://bitbucket.org
Collaborative Reproducible Reporting - Git Submodules as a Data Security Solution
231