Process Modeling for Privacy - Conformant Biobanking:

Case Studies on Modeling in UMLsec

Ralph Herkenh¨oner

Dept. for Computer Science, Christian-Albrechts-University, 24098 Kiel, Germany

Abstract. The continuing progress in research on human genetics is highly in-

creasing the demand on large surveys of voluntary donors’ data and biospec-

imens. By this new dimension of acquiring and providing data and biospeci-

mens, a new quality of biobanking arose. Using automated data and biospeci-

mens handling along with modern communication channels—such as the world

wide web—assigns new challenges to protection of the donor’s privacy . Within

current discussions on privacy and data protection an emerging result is the need

of auditing privacy and data protection within biobanks. For this purpose, ﬁnding

a proper way for describing biobanks in terms of a data protection audit is a vital

issue. This paper presents how modeling in UMLsec can improve the description

of biobanks with the objective of performing a data protection audit. It demon-

strates the use of UMLsec for describing security characteristics regarding data

protection issues on the basis of two case studies.

1 Introduction

Research on human genetics was signiﬁcantly advancing during the last decades. Pri-

marily, this progress is due to the ability of fully sequencing the genotype of the hu-

man DNA. Complexity of acquisition and research has reached a scale that requires

highly specialized acquisition and provision architectures. For this, a new generation of

biobanks arises all over the world.

A biobank is storing biospecimens and sensitive medical information of voluntary

donors. For research, this medical information is enriched by genetic information—

so called genotypes—gained from the stored biospecimens. The possibility of directly

combining medical, genetic, and identifying data demands an appropriate safeguard for

protecting privacy issues of the donors.

In order to tighten biobank integrity and trust of the donors there need to be new at-

tempts to audit biobanks regarding their soundness of privacy protection measures. For

gaining such an audit, the biobank discloses its internal data and biospecimens man-

agement along with their appropriate protection measures to an external independent

accredited entity.

For performing a data protection audit, it is necessary to describe how data protec-

tion is integrated into the processes of a biobank. However, a textually description of a

biobank as a complex system faces the problem of being very difﬁcult to be written in

an understandable, complete and consistent way. Therefore, it is good practice to use

Herkenhöner R. (2008).

Process Modeling for Privacy - Conformant Biobanking: Case Studies on Modeling in UMLsec.

In Proceedings of the 6th International Workshop on Security in Information Systems, pages 3-12

DOI: 10.5220/0001732900030012

 SciTePress

graphic modeling in order to get a better understanding of complex systems (e.g. soft-

ware development and process management). Nevertheless, graphic modeling cannot

replace textual description.

For modeling biobank processes with the objective of performing a data protection

audit, the following requirements arise. An appropriate modeling language should:

– be easy to understand,

– require as little a-priori-knowledge as possible,

– allow a fairly complete description of all coherences on processes, and

– be described by a formal grammar.

Further, an appropriate modeling language should be able to describe:

– processes, roles, and their relations,

– data and control ﬂow, and

– security characteristics regarding privacy and data protection issues.

A common language for graphical modeling is the Uniﬁed Modeling Language (UML)

[1]. Originally, UML was intended to be used in the context of software engineering.

As software engineering integrates process automation in real-world processes, UML

uniﬁes modeling techniques to describe program architectures, real-world processes,

and real-world environments. Nowadays, UML is used in a wide set of modeling issues,

including process management. Therefore, UML is considered to be a good candidate

for modeling biobank processes.

Generally, processes, roles, and their relations can be modeled by UML use case

diagrams. Further, data and control ﬂow are modeled by UML activity diagrams. As

UML does not support modeling of security characteristics by itself, an appropriate

extension is necessary. As a prominent candidate, UMLsec meets this requirements.

This paper presents two case studies, demonstrating the use of UMLsec for descrip-

tion of biobanks with the objective of performing a data protection audit. In the next

Section 2, a brief overview on related work is given. Afterwards, Section 3 introduces

the UMLsec approach of modeling security requirements. Section 4 and 5 present case

studies, modeling the GENOMatch and the popgen biobank in UMLsec. Finally, con-

clusion and outlook are given.

2 Related Work

UML was already used for business process modeling, as presented by Kreische [2].

Further, modeling security characteristics were introduced by J

urjens, extending UML

to UMLsec [3]. Also, there are alternative approaches on process modeling. In the cur-

rent version, the Business Process Modeling Notation (BPMN) [4] uses constructs sim-

ilar to UML activity diagrams [5]. Further, the Event-driven Process Chain (EPC) in-

troduced by Keller et al. [6] is using its own type of semi-formal modeling language.

Currently, there are several projects all over the world dealing with audit and stan-

dardizing issues of privacy and data protection in biobanks (e.g. by the USA National

Cancer Institute [7], the UK Information Commissioner’s Ofﬁce [8], the German TMF

[9] [10], and Swiss Academy of medical science (SAMW) [11]). Although, as all these

projects focus on determining criteria for privacy and data protection, they do not con-

sider description at all.

3 Modeling Security Characteristics Regarding Data Protection

Focus of this paper is modeling of biobanks with respect to a data protection audit.

For this purpose, it is necessary to describe how data protection is integrated into the

processes of a biobank. As mentioned before, process modeling can be done in UML

using use case and activity diagrams. Further, UMLsec enables modeling security char-

acteristics using extension mechanisms of UML. In a data protection audit, the achieved

degree of data protection is evaluated according to the presence or absence of security

characteristics within the process ﬂows. Therefore, an important question is whether

modeling in UMLsec can help improving the description of biobanks for the objec-

tive of a data protection audit. Concerning the modeling of processes, roles, and their

relations in an UMLsec-enriched use case diagram, we think this is fulﬁlled.

In the following, three security characteristics that can be described by UMLsec are

examined, and subsequently their use is illustrated for two case studies.

3.1 Non-Repudiation

In general, non-repudiation is the property of assurance that no participant of an action

can deny its participation. In UMLsec, this security characteristic is represented by the

stereotype

≪

provable

≫

. This stereotype extends an UML object with the property

that certain use cases or activities inside the object are provable, and therefore unde-

niable. Concerning data protection in biobanks, this characteristic targets at the de-

mand for transparency and traceability of data and biospecimen handling—commonly

achieved by recording of all handling activities.

Formally, the stereotype

≪

provable

≫

requires three parameters, which are de-

scribing the use cases that must be provable, the prove and an adversary. In the follow-

ing, these parameterswere omitted within the diagrams due to the fact that all shown use

cases are provable if the stereotype

≪

provable

≫

would be present. For this stereo-

type the adversary is always an insider threat having the same privileges as the involved

actor.

3.2 Role-Based Access Control

Generally, role-based access consists in restricting access to systems or environments

to authorized individuals. In UMLsec, this security characteristic is represented by the

stereotype

≪

rbac

≫

. This stereotype extends an UML object with the property that

certain use cases or activities inside the object are restricted to certain actors only. Con-

cerning data protection in biobanks these parameters address the requirement to limit

access to data and biospecimens according to identity, duration, and amount.

Formally, the stereotype

≪

rbac

≫

requires parameters that describe the actors having

access and the use cases or activities being accessed by the actors. In the following,

these parameters were omitted due to their complexity.

3.3 Secured Communication

Secured communication is the concept of communicating only via secured links that

fulﬁll requirements concerning conﬁdentiality and integrity. In UMLsec, this security

characteristic is represented by the stereotype

≪

secure links

≫

. This stereotype

extends an UML object with the property that interactions between certain use cases

or activities inside the object and other use cases, activities or actors are using secured

communication links only. For this, the following stereotypes are attached to the edges

between the interacting objects (e.g. actor to use case, use case to use case):

–

≪

secrecy

≫

(communication is encrypted),

–

≪

integrity

≫

(communication is signed), and

–

≪

high

≫

(communication is signed an encrypted).

Concerning data protection in biobanks, this characteristic targets the need to keep con-

trol on biospecimen and information ﬂow.

Formally, the stereotype

≪

secure links

≫

requires a parameter describing an

adversary. Again, this parameter were omitted due to the fact, that the adversary is

always an outsider threat, having no privileges within the biobank.

4 Case Study 1: The GENOMatch Biobank

In a ﬁrst step of evaluation the GENOMatch biobank of Bayer Schering AG was mod-

eled as a use case diagram. The evaluation targeted at determining whether modeling

in UMLsec improves the description of biobanks with the objective of a data protection

audit. For this, the report on the data protection audit of the GENOMatch biobank in

2003 provides a basis, as it speciﬁes all actors along with their activities. A brief re-

port on this data protection audit was published by the Independent State Center for

Privacy Protection Schleswig-Holstein [14]. Excerpts of the full report are made pub-

lic by Luttenberger et al. [12] [13]. In the following, the ﬁnal process step within the

pseudonymization—as deﬁned by Pﬁtzmann and Hansen [15]—of the tubes storing the

biospecimen—in the following sample tubes—is presented as an exemplar of this case

study.

Fig. 1 illustrates an UMLsec-enriched use case diagram of the second step of the

pseudonymization within the storing process of sample tubes. In this step, the sample

tubes are relabeled before they are ﬁnally stored within the biobank.

There are two participating facilities in this process: the external data custodian—

represented by the SIM Center, that is storing the pseudonym-links of the biospecimen

tube labels—and the biobank it self—known as the Central Sample Repository. The

Central Sample Repository is divided into three different areas of accountability. The

ﬁrst area is called Safety Zone 1. This area is responsible for the biospecimen transfer

from the Clinical Trial Site to the biobank and for removingthe identiﬁer labels from the

Clinical Trial Site—the ﬁrst step of pseudonymization. The second area—Safety Zone

2—is liable for relabeling the sample tubes, which is the second step of pseudonymiza-

tion. Storage and handling of the biospecimens for analysis and research is done within

the third area—Safety Zone 3.

2nd Step of Pseudonymisiationpackage Schering[ ]

Biobank (Central Sample Repository)

<<useCaseModel>>

<<provable>>

Safety Zone 3

<<subsystem>>

<<rbac>>

Freezer 1

removing sample tubes

<<subsystem>>

<<rbac>>

Freezer 2

storing sample tubes

<<useCaseModel>>

<<provable>>

Safety Zone 2

re-labeling sample tubes

destroying BC1 label

Sample Code Exchanger

removing BC1 label adding BC2 label

<<useCaseModel>>

<<provable>>

Data Custodian (SIM Center)

<<subsystem>>

<<secure links>>

<<rbac>>

SIM Center Database

registering BC1-BC2 link generating BC2

<<high>>

<<secrecy>>

Fig.1. UMLsec-enriched use case diagram of the second step of pseudonymization at GENO-

Match (Safety Zone 1 is outside of this diagram.).

As the SIM Center is a fully automatic and PET

-protected database, the only actor in

this use case is the Sample Code Exchanger within the biobank. He is responsible for

relabeling the sample tubes, which is done in Safety Zone 2.

Relabeling the sample tubes includes the following activities: getting the sample

tubes from Freezer 1 in Safety Zone 3, getting a new pseudonym from the SIM Center,

saving the link between the new and the exchanged pseudonym at the SIM Center, and

storing the relabled sample tubes in Freezer 3 in Safety Zone 3.

All these activities are done in interaction with the outside of the area of account-

ability of the Sample Code Exchanger—thus Safety Zone 2. This fact implicates the

existence of interfaces between Safety Zone 2 and Safety Zone 3 and accordingly be-

tween Safety Zone 2 and the SIM Center. In ﬁgure 1 the existence of these interfaces

are visible as edges crossing the border of Safety Zone 2.

As the security-sensitive sample tubes are stored within Safety Zone 3, every activ-

ity within this zone must be recorded. In the diagram, this requirement is represented

by the UMLsec-stereotype

≪

provable

≫

attached to Safety Zone 3. This indicates

that removing and storing of the sample tubes by the Sample Code Exchanger must be

recorded.

Beyond that, the access to the freezers is restricted by PET-enforced role-based ac-

cess control. In the diagram, this is represented by the UMLsec-stereotype

≪

rbac

≫

at-

tached to Freezer 1 and to Freezer 2 accordingly.

Even more sensitive are the pseudonym links that are stored at the SIM Center

Database. Analogous to Safety Zone 3, every activity within the Sim Center must be

recorded—indicated by the UMLsec-stereotype

≪

provable

≫

—and requires authen-

tication and authorization by the PET-enforced role-based access control—indicated by

the UMLsec-stereotype

≪

rbac

≫

. But, unlike the freezers, the SIM Center Database is

not part of the biobank. For this reason, data exchange must be done via PET-secured

links. In ﬁgure 1, the usage of secured links is indicated by the UMLsec-stereotype

≪

secure links

≫

attached to the SIM Center Database. This means, every commu-

nication link to actors from outside the SIM Center Database must have a certain state of

security. The most sensitive activity in this use case is saving the pseudonym link. For

this, activity there must be a highly secured link that matches requirements concern-

ing conﬁdentiality and integrity. In the diagram, this is represented by the UMLsec-

stereotype

≪

high

≫

at the edge that links the Sample Code Exchanger to the use case

associated with this activity. In contrast, of generation of a new pseudonym, a conﬁ-

dential link—indicated by the UMLsec-stereotype

≪

secrecy

≫

—meets the security

requirements.

5 Case Study 2: The Popgen Biobank

In a next step of evaluation the popgen biobank of the University Medical Center

Schleswig-Holstein was modeled. For this, the report on data management at popgen by

Eller-Eberstein et. al. [16] was taken as a basis. Eller-Eberstein describes in this report

the ﬂow of data and biospecimens within processes regarding to collection, sampling,

Privacy Enhancing Technology

storing and research in the popgen architecture. In this paper, the process steps regard-

ing to anonymization and research are presented as an exemplar of this case study.

Figure 2 presents an UMLsec-enricheduse case diagram of merging and statistically

analyzing at popgen. There are four participating facilities in this process:

– the Pseudonymization Center (providing a Pseudonymization Service for forward-

ing and pseudonymizing data and biospecimens, and storing the pseudonym-links;

it acts as an intermediate for every communication between the other facilities),

– the Study Center (responsible for recruitment, data and biospecimen collection, and

providing the phenotypes),

– the Analysis Labor (extracting DNA from the biospecimens, genotyping, and pro-

viding the genotypes), and

– the Statistical Research Center (merging pheno- and genotypes, anonymizing and

providing statistical analysis).

As the Pseudonymization Service at the Pseudonymization Center, the Genotype Data-

base at the Study Center, and the Phenotype Database at the Analysis Labor are fully

automatic and PET-protected, the only actors in this diagram are the Data Custodian

and the Statistical Analyst at the Statistical Research Center.

The Statistical Analyst is responsible for statistically analyzing the pheno- and

genotypes on correlations. This analysis provides a basis for research at popgen.

Prior to statistical analysis, the necessary pheno- and genotypes must be provided

and in conformance to the data protection policy at popgen the provided data must be

anonymized. These activities lie within the accountability of the Data Custodian. He

requests the necessary pheno- and genotypes, merges them by PsID identiﬁer—thus in

terms of the donor—,removesall pseudonyms—incase of popgen anonymizes them—,

and forwards the merged and anonymous data to the Statistical Analyst.

Requesting the necessary pheno- and genotypes in popgen is supported by the fully

automatic and PET-protected pseudonymization Service. This service acts as an inter-

mediate between the Data Custodian and the Phenotype Database at the Study Center

and the Genotype Database at the Analysis Labor. Further, the Pseudonymization Ser-

vice exchanges the pseudonyms on pheno- and genotypes to enable merging in terms of

donor. As this service provides access to linked—in terms of donor—pheno- and geno-

types, every activity in the Pseudonymization Center must be protocoled—indicated by

the UMLsec-stereotype

≪

protected

≫

. In addition, usage of the Pseudonymization

Service is protected by role-based access control—indicated by the UMLsec-stereotype

≪

rbac

≫

—and limited to the Data Custodian. For this, communication is limited to

highly secured link—indicated by the UMLsec-stereotype

≪

secured links

≫

within

the Pseudonymization Service and the UMLsec-stereotype

≪

high

≫

at the edge linking

internal use cases with the outside.

Analogous activities within the Analysis Labor and the Study Center must be pro-

tocoled and access to the Pheno- and Genotype Database is done by PET-enforced role-

based access controll and via PET-secured links.

Merging and Statistically Analyzingpackage pogen[ ]

<<useCaseModel>>

<<provable>>

Pseudonymization Center (RZ UK-SH)

<<subsystem>>

<<secure links>>

<<rbac>>

Pseudonymization Service

<<subsystem>>

Pseudonym Database

mapping LabC

identifier to RC

identifier

mapping RC

identifier to PsID

identifier

exchanging

LabC identifier

with PsID

identifier

forwarding

request on

genotypes

with PsID

identifier

exchanging RC

identifier with

PsID identifier

forwarding

request on

phenotypes

with PsID

identifier

Statistical Research Center (IMIS)

merging

phenotype and

genotype data by

PsID identifier

anonymize

merged data

analyse pheno-

and genotypes

Statistical Analyst

Data Custodian

<<useCaseModel>>

<<provable>>

Analysis Labor (IKMB)

<<subsystem>>

<<rbac>>

<<secure links>>

Genotype Database

providing

genotypes with

LabC identifier

<<useCaseModel>>

<<provable>>

Study Center (KAIM)

<<subsystem>>

<<rbac>>

<<secure links>>

Phenotype Database

providing

phenotypes with

RC identifier

<<high>><<high>>

<<high>>

Fig.2. UMLsec-enriched use case diagram of merging and statistically analyzing at popgen.

6 Conclusions

To summarize, the case studies presented in this paper show that it is of use to describe

processes, roles, and their relations within biobanks by using UMLsec-enriched use

case diagrams. Furthermore, they demonstrate how the modeling supports the descrip-

tion of biobanks for a data protection audit:

1. Interfaces between different areas of accountability are visible as edges crossing

their borders. This enhances the detection of hidden data ﬂow.

2. Usage of protocoling, role based access control, and secured communication are

modeled as UMLsec-stereotypes.This enriches the modelby importantinformation

necessary for the evaluation of data ﬂow.

Thus, modeling processes in biobanks by using UMLsec-enriched use case dia-

grams signiﬁcantly improves the description of a biobank with the objective of a data

protection audit.

7 Future Work

Even if use case diagrams meet the requirements on describing biobanks in terms of

data protection audits rather well, the following questions arise:

1. What about activity diagrams? As activity diagrams are used for modeling data and

control ﬂows, they might be a powerful completion of describing biobanks in terms

of data protection with respect to items of responsibility.

2. Is it possible to describe other characteristics concerning data protection (e.g. char-

acteristics of pseudonymization and anonymization and responsibilities on data

and control ﬂow)? And if yes, is it necessary and possible to signiﬁcantly extend

UMLsec for this purpose? UMLsec meaningfully allows such extensions. Expe-

riences during the case studies lead to the assumption that modeling the respon-

sibilities on data and control ﬂow may be possible by a new deﬁned stereotype

≪

responsible

≫

3. As UMLsec provides a basis for provingthe achievementof security, could UMLsec

be a basis for a formal argumentation within the data protection audit meeting the

standard of proving security in safety-critical systems presented by J¨urjens?

References

1. Uniﬁed Modeling Language: Superstructure. Version 2.1.1 (formal/2007-02-03). Object

Management Group. http://www.omg.org/docs/formal/07-02-03.pdf

2. Kreische, D.: Gesch¨aftsprozessmodellierung mit der ”Uniﬁed Modeling Language (UML)”

(in German). Dissertation at the University Erlangen-N¨urnberg (2004).

http://deposit.ddb.de/cgi-bin/dokserv?idn=972544232

3. J¨urjens, J.: Secure systems development with UML. Springer-Verlag, Berlin Heidelberg New

York (2005)

4. Business Process Modeling Notation Speciﬁcation. Final Adopted Speciﬁcation (dtc/2006-

02-01). Object Management Group.

http://www.bpmn.org/Documents/OMG\%20Final\%20Adopted\

%20BPMN\%201-0\%20Spec\%2006-02-01.pdf

5. White, S. A.: Process Modeling Notations and Workﬂow Patterns. Object Management

Group, Business Process Management Initiative (2004). http://www.bpmn.org/

Documents/Notations\%20and\%20Workflow\%20Patterns.pdf

6. Keller, G., N¨uttgens, M., Scheer, A.-W.: Semantische Prozeßmodellierung auf der Grund-

lage ”Ereignisgesteuerter Prozeßketten (EPK)” (in German). Scheer, A.-W. (Hrsg.):

Ver¨offentlichungen des Instituts f¨ur Wirtschaftsinformatik, Nr. 89. Saarbr¨ucken (1992).

7. Best Practices for Biospecimen Resources. National Cancer Institute (2007).

http://biospecimens.cancer.gov/practices/

8. Data Protection - Complete Audit Guide. The Information Commissioner’s Ofﬁce, UK.

http://www.ico.gov.uk/upload/documents/library/data_

protection/detailed_specialist_guides/data_protection_

complete_audit_guide.pdf

9. Reng, C.-M., Dembold, P., Specker, Ch., Pommerening, K.: Generische L¨osungen zum

Datenschutz f¨ur die Forschung in der Medizin (in German). Medizinisch Wissenschaftliche

Verlagsgesellschaft, Berlin (2006)

10. Pommerening, K., Schr¨oder, M., Petrov, D., Schl¨osser-Faßbender, M., Semler, S.C., Drep-

per, J.: Pseudonymization Service and Data Custodians in Medical Research Networks and

Biobanks. GI Jahrestagung (1) 2006: 715-721

11. Biobanks: Obtainment, preservation and utilisation of human biological material. Swiss

Academy of medical science (SAMS),Basel , Swiss (2006).

http://www.samw.ch/docs/Richtlinien/e_RL_Biobanken.pdf

12. Luttenberger, N., Reischl, J., Schr¨oder, M., St¨urzebecher, C.S.: Datenschutz in der phar-

makogenetischen Forschung - eine Fallstudie (in German). DuD Datenschutz und Daten-

sicherheit 28(6) (2004).

13. Luttenberger, N., St¨urzebecher, C.S., Reischl, J., Schr¨oder, M.: Der elektronische Daten-

treuh¨ander (in German). DIGMA Zeitschrift fur Datenrecht und Informationssicherheit 5, 1,

pages 2429, 3 2005.

14. Brief Report on the Data Protection Audit. Independent State Centre for Privacy

Protection Schleswig-Holstein (2003). https://www.datenschutzzentrum.de/

audit/kurzgutachten/a0303/a0303_engl.htm

15. Pﬁtzmann, A., Hansen, M.: Anonymity, Unlinkability, Undetectability, Unobservability,

Pseudonymity, and Identity Management A Consolidated Proposal for Terminology.

http://dud.inf.tu-dresden.de/Anon_Terminology.shtml

16. v. Eller-Eberstein, H., Gundermann, L., Krawczak, M., Schreiber, S., Wolf, A.: Datenman-

agement bei popgen (in German). GI Jahrestagung (1) 2006: 729-735