DATA INTEGRATION THROUGH THE CLOUD
How to Combine Internal and External Data Sources – A Design Study
Patrik Hitzelberger, Paulo da Silva Carvalho and Fernand Feltz
Centre de Recherche Public Gabriel Lippmann, 41 rue du Brill, L-4422 Belvaux, Luxembourg
Keywords: Data Integration, Open Data, Cloud Computing.
Abstract: This short paper focuses on the application of cloud computing principles and solutions to the domain of
data integration. After an introduction to the topic, data integration is shortly discussed, and some quality
criteria for data integration solutions, including infrastructure and the organizational context, are presented.
Afterwards, cloud computing and possible cloud-based data integration scenarios are discussed. The before-
mentioned quality criteria are revisited especially relative to public cloud deployment scenarios. Finally, a
design study for the examination of cloud-based data integration that focuses on open data integration for an
environmental data management application is proposed.
1 INTRODUCTION
The integration of information systems is a
fundamental task for software architects and
developers. Academia and industry have been
working in this field for at least two decades now
(cf. e.g. Hasselbring, 2000). This has resulted in a
plethora of bespoken software architecture patterns
(Berbner et al., 2005) and products for information
systems integration (Bernstein and Haas, 2008) that
cover certain aspects of integration.
This short paper focuses on data integration as
one of these aspects, and examines opportunities and
challenges of the cloud paradigm applied to this
field. We will argue that cloud-based data
integration is probably not a panacea for all
integration needs, and to be scrutinized is which
technical architecture fits best for an organization
and its requirements.
The paper consists of two main parts. We will
first analyse briefly the state-of-the-art of data
integration and data integration in the cloud, and
examine some ongoing work. In the second part, we
present research questions and our recently started
research project in the domain of cloud-supported
data integration with a focus on open data (Miller et
al. 2008) integration.
2 DATA INTEGRATION
Lenzerini (2002) defines data integration as “the
problem of combining data residing at different
sources, and providing the user with a unified view
of these data”.
Hasselbring (2000) has identified three
dimensions for describing this problem of the
integration: distribution, autonomy and
heterogeneity of the underlying technical systems,
organisations and data. Data to be integrated can
exist on geographically distributed systems run by
different organisations, in different formats.
Advanced data integration technologies and tools
that tackle the inherent integration problems are a
prerequisite for application domains like BI, CRM
and Master Data Management. The focus of such
tools used to be on read-only applications, meaning
that the integrated data view does not change the
data. Read-and-write scenarios, where integrated
data is also modified, seem to become more
important however (Yahanna and Gilpin, 2012).
Data integration concerns more and more
electronic information that is available and shared on
the Internet, coming from different origins and
sources: private companies (e.g. annual reports),
public entities and governments. The concerned data
is either an implicit result of the growth of the
Internet, or explicitly fostered by open data and
similar initiatives that aim at using the Internet for
public data access (Dekkers et al., 2006).
There are different scenarios for the combination
of organisation internal data with external, public or
non-public sources. It can e.g. foster or permit the
cooperation between enterprises, government
177
Hitzelberger P., da Silva Carvalho P. and Feltz F..
DATA INTEGRATION THROUGH THE CLOUD - How to Combine Internal and External Data Sources – A Design Study.
DOI: 10.5220/0003961001770182
In Proceedings of the 2nd International Conference on Cloud Computing and Services Science (CLOSER-2012), pages 177-182
ISBN: 978-989-8565-05-1
Copyright
c
2012 SCITEPRESS (Science and Technology Publications, Lda.)
agencies and other private or public entities (Halevy
et al., 2006). In unilateral scenarios, organisations
can integrate external data in order to improve
internal analytical or reporting processes. Other
projects focus on the external integration of open
data for public re-use, and tackle the paramount
challenge of integrating (semi-) structured, but
semantically different data (Böhm et al., 2010). We
focus exemplarily on the application of data
integration in environmental research, where the
integration of internal and external data, becomes
more and more important.
The original objective of data integration is to
generate data sets with new, added value, by
homogenising and cleaning the original data sources.
(Halevy et al., 2006). Cloud computing extends this
idea by virtualizing the necessary IT infrastructure.
The general advantages of such a virtualization are
known (Al-Zoube, 2009). We focus in the next
paragraphs on some of the goals and quality criteria
for data-centric integration solutions. These criteria
are based on the assumption that data is being
regarded as an asset by organisations that have to
report a “single version of the truth” (Khatri and
Brown, 2010). Industry markets their own data
integration solutions also known as Information as a
Service and Data Virtualization, with numerous
applications in all sectors. In this paper, we stick to
the notion of data integration, because all terms
converge on the integration of heterogeneous data
sources.
Theoretically, the data from these sources can be
copied, migrated or accessed on-line. If external data
is integrated, it might be inevitable to persist and
copy this data, because it is either required to
snapshot the actual state of data that changes over
time, or because the (long-term) online availability
of external sources cannot be guaranteed. The more
external sources are integrated, the more this issue
becomes relevant. Given that, there are obviously
integration scenarios where scalable data persistence
mechanisms are indispensible for many or all data
sources, and online-access and integration at runtime
is not sufficient. Cloud computing seems to offer
this.
In the next paragraphs, we will present some
important criteria for evaluating data integration
solutions. The list is not exhaustive, but tries to
present some key issues that we will examine in our
further work.
2.1 Data Quality
There is rich literature about information and data
quality models and criteria. Naumann and Rolker
(2000) have already underpinned that “Information
quality (IQ) is one of the most important aspects of
information integration on the Internet” (which is
only partly our objective, but applies to any
integration scenario). They identified 22 information
quality criteria, classified into subject (e.g.
relevancy, comprehensibility), object (e.g.
completeness, price) and process criteria (e.g.
accuracy, availability). Khatri and Brown (2010)
,who are less exhaustive and more focused on
organisational data governance enumerates
accuracy, timeliness, completeness, and credibility.
It is obvious that there is an impact of data
source quality, heterogeneity and the number of
sources to integrate on the resulting integration
solutions. Furthermore, the criteria might be
different for the different sources.
2.2 Technical Infrastructure
Data can only be accessed on sound information
infrastructures. As for data quality, there are
different metrics and requirements for assessing IT
infrastructures, like
reliability, with criteria like fail-safety and
redundancy, backup and long-term archiving
features and so on
performance and scalability of hardware,
software and network infrastructure elements
Data integration solutions require specific IT
infrastructures. They can e.g. produce enormous data
volumes to manage, resulting in extended
requirements for storage, backup and network
capacities.
A detailed discussion of all related IT
infrastructure standards like e.g. ISO 27001, ITIL
and the related literature is outside the scope of this
short paper.
2.3 Process and Compliance
The quality of IT solutions is a result of technical,
but also procedural and organisational measures.
Quality criteria can only be defined relative to
domain and application specific requirements.
Regarding the specifically data integration
related requirements, compliance and legal
frameworks can define, which data has to be stored
and managed in what way, e.g. with respect to data
retention times, data life-cycles, data location, and
data protection provisions to respect. Reporting
obligations become more and more important in
CLOSER2012-2ndInternationalConferenceonCloudComputingandServicesScience
178
sectors like finance and health (Anderson et al.,
2011).
The necessary measures, processes and
organisational decisions in order to reach and
maintain compliance to the sector specific
frameworks can be denoted as Data Governance
(Khatri and Brown, 2010). Replacing crucial parts of
existing data integration solutions has a chief impact
on it.
2.4 Costs
Naumann and Rolker (2000) define price as a data
quality criterion (see above). Any information
system technology assessment must take into
account and compare investment and operational
costs of the solutions. Given that costs reduction is
one of the central arguments in the marketing for
cloud offers, research must focus on costs and the
related payment models and their potential impact
on data integration. The objective is to evaluate
cloud based data integrations in comparison to
conventional solutions.
2.5 Other Criteria
Quality and assessment criteria for data integration
solutions can be detailed further, as mentioned
above. Process-related criteria like speed and ease of
development, maintenance costs etc. (D'Agostino et
al., 2010) should be taken into account. If the focus
of the solution is external collaboration and/or data
integration, the weighting of the criteria changes
considerably (Doelitzscher et al., 2011).
In the next chapter we discuss some of the
presented criteria with respect to cloud data
integration solutions.
3 CLOUD DATA INTEGRATION
3.1 Cloud Technologies
There are many definitions of cloud computing and
cloud technologies. We follow Baars and Kemper
(2010) by abstracting from the details and looking at
cloud computing as “a distributed, net-based
architecture where resources can be dynamically
rearranged”. It seems commonly agreed that
the technical access to this architecture is
service-oriented,
infrastructure, platform and software (IaaS,
PaaS and SaaS) are the main layers for
accessing it,
public and private clouds are the primary
deployment models. Hybrid and community
forms are possible (Mell and Grance, 2011),
payment and scalability is demand-oriented.
Technology-wise, cloud solutions offer potential
for
a complete virtualization of data integration by
migrating all relevant data sources and
infrastructure to the cloud
mixed scenarios where only parts and/or copies
of the internal data to integrate is put into the
cloud, and an on-premises infrastructure is
kept,
on-premises integration where cloud
technologies are used in a private cloud. In
such deployment models, data does not leave
the organisation.
Some available SaaS applications address typical
data-integration driven domains, like BI or CRM,
offered either in public or private clouds. They offer
supporting tools and standard connectors for
migration of and/or the interfacing to internal
applications, but are restricted to specific application
domains and organisational data. There are also
more versatile tools that are marketed as universal
data integration platforms, coming from well-
established business players in this domain.
It seems that independently from buy-or-build
decisions when investing in the cloud, the crucial
data-integration related issue is related to the chosen
deployment model. If public or hybrid cloud models
are part of the solution, some or all data has to leave
the concerned organisation(s), either as copy or after
migration. Initially, this disintegrates existing IT
processes and infrastructures. In-house private cloud
solutions seem to be less disintegrating, and are
considered by some authors as possible transitory
solutions for public cloud based solutions later on
(Géczy et al., 2012).
Both scenarios require the development of
service-oriented accesses to the integrated data. This
results in a “Data-as-a-service” view on data assets.
3.2 Data Integration by and in the
Cloud
In the following, we discuss briefly the general
criteria and characteristics of data integration that
have been introduced in chapter 2. We argue that it
is vital to understand the potential and risks using
cloud computing for data integration. Generally,
literature confirms that it is at least questionable if
every organization should move all or parts of their
DATAINTEGRATIONTHROUGHTHECLOUD-HowtoCombineInternalandExternalDataSources-ADesign
Study
179
data assets into the cloud. Kim (2009) asks e.g.:
Which information must be moved?
Which information cannot be moved?
What is the availability of that kind of system?
Is the system performance affected?
What is the level of security of that kind of
system? Is it trustworthy?
The next subparagraphs sketch some of these
issues.
3.2.1 Data Quality
A cloud computing approach does not modify the
requirements for data quality. There are proven and
mature tools that permit data cleansing for duplicate
detection, data fusion etc. (Halevy et al., 2006). As
soon as external data sources are integrated, data
quality and lineage become more difficult to attain.
Data access tools and layers have to be
redesigned and adapted to new languages and
service-oriented methods, and that data-models
might change.
3.2.2 Technical Infrastructure
One of the central arguments of cloud providers is
the fact that clouds offer the dematerialization of the
management of data assets by replacing complex
and expensive internal infrastructures by on-demand
cloud solutions. This is the case for migration
solutions into public clouds.
Especially for solutions with high volumes of
data to integrate, organizations must be aware of the
fact that the current public cloud solutions depend
on the public internet as transport layer. Current
internal SAN solutions for storage reach 1-10GB/s
guaranteed network performance, which is by orders
of magnitude faster than the vast majority of internet
connections available, especially for small and
medium enterprises and organisations.
Also, it seems to be difficult, based on the
available public cloud providers SLA models, to
define a clear level of reliability for end-to-end
processes when cloud infrastructure elements are
part of the solutions (Géczy et al., 2012). Recently,
data loss and outages of large public cloud providers
have been reported (Blodget, 2011).
3.2.3 Process and Compliance
The externalization of data to public cloud requires
to adapt (or re-integrate) elementary parts of an
existing IT infrastructure and the related
organizational and process frameworks.
Accountability and compliance to legal and sector-
specific frameworks might have implicit or explicit
provisions that hamper or circumvent cloud projects.
This is mainly due to the fact that public cloud
solutions represent two fundamental modifications
of existing on-premises solutions (Adkinson-
Orellana et al., 2011):
Data is moved or copied outside the
organisation
Data is managed or stored by a third party
In contrast to conventional outsourcing solutions
where dedicated resources are run and managed by
third parties, the business model of cloud providers
is by definition focused on their internal economy of
scales; meaning that data location and sharing
behind the service-oriented data access level is
Table 1: Comparison of public cloud and on-premises data
integration characteristics.
Public Clouds
On-premises (incl.
private clouds)
Data quality
Same restrictions and
technologies apply.
Possible impact of
new data access
methods
Proven technologies
and tools available.
Inhouse
development, COTS
solutions.
Technical
Infrastr.
Desintegration or
migration of own
infrastructure – public
network and vendor
dependency. Potential
bandwith issue.
Internal
infrastructures with
definable reliability
and characteristics.
Data
managemt.
External vendor must
be integrated in
compliance and legal
context, trust model
changes
Internally managed
procedures and
accountability. Data
stays in organisation
boundaries.
Costs
Potential cost-savings
through on-demand
cost-models. Economy
of scales at provider
side. (Armbrust et al.,
2009). Integration
costs.
Traditional IT cost
models (own, leased,
outsourced soft- and
hardware costs).
Maintenance and
development costs.
Online
collaborat.
and sharing
Collaboration and
sharing within the
external cloud over
well-defined services
Bespoken bilateral
information
exchange with
infrastructure on both
sides
Speed and
ease of
deployment
Standardized products,
quick on-demand
setup of test and
staging servers, etc.
(D'Agostino et al.,
2010)
Depends on buy-or-
build decisions and
customization effort.
CLOSER2012-2ndInternationalConferenceonCloudComputingandServicesScience
180
hidden from their customers. Physical data locations,
the number of copies, backup strategies and so on
are normally not part of SLAs for clouds. In terms of
legal requirements, this can lead to conflicts of law,
because “data may move from one jurisdiction into
another in milliseconds” (Spies, 2011). In some
countries it is explicitly forbidden to export certain
kinds of data.
Table 1 summarizes the discussed issues with
data integration and compares them to on-premises
strategies, including private clouds.
4 DESIGN STUDY: CLOUD
INTEGRATING OPEN DATA
Based on the discussed literature and practical
experience, we argue that the migration of data
integration applications into (in particular: public)
clouds disintegrates existing information system
architectures. The redesign of a new system cannot
be without costs, and might even be impossible in a
given setting. The risks and costs must be balanced
against the potential advantages of a comprehensive
data virtualization as detailed above.
In order to judge and evaluate cloud data
integration solutions and offers, we will conduct a
design study (Hevner et al., 2004) that will try to
find answers to the following principal challenges:
How to design and build a generalized cloud
data virtualization application that can
integrate internal organization data and
external open data?
What are the necessary technical and
organizational prerequisites?
How to re-integrate and adapt existing data
integration architectures?
Can private cloud integration solutions serve as
a transitional solution for public cloud data
integration solutions later on?
Based on the methodological approach and these
questions, the main design artefact will be a software
prototype with the following preliminary global
specification:
migration of an existing database application
for soil data management into a private cloud
solution based on open source software
Integration of available open environmental
data into this cloud
Adaption and/or redevelopment of the existing
data access and management software tiers
The study will hopefully yield more insights into
the concrete questions of what kind of data can be
migrated, and how to efficiently handle on-premises
and cloud data integration. We hope to identify
further issues by the fact that the prototype responds
to an actual and relevant requirement in the
environmental department of the authors’ institute.
In the natural sciences, there is an increasing
demand for data integration solutions that permit to
conduct interdisciplinary research. The emerging
availability of open governmental data (Murray-
Rust, 2008) in this area can foster this. Furthermore,
mobility of researchers and long-term persistence
challenges for scientific data are additional reasons
for examining cloud solutions in this domain.
Given this, this prototype is an exemplary
application of data integration, and will permit to
scrutinize the application of cloud computing
principles in this domain.
REFERENCES
Adkinson-Orellana, L., A. Rodríguez-Silva, D., J.
González-Castaño, F., 2011. Sharing Secure
Documents in the Cloud. CLOSER 2011 -
International Conference on Cloud Computing and
Services Science.
Al-Zoube, M., 2009. E-Learning on the Cloud -
International Journal of Virtual and Personal Learning
Environments.
Anderson, J., Bagnall, R., Smythe, M., 2011. Position
Reporting Obligations - Investment Advisers
Armbrust, M., Fox, A., Griffith, R., Joseph, A. D., Katz,
R., Konwinski, A., Lee, G., Patter-son, D., Rabkin, A.,
Stoica, I., Zaharia, M., 2009. Above the Clouds: A
Berkeley View of Cloud Computing.
Baars, H. and Kemper, H. G., 2010. Business Intelligence
in the Cloud?. PACIS 2010 Proceedings
Berbner, R., Grollius, T., Repp, N., 2005. An approach for
the Management of Service-oriented Architecture
(SoA) based Application Systems - Enterprise
Modelling and Information Systems Architectures.
Bernstein, P., Haas, L., 2008. Information Integration in
the Enterprise.
Blodget, H., 2011, Amazon's Cloud Crash Disaster
Permanently Destroyed Many Customers' Data.
Retrieved January 21, 2012 from http://articles.busines
sinsider.com/2011-04-28/tech/29958976_1_amazon-
customer-customers-data-data-loss#ixzz1l21WHWng
Böhm, C., Naumann, F., Freitag, M., George, S., Höer,
N., Köppelmann, M., Lehmann, C., Mascher, A.,
Schmidt., T., 2010. Linking Open Government Data:
What Journalists Wish They Had Known -
Proceedings of the 6th International Conference on
Semantic Systems.
D'Agostino, S., Ahronovitz, M., Armstrong, J., Ahmad,
R., Davalbhakta, N., Gogulapati, R., Lau, E., Luster,
E., A. M. Matsui, A., Mohammed, A., Moskowitz, D.,
DATAINTEGRATIONTHROUGHTHECLOUD-HowtoCombineInternalandExternalDataSources-ADesign
Study
181
Nolan, M., Plunkett, T., Porwal, S., Raj
Radhakrishnan, A., Richet, J.L., Prasad Rimal, B.,
Russell, D., B. Sigler, M., Sreenivasan, K., Stratton,
P., Syputa, R., Tidwell, D., Venkatraman, K., Versace,
M., 2010. Moving to the Cloud. Cloud Computing Use
Cases Discussion Group.
Dekkers, M., Polman, F., Velde, R. T., Vries, M. D., 2006,
Measuring European Public Sector Information
Resources. Part 1: Description, overview of results and
conclusions. Retrieved January 20, 2012, from
http://ec.europa.eu/information_society/policy/psi/doc
s/pdfs/mepsir/final_report.pdf.
Doelitzscher, F., Sulistio, A., Reich, C., Kuijs, H., Wolf,
D., 2011. Private Cloud for Collaboration and e-
Learning Services: from IaaS to SaaS. Journal
Computing - Cloud Computing archive Volume 91
Issue 1.
Géczy, P., Izumi, N., Hasida, K., 2012. Cloudsourcing:
Managing Cloud Adoption - Global Journal of
Business Research
Halevy, A., Rajaraman, A., Ordille, J., 2006. Data
Integration: The Teenage Years. Proceeding VLDB
'06 Proceedings of the 32nd international conference
on Very large data bases.
Hasselbring, W., 2000. Information system integration -
Communications of the ACM (Vol. 43, pp. 32-38).
Hevner, A. R., Ram, S., March, S. T., 2004. Design
Science in Information Systems Research - MIS
Quarterly.
Khatri, V., Brown, C., 2010. Designing Data Governance -
Magazine Communications of the ACM.
Kim, W., 2009. Cloud Computing: Today and Tomorrow.
Journal of Object Technology (JOT).
Lenzerini, M., 2002. Data Integration: A Theoretical
Perspective - Symposium on Principles of Database
Systems.
Mell, P., Grance, T., 2011. The NIST Definition of Cloud
Computing. Retrieved January 24, 2012, from http://
csrc.nist.gov/publications/nistpubs/800-145/SP800-
145.pdf.
Miller, P., Styles, R., Heath, T., 2008. Open Data
Commons, a License for Open Data. 1st Work-shop
on Linked Data on the Web.
Murray-Rust, P., 2008. Open Data in Science. Unilever
Centre, Department of Chemistry, Univeristy of
Cambridge.
Naumann, F. and Rolker, C. 2000. Assessment methods
for information quality criteria. Proceedings of 5th
International Conference on Information Quality
Spies, A., 2011. Global Data Protection: Whose Rules
Govern? Retrieved January 28, 2012, from http://
www.bingham.com/Media.aspx?MediaID=12931
Yahanna, N. and Gilpin, M., 2012, The Forrester Wave:
Data Virtualization, Q1, 2012. Retrieved January 28,
2012, from http://www.forrester.com/rb/Research/wa
ve%26trade%3B_data_virtualization%2C_q1_2012/q/
id/60746/t/2
CLOSER2012-2ndInternationalConferenceonCloudComputingandServicesScience
182