onomies, as well as the quality of data sources, e.g.,
its coverage and error rate. It is essential to make
use of extensive expertise in the areas of informa-
tion management for biodiversity data (Lotz et al.,
2012), semantic web (Nadrowski et al., 2013), and
user-generated annotations (Gohr et al., 2010; Gohr
et al., 2011). Due to the sheer size of many data
sources it will only be possible to evaluate them ef-
ficiently if the evaluating programs are brought to the
locations of the data (function shipping) rather than,
as usual, copying the data to the respective process-
ing locations. This could be achieved using declar-
ative data description and transformation languages.
Examples of such languages are SQL, SparQL, Dat-
aLog and Map-Reduce. We envision scientific work-
flows that compose large data queries and transfor-
mation jobs. Using todays database infrastructure,
this would result into jobs consisting of multiple de-
pendent SQL queries or Map-Reduce jobs. In the
text-mining domain, such architectures are used in
the TopicExplorer-System (Hinneburg et al., 2012).
For biodiversity science, the spectrum ranges from
molecular biological data such as genome, transcrip-
tome, proteome, and metabolome data, to observed
data (species, traits, habitats etc.), remote sensing,
and climate data.
4.4 Combining the Layers
The building blocks to be developed at the three lay-
ers are to be merged into a software architecture that
supports scientists in the whole process of knowl-
edge discovery and that records the analysis and the
decision-making processes as follows. First, the sys-
tem searches for the appropriate data, programs, and
its parameters. This is followed by finding a suitable
combination of multiple data sources and by solving
the big data problem. Finally, the system selects a
suitable way to visualize the results. During this pro-
cedure, all the steps are documented and stored so that
they can finally be visualized graphically. This way,
the origin of the data and the results are reproducible.
The results can be exported for further processing or
archiving using different ways. In addition, it is possi-
ble to repeat an explorative analysis with minimal hu-
man effort and to easily integrate new scientific work-
flows. Note that explorative visual analysis goes be-
yond mere result presentation. It allows interactive
explanations and justifications of results. Therefore,
fast and easy design and composition of visualization
views are necessary to provide interfaces to users that
allow them to explore analysis results, relate them to
observed data and understand their impact.
5 CONCLUDING REMARKS
Biodiversity research has a high societal and eco-
nomic relevance. Many key questions of this disci-
pline can only be answered using big data. However,
up to now, support of big data in this field is limited.
Existing approaches address individual aspects, but
not the problem as a whole. We believe that an in-
novative computer science approach is needed here.
For this purpose, we proposed a three-layer architec-
ture connecting data sources and function implemen-
tations to scientific workflows supporting domain-
specific problem solving.
REFERENCES
Berg, C. and Zimmermann, W. (2014). Evaluierung von
M¨oglichkeiten zur Implementierung von Semanti-
schen Analysen f¨ur Dom¨anenspezifische Sprachen. In
Software-Engineering 2014, Workshopband Arbeits-
tagung Programmiersprachen ATPS 2014, volume
1129, pages 112–128. CEUR Workshop Proceedings.
Blei, D. M. (2012). Probabilistic topic models. Commun.
ACM, 55(4):77–84.
Catapano, T., Hobern, D., Lapp, H., Morris, R. A., Mor-
rison, N., Noy, N., Schildhauer, M., and Thau, D.
(2011). Recommendations for the use of knowledge
organization systems by GBIF. Global Biodiversity
Information Facility (GBIF), Copenhagen. Available
at http://www.gbif.org/orc.
Fortmeier, O., B¨ucker, H. M., Fagginger Auer, B. O., and
Bisseling, R. H. (2013). A new metric enabling an ex-
act hypergraph model for the communication volume
in distributed-memory parallel applications. Parallel
Computing, 39(8):319–335.
Freytag, A., Rodner, E., Bodesheim, P., and Denzler, J.
(2012). Rapid uncertainty computation with Gaussian
processes and histogram intersection kernels. In Proc.
Asian Conf. Comput. Vis., pages 511–524.
Goff, S. A., Vaughn, M., McKay, S., Lyons, E., Stapleton,
A. E., Gessler, D., Matasci, N., Wang, L., Hanlon,
M., Lenards, A., Muir, A., Merchant, N., Lowry, S.,
Mock, S., Helmke, M., Kubach, A., Narro, M., Hop-
kins, N., Micklos, D., Hilgert, U., Gonzales, M., Jor-
dan, C., Skidmore, E., Dooley, R., Cazes, J., McLay,
R., Lu, Z., Pasternak, S., Koesterke, L., Piel, W. H.,
Grene, R., Noutsos, C., Gendler, K., Feng, X., Tang,
C., Lent, M., Kim, S.-j., Kvilekval, K., Manjunath,
B., Tannen, V., Stamatakis, A., Sanderson, M., Welch,
S. M., Cranston, K., Soltis, P., Soltis, D., O’Meara,
B., Ane, C., Brutnell, T., Kleibenstein, D. J., White,
J. W., Leebens-Mack, J., Donoghue, M. J., Spalding,
E. P., Vision, T. J., Myers, C. R., Lowenthal, D., En-
quist, B. J., Boyle, B., Akoglu, A., Andrews, G., Ram,
S., Ware, D., Stein, L., and Stanzione, D. (2011). The
iPlant collaborative: Cyberinfrastructure for plant bi-
ology. Frontiers in Plant Science, 2(34).
Gohr, A., Hinneburg, A., Schult, R., and Spiliopoulou, M.
(2009). Topic evolution in a stream of documents.
DATA2014-3rdInternationalConferenceonDataManagementTechnologiesandApplications
256