Are we Ready for Science 2.0?

Tim W. Nattkemper

Biodata Mining Group, Faculty of Technology, Bielefeld University, PO Box 100131, D-33501, Bielefeld, Germany

Keywords:

Science 2.0, Knowledge Discovery, Data Mining, Information Visualization, Information Sharing, Semantic

Annotation, Cooperative Data Analysis, Web 2.0.

Abstract:

In this position paper the impact of web development on knowledge discovery and information sharing in

natural sciences and humanities is discussed. While on the one hand the potential of moving data analysis

to the web is huge, one has to deal with fundamental obstacles on both levels: administrative/political and

scientiﬁc/algorithmic. Some recent trends in Science 2.0 applications and tools in scientiﬁc research are sum-

marized and discussed. Afterwards the reasons for limitations in the Science 2.0 progress are identiﬁed. The

paper concludes with the opinion, that information sciences in general and the ﬁelds of data mining, visualiza-

tion, statistical learning and applied computer sciences (such as bioinformatics, or medical informatics) have

not kept pace with the development and should reconsider some of their research foci.

1 INTRODUCTION

The world wide web (WWW) is continuously and

dynamically changing regarding its technical fea-

tures, its structure and (consequently) its content.

Many aspects of this change relate to each other

(for instance they are based on one and the same

technical development) and are in sum termed Web

2.0

1, 2

. And although this term is only loosely de-

ﬁned it has become common language in the last

decade. If a new service or web application is in-

troduced it is referred to as a Web 2.0 service if

it owns a subset of the following features: User-

centered Design, Rich Internet Application (RIA),

Dynamic Content (DC), Collaboration/Cooperation

(CC), Software as a Service (SAAS), Decentralisa-

tion of Management/Power/Administration, Crowd-

sourcing, Web and Rich User Experience.

Of course, this development in the WWW towards

Web 2.0 applications itself created new large collec-

tions of structured data, semi-structured data or non-

structured data and stimulated many knowledge dis-

covery and data mining research projects to search

these new data collections for hidden relationships

and patterns (Fayyad et al., 1996; Cooley et al., 1997;

Nasraoui et al., 2008; Gloor et al., 2009; Munibalaji

and Balamurugan, 2012).

But since scientists were massive users of the

http://oreilly.com/web2/archive/what-is-web-20.html

http://www.techpluto.com/web-20-services/

WWW from its beginning at CERN, this was not the

only reaction of science to the web development in

general and to the Web 2.0 development in particular.

One of the main observations in the advent of the Web

2.0 was that web-based technologies became a major

driving force for the collection of user-generated con-

tent. And parallel to that, science became more and

more quantiﬁed and digitized as well. In the natural

sciences, measurement is nowadays carried out with

sensors directly connected to a PC so quantiﬁcation

is straightforward. This fact has a strong inﬂuence on

almost all ﬁelds of natural sciences, especially in life

sciences. There, the rapid development of new tech-

nologies for genomic sequencing led to a huge gap

between the large data collections and the computa-

tional methods to analyze the data and to extract in-

formation that can be analyzed and understood by a

user (Pennisi, 2011). But there is no doubt, that the

problem of ”drowning in data and starving for knowl-

edge” problem will be faced in many more areas of

natural sciences.

Even in sciences like marine biology and ecol-

ogy, ﬁeld studies are nowadays carried out in highly

standardized routines recording time series data with

permanently increasing resolution in time and di-

mension. Especially, when images and videos are

recorded, the data volume fraction which can be

manually analyzed, i.e. annotated with semantics is

shrinking more and more leaving a growing mountain

of unlabeled and not annotated data. This has serious

302

W. Nattkemper T..

Are we Ready for Science 2.0?.

DOI: 10.5220/0004169903020306

In Proceedings of the International Conference on Knowledge Management and Information Sharing (KMIS-2012), pages 302-306

ISBN: 978-989-8565-31-0

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

www.biigle.de

Internet

Computer Vision API

...

Laser Point Detection

Illumination Correction

Statistics / Data Mining

Montag, 9. Juli 2012

Figure 1: The central element of BIIGLE’s architecture is the database which contains the images themselves, user-generated

label data and results from the computer-vision modules. The data are made available through the rich internet application

served by www.biigle.de.

consequences for the signiﬁcance of the conclusions

drawn from the study because the majority of data

has not been considered so it automatic labeling and

annotation of data has been proposed (Culverhouse

et al., 2003; Lebart et al., 2003; Pizarro et al., 2009).

It is easy to foresee, that in humanities like social sci-

ences and psychology, automated digital recording of

large data collections (like video observations, or au-

dio streams) will become standard as well and these

disciplines will experience their bottleneck problem

of data analysis soon.

Nevertheless, since automatic semantic annota-

tion of complex semi- or non-structured data such as

images or video is sometimes not perfectly achiev-

able, the recent developments of the WWW, e.g.

Web 2.0 services, triggered some people to motivate

some paradigm shifts in scientiﬁc practice. The abil-

ity to access the same data from different locations

through common computer hardware promised to sig-

niﬁcantly lower the hurdles for contributing to online

science communities. Consequently, these authors

propose Web 2.0 tools for the scientiﬁc community

and have coined the phrase “Science 2.0” (Shneider-

man, 2008; Waldrop, 2008). It was clear, that this

new term was much more than a new ”buzz word”,

since it appeared as the perfect reaction to the trend,

that progress and success in science is more and more

dependent on collaboration in teams of growing size

as reported in (Wuchty et al., 2007).

2 FROM WEB 2.0 TO SCIENCE 2.0

The fact, that the term Science 2.0 is just vaguely de-

ﬁned is not surprising and follows directly from the

loose deﬁnition of the term Web 2.0. Interestingly,

the term seems to have two faces like a Janus statue.

2.1 The Face of Politics

The ﬁrst face is its interpretation from the per-

spectives of administration and politics. From this

persepective, the term Science 2.0 covers in some

sense all non-scientiﬁc questions like ”Should re-

sults be freely exchangeable via the web” or ”How

should the process of publishing be reconsidered?”.

Of course, these are interesting questions and the open

access development deﬁnitely has a strong impact of

the scientiﬁc landscape already. But it is also def-

initely surprising, that some communities (such as

for instance image processing or medical imaging or

bioimaging) do not participate much in that develop-

ment although they would beneﬁt immensely from

that, e. g. considering the unlimited size of supple-

mentary image material which could be associated to

their papers. Another point is, that sharing and pub-

lishing data through the web is used only by a small

set of researchers from life sciences since these are

forced to do so by their national or international fund-

ing agencies supporting their research. In other scien-

tiﬁc disciplines, researchers still consider their data as

their ”precious” and show no clear tendency for shar-

ing data.

2.2 The Face of Science

Nevertheless, the second face of Science 2.0 seems

more interesting in the context of this conference.

This perspective is determined by the question ”How

does Web 2.0 change the way research and develop-

ment is carried out?”. In other words, which devel-

opments in algorithms and software are necessary to

accelerate data analysis and increase the signiﬁcance

of scientiﬁc studies by tackling the bottleneck prob-

lem of understanding huge amounts of complex and

semi-/non-structured data. And this includes not only

”classic” data mining methodology like clustering, di-

mension reduction, applied statistics or association

rule mining. Another very important aspect is shar-

AreweReadyforScience2.0?

303

ing data and collaboration via the web (see Web 2.0

deﬁnition above). Here, new approaches for sharing

data and (maybe more important) annotating and dis-

cussing data via the web have been proposed just re-

cently for instance in the context of molecular biol-

ogy in particular for metabolomics data (Neuweger

et al., 2010), for transcriptomics data (Dondrup et al.,

2009) and for bioimage / microcopy data (Kvilekval

et al., 2010; Loyek et al., 2011). In marine biol-

ogy (see above) two systems have been proposed to

open data for a larger scientiﬁc community and to

support collaborative semantic annotation, e.g. the

NEPTUN project in Canada (Pirenne and Guillemot,

2009; Leslie et al., 2010) and the BIIGLE system

(Ontrup et al., 2009; Bergmann et al., 2011) (see ﬁg-

ure 1). Some of these systems do even support data

mining by offering algorithms for clustering and di-

mension reduction in a software as a service (SaaS)

framework. One example is the WHIDE visualization

for complex bioimages (K

olling et al., 2012), which

is computationally expensive but can be applied easily

due to a SaaS framework via the BIOIMAX website.

The technical concept referred to as TICAL (i.e. how

the job is carried out by a web server, a compute clus-

ter and a data server) is straightforward and shown in

Figure 2.

But although the political arguments are well mo-

tivated and the hardware and software concepts are

well known to establish the technical level of Science

2.0 the author does not really observe that something

like Science 2.0 is really shaping. The majority of

data is not shared or open to the public, the majority

of high impact publications is still published in a tra-

ditional way and just a small number of Web 2.0 web

services exist for data mining or knowledge discov-

ery. What are the reasons for that?

3 WHY IT DOES NOT REALLY

WORK

To ﬁnd the answers for the above question we have

to look at the two faces again. In the political face the

reasons can be seen very easily. Researchers put much

effort in designing studies, collecting and recording

data, investing in new hardware and teaching stu-

dents and assistants. Consequently, the are reluctant

for sharing data, since even if they do not consider

their own carrier (i.e. writing high impact papers as

a PI) they are responsible for the carrier of their stu-

dents. The WWW complex has gained some bad rep-

utation in the light of illegal media data copying and

exchange, so it will need some pressure to make some

researchers moving their data to the web as long they

RIA

(Rich Internet App)

TICAL

Webserver

</>

Compute Servers

XML RPC Server

.....

Cluster-Job

XML RPC

RIA

(Rich Internet App)

WHIDE

RIA

(Rich Internet App)

WHIDE

Result

.....

JSON

DB Server

II.

III.

Montag, 9. Juli 2012

Figure 2: The TICAL/WHIDE architecture consists of three

layers. First, the user submits a request to the web server,

which triggers a XML-RPC call. Second, the call is re-

ceived by the XML-RPC server which starts the execution

of the clustering software on the high performance com-

pute servers using the parameters entered by the user. Third,

when the algorithms have ﬁnished, the user is notiﬁed by an

email. The clustering result (usually a set of prototypes and

a cluster map) is written to a ﬁle and stored in a database,

together with additional meta information (time, cluster pa-

rameters, user, data set etc.). By requesting to view the

result in another web application in BIOIMAX (such as

WHIDE), the corresponding JSON ﬁle is loaded and the

user can explore a visualization in a web browser through

the BIOIMAX system.

see no beneﬁt that outweighs the risks. And this leads

us to the technical face.

In the technical face of our Science 2.0 Janus

statue metaphor, the reasons are quite heterogeneous.

Let us ﬁrst have a look at the humanities. There, the

digitization of scientiﬁc methods is more or less in

KMIS2012-InternationalConferenceonKnowledgeManagementandInformationSharing

304

an infant stage. Researchers just start to record data

digitally with a perspective of a sophisticated follow-

ing data analysis. But in the natural sciences and

medicine we see a different problem. From the point

of view of the authors the development of algorithms

and software are just oriented on making the estab-

lished methods available through the web (like the

WHIDE system, see above). So the overall gain re-

garding reasoning, knowledge and insight is limited.

For instance in machine learning research the best

groups work on ﬁnding new methods for dimension

reduction and projection that outperform the stan-

dard methods regarding topology preservation (like

ISOMAP, LLE, T-SNE etc.) and report progress con-

tinuously. But the methods are getting more and more

computational expensive so they are not applicable in

many contexts with large data volume and an applica-

tion through the web does not make sense either since

the user needs to wait for hours until the results have

been computed.

The author concludes that the areas of data mining

and knowledge discovery can contribute much more

so the potential of Science 2.0 can be unfold.

4 WHAT CAN BE DONE?

From the point of view of the author, the most reason-

able thing to do would be to invent new paradigms for

knowledge discovery in a Web 2.0 framework. This

starts with implementing some aspects of social net-

works so ideas and conclusions are exchanged rapidly

and safe. This way, the quality of data annotations

would be improved rapidly. Another point would

be the collaborative analysis of data. Collaborators

would use the same tool to derive information graph-

ics from their data (scatter plots, histograms, pseudo-

color maps, ...) or to carry out statistical tests which

would provide a good basis for discussing the data.

But one may ﬁnd the next step in data analysis,

data mining, much more interesting. How should one

selected standard data mining procedure be re-shaped

if it is part of a Web 2.0 / Science 2.0 framework? If

one considers for example clustering, the idea would

be for instance to work on new online clustering meth-

ods which perform rapidly, since users are used to get

the results instantaneously after ”pressing the button”.

Maybe one could for instance present a ﬁrst estimate

of a clustering result, while the real clustering is per-

formed in the background and the result is updated

continuously. As a consequence, the whole diea of a

clustering algorithm could be re-considered. It would

be the primary goal to ﬁnd the clustering which is able

to achieve the best clustering indices (i.e. clustering

quality regarding intracluster variance and interclus-

ter distance) but to ﬁnd the clustering which achieves

minimum cluster quality in a given (short) time win-

dow, so the steepness of the cluster index (like for in-

stance the Index I, Chalinsky-Harabasz or the Davis-

B. Index) could be more interesting in the Science 2.0

context?

Another point is, that users usually do not have an

idea about the number of clusters but they would ac-

cept to choose between different results. So maybe

the question, how the best number of clusters k is to

be set and which metric d(x

, x

) is to be applied to

quantify the similarity or the distance of to items i

and j and their n-dimensional features x

i, j

may be not

the only one of interest to data mining developers in

the Science 2.0 context. It would be interesting to ﬁnd

good algorithmic foundations how to cluster data for a

ﬂexible number of clusters and how the result should

be visualized dynamically, so the user can interac-

tively explore the clustering results to gain a mental

model for her/his data. And it would be interesting

here to further explore the connections between the

algorithmic foundations and development of graphics

standards, (html5, 3D) in the WWW.

But these were just some examples and it seems

natural to the author, that it could be interesting to

reconsider many KDD methods along these lines.

5 CONCLUSIONS

The author ﬁnally concludes that Science 2.0 still has

new potential, but the role of KDD can be reconsid-

ered. The main goal is to develop new data analysis

methods that have a huge substantial advantage for

the users so they are more motivated to move their

research to the web.

REFERENCES

Bergmann, M., Langwald, N., Ontrup, J., Soltwedel,

T., Schewe, I., Klages, M., and Nattkemper, T.

(2011). Megafaunal assemblages from two shelf sta-

tions west of svalbard. MARINE BIOLOGY RE-

SEARCH, 7(6):588–602.

Cooley, R., Mobasher, B., and Srivastava, J. (1997). Web

mining: information and pattern discovery on the

world wide web. In Tools with Artiﬁcial Intelligence,

1997. Proceedings., Ninth IEEE International Confer-

ence on, pages 558 –567.

Culverhouse, P., F., Williams, R., Reguera, B., Herry, V.,

and Gonzalez-Gil, S. (2003). Do experts make mis-

takes? a comparison of human and machine indenti-

AreweReadyforScience2.0?

305

ﬁcation of dinoﬂagellates. Marine Ecology Progress

Series, 247:17–25.

Dondrup, M. Albaum, S. P., Griebel, T., Henckel, K., Juen-

emann, S., Kahlke, T., Kleindt, C., Kuester, H., Linke,

B., mertens, D., Miiard-Runte, V., Neuweger, H.,

Runte, K., Tauch, A., Tille, F., Puehler, A., and Goes-

mann, A. (2009). Emma 2 - a MAGE-compliant sys-

tem for the collaborative analysis and integration of

microarray data. BMC Bioinformatics, 10(50).

Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. (1996).

From data mining to knowledge discovery: An

overview. In Fayyad, U., Piatetsky-Shapiro, G.,

Smyth, P., and Uthurusamy, R., editors, Advances in

Knowledge Discovery and Data Mining, pages 495–

515. AAAI press/The MIT Press.

Gloor, P., Krauss, J., Nann, S., Fischbach, K., and Schoder,

D. (2009). Web science 2.0: Identifying trends

through semantic social network analysis. In Com-

putational Science and Engineering, 2009. CSE ’09.

International Conference on, volume 4, pages 215 –

222.

olling, J., Langenkmper, D., Abouna, S., Khan, M., and

Nattkemper, T. W. (2012). Whide - a web tool for

visual data mining colocation patterns in multivariate

bioimages. Bioinformatics, 28(8):1143–1150.

Kvilekval, K., Fedorov, D., Obara, B., Singh, A., and Man-

junath, B. (2010). Bisque: a platform for bioimage

analysis and management. Bioinformatics, 26(4):544–

52.

Lebart, K. andSmith, C., Trucco, E., and Lane, D. (2003).

Automatic indexing of underwater survey video: al-

gorithm and benchmarking method. IEEE Journal of

Oceanic Engineering, 28(4):673– 86.

Leslie, M., Scott, N., Guillemot, E., and Auger, V. (2010).

Video acquisition, archiving, annotation and analysis:

Neptune canada’s real-time georeferenced library of

deep sea video. In OCEANS 2010, pages 1 –9.

Loyek, C., Rajpoot, N., Khan, M., and Nattkemper, T. W.

(2011). Bioimax: A web 2.0 approach for easy

exploratory and collaborative access to multivariate

bioimage data. BMC Bioinformatics, 12(1):297.

Munibalaji, T. and Balamurugan, C. (2012). Analysis of

link algorithms for web mining. International Journal

of Engineering and Innovative Technology (IJEIT),

1(2):83–86.

Nasraoui, O., Soliman, M., Saka, E., Badia, A., and Ger-

main, R. (2008). A web usage mining framework for

mining evolving user proﬁles in dynamic web sites.

Knowledge and Data Engineering, IEEE Transactions

on, 20(2):202 –215.

Neuweger, H., Bonte, A., Stutz, L., Niehaus, K., Stoye, J.,

and Goesmann, A. (2010). Meltdb 2.0 - advances of

the metabolomics software system. Bioinformatics,

24(23):2726–32.

Ontrup, J., Ehnert, N., Bergmann, M., and T.W., N. (2009).

Biigle - web 2.0 enabled labelling and exploring of

images from the arctic deep-sea observatory haus-

garten. In Proc. of OCEANS’09 IEEE, Bremen, Ger-

many.

Pennisi, E. (2011). Will computers crash genomics? Sci-

ence, 331(6018):666–668.

Pirenne, B. and Guillemot, E. (2009). The data manage-

ment system for the venus and neptune cabled obser-

vatories. In OCEANS 2009 - EUROPE, pages 1 –4.

Pizarro, O., Williams, S., and Colquhoun, J. (2009).

Topic-based habitat classiﬁcation using visual data.

In Proc. of IEEE OCEANS’09, Bremen (Germany),

pages 1–8.

Shneiderman, B. (2008). Science 2.0. Science, 319:1349.

Waldrop, M. M. (2008). Science 2.0: Great new tool, or

great risk? Scientiﬁc American.

Wuchty, S., Jones, B. F., and Uzzi, B. (2007). The increas-

ing dominance of teams in production of knowledge.

Science, 316:1036–1039.

KMIS2012-InternationalConferenceonKnowledgeManagementandInformationSharing

306