JAR2ONTOLOGY - A TOOL FOR AUTOMATIC EXTRACTION OF

SEMANTIC INFORMATION FROM JAVA OBJECT CODE

Nicol´as Mar´ın, Clara S´aez-

Arcija

∗

and M. Amparo Vila

Department of Computer Science and Artiﬁcial Intelligence, University of Granada, Granada, Spain

Keywords:

Object code analysis, Semantics extraction, Java, OWL.

Abstract:

We present here a novel approach (and its implementation) for the automatic extraction of semantic knowledge

from Java libraries.

We want to match software libraries, so we need to obtain as much information as possible to use it in the

matching process. For this purpose, this approach extracts information about the structure of the classes (i.e.,

name, ﬁelds and hierarchy), as well as information about the behavior of the classes (i.e., methods).

In the literature, to our knowledge, it can be only found lightweight approaches to the extraction of this kind

of information from Java object code. The approach is implemented in an automatic extraction tool (called

Jar2Ontology) that has been developed as a plug-in of the Prot´eg´e Ontology and Knowledge Acquisition

System. Jar2Ontology extracts the semantics from Java libraries and translates it into OWL (Ontology Web

Language).

1 INTRODUCTION

Information systems integration is a problem of in-

creasing interest in many areas, as in Business In-

telligence, Customer Relationship Management, En-

terprise Information Portals, E-Commerce or E-

Business. Interested lectors can ﬁnd remarkable sur-

veys of research in this area in (Kalfoglou and Schor-

lemmer, 2003; Rahm and Bernstein, 2001; Shvaiko

and Euzenat, 2005; Wache et al., 2001).

Often, software libraries are part of the informa-

tion systems, and for this reason, information sys-

tems integration should entail software libraries in-

tegration. Nevertheless, there is not any integration

system that performs software libraries matching. In

this work, we focus on the representation of software

libraries with the aim to integrate them.

There are many deﬁnitions about information in-

tegration, and we can ﬁnd in (S´aez-

Arcija et al., 2009)

the next one (it is a compendium of the before men-

tioned deﬁnitions).

Deﬁnition 1. Information Integration is the task that

aims at building a global system which provides an

uniﬁed access to the information from many infor-

mation sources. Those information systems are dis-

tributed (placed in different places), autonomous (in-

∗

Corresponding author

dependently managed) and heterogeneous (with dif-

ferent software, hardware, data model, etc.).

This work is devoted to the ﬁrst step of the infor-

mation integration process, i.e., the semantics extrac-

tion. Our goal is to integrate software libraries, and

for this reason, we have to extract the semantics of

the libraries and model them.

We present here a novel approach (and its im-

plementation) for the automatic extraction of seman-

tic knowledge from Java object code. We have cho-

sen ontologies to represent the obtained knowledge,

because most of the integration systems are based

on ontology matching techniques. Java data model

has been chosen because Java programming language

is one of the most extended general-purpose object-

oriented languages.

We want to point out that the proposed approach

has been implemented as Jar2Ontology, a plug-in for

the widely used Prot´eg´e Ontology Editor and Knowl-

edge Acquisition System

. This tool extracts infor-

mation directly from Java jar ﬁles, that is Java object

code, so we do not need the source code to model the

semantics of the library.

Most of tools for conceptual modeling allow to ex-

press information about the structural component of

the concepts, and integration systems use this type of

http://protege.stanford.edu

267

Marín N., Sáez-Árcija C. and Amparo Vila M..

JAR2ONTOLOGY - A TOOL FOR AUTOMATIC EXTRACTION OF SEMANTIC INFORMATION FROM JAVA OBJECT CODE.

DOI: 10.5220/0003510302670276

In Proceedings of the 13th International Conference on Enterprise Information Systems (ICEIS-2011), pages 267-276

ISBN: 978-989-8425-53-9

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

knowledge in the matching process. In our case, the

conceptual model is the Java data model, where the

concepts are the library classes and its structural com-

ponent refers to the classes name and ﬁelds, as well as

to the class hierarchy.

Nevertheless, in some conceptual models (e.g.,

object-oriented data model), the semantics consists

of two types of knowledge: structural and behavioral

knowledge, that are related to the structural compo-

nent and the behavioral component of the model, re-

spectively. In our case, the behavioral component

refers to the information about the methods of the

classes. The use of this behavioral information can

enrich the matching process thanks to additional cri-

teria concerning to the behavioral component.

It is important to point out that, thought our re-

search has been oriented to the data integration con-

text, semantic knowledge extraction from Java li-

braries can be useful not only for data integration, but

also for many more applications, as for example, au-

tomatic generation of code documentation, or reverse

engineering.

The paper is structured as follows. Initially, sec-

tion 2 presents a brief state of the art on semantic

knowledge extraction and code analysis. In section

3 we focus on how to face the extraction of seman-

tic information from a jar ﬁle and how to write it in

the form of an OWL (Ontology Web Language) on-

tology. Next, in section 4 we present the diverse types

of ontologies that can be obtained after the semantic

extraction process. Section 5 is devoted to introduce

some implementation issues of the proposed approach

and example experimentation. Finally, in section 6,

concluding remarks end the paper.

2 RELATED WORK

Many research efforts have been made on the ﬁeld

of automatic semantic knowledge extraction during

last years. There are many works that aim to ob-

tain a formal representation of the semantics that un-

derlies a variety of sources, as for example, plain

text (e.g. (Buitelaar et al., 2008; Wimalasuriya and

Dou, 2010)), semi-structured documents (e.g. (DuL,

; Thiam et al., 2009)) or relational database schema

(e.g. (Curino et al., 2009; Myroshnichenko and Mur-

phy, 2009)).

Yet on the object analysis area, we can ﬁnd many

interesting works. Code analysis provides support

for many applications, as program understanding (e.g.

(Jakobac et al., 2005)), hardware design (e.g. (Mar-

tino et al., 2002)), software metrics (e.g. (Wong and

Gokhale, 2005)), security testing (e.g. (Herbold et al.,

2009; Hong et al., 2009; Letarte and Merlo, 2009;

Spoto et al., 2010)), software design (e.g. (Amey,

2002)) and reengineering (e.g. (Herbold et al., 2009;

Kawrykow and Robillard, 2009)). Most code ana-

lyzers examine C/C++ (e.g. (Martino et al., 2002;

Spinellis, 2010; Wong and Gokhale, 2005)) and Java

(e.g. (Jakobac et al., 2005; Kawrykow and Robil-

lard, 2009)) source code, but we have also found

works about PHP (e.g. (Letarte and Merlo, 2009))

and SPARK (e.g. (Amey, 2002)). Most of these ap-

proaches take source code as input data.

Although information extraction from source code

is straightforward, in many cases it is not possible,

simply because source code is not available. There-

fore, if we want to develop a tool as general as possi-

ble, we have to face the difﬁcult task of analyzing in

detail object code.

With respect to object code, we can cite (Hong

et al., 2009; Jackson and Waingold, 1999; Spoto et al.,

2010) as examples of Java Byte Code analysis. Nev-

ertheless, only one of these approaches (Jackson and

Waingold, 1999) is focused on the representation of

the semantic knowledge of the analyzed object code,

and it only represents the structural component of the

model in a UML (Uniﬁed Modeling Language) dia-

gram. The main goal of this work is to extract the

structural component from the analyzed object code,

as well as the behavioral one.

3 SEMANTIC MODEL

EXTRACTION

We have carried out the task of developing an ap-

proach to semantically model the structure and the be-

havior of the classes embedded in a jar ﬁle. It means

going one step beyond the traditional structural con-

ceptual modeling.

We will distinguish two kind of ontologies ob-

tained after the semantic model extraction process,

depending on the information that we want to use.

The ﬁst kind of ontology (we call it Data Ontology)

models only structural knowledge from the java li-

brary. Classes from the library are modeled as on-

tology classes and are organized in a class hierarchy

that is analogue to the library class hierarchy. The

other kind of ontology (we call it Metadata Ontology)

is more comprehensive, because it models both struc-

tural and behavioral knowledge from the java library.

In this section we explain the processes of Extrac-

tion of a Structural Model and Extraction of a Com-

prehensive Model that obtain as a result a data ontol-

ogy and a metadata ontology, respectively.

ICEIS 2011 - 13th International Conference on Enterprise Information Systems

268

java.lang.Object

Employee

ByThehourEmployee WageEarningEmployee CommissionBasedEmployee

Director

EnterpriseManagementEnterprisejava.lang.Throwable

java.lang.Exception

EnterpriseException

Figure 1: Enterprise management class diagram.

Let us present here an example to illustrate the ex-

plained processes of information extraction. It is a

little library, that contains 8 class ﬁles representing

information related to an enterprise and their employ-

ees. In ﬁgure 1 we can see the class hierarchy em-

bebed in the jar ﬁle. Classes in white background

are the classes related to the .class ﬁles of the library,

and classes in dark background are the superclasses

of these classes.

3.1 Extraction of a Structural Model

In this subsection we explain in more detail the pro-

cess of extraction of the structural model from a Java

library and its organization in the form of an OWL

ontology. This process provides as a result an Data

Ontology.

In (G´omez-P´erezet al., 2004)we can ﬁnd the most

extended way to express an object-oriented model by

means of ontologies and, speciﬁcally, using OWL. On

the basis of the ideas exposed in (G´omez-P´erez et al.,

2004), we have modeled the library classes as OWL

classes, and their ﬁelds as properties of these OWL

classes. Each ﬁeld is modeled as an OWL data type

property if the type of the ﬁeld is a datatype, or as an

OWL object property, if the ﬁeld type is a class. In

this last case, this class is also modeled in the ontol-

ogy. Each class is included into the ontology with its

superclasses, obtaining thus a class hierarchy.

Figure 2 shows the classes of the ontology that is

generated taking into account only the classes of the

example (ﬁgure 1) that are embebed in the jar ﬁle with

their superclasses.

We can see in ﬁgure 3 the properties of these

classes, that are related to the Java classes ﬁelds.

Properties in dark background are those that represent

Figure 2: Classes in the data ontology that models the ex-

ample library class hierarchy.

ﬁelds whose type is a class (OWL object properties).

Properties in white backgroundrepresent ﬁelds whose

type is a datatype (OWL datatype properties).

In ﬁgure 4 we can see the process to create the

ontology. This process is as follows: Initially, a hi-

erarchy class that represents the library classes is cre-

ated in the ontology. Then, the hierarchy is explored

with the aim to add the ﬁelds of each class. This step

often entails the creation of new classes, because the

types of the added ﬁelds are classes that do not ex-

ist in the hierarchy. Thus, a recursive process is exe-

cuted. It consists of adding ﬁelds to the new classes

and adding new classes when it is necessary. At the

end of the process, the class hierarchy has remarkably

grown, because we add to the ontology the classes

of the initial hierarchy ﬁelds together with the classes

created by the subsequent ﬁelds addition.

Let us formalize the above introduced Structural

Model Extraction Process.

Deﬁnition 2. Structural Model Extraction Process is

the transformation of a Java library L to an ontology

O that satisﬁes:

JAR2ONTOLOGY - A TOOL FOR AUTOMATIC EXTRACTION OF SEMANTIC INFORMATION FROM JAVA

OBJECT CODE

269

Figure 3: Properties of the classes in the data ontology that

models the example library.

• For each class JC

in the library L, there exists a

corresponding class OC

in the ontology O.

• For each superclass JC

of a class JC

, there exists

a corresponding class OC

in O, where OC

is the

superclass of OC

• For each ﬁeld f

of a class JC

, there exists a cor-

responding property p

of the class OC

in O.

• For each ﬁeld f

, if its range is a class JC

, there

exists a corresponding class OC

in O.

Class Hierarchy

Creation

Adding Fields to

the new Classes

Add Class to the

Hierarchy

Any field whose type

is a class out of the

current hierarchy?

yes

No End

Figure 4: Structural model extraction ﬂow diagram.

Figure 2 shows the ontology at the beginning of

this process for our example. This ontology has only

11 classes, but when the recursive process ends, the

ontology contains 120 classes. These results have

been obtained using Jar2Ontology, the tool that im-

plements this approach.

We have shown how we can create an ontology

that models the structural component of a Java li-

brary following the representation of object oriented

data models that is most extended in current litera-

ture. This approach provides a data ontology whose

class hierarchy is analogue to the original Java class

hierarchy embedded in the library. However, this data

ontology cannot represent all the structural metadata

of a Java class. For this reason, we propose here a

new approach that solves this problem.

3.2 One Step Beyond: Extraction of a

Comprehensive Model

As we have already said, till now we have not taking

into account all the semantic information that we can

obtain from a Java class. For example, it is not able

to represent if a Java class implements a certain inter-

face, if it is public, private, or protected or which its

package is. Furthermore, a conceptual model repre-

sented by means of an OWL ontology cannot repre-

sent information about the behavioral component of

the classes (i.e. the set of methods of each Java class).

In this subsection we explain a more comprehen-

sive approach that handles with all the class metadata

that we are able to obtain from the object code: ﬁrst,

we do a more exhaustive study of the structure and,

additionally, we also incorporate the behavioral infor-

mation. This approach obtains as a result a Metadata

Ontology.

To carry out this purpose, we create an OWL on-

tology that models the metadata of the conceptual

model, i.e, a container in OWL to store information

from conceptual models. Thus, when we want to rep-

resent the semantics of a conceptual model, we create

instances of the ontology classes, i.e., we insert se-

mantic information into the container. Let us see this

in detail.

3.2.1 A Container in OWL to Store Information

from a Conceptual Model

In order to represent all this semantic information in

OWL, we have deﬁned 4 classes into the ontology

(Class, Method, Local Variable, and Field). Later, we

will create instances of these classes to capture the

extracted information.

In ﬁgure 5 we can see these four classes with their

properties and their properties types. Let us see them

in more detail.

• ClassClass: Each instance of this class models a

Java class. The set of properties of the ClassClass

is the following:

ICEIS 2011 - 13th International Conference on Enterprise Information Systems

270

-Name : String

-SuperClass : ClassClass

-Package : String

-Modifiers : String Set

-Fields : FieldClass Set

-Methods : MethodClass Set

-Interfaces : String Set

ClassClass

-Signature : String

-Name : String

-Modifiers : String Set

-Parameters : String Set

-ReturnType : String

-LocalVariables : LocalVariableClass Set

-InvokedMethods : String Set

-InvokedMethodsInOntology : MethodClass Set

-Exceptions : String Set

-Code : String

MethodClass

-Name : String

-Type : String

-TypeInOntology : ClassClass

-Modifiers : String Set

-IsArray : Boolean

FieldClass

-Name : String

-Type : String

-TypeInOntology : ClassClass

-IsArray : Boolean

LocalVariableClass

Figure 5: OWL Classes used to model the metadata ex-

tracted from Java libraries.

– Name. It is a string which contains the name of

the class.

– SuperClass. This property relates the class to

its direct superclass.

– Package. This is a string with the name of the

package of the class.

– Modiﬁers. It is a list of the modiﬁers of the

class. These modiﬁers indicate if the class is

abstract, ﬁnal, private, protected, public, static

or strictfp. Furthermore, we can indicate that it

is an interface.

– Fields. It is a set of instances of the Field-

Class that model the structural component of

the class.

– Methods. This property is a set of instances of

MethodClass that model the behavior compo-

nent of the class.

– Interfaces. It is a set of strings with the names

of the classes that are implemented by the rep-

resented class.

• FieldClass: It is used to model class ﬁelds. The

properties of this class are detailed next.

– Name. This property contains the name of the

ﬁeld in a string.

– Type. It represents the type of the ﬁeld using

a string. If the ﬁeld is an array, this property

indicates the name of the type without the []

symbols.

– TypeInOntology. This property relates the ﬁeld

to its type when the type of the ﬁeld is a class

that is modeled in the ontology.

– Modiﬁers. It contains the ﬁeld modiﬁers, that

indicate if the ﬁeld is ﬁnal, private, protected,

public, static, transient or volatile.

– IsArray. It is a boolean that indicates if the ﬁeld

cardinality is more than 1, i.e., if the ﬁeld is a

set.

• MethodClass: Methods are modeled with this

class. The set of properties that deﬁnes this class

is the following.

– Name. This is a string to express the name of

the method.

– Signature. This property contains the complete

signature of the method. The parts of the signa-

ture are also modeled in another properties, as

we can see here.

– Modiﬁers. It is a list with the modiﬁers of

the method. In this case, the modiﬁers indi-

cate if the method is abstract, ﬁnal, native, pri-

vate, protected, public, static, strictfp or syn-

chronized.

– Parameters. This is a list with the types of the

metdhod parameters.

– ReturnType. This is a string with the return type

of the method.

– LocalVariables. This property relates the

method to its local variables, that are modeled

as instances of the LocalVariableClass.

– InvokedMethods. It is a set with the names

names of the methods that are invoked by the

method that is being modeled.

– InvokedMethodsInOntology. This property is

very close to the previous one. It is a set of

MethodClass instances that represent the in-

voked methods that are represented into the on-

tology. The set of invoked methods of the In-

vokedMethod property contains more elements

than the set of the InvokedMethodInOntology

property when the method invokes methods

that are not modeled in the ontology. It hap-

pens frequently.

– Exceptions. It is a set of strings with the names

of the exceptions that the method throws.

– Code. This is a string with the byte code (object

code) of the method.

• LocalVariableClass: This class is used to repre-

sent information about the local variables of the

methods. The properties of this class are the same

as those of Field Class, except the Modiﬁers prop-

erty, as we can see.

– Name. It represents the name of the local vari-

able in a string.

– Type. It indicates the type of the local variable

in a string. If the local variable is an array, this

property do not include the [] symbols.

– TypeInOntology. When the type of the ﬁeld

is a class, if it is represented in the ontology,

this property is the instance that represents that

class.

JAR2ONTOLOGY - A TOOL FOR AUTOMATIC EXTRACTION OF SEMANTIC INFORMATION FROM JAVA

OBJECT CODE

271

Class Hierarchy

Creation

Adding Fields and

Methods to the

new Classes

Add Class to the

Hierarchy

Any field whose type

is a class out of the current

hierarchy?

Yes

Any local variable

whose type is a class out of

the current

hierarchy?

Add Class to the

Hierarchy

Yes

Has been created

any new class after adding

fields step?

End

Yes

Figure 6: Structural and behavior model extraction ﬂow di-

agram.

– IsArray. It is a boolean that indicates if the ﬁeld

cardinality is more than 1, i.e., if the ﬁeld is a

set.

As we can see, by means of the previous OWL

containers, our approach obtains an ontology that

represents more semantic information about classes.

Each class created in the ontology include all the in-

formation obtained by means of the structural model

extraction process, plus additional structural and be-

havior information.

3.2.2 Inserting Semantic Information into the

Container

Now, we will explain the process to obtain all that

information. It is similar to the structural model ex-

traction process, although it is more complex.

The process, shown in ﬁgure 6, is as follows: Ini-

tially, we add to the ontology instances of the Class-

Class corresponding to the Java classes of the hierar-

chy embedded in the input library. In this ﬁrst step,

we add all the information about each class, but the

ﬁelds and the methods. After that, each class of the

hierarchy is deeply analyzed focusing on its ﬁelds and

methods. This step usually entails the creation of new

ClassClass instances. It can be caused by two rea-

sons: because there are types of the added ﬁelds that

are classes that are not yet represented in the OWL

ontology or because there are types of the local vari-

ables of the added methods that are classes that are

not yet represented in the ontology. This process is

recursively executed until no new class is added.

At the end of the process, the amount of classes

represented in the OWL ontology is usually larger

than the amount of classes obtained by the approach

explained in the previous subsection, because we now

add to the ontology information regarding the classes

that appear in the Java methods.

Let us formalize the before introduced Compre-

hensive Model Extraction process.

Deﬁnition 3. Let C be the set of containers shown in

ﬁgure 5 and represented in an ontology O, we deﬁne

the Comprehensive Model Extraction Process as the

transformation of a Java library L in a set of instances

of the O ontology classes that satisﬁes:

• For each class JC

in the library L, there exists

a corresponding instance CI

of the ClassClass in

the ontology O.

• For each superclass JC

of a class JC

, there exists

a corresponding instance CI

of the ClassClass in

O, where CI

represents the superclass of CI

• For each ﬁeld f

of a class JC

, there exists a cor-

responding instance FI

of the FieldClass in O.

• For each method m

of a class JC

, there exists a

corresponding instance MI

of the MethodClass

in O.

• For each local variable lv

of a method m

, there

exists a corresponding instance LVI

of the Local-

VariableClass in O.

• For each ﬁeld f

or local variable lv

, if its range

is a class JC

, there exists a corresponding

instance CI

of the ClassClass in O.

When we use our example library as input of

Jar2Ontology and we use this comprehensive ap-

proach, we obtain an ontology with 4 classes (Class-

Class, FieldClass, MethodClass and LocalVariable-

Class). Their properties are those that we shown in

ﬁgure 5. This ontology has much information: For

each class that we model, we have one instance of

the ClassClass, some FielClass instances related to

its ﬁelds, some MethodClass instances related to its

methods, and for each one of its method, some Local-

VariableClass instances related to its local variables.

Figure 7 shows the instances of the ClassClass at

the beginning of this process. We can see that the

classes are the same that the obtained at the beginning

of the structural model extraction process (in ﬁgure

ICEIS 2011 - 13th International Conference on Enterprise Information Systems

272

Figure 7: Classes in the metadata ontology that models the

example library class hierarchy.

Figure 8: Properties of the classes in the metadata ontology

that models the example library.

2). Figure 8 contains the instances of the FieldClass

related to the classes shown in ﬁgure 7. We can real-

ize that these ﬁelds correspond to the OWL properties

that we obtained using the structural model extraction

process (in ﬁgure 3). We do not show the instances of

the MethodClass neither of the LocalVariableClass,

because there are too many.

At the beginning of the comprehensive model ex-

traction process, the ontology only has 11 instances of

the ClassClass, 17 instances of the FieldClass, 91 of

the MethodClass and 188 of the LocalVariableClass.

When the extraction process has ﬁnished, there are

129 ClassClass instances, and the number of the other

classes instances increases accordingly.

4 THE OBTAINED MODELS

As we have seen in section 3, information extraction

process can obtain as a result a big model that repre-

sents much more classes than those that are directly

embedded in the library. However, this exhaustive re-

cursion is not always necessary (for example, when

we want to use these ontologies as input of a matching

process). For this reason, we distinguish three kinds

of models with different recursion degrees, depending

on the desired comprehension of the resulting model:

• Lite Model. This model only captures the classes

that directly appear in the jar ﬁle as well as their

superclasses. No additional classes are included

in the model. Figures 2, 7 and 8 correspond to

this kind of model, for our example.

• Full Model. This model contains all the classes

that appear in the jar ﬁle together with all the

classes found during the methods/ﬁelds analysis.

This type of model stores the highest information

amount that can be extracted from the jar ﬁle.

• Pruned Model. This type of model is a compro-

mise between full and lite models. It contains

information regarding to the classes from the jar

ﬁle class hierarchy and the classes added in the

ﬁrst iteration of the methods/ﬁelds analysis. Fig-

ure 9 shows the data ontology obtained as pruned

model. Classes in darker background are those

classes that have been added in the ﬁst iteration of

the process.

Next section will show the results of the execution

of Jar2Ontology with different software libraries. We

will see the different amount of classes, ﬁelds, meth-

ods and local variables represented in each kind of

these models.

5 IMPLEMENTATION AND

EXPERIMENTATION ISSUES

The Prot´eg´e Ontology Editor and Knowledge Acqui-

sition System has been developed in Java, is extensi-

ble and provides a programming frame by means of

plug-ins. Furthermore, Prot´eg´e is backed by a large

community of active users and developers.

JAR2ONTOLOGY - A TOOL FOR AUTOMATIC EXTRACTION OF SEMANTIC INFORMATION FROM JAVA

OBJECT CODE

273

Figure 9: Classes in the data ontology obtained as pruned

model of the example library.

We have developed JarToOntology, a set of

Prot´eg´e plug-ins, to implement the before described

approaches. As Prot´eg´e is developed in Java, we have

used some libraries apart from the standard Java ones.

Core Prot´eg´e API

and Prot´eg´e-OWL API

have been

used to develop the plug-ins on top of Prot´eg´e and to

create and to manipulate OWL ontologies. Addition-

ally, in order to directly deal with Java object code,

we have used BCEL API

(Dahm and Van Zyl, 2006;

Dahm, 2001) and Java reﬂection

JarToOntology has been tested by means of many

libraries. Some of them are the following:

• EnterpriseModel.jar. It is a little library created

with the aim to illustrate the implemented ap-

proaches. In ﬁgure 1 we can see the class hier-

archy that the jar ﬁle contains, together with their

superclasses.

• ﬂickrapi.jar. It is the library of the ﬂickr Java

API

. This library contains 110 class ﬁles.

• htmlcleaner2 1.jar. HtmlCleaner is an open-

source HTML parser written in Java

. This library

contains 40 class ﬁles.

• MozillaParser.jar. It is a Java package that en-

ables you to parse html pages into a Java Docu-

ment object

. This library contains 14 class ﬁles.

We want to remind that JarToOntology input is not

an UML diagram, neither the source code that is the

http://protege.stanford.edu/protege/3.4/docs/api/core

http://protege.stanford.edu/protege/3.4/docs/api/owl

http://jakarta.apache.org/bcel

http://java.sun.com/docs/books/tutorial/reﬂect

http://www.ﬂickr.com/services/api

http://htmlcleaner.sourceforge.net/

http://mozillaparser.sourceforge.net/

Table 1: Number of classes obtained in the data ontology,

i.e., when the structural model extraction approach is used.

Library Lite Model Full Model Pruned Model

EnterpriseModel 11 120 16

ﬂickrapi 121 241 143

HtmlCleaner 47 159 60

MozillaParser 18 143 23

result of its implementation. The input of JarToOntol-

ogy is the jar ﬁle that contains Java object code ﬁles

corresponding to the implemented classes of the li-

brary.

In table 1 we can see the number of classes of

the ontologies obtained by means of our extraction

tool when we use the structural model extraction ap-

proach. The number of classes of the lite model is

always a bit higher than the number of class ﬁles of

the library, because the superclasses of the library

classes are included. Thus, we can note that al-

though there are 8 classes in EnterpriseModel library

(classes with white background in ﬁgure 1), the lite

model includes 3 classes more, corresponding to the

superclasses of the library classes (java.lang.Object,

java.lang.Throwable and java.lang.Exception).

We can see that, as we said when we explained the

obtained models in section 4, the full model has much

more classes than the initial class hierarchy as well as

the pruned model is a compromise solution between

the other two.

In table 2 we can see the number of instances of

each of the four classes created in the ontology when

we apply our comprehensive model extraction ap-

proach using the above mentioned example libraries.

For the sake of space, we do not show here the ob-

tained ontology instances, but a summary of the num-

ber of them.

We obtain the same conclusions from this table

than the conclusions obtained before. Furthermore,

we can see that the growth of the number of classes

represented in the ontologies is higher, because we

add classes obtained during the deep analysis of the

behavioral component.

6 CONCLUSIONS

In this work we have presented a novel approach for

the automatic extraction of semantic knowledge from

Java object code. The approach takes into account

both the structural and the behavioral knowledge, go-

ing one step beyond the traditional structural concep-

tual modeling.

In addition to this, the proposed approach has been

ICEIS 2011 - 13th International Conference on Enterprise Information Systems

274

Table 2: Number of classes (C), ﬁelds (F), methods (M)

and local variables (LV) obtained in the comprehensive on-

tology, i.e., when the comprehensive model extraction ap-

proach is used.

EnterpriseModel

Number of... C F M LV

Lite Model 11 17 91 188

Full Model 129 599 1440 188

Pruned Model 25 65 334 188

ﬂickrapi

Number of... C F M LV

Lite Ontology 121 694 236 217

Full Ontology 297 1696 2447 217

Pruned Ontology 163 941 990 217

HtmlCleaner

Number of... C F M LV

Lite Ontology 47 177 532 681

Full Ontology 212 976 2609 681

Pruned Ontology 60 217 837 681

MozillaParser

Number of... C F M LV

Lite Ontology 18 51 58 76

Full Ontology 144 693 1513 76

Pruned Ontology 23 82 218 76

implemented, obtaining as a result the Jar2Ontology

tool. It is a set of Prot´eg´e plugins that generate OWL

models from Java libraries object code. We have

tested it with toy examples, as well as with real world

libraries.

Our research has been oriented to the data integra-

tion context. Nevertheless, semantic knowledge ex-

traction from Java libraries can be useful for many

more applications, as for example, automatic genera-

tion of code documentation, or reverse engineering.

ACKNOWLEDGEMENTS

Part of the work reported in this paper was sup-

ported by the Spain’s Science and Innovation Min-

istry (Ministerio de Ciencia e Innovaci´on) under grant

TIN2006-15041-C04-01.

Part of the research reported in this paper is sup-

ported by the Andalusian Government (Junta de An-

daluc´ıa, Consejer´ıa de Econom´ıa, Innovaci´on y Cien-

cia ) under project P07-TIC-03175 ”Representaci´on y

Manipulaci´on de Objetos Imperfectos en Problemas

de Integraci´on de Datos: Una Aplicaci´on a los Al-

macenes de Objetos de Aprendizaje”.

REFERENCES

Amey, P. (2002). Closing the Loop: The Inﬂuence of Code

Analysis on Design. Reliable Software Technologies -

Ada-Europe 2002. LNCS, 2361:151–162.

Buitelaar, P., Cimiano, P., Frank, A., Hartung, M., and

Raioppa, S. (2008). Ontology-based Information

Extraction and Integration from Heterogeneous Data

Sources. International Journal of Human-Computer

Studies, 66(11):759–788.

Curino, C., Orsi, G., Panigati, E., and Tanca, L. (2009).

Accessing and Documenting Relational Databases

through OWL Ontologies. Flexible Query Answering

Systems (FQAS’09). LNAI, 5822:431–442.

Dahm, M. (2001). Byte Code Engineering with the BCEL

API. Technical report, Freie Universit¨at Berlin. Insti-

tut f¨ur Informatik.

Dahm, M. and Van Zyl, J. (2006). Byte Code Engineering

Library.

G´omez-P´erez, A., Fern´andez-L´opez, M., and Corcho,

O. (2004). Ontological Engineering: With Exam-

ples from the Areas of Knowledge Management, e-

Commerce and the Semantic Web. Springer.

Herbold, S., Grabowski, J., and Neukirchen, H. (2009). Au-

tomated Refactoring Suggestions Using the Results

of Code Analysis Tools. First International Confer-

ence in System Testing and Validation Lifecycle, pages

104–109.

Hong, T., Hua, C., Gang, Z., Qiang, L., and Jinjin, Z.

(2009). The Vulnerability Analysis Framework for

Java Bytecode. 15th International Conference on Par-

allel and Distributed Systems, pages 896–901.

Jackson, D. and Waingold, A. (1999). Lightweight extrac-

tion of object models from bytecode. Proceedings

of the 21st international conference on Software en-

gineering, pages 194–202.

Jakobac, V., Egyed, A., and Medvidovic, N. (2005). Im-

proving System Understanding via Interactive, Tai-

lorable, Source Code Analysis. Fundamental Ap-

proaches to Software Engineering (FASE). LCNS,

3442:253–268.

Kalfoglou, Y. and Schorlemmer, M. (2003). Ontology Map-

ping: the State of the Art. The Knowledge Engineer-

ing Review Journal (KER), 18(1):1–31.

Kawrykow, D. and Robillard, P. (2009). Improving API Us-

age trhough Automatic Detection of Redundant Code.

IEEE/ACM International Conference on Automated

Software Engineering, pages 111–122.

Letarte, D. and Merlo, E. (2009). Extraction of Inter-

procedural Simple Role Privilege Models from PHP

Code. 16th Working Conference on Reverse Engineer-

ing, pages 187–191.

Martino, B., Mazzocca, N., Saggese, G., and Strollo, A.

(2002). A Technique for FPGA Synthesis Driven

by Automatic Source Code Analysis and Transfor-

mations. International Conference on Field Pro-

grammable Logic and Applications (FLP). LCNS,

2438:47–58.

JAR2ONTOLOGY - A TOOL FOR AUTOMATIC EXTRACTION OF SEMANTIC INFORMATION FROM JAVA

OBJECT CODE

275

Myroshnichenko, I. and Murphy, M. (2009). Mapping ER

Schemas to OWL Ontologies. IEEE International

Conference on Semantic Computing, pages 324–329.

Rahm, E. and Bernstein, P. (2001). A Survey of Approaches

to Automatic Schema Matching. The VLDB Journal,

10:334–350.

S´aez-

Arcija, C., Mar´ın, N., and Vila, M. (2009). A Lazy-

Typing Based Architecture for a Data Integration Sys-

tem. Workshop on New Trends on Intelligent Systems

and Soft Computing, 2:1–18.

Shvaiko, P. and Euzenat, J. (2005). A Survey of Schema-

based Matching Approaches. Journal on Data Seman-

tics(JoDS).

Spinellis, D. (2010). CScout: A refactoring browser for C.

Science of Computer Programming, 75:216–231.

Spoto, F., Mesnard, F., and Payet, E. (2010). A Termination

Analyzer for Java Bytecode Based on Path-Length.

ACM Transactions on Programming Languages and

Systems, 32(3):8:1–8:70.

Thiam, M., Bennacer, N., Pernelle, N., and Lˆo, M. (2009).

Incremental Ontology-Based Extraction and Align-

ment in Semi-structured Documents. 20nd Interna-

tional Conference on Database and ExperSystems Ap-

plications (DEXA). LCNS, 5690:611–618.

Wache, H., Vgele, T., Visser, U., Stuckenschmidt, H.,

Schuster, G., Neumann, H., and Hbner, S. (2001).

Ontology-Based Integration of Information - A Sur-

vey of Existing Approaches. Workshop on Ontologies

and Information Sharing at the International Joint

Conference on Artiﬁcial Intelligence (IJCAI), pages

108–117.

Wimalasuriya, D. and Dou, D. (2010). Ontology-based In-

formation Extraction: An Introduction and a Survey

of Current Approaches. Journal of Information Sci-

ence, 36(3):306–323.

Wong, W. and Gokhale, S. (2005). Static and Dynamic Dis-

tance Metrics for Feature-based Code Analysis. The

Journal of Systems and Software, 74:283–295.

ICEIS 2011 - 13th International Conference on Enterprise Information Systems

276