knowledge in the matching process. In our case, the
conceptual model is the Java data model, where the
concepts are the library classes and its structural com-
ponent refers to the classes name and fields, as well as
to the class hierarchy.
Nevertheless, in some conceptual models (e.g.,
object-oriented data model), the semantics consists
of two types of knowledge: structural and behavioral
knowledge, that are related to the structural compo-
nent and the behavioral component of the model, re-
spectively. In our case, the behavioral component
refers to the information about the methods of the
classes. The use of this behavioral information can
enrich the matching process thanks to additional cri-
teria concerning to the behavioral component.
It is important to point out that, thought our re-
search has been oriented to the data integration con-
text, semantic knowledge extraction from Java li-
braries can be useful not only for data integration, but
also for many more applications, as for example, au-
tomatic generation of code documentation, or reverse
engineering.
The paper is structured as follows. Initially, sec-
tion 2 presents a brief state of the art on semantic
knowledge extraction and code analysis. In section
3 we focus on how to face the extraction of seman-
tic information from a jar file and how to write it in
the form of an OWL (Ontology Web Language) on-
tology. Next, in section 4 we present the diverse types
of ontologies that can be obtained after the semantic
extraction process. Section 5 is devoted to introduce
some implementation issues of the proposed approach
and example experimentation. Finally, in section 6,
concluding remarks end the paper.
2 RELATED WORK
Many research efforts have been made on the field
of automatic semantic knowledge extraction during
last years. There are many works that aim to ob-
tain a formal representation of the semantics that un-
derlies a variety of sources, as for example, plain
text (e.g. (Buitelaar et al., 2008; Wimalasuriya and
Dou, 2010)), semi-structured documents (e.g. (DuL,
; Thiam et al., 2009)) or relational database schema
(e.g. (Curino et al., 2009; Myroshnichenko and Mur-
phy, 2009)).
Yet on the object analysis area, we can find many
interesting works. Code analysis provides support
for many applications, as program understanding (e.g.
(Jakobac et al., 2005)), hardware design (e.g. (Mar-
tino et al., 2002)), software metrics (e.g. (Wong and
Gokhale, 2005)), security testing (e.g. (Herbold et al.,
2009; Hong et al., 2009; Letarte and Merlo, 2009;
Spoto et al., 2010)), software design (e.g. (Amey,
2002)) and reengineering (e.g. (Herbold et al., 2009;
Kawrykow and Robillard, 2009)). Most code ana-
lyzers examine C/C++ (e.g. (Martino et al., 2002;
Spinellis, 2010; Wong and Gokhale, 2005)) and Java
(e.g. (Jakobac et al., 2005; Kawrykow and Robil-
lard, 2009)) source code, but we have also found
works about PHP (e.g. (Letarte and Merlo, 2009))
and SPARK (e.g. (Amey, 2002)). Most of these ap-
proaches take source code as input data.
Although information extraction from source code
is straightforward, in many cases it is not possible,
simply because source code is not available. There-
fore, if we want to develop a tool as general as possi-
ble, we have to face the difficult task of analyzing in
detail object code.
With respect to object code, we can cite (Hong
et al., 2009; Jackson and Waingold, 1999; Spoto et al.,
2010) as examples of Java Byte Code analysis. Nev-
ertheless, only one of these approaches (Jackson and
Waingold, 1999) is focused on the representation of
the semantic knowledge of the analyzed object code,
and it only represents the structural component of the
model in a UML (Unified Modeling Language) dia-
gram. The main goal of this work is to extract the
structural component from the analyzed object code,
as well as the behavioral one.
3 SEMANTIC MODEL
EXTRACTION
We have carried out the task of developing an ap-
proach to semantically model the structure and the be-
havior of the classes embedded in a jar file. It means
going one step beyond the traditional structural con-
ceptual modeling.
We will distinguish two kind of ontologies ob-
tained after the semantic model extraction process,
depending on the information that we want to use.
The fist kind of ontology (we call it Data Ontology)
models only structural knowledge from the java li-
brary. Classes from the library are modeled as on-
tology classes and are organized in a class hierarchy
that is analogue to the library class hierarchy. The
other kind of ontology (we call it Metadata Ontology)
is more comprehensive, because it models both struc-
tural and behavioral knowledge from the java library.
In this section we explain the processes of Extrac-
tion of a Structural Model and Extraction of a Com-
prehensive Model that obtain as a result a data ontol-
ogy and a metadata ontology, respectively.
ICEIS 2011 - 13th International Conference on Enterprise Information Systems
268