An Ontology for Portability and Interoperability Digital Documents
An Approach in Document Engineering using Ontologies
Erika Guetti Suca and Flávio Soares Corrêa da Silva
Institute of Mathematics and Statistics, University of São Paulo, São Paulo, Brazil
Keywords:
Document Engineering, Document Interoperability, Document Portability, OOXML (Office Open XML),
ODF (Open Document Format), Ontologies.
Abstract:
Organizations need to exchange information simple and efficient, with costs as low as possible. Such infor-
mation is ussually presented as documents with pre-defined content. These documents may be equivalent or
almost equivalent but quite distinct in different organizations. The same document can be different depending
on the historical context. Also, organizations do not always use the same technology to generate your docu-
ments. The purpose of this work is to enable interoperability of documents and achieve portability of digital
documents through the reuse of content and format in different plausible combinations. We propose the char-
acterization of digital documents using ontologies as a solution to the problem of lack of interoperability in
the implementations of document formats. As proof of concept we consider the portability between OOXML
and ODF document formats.
1 INTRODUCTION
Governments are interested in the development of
policies, processes, standards in Information and
Communication Technology (ICT), mounting struc-
tures dedicated to achieving interoperability. Mainly
in digital preservation, the challenge of interoperabil-
ity in addition to a technical is a social and insti-
tutional problem, as it depends on institutions that
pass through changes of direction, mission, admin-
istration and funding sources. Therefore, the Brazil-
ian government established a set of interoperability
standards called e-PING (Padrões de Interoperabili-
dade de Governo Eletrônico)
1
. Concerning the way
storing information, E-PING adopts document for-
mat Open Document Format(ODF) to transmit gov-
ernment information among public and private sec-
tors maintaining privacy and security. Digital doc-
uments are considered official records and are man-
aged according to laws and standards that understand
entire lifecycle of these materials(Bretas and do So-
corro Ferreira Mesquita, 2010).
Choosing a common standard format for exchang-
ing information is excellent, but still continues depen-
dent on a specific format, even being free. The main
problem with digital documents is to ensure access to
those documents in the long term. So it is necessary
1
http://www.governoeletronico.gov.br/
to overcome technical barriers associated with docu-
ment formats. The content, structure and context of
documents must be associated with software features
that preserves its representations and relationships en-
abling their reconstruction. The purpose of this work
is to facilitate the distribution of documents, over-
coming the problem of formats with which they were
created. Besides, we aim to enable documents inter-
operability and achieve documents portability simply
through the reuse of content and formats in different
plausible combinations.
Ontologies can assist us in this work. They pro-
vide a shared understanding of terms allowing inter-
operability and means for an intelligent integration
of information(Uschold, 1998). This work uses on-
tologies as a solution to the lack of problem of inter-
operability in implementations of document formats.
The proposal is to represent digital documents based
on two ontologies: (1) format ontology, that charac-
terizes the digital documents structure and presenta-
tion, independently of specific encoding of each soft-
ware product and (2) context ontology, that repre-
sents the information contexts of businesses. Figure
1 shows our proposal. Documents are offered based
on generic representations centralized, through me-
diators among ontologies, presentation systems and
document editing. As proof of concept will be consid-
ered interoperability between document formats ODF
and OOXML.
373
Guetti Suca E. and Soares Corrêa da Silva F..
An Ontology for Portability and Interoperability Digital Documents - An Approach in Document Engineering using Ontologies.
DOI: 10.5220/0004547503730380
In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval and the International Conference on Knowledge
Management and Information Sharing (KMIS-2013), pages 373-380
ISBN: 978-989-8565-75-4
Copyright
c
2013 SCITEPRESS (Science and Technology Publications, Lda.)
Renderer
Text
Document Ontology
ODF Translator OOXML Translator PDF Translator
Format
Ontology
Context
Ontology
Figure 1: Document generation from format ontology and
content ontology.
The paper is organized as follows: in section 2
we explain the problem of preserving digital docu-
ments; in section 3 we talk about document formats
based XML. in section 4 we introduce the fundamen-
tal concepts of document engineering; In section 5 we
summarize some related work, and in section 6 we ex-
plain our proposed method. Then, in the section 7, we
present the results. Finally, in section 8 we point out
some closing remarks and future works.
2 PRESERVATION OF DIGITAL
DOCUMENTS
Currently doing business by document exchange is
natural and intuitive. Documents are interfaces for
people and business processes. However, there is a
lot of file are formats incompatible. Many documents
written and stored with the same format can be un-
readable, inoperable or after some time be necessary
to migrate the data. Therefore organizations need to
manage their knowledge effectively in order to pre-
serve their intellectual capital. Organizations need to
provide documents independently of the software cre-
ated. Hence it, is important to enable documents inter-
operability within the business processes, i.e., to en-
able coherent exchange of information based on the
adoption of rules and communication standards that
allows comunication between heterogeneous systems.
When documents are exchanged, it is very impor-
tant to ensure documents authenticity. A document is
authentic if it can prove that there is a set of proper-
ties, considered significants, that were correctly pre-
served along time. To achieve authenticity is funda-
mental to rightly record the provenance of the doc-
ument. Contextualize their existence, describe their
custodial history and testify to their integrity was not
compromised (Ferreira, 2006).
Preserving digital information is sometimes in de-
liberately modify or transform the digital document
that carries the message. For this transformation does
not produce a message disproportionately degraded,
it is essential to define what are the properties of mes-
sage should be ensured during the proceso transfor-
mation. In summary, to enable document interoper-
ability would mean not lose the data that explain the
semantics of the document. While document portabil-
ity would mean not losing the characteristics of for-
mat settings document presentation (Ferreira, 2006).
3 DOCUMENT FORMATS BASED
XML
Document format encapsulates a complete descrip-
tion for storing digital documents. For instance,
organizing elements such as text, fonts, graphics,
and other information needed to display a document.
Among the frequently used document formats are
Office Open XML (OOXML) and Open Document
Format (ODF). They are main open standards based
XML for document formats. However, there are no
implementations that offer 100% of portability for
both, not even into the dominant implementations Mi-
crosoft Office and Open Office.org (Shah and Kesan,
2009). For example, to present some differences be-
tween them, Figure 2 shows a simple text encoded in
standards OOXML and ODF.
4 DOCUMENT ENGINEERING
Documents are purposeful representations and orga-
nizations of information, but they exhibit great vari-
ety. Document engineering analyse and design meth-
ods that yield precise specifications for the informa-
tion and rules that business processes require. This
mean, developing models that emphasize document
requirements and patterns of information exchange,
focusing the separation of content and presentation of
the document information in an inherent and desirable
way. A document must be represented using a shared
common conceptual model (Glushko and McGrath,
2008).
Document engineering define models of different
types of documents in a rigorous and unambiguous
way so that we can automate their process or ex-
change within or between applications. It needs to
be diligent and precise when defines the meaning of
any information produced and consumed by business
KMIS2013-InternationalConferenceonKnowledgeManagementandInformationSharing
374
Figure 2: Sample text encoded in formats OOXML and ODF.
domain
range
domain
ObjectProperty
has
range
ObjectProperty
has
range
ObjectProperty
has
range
ObjectProperty
has
range
domain
ObjectProperty
has
ObjectProperty
contains
domain
rangedomain
domain
range
range
range
ObjectProperty
isComposedOf
ObjectProperty
displayContent
ObjectProperty
displayContent
ObjectProperty
displayContent
isComposedOf
ObjectProperty
domain
range
SubClassOf
domain
domain
domain
ObjectProperty
has
ObjectProperty
has
ObjectProperty
has
ObjectProperty
has
range
domain
range
domain
domain
range
SubClassOf
domain
range
domain
Class
Body
Class
Document
Class
Meta
Class
Paragraph
Class
Cell
range
Class
Graphics
Class
Text
Class
Font
Class
Table
Class
Class
StyleGraphics
Class
StyleParagraph
Class
StyleText
Class
Style
Class
StyleTable
Class
StyleTableColumn
Class
StyleCell
Figure 3: Format Ontology.
applications. As electronic documents are ubiqui-
tous, document engineering emphasize that a docu-
ment must be defined in a technology-neutral way as
a purposeful and self-contained collection of infor-
mation. When businesses exchange documents, they
must agree on what the documents mean and on the
business processes they expect each other to carry out
with them, but they do not need to agree on the tech-
nology they use (Glushko and McGrath, 2008).
Many applications need to support different phys-
ical interfaces. These imply many-to-many mappings
between the input and output interfaces for each ap-
plication. Many-to-many mappings can be avoided
by mapping all physical interfaces to a common con-
ceptual model. So, the best way to facilitate inter-
operability is allowing the participants to share the
same conceptual model. A common metamodel helps
aligning different models. Basing the user and ap-
plication interfaces on a common conceptual model
ensures that the documents they process are interop-
erable (Glushko and McGrath, 2008).
Document engineering entails the need for stan-
AnOntologyforPortabilityandInteroperabilityDigitalDocuments-AnApproachinDocumentEngineeringusing
Ontologies
375
range
domain
ObjectProperty
study
range
range
ObjectProperty
belongs
range
ObjectProperty
make
ObjectProperty
isEnrolledIn
ObjectProperty
make
domain
range
ObjectProperty
hasAdvisor
domain
domain
range
range
ObjectProperty
work
ObjectProperty
has
domain
range
ObjectProperty
has
domain
range
range
ObjectProperty
participates
domain
domain
domain
domain
Class
Subject
Class
Student
Class
Curse
Class
Institute
Class
Qualification
Class
Defense
Class
Committee
Class
Professor
ObjectProperty
formedIn
domain
range
Figure 4: Context Ontology for Student Sheet.
dardization of syntax, structure, and semantics of
business documents and their reusable components.
Document engineering search achieve documents in-
teroperability and improve documents portability. In
other words, document interoperability is the abil-
ity of businesses applications to extract information
contained in various kinds of documents and trans-
form it standardized XML structures. These XML
data files can then be exchanged between the various
systems and further processed (Schmidt et al., 2006).
Using standardized XML structures saves effort and
yields more consistent, compatible, and successful
designs. Moreover, document portability refers to the
exchangeability of documents as a whole, i.e. with
all the information they contain, formatting settings
and graphic information. Crucially, all stylistic and
graphical data (Schmidt et al., 2006). This infor-
mation could be grouped in four components: con-
tent, logical structure, layout structure and presenta-
tion(Barron, 1996).
Document portability considers the issue of visual
fidelity to an original, makes requirements in terms of
optical appearance, stylistic elements and other such
matters. On the other hand, document interoperabil-
ity is exclusively concerned with the exchange of the
business information contained in documents. Docu-
ment interoperability could enable businesses appli-
cations to communicate directly with a wide range
of different eGovernment services, platforms and ad-
ministration applications. The document interoper-
ability shall enable business processes to be generated
from office applications, and then to be integrated in
corresponding eGovernment processes(Schmidt et al.,
2006).
5 RELATED WORKS
An excellent work existent in the literature is based
on the framework UN/CEFACT CCTS
2
(Core Com-
ponent Technical Specification). CCTS is an euro-
pean conceptual framework for modeling document
components in a syntax neutral and technology inde-
pendent manner. It permits handling different doc-
ument configurations imposed by divergent national
legislations. The conceptual model is transferred to
XML schema serving as a basis for Collaborative Web
Services for eGovernment(Vogel et al., 2008).
Another interesting work is designing XML doc-
uments from conceptual schemas and workload in-
formation for compliant to consensual information of
specific domains. The research presents a conver-
sion approach which considers data and query work-
load estimated for XML applications, in order to gen-
erate an XML schema from a conceptual schema.
Loaded information is used to produce XML schemas
which can respond well to the main queries of an
XML application. The work evaluates an approach
through a case study carried out on a native XML
database. Its experimental results demonstrate that
the XML schemas generated by the proposed method-
ology contribute to a better query performance than
related approaches(Schroeder and Mello, 2009).
Lastly, Universal Business Language (UBL) is a
library of standard electronic XML business docu-
ments such as purchase orders and invoices. It was
developed by OASIS. UBL is designed to provide
a universally understood and recognized commer-
cial syntax for legally binding business documents
and to operate within a standard business frame-
2
http://www.unece.org/cefact/index.html
KMIS2013-InternationalConferenceonKnowledgeManagementandInformationSharing
376
work such as ISO 15000 (ebXML) to provide a com-
plete, standards-based infrastructure that can extend
the benefits of existing EDI systems to businesses of
all sizes. UBL Library is based on a conceptual model
of information components known as Business Infor-
mation Entities (BIEs). These components are assem-
bled into specific document models. One document is
a set of information components that are interchanged
as part of a business transaction; for example, in plac-
ing an order. This approach facilitates the creation of
UBL-based document types beyond those specified in
this release(Bosak et al., 2011).
Our proposal uses ontologies implemented in
OWL (Web Ontology Language). Advantages using
ontologies compared to previous models, more ro-
bust and worked our proposal, we could say: OWL
is a standard semantic markup language for publish-
ing and sharing ontologies on the World Wide Web
and the Semantic Web. We have freedom to reuse the
content ontology in other services together in a ser-
vice document generation, i.e., because it is indepen-
dent of format ontology, and finally we can harness
the power of ontologies to infer new knowledge do-
main.
6 USING ONTOLOGIES FOR
MODELING DIGITAL
DOCUMENTS
Ontologies are designed for enabling knowledge shar-
ing and reuse on some domain that can be commu-
nicated between people and computers. Therefore,
to enable the sharing and reuse of knowledge, it is
necessary a formal specification of concepts. Ontolo-
gies define rules of relationships between concepts to
query, infer knowledge(Gruber, 1995).
The documents present information depending on
its purpose, and this information can be presented in
different ways. Our proposal is to build a model that
considers essential qualities of digital document. Fol-
lowing the approach of engineering documents, a doc-
ument is considered as combination of information
components and presentation components. Whereas
the information is independent of how it is presented.
This model is based on a integration of two ontolo-
gies. An ontology that represents the presentation
structure and an other that represents the information
according business context, i.e. , a format ontology
and a context ontology. These ontologies are indepen-
dent between each other. The objetive of format on-
tology is to achieve simple document portability. The
purpose of context ontology is to achieve document
interoperability. The mapping between the format on-
tology and content ontology for its physical interfaces
occurs through translators.
6.1 Format Ontology
Format ontology characterizes visual structure of doc-
ument, i.e., formatting settings and graphic infor-
mation. Format ontology specifies formally docu-
ment components, i.e., presentation structure includ-
ing metadata, paragraphs, texts, tables, lists, enumer-
ations, images, styles, etc. The Metadata is the in-
formation associated with the document, for exam-
ple: creation date, last modified date, text language,
document author, pages number, etc. The document’s
layout is based on tables. The tables are composed of
cells, cells can contain paragraphs with images, text
ou maybe another table. Each paragraph of text has
a presentation style, i.e., color, color-font, font, size,
horizontal alignment, vertical alignment, etc. The for-
mat ontology is shown in Figure 3. From the format
ontology, a document can be created in an appropriate
format for its purpose.
6.2 Context Ontology
A business context is a scope in which a special-
ized vocabulary is employed, so the business con-
text defines the type of document information. The
context is used to organize and analyze requirements
and rules information presentation. For example, a
student sheet, a medical record, rental contract, etc.
show different scenarios. For proof of concept this
work takes the context of a student sheet. The main
objetive of a student sheet is to provide academic in-
formation, i.e., grades, attendance, subjects, advisor,
date of birth, date of admission, etc. The context on-
tology was created based on that context. Context on-
tology is shown in Figure 4.
6.3 Generating Documents
Documents are generated from the combination of
format ontology and context. Figure 1 summarizes
the process of document generation. Usually number
of instances to represent a document is big. For exam-
ple, Figure 5 shows a tree of instances of format on-
tology that characterizes the text shown in Figure 6.
Figure 7 shows interaction between one instance of
Text concept of format ontology and another instance
of Institute concept of content ontology.
Document is composed of paragraphs, in-
stances of Text concept are always inside a para-
graph. The string of the Figure 7, #insti-
AnOntologyforPortabilityandInteroperabilityDigitalDocuments-AnApproachinDocumentEngineeringusing
Ontologies
377
Student Sheet
body
metadata
par1_TblLogoHeader
par2_TblPersonalData
par3_space
par4_TblAcademicData
cel_11_ImgLogoJanus
cel_12_TxtJanus
cel_21_ImgLogoUSP
cel_22_TxtTitleUniversity
tblLogoHeader
tblPersonalData
txtSpace
tblAcademicData
contains
displayContent displayContent
displayContent displayContent
has
has
isComposedOf isComposedOf
par1_cel11_TblLogoHeader
imgLogoJanus
isComposedOf
displayContent
styleBorderLess
has
styleTblLogoHeader
has
par1_cel12_TxtJanus
isComposedOf
has
styleLogo
has
“images_sheet_student/janus.jpeg”^^string
addressImage
txtTitleJanus
displayContent
styleParTitle
“System Administrative
Graduate”^^string
content
has
isComposedOf
“NONE”^^string
cellBorder
“LEFT”^^string
“CENTER”^^string
horizontalPos
textAligned
“200”^^string
spacedLines
“Erika Guetti Suca”^^string
author
contains
“2”^^unsignedShort
“2”^^unsignedShort
“1.0”^^double
contains contains
columns
rows
witdh
styleSubTitle
“12”^^short
“Times New Roman”^^string
has
fontSubTitle
size fontType
has
isComposedOf
par1_cel21_TxtJanus
isComposedOf
par1_cel22_TitleUniversity
isComposedOf
par2_cel22_TitleUniversity
par3_cel22_TitleUniversity
“14”^^short “Times New Roman”^^string
fontTitle
size
fonType
imgBrasaoJanus
txtNameUniversity
txtNameInstitute
txtTitleDocument
styleTitle
“images_sheet_student/brasaoUSP.gif”
^^string
styleLogo
has
displayContent
displayContent
displayContent
“University of
São Paulo”^^string
“Institute of Mathematics and
Statistics”^^string
“STUDENT SHEET”^^string
content
content
content
addressImage
has
has has
isComposedOf
isComposedOf
Figure 5: Tree instances of ontology format corresponding to header’s student sheet.
Figure 6: Sample of header’s student sheet.
tute_EN_#nameInstitution is a reference to instance
of Institute and its attribute separated by the sym-
bol #. The instance name is institute_EN and its in-
stance attribute referenced is nameInstitution. The
Text concept rederizes in document the value of at-
tribute nameInstitution of instance institute_EN, for
the example is University of São Paulo.
Instances of format ontology refer to instances of
content ontology. The translators receives instances
of format and content ontology to mapper to objects
that represent components of docx and odt formats.
Finally, based on the representations of docx and odt
formats, a document is rendered in an appropriate for-
mat.
7 RESULTS
This study tested a small subset, basic word process-
ing features, of what is needed for multiple interop-
erable implementations. This work is not trying to
test extremely complex elements, but elements that
are routinely used.
The experiment was implemented in java using
KMIS2013-InternationalConferenceonKnowledgeManagementandInformationSharing
378
Figure 7: OWL code of an instance of concept Text of for-
mat ontology and an instance of concept Institute of content
ontology.
OWL Api
3
3.2, OpenDocument Odfdom
4
0.8.7 and
Apache POI
5
3.8. The documents generated in ODT
were tested using OpenOffice.org3. In the case of
the format OOXML, the documents were tested using
Microsoft Office 2010. Lastly, the ontologies were
created using Protégé
6
4.2.0.
The main difficulty in the experiment was main-
taining the aesthetic characteristics fidelity of the doc-
ument. We have not considered features not com-
patibles between standars OOXML and ODF. It is
not possible to obtain always 100% of the translata-
bility between DOCX documents to ODT and vice
versa, due to the unique characteristics of standards
OOXML and ODF(Eckert et al., 2009).
The objective was to enable sharing documents
keeping the integrity of their information, i.e., to
achieve document interoperability while allowing
simple portability.
The format ontology and content ontology over-
come the problem of preserving digital documents,
eliminating the dependence of particular technologies
and enabling the central storage of documents. The
context ontology can be reused in other applications
and take full advantage of the power of expressivity
of ontologies. The format ontology can generate doc-
uments regardless of context ontology. Figures 8 and
9 show rebuilding the document from a sheet student
represented our model. Despite failing to reproduce
the 100% of the stylistic features equally in both for-
mats, the student information integrity has not been
3
http://owlapi.sourceforge.net/
4
http://incubator.apache.org/odftoolkit/odfdom/
index.html
5
http://poi.apache.org/
6
http://protege.stanford.edu/
Figure 8: Version generated for DOCX format.
Figure 9: Version generated for ODT format.
compromised. May be tolerable minimal lost or dif-
ference of stylistic document, but not the the student
information. The two student sheets continue serving
their purpose, reporting the data and student perfor-
mance. The documents maintained its authenticity.
8 CONCLUSIONS
This work has shown that ontologies and simple map-
pings provide a good foundation for the creation of
digital documents allowing document interoperabil-
AnOntologyforPortabilityandInteroperabilityDigitalDocuments-AnApproachinDocumentEngineeringusing
Ontologies
379
ity and simple portability. In addition, based on cen-
tralized representations of documents, it is possible to
change the physical interface without changing all its
mapped physical interfaces. Even more, new physical
interfaces could be added, perhaps for new display
or output devices, without changing the conceptual
model. If technologies change, translators will also
change even though the underlying conceptual mod-
els will not.
Using the same base of conceptual model, multi-
ple publicationsand formats could be created, and dif-
ferent assemblies of document could share common
structures or patterns. The idea is to reuse standard
schema components wherever it is possible. We could
easily imagine applications that reuse documents and
processes in ways not anticipated by theirs creators.
For future work, we will improve the implementa-
tion of translators and include anothers formats, PDF,
HTML, etc. A future application might be generat-
ing filled forms with predetermined information. To
get a complete flow of document interoperability, We
will need create instances of format ontology from the
reading of documents. This would imply creation of
tool that assists in automating creation and reading of
instances. Then we would have complete flow of doc-
ument interoperability, information represented in our
conceptual model could be updated with information
from documents.
Finally, We will also develop a use case docu-
ment interoperability applied to electronic govern-
ment, where format ontology and content ontology
should play an important role in preserving and dis-
tributing digital documents efficiently.
REFERENCES
Barron, D. W. (1996). Portable documents: problems and
partial solutions. Department of Electronics and Com-
puter Science University of Southampton, 8:343–367.
Bosak, J., McGrath, T., and Holman, G. K. (2011). Univer-
sal business language v2.1. Organization for the Ad-
vancement of Structured Information Standards (OA-
SIS), Standard.
Bretas, N. L. and do Socorro Ferreira Mesquita, C. (2010).
Panorana da Interoperabilidade no Brasil. Ministério
do Planejamento, Orçamento e Gestão.
Eckert, K.-P., Ziesing, J., and Ishionwu, U. (2009). Docu-
ment Operabilitiy Open Document Format and Office
Open XML. Fraunhofer Verlag, Germany, fokus edi-
tion.
Ferreira, M. (2006). Introdução e preservasão digital: Con-
ceitos, estratégias e actuais consensos. Escola de
Engenharia da Universidade do Minho, edição elec-
trônica edition.
Glushko, R. J. and McGrath, T. (2008). Document En-
gineering: Analyzing and Designing, Documents
for Business, Informatics and Web Services. Mas-
sachusetts Institute of Technology.
Gruber, T. R. (1995). Toward principles for the design of
ontologies used for knowledge sharing. Int. J. Hum.-
Comput. Stud., 43(5-6):907–928.
Schmidt, K.-U., Fox, O., Henckel, L., Holzmann-Kaiser,
U., Martin, P., and Tschichholz, M. (2006). Document
Interoperability for Use in eGovernment: Integration
of XML-based Document Content in Public Adminis-
tration Processes. FOKUS.
Schroeder, R. and Mello, R. D. S. (2009). Designing xml
documents from conceptual schemas and workload in-
formation. Multimedia Tools Appl., 43(3):303–326.
Shah, R. and Kesan, J. (2009). Interoperability challenge
for open standars: Odf and ooxml as examples. The
proceedings of the 10th International Digital Govern-
ment Research Conference.
Uschold, M. (1998). Knowledge level modelling: concepts
and terminology. The Knowledge Engineering Re-
view, 13:1:5–29. Printed in the United Kingdom.
Vogel, T., Schmidt, A., Lemm, A., and Österle, H. (2008).
Service and document based interoperability for eu-
ropean ecustoms solutions. J. Theor. Appl. Electron.
Commer. Res., 3(3):17–37.
KMIS2013-InternationalConferenceonKnowledgeManagementandInformationSharing
380