CHANNELS TO THE FUTURE

Gábor Magyar

Dept. of Telecommunications and Media Informatics, Budapest University of Technology and Economics

Budapest, Hungary

Keywords: Archiving documents, originality, metadata, Dublin Core, semantics.

Abstract: The long-term archiving of digital documents is a very challenging task, because of policy, legal,

intellectual property rights, metadata, semantic support and other issues. This paper merges technical and

sociotechnical approaches. As more research disciplines and societal sectors have come to rely on data-

driven models and observational data, the archiving problem is growing, the shortcomings of current

technologies have become apparent and the need to preserve historical material has become imperative. The

variety and complexity of digital documents as information technology objects brings up a basic question:

does it necessary to preserve the variety and complexity of the original objects? Our answer in general is

’no’, essential attributes of a document are preserved when the document is transformed to different

platforms. There are many reasons to change the format of a document. We use the categories of physical,

logical, and conceptual layers in order to define generic properties that are true of all digital documents.

This approach gives an overall framework for general preserving strategy managing technical obsolescence

and semantic mutations.

1 DIGITAL DOCUMENTS

There is no a single conventional definition of the

category „digital document”. (Buckland, 1998)

In the „PC world” this term was originally used

for a file created with a word processor. I.e. the term

referred to a textual object. This conception has

totally changed: documents can contain graphics,

charts, and other objects (images, embedded sounds,

animations, videos). A digital document can have an

exactly described structure of its elements – this

structure itself can have extra meaning. A word

processing application can produce graphics, tables,

hyperlinks, XML output and a graphics application

can produce words. This trend has accelerated with

technologies that allow an application to combine

many components. What’s more, many forms of

digital information cannot be expressed in traditional

hard-copy or analog media; for example, interactive

Web pages, geographic information systems, and

virtual reality models. Consequently, the term

’document’ is used more and more to describe any

file produced by an application. There is no single

definition or model of a digital document that would

be valid in all cases. Information technologists

model digital documents in very different ways: a

digital document can be a sequence of expressions in

natural language characters or a sequence of scanned

page images, a directed graph whose nodes are

pages (what appears in a Web page), and so on. How

documents are managed, and therefore how they are

preserved, depend on the model that is applied.

Unlike a paper record where the symbols which

encode the information are directly accessible to a

reader, digital data is stored as a series of bits on a

storage medium in a form which is intelligible

(frequently not visible) to the human being. So a

machine is required to retrieve the binary patterns

from the device, and then a program is required to

interpret the encoding format of the bit stream and

render it in a human-intelligible form. The

information is interpreted as the result of the action

of these human made hardware/software tools on the

data. The precise form in which the information is

made available to a user depends as much on the

action of these technological intermediaries as it

does on the data on which they operate.

Two different software tools may render the

same data in ways that give the user different views

of that data as information. This may be true even

for the same tool running in different environments

or with different parameters. Typical word

369

Magyar G. (2007).

CHANNELS TO THE FUTURE.

In Proceedings of the Ninth International Conference on Enterprise Information Systems, pages 369-374

 SciTePress

processing programmes incorporate multiple ‘view’

options which present quite different screen

renditions of the same data. The user’s experience of

information becomes a complex product of the base

data itself and the processes performed on that data.

Extreme, but simple example is the difference

between the views of the same textual content made

by the same word processing applications but using

different file format outputs (e.g. Rich Text Format

and simple text).

This process of mediation also means that the

user don’t know the physical structure of a digital

record (the way in which digital units of a

representation are physically arranged on the storage

medium). This will certainly change when a single

representation is transferred from one medium to

another, sometimes even on transfer from one

instance to another of the same medium.

2 ORIGINALITY OF A

DOCUMENT

What is meant by ‘original’ in the space of digital

documents?

Almost any ‘use’ of a selected representation of a

digital record involves the making of a copy in some

way. If a user accesses a representation which is

held on a networked server, a copy of that

representation is transmitted to their client computer,

and indeed multiple users could perform the same

action simultaneously on that same representational

form. There are files in the digital world that have

the same properties. (The problem of versions,

modifications are out of the scope of this paper.

Here and now different versions are identified as

different documents. Version management is a

practical issue from this point of view. However, the

three-layer model, described in this paper can handle

this issue.)

The variety and complexity of digital documents

as IT objects brings up a basic question: does it

necessary to preserve the variety and complexity of

the original objects?

The answer in general ’no’, essential attributes of

a document are preserved when the document is

transformed to different platforms. There are many

reasons to change the format of a document,

crossing technological boundaries (eg. platforms,

operating systems, applications). For example, to

ensure that written documents retain their original

appearance, authors translate them from the word

processing format in which they were created to

Adobe's PDF format.

This paper gives a generic model of properties of

digital documents. We use the categories of

physical, logical, and conceptual layers in order to

define generic properties that are true of all digital

documents.

3 PHYSICAL LAYER

At the physical layer, a digital document is an

inscription of signs on a medium. There are specific

(coding) rules in any technical environment to

determine the matching between a system of signs

and the physical storage of bits. Those conventions

vary with the type of the physical medium, media

types and other factors. The physical layer of the

model deals with physical files. The physical

inscription of bits is independent of the meaning of

the inscribed bits.

It is well known, that today’s digital media

solutions are not durable over long periods of time.

The digital storage media degrades relatively

quickly, when compared with the known durability

of paper. This problem can be addressed through

copying digital information to new media. Media

refreshment or migration adds, of course a new cost

element of digital preservation to the whole life

cycle. However, this does not mean the continuous

growth of the total costs, because digital storage

effectiveness, especially recording density increases

while costs decrease. Repeated copying of digital

data to new media over time reduces per-unit costs.

Based on the traditional rate of the growth of storage

densities media migration yields a net reduction in

operational costs. (Moore, 2000) In this context, the

durability of the medium is only one variable in the

cost equation: the medium needs to be reliable only

for the length of time that it is economically

advantageous to keep the data on it.

The physical preservation strategy must also

include a reliable method for maintaining data

integrity in storage and in any change to storage,

including any updating of the storage system,

moving data from inactive storage to a server or

from a server to a client system, moving data

between the elements of a distributed storage

architecture or delivering information to a customer

via the Internet, as well as in any media migration.

Physical preservation is necessary, but not

sufficient in the archiving process.

ICEIS 2007 - International Conference on Enterprise Information Systems

370

4 LOGICAL LAYER

A digital document can be recognized as a logical

object (or a specific set of logical objects) according

to the logic of certain application software. The

conventions of composing logical objects are

independent of how the data are written on a

physical medium. As we described before, at the

storage level the interpretation of the bits is not

defined. Similarly, at the logical layer the grammar

is independent of physical inscription. Once data are

read into memory, the type of medium and the way

the data were inscribed on the medium are of no

consequence. The rules that apply at the logical layer

determine how information is encoded in bytes and

how different encodings are translated to other

formats; how the input stream is transformed into

the system's memory and output for presentation.

The technologies of storage media and machine

tools which read those media and the encoding

formats. The programs which interpret those formats

are permanently changing, driven by scientific

advances, user requirements for improved cost

effectiveness, and the commercial imperative of the

suppliers. A file created using Microsoft Word, and

stored in Word format can only be read by another

version of that same program, or by a program

which is able to encode the Word format. If the file

is transferred to an environment where such a tool is

not available, or if it is not accessed over a period of

time and during that time the program becomes

obsolete, then the information content of that file

becomes inaccessible. This constant change is

certain to continue: there is no evidence to suggest

that a plateau of technological stability will ever be

attained.

A logical object is a unit recognized by some

application software. To preserve digital information

as logical objects, we have to know the requirements

for correct processing of each object's data type and

what software can perform correct processing.

5 CONCEPTUAL LAYER

At the conceptual layer we see document as they are

handled in the real world: documents are meaningful

objects, such as books, reports, proceedings,

photographs, contracts, maps - but in the digital

space a document can be a mix of different media

types.

The properties of the documents at the

conceptual layer are those that are significant in the

real world. A book has author(s), title, etc. A report

has an author, a title, an intended audience, and a

defined subject and scope. A proceeding has editor,

authors, titles, etc. These properties of documents

have meaning to human beings. Data elements of

documents may have structure. Actually all

meaningful textual documents have structure –

words compose sentences, there are paragraphs,

chapters, etc. Sometimes this structure is (at least

partially) pre-defined: a good example is e-mail

what has header (including ’to’, ’from’, ’subject’,

etc. fields) and (unstructured) message body. Web

pages have links as structural elements. The

information content of a table can not be recognized

without the structure of this table. An Excel file is a

set of tables, having cross-references between tables.

This structured set of data can be seen as a database.

Some of the properties are tagged and stored as

metadata and metadata elements are organized into

data-schema(s). Advanced word processing

applications do have metadata management facility.

Metadata often serves as integration platform.

Formalized metadata is on the rise, leading to

significantly better data management and

exploitation capabilities. Metadata will make it

much easier for machines to process data

automatically, and it is exactly this capability that

can drive many other benefits: interoperability, cost-

cutting, better data quality, transparency, better

decision support and new business opportunities.

Metadata standardization is key to interoperability.

Dublin Core Metadata for Resource Discovery

seems to become the most relevant candidate for

common metadata platform of different media-types.

(DC, 1998)

Metadata tags will also aid search engines and

content processing intelligence software. They can

carry additional information that can be used as

additional hooks for searching, whether based on

keywords or taxonomy. And the tags can facilitate

better categorization and predictive analysis.

Google’s Froogle for example requires catalog

suppliers to tag their content according to the W3C’s

RDF. (RDF, 2006)

There are many problematic issues concerning

metadata. First, metadata tends to be scattered and

there are often conflicting approaches for describing

the same things. Second, creating coherent metadata

can be difficult and expensive. And third, it is not

easy to interpret different proprietary metadata

schemes in various application modules. For

metadata to become more cost-effective, it must be

shared and reused – not only within one organization

but within the whole circle of co-operating

institutions.

CHANNELS TO THE FUTURE

371

The content and structure of conceptual

document properties must be contained somehow in

the logical document(s) that represent that document

in digital form. However, the same conceptual

content can be represented in very different digital

encodings, and the conceptual structure may differ

substantially from the structure of the logical

document. The content of a document, for example,

may be encoded digitally as a page image or in a

character-oriented word processing document. There

are different metadata schemas. The conceptual

structure of a report - e.g., title, author, date, and

introduction - may be reflected only in digital codes

indicating differences in presentation features such

as type size or underscoring, or they could be

matched by markup tags that correspond to each of

these elements.

Can we state that one of the possible digital

formats (Microsoft Word, Adobe PDF, WordPerfect,

HTML, a scanned image, etc.) is the true or correct

logical representation of the document? As the

archivist’s ultimate aim is to preserve the document

exactly as it was created the most basic criterion is

whether the document that is produced when the

digital file is processed by the right software is

identical to the original. In fact, each of these

encodings, when processed by software that

recognizes its data type, will display or print the

document in the format in which it was created. So if

the requirement is to maintain the content, the

structure, and the appearance of the original

document, either digital format is suitable.

Since we have a variety of digital formats that

are equally suitable for preserving the conceptual

object(s), this rule can be extended to more complex

types of documents, including databases and

electronic transactions as well, where the documents

are not necessarily presented to human beings but

are found only at the interface of two computer

applications.

6 THE RICH RELATIONS OF

THE THREE LAYERS

The complex nature of a digital document having

distinct physical, logical, and conceptual properties

gives rise to considerations for digital preservation.

To preserve a digital document, the relationships

between layers must be known or knowable. To

retrieve a paper archived as master and

subdocuments, we should know that it is stored in

this way and we must know the identities of all the

logical components. To retrieve a specific

certification of examination results for a student, you

don’t need to know where all of the data for that

student’s educational activities are stored in the

database. You only need to know how to locate the

relevant data, given the logical structure of the

database.

In general: to preserve a digital document, we

must be able to identify and retrieve all its digital

components – including the meaningful structure of

it.

The digital components of a document are the

logical and physical objects that are necessary to

reconstitute the conceptual object. These

components are not necessarily limited to the objects

that contain the contents of a document. Digital

components may contain data necessary for the

structure or presentation of the conceptual object,

like style sheets, form specifications and more

complex ones, like name spaces.

To identify and retrieve the digital components,

one must process them correctly. Digital

preservation is not a simple process of preserving

physical objects but one of preserving the ability to

reproduce the objects. The success of digital

preservation can be proved only by re-creating the

document in some form that is appropriate for

human use or for computer system applications.

Remember the first general question of this

paper: what is „original” in the space of digital

documents? Does it necessary to preserve the variety

and complexity of the original objects? In the

context of the three-layer model we should ask: does

it necessary to preserve the physical and logical

components of a digital document and also their

interrelationship, without any alteration?

No. You can change the way a conceptual object

is encoded in logical objects and stored in physical

objects without having any negative impact on its

preservation. E.g. in a repository of staff members

data in a university database CVs are defined as

textual and embedded image (photo) files. The

photographs of staff members can be stored in

separate image files (for different applications –

ensuring single storage for multiple applications),

and there are only links in the CV files to the

appropriate image file. However, the image file

could be embedded in the word processing file

without altering the report as such. One can produce

PDF version of this CV.

At first sight change and preservation are

opposite categories. On second thought the

possibility of preserving a digital document while

changing its logical encoding or physical inscription

ICEIS 2007 - International Conference on Enterprise Information Systems

372

promises benefits, especially in long-time

preservation. Technology creates the possibilities for

change, but we should determine what changes are

permissible, beneficial, necessary, or harmful.

To make such determinations, we have to

consider the ultimate purpose of preservation. What

is the goal of digital preservation of documents?

For libraries, archives, and other organizations

that are for preservation of digital documents over

time, the ultimate outputs are authentic preserved

documents. According to the previous parts of this

paper the output of a preservation process must be

identical in all essential aspects, to what went into

that process. Identical - in all essential aspects.

The ideal preservation system would be a

communications channel for transmitting

information to the future. This channel should not

corrupt or change the messages transmitted in any

way. The process of preserving digital documents is

essentially different from that of preserving physical

objects such as traditional books on paper. To access

any digital object, we have to retrieve the stored

data, reconstituting, if necessary, the logical

components by extracting or combining the bytes

from physical files, reestablishing any relationships

among logical components, interpreting any

syntactic or presentation marks or codes, and

outputting the object in a form appropriate for use by

a person or a business application. We don’t want to

preserve a digital document as a physical object;

instead we need to ensure the ability to reproduce

the document for future users. The preservation of

an information object in digital form is complete

only when the object is successfully reconstructed.

In fact, the original document is not retrieved, but

„copied”, as it is reproduced by processing the

physical and logical components using software that

recognizes and properly handles the files and data.

Paper degrades, ink fades. (Lorie, 2000) In

general we are not able to assert with complete

assurance that no substitution or alteration of the

object has occurred over time. Authentication of

preserved objects is ultimately a matter of trust.

There are ways to reduce the risk entailed by trusting

someone, but ultimately, you need to trust some

person, some organization, or some system or

method that exercises control over the transmission

of information over space, time, and technological

boundaries.

Can an object change and still remain authentic?

Common sense suggests that something either is or

is not authentic, but authenticity is not absolute.

Authenticity depends on use. (Thibodeau, 2001) The

criteria for authenticity depend on the intended use

of the object.

A document known to be in someone’s

handwriting, but containing text he copied from a

book, does not reveal his thoughts. Oppositely, the

final testimonial of a person can be written down by

his secretary. Authenticating something as

someone’s writing depends on how we define that

concept.

There are contexts in which the intended use of

preserved information objects is well-known. For

example, many corporations preserve records for

long times for taxation purposes. It is a clear case:

we know the exact aim of the preservation and the

intended use as well. Libraries and public archives,

however, usually cannot prescribe or predict the

future use of their collection. Such institutions

generally maintain their collections for access by

anyone, for whatever reason. Users and their

behaviors are not known in advance, you must

assume that any valid intended use must be

somehow consonant with the original nature and use

of the document. Anyway, given that a digital

document is not something that is preserved as an

inscription on a physical medium, but something

that can only be constructed or reconstructed by

using software to process stored inscriptions, it is

necessary to have an explicit model that is

independent of the stored object and that provides a

criterion, or at least a benchmark, for assessing the

authenticity of the reconstructed object.

7 CHANNELS TO THE FUTURE

A preservation system will act as a communications

channel for transmitting information to the future

only if it systematically supports the preservation of

the original context of the document(s). That is why

you should manage the semantics of the document,

what can be done in the model described above – at

the Conceptual Layer.

We use metadata for contextual description.

(Magyar, 2004) The contextual information serves to

provide a more complete understanding of the

document(s). The most important method for

contextual description is taxonomy. Taxonomy is a

classification of information components and their

interrelationships that supports the discovery of and

access to information. Metadata and taxonomies can

work together to identify information and its

features, and then organize it for access, navigation

and retrieval.

CHANNELS TO THE FUTURE

373

Terms and taxonomies are so subjective.

Different users and different applications may use a

variety of terms for the same concept. Humans can

also easily relate the meanings of two terms as

"similar", "the same", "more general" or "more

specific". (Magyar, 2005) Computers can handle

only strings, not concepts. The semantics of a

domain model is machine-usable only if it is

expressed using an agreed upon vocabulary and

syntax. The heterogeneity of the information

environment also relates to the coexistence of

structured, semi structured and unstructured

information. (There are, of course essential

differences between these categories in terms of

their conceptual schemas and their inherent machine

and human understandable representations.)

(Magyar, 2004)

In the case of the structured databases, the

addition of lexical relations to the semantic

relationships already defined by the schema. In the

class of semi-structured documents, in addition to

their classical role for assigning and searching

metadata, thesauri can be used for searching and

retrieving free text or integrated into data mining,

knowledge extraction, or other related applications.

(NISO, 1993) Thesauri can be used for assigning

metadata to different kinds of materials, such as

images, videos, sounds, etc. in rich multimedia

documents. Thesauri can serve for linking all

different types of information and their

representations in the domain of digital documents.

(Kosovac, 1998) Thesaurus services can be used to

complement the modeling technology by extending

mechanisms for achieving semantic interoperability

to the level of human-understandable information

representations.

8 CONCLUSION

A comprehensive model of long-term archiving of

digital documents was presented. The model is

applicable to all kinds of rich content multimedia

documents. The importance of preserving semantics

of document was emphasized.

Semantic approaches try to construct and use

formalized imprints of human conceptual

interpretation results. Semantic tools tend to be part

of the so-called content-infrastructure. Using that

kind of semantic resources conceptual objects could

be interpreted perhaps not only by human beings but

by software agents as well.

As a final conclusion for future long-term

archival systems we must assume that semantic

interoperability needs integrated solutions at all the

three layers of the model described in this paper

REFERENCES

Buckland, 1998; Michael Buckland: What is a "digital

document"? Document Numérique (Paris) 2, no. 2

(1998): 221-230

DC, 1998; Dublin Core Metadata for Resource Discovery.

IETF #2413 / Weibel, S.; Kunze, J.; Lagoze, C.; Wolf,

M. The Internet Society, 1998. September

1998.(http://purl.org/DC/index.htm)

Lorie, 2000; Lorie, Raymond A. 2000. The Long-Term

Preservation of Digital Information.

http://www.si.umich.edu/CAMILEON/Emulation%20

papers%20and%20publications/Lorie.pdf.

Thibodeau, 2001; Kenneth Thibodeau: Building the

Archives of the Future. D-Lib Magazine. February

2001. Volume 7 Number 2. ISSN 1082-9873

Kosovac, 1998; Kosovac, B (1998). Internet/Intranet and

Thesauri, Canadian Institute for Scientific and

Technical Information, Internal Report, National

Research Council Canada, Ottawa, Canada.

<http://www.nrc.ca/irc/thesaurus/roofing/report_b.htm

Magyar, 2001; Magyar, G., Szakadát: Metadata System of

National Audiovisual Archive in Hungary. Invited

Paper. 20th Conference of the Audio Engineering

Society: Archiving: Restoration and New Methods of

Recording. Budapest, 5-7 October 2001. Mira Digital

Publishing Inc., 2001

Magyar, 2004; Tikk D., Kardkovacs Z, and G. Magyar,

„The hungarian deep web searcher project,”

Internation Journal on Information Technology, vol. I,

pp. 191–197, Dec. 2004.

Magyar, 2005; Tikk D., Szidaroszky F. P., Kardkovács

Zs., Magyar G.: Entity Recognizer in Hungarian

Question Processing. In: Lecture Notes in Computer

Science, 2005, Publisher: Springer-Verlag GmbH,

ISSN: 0302-9743

Moore, 2000; R. Moore et al. Collection-Based Persistent

Digital Archives, D-Lib Magazine, March 2000,

Volume 6 Number 3 [Part 1]

<http://www.dlib.org/dlib/april00/moore/04moore-

pt2.html>.

NISO, 1993; National Information Standards Organization

(1993). ANSI/NISO Z.39.19-1993. Guidelines for the

Construction, Format, and Management of

Monolingual Thesauri, Bethesda, MD: NISO Press.

RDF, 2006; Resource Description Framework (RDF) /

W3C Semantic Web Activity.

http://www.w3.org/RDF

ICEIS 2007 - International Conference on Enterprise Information Systems

374