up till now, related work in literature does not deliver
something comparable for data formats in general, as
we will see later in section 2.
Moreover, the initial factors of a translation often
do not remain constant. Data formats are revised, get
extended or have to be adapted to be interoperable
with deviating implementations. Instead of just read-
ing data, we may need to write it as well. We may
have to port parts of it to a J2ME mobile phone, or
an embedded system. We may even need to change
the programming language for better integration with
other components. With every change, we again have
to heavily adapt previous source or even retranslate,
using manual labour.
Automating the translation process of data format
knowledge to source code thus becomes attractive.
For that purpose, we need explicit data format knowl-
edge to be present in a machine-processable sense.
Yet, no suited model on data formats exists in lit-
erature, so for data format knowledge, we are cur-
rently limited human processing of semi-formal, tex-
tual specifications or source code of other implemen-
tations.
1.2 Restating the Problem
The intention of a data format is to provide a lossless
transport representation of information for the pur-
pose of storage and transmission for each of its data
format instances. A data format instance is a bijec-
tive mapping between structured, semantic literals as
information and a finite sequence of bits, or bitstream,
as transport representation, with an example shown
in Figure 1. A data format is a finite definition of a
possibly infinite set of data format instances, which is
analogous in definition to that of a formal language
(Mateescu and Salomaa, 1997).
Existing research in literature does not pro-
vide a complete, generally applicable and machine-
processable model for describing arbitrary data for-
mats such as those in practice. Just like a data format
is composed from its instances, we need to be able
to describe these instances before we can describe a
data format as a whole. As such a model on data for-
mat instances is a necessary prerequisite which does
not exist as well, it is the main subject of this paper.
1.3 Implications
Due to the lack of suited models, common engineer-
ing tasks with respect to data formats basically remain
manual processes for human engineers that make mis-
takes, which introduces follow-up problems:
• Defining or reverse-engineering of a data format
has no standard representation suited for auto-
mated documentation and exchange. Ensuring
correctness, completeness, consistency and unam-
biguousness of semi-formal, textual specifications
is entirely in the hand of human judgement and
therefore hard to guarantee.
• Designing and implementing format-compliant
components for typical purposes such as parsing,
in-memory representation and serialization for a
specific environment is a complex task in itself.
Since the complexity of a data format is usu-
ally present in its design and implementation as
well, this manual task becomes even more com-
plex and therefore error-prone. Until discovered
and patched, errors in the implementation can
lead to security issues such as buffer overflows or
break interoperability with other implementations
which is often hard to attribute to. Moreover, the
resulting implementation is bound to its intended
data format, environment and purpose, where any
non-trivial change to any of these requires sub-
stantial adaptation or even redevelopment. It can
be assumed that the arising development cost lim-
its the diversity of existing, reusable implementa-
tions.
• As long as access to and navigation of data in
a specific data format directly depends upon the
existance of suited format-compliant implemen-
tations that have to be developed manually, data
remains tightly coupled with these implementa-
tions. This is a hard problem for Digital Preser-
vation efforts of libraries, as the obsolescence of
applications over time threatens a large body of
digitally born data (Ross and Hedstrom, 2005) on
an individual, corporate and national scale (Wet-
tengel, 1998), much of which comprises our digi-
tal cultural heritage.
1.4 Contribution
In this paper, we present Bitstream Segment Graphs
(BSG) as a complete, generally applicable and
machine-processable model on data format instances,
serving a step-stone towards a later corresponding
model on data formats as a whole.
We start by taking a look on related work in lit-
erature (Section 2) and develop a more distinct no-
tion of completeness (Section 3). Based on that no-
tion, the formalism of Bitstream Segment Graphs is
defined (Section 4.1) and an algorithm for its compo-
sition is given, together with a visual representation
is given (Sections 4.2 and 4.3). Using our model, we
finally present a practical example (Section 5) that ex-
USING BITSTREAM SEGMENT GRAPHS FOR COMPLETE DESCRIPTION OF DATA FORMAT INSTANCES
199