scheme both make use ordering of un-encoded
numbers.
When conditions (1)-(3) are met, one has a valid
encoding function. However, the encoding could be
inefficient. Consider the trivial encoding of
[0..2
8n
], which utilizes exactly n-byte strings to
represent the number. The string is composed of all
the bytes starting from the most significant bytes
even if they are 0 bytes. This is called identical
encoding. It satisfies the three required conditions,
but it is not efficient in terms of space. With n=2,
large databases are not covered. With n=8, there is
a great deal of wasted space on long object IDs in
small databases. For example, ID=0x21 will be
encoded as 0000000000000021. We can not
truncate leading zero bytes since the resulting
encoding would not satisfy the required conditions.
The choice of encoding function along with the
object ID size plays a major role in achieving a
successful semantic binary engine design. It is also
important to determine the space requirements for
object IDs and the way object IDs are generated and
disposed of. Existing ID encoding conventions
might be utilized. An example encoding technique is
Globally Unique Identifiers (GUIDs), where each
object ID is a 16-byte integer and an algorithm is
available for generating GUIDs, as is the procedure
for discarding GUIDs when objects are disposed.
2.2 Efficiency of Encoding
The following properties are desirable features of an
efficient object ID system for the semantic binary
engine.
(I) Encoded object IDs should be short for small
numbers. For a database with several thousand
objects, a maximum object ID that is 8 bytes long
would be considered a waste of space. More
importantly, even large databases usually have
relatively small schemas. A category ID or relation
ID is present in every fact in the database. Therefore,
if all relations and categories in the schema have
short encoded object IDs, the overhead for storing
them would be low.
(II) The space of object IDs should cover large
datasets. It is possible to design a system with
unlimited-size object IDs, but the system would be
excessively complex and inefficient. Given a limit
on the length of encoded strings, the space of object
IDs should be as large as possible. If the length of
encoded strings is limited to n bytes, the space of the
maximum possible size covered by object IDs is 2
8n
objects, achieved by identical encoding.
(III) The encoding (φ) and decoding (φ
-1
)
algorithms should not be computationally complex.
Though it is hard to beat the speed of identical
encoding, which could likely be implemented as a
single CPU instruction, the encoding algorithm
should come as close as possible.
(IV) ε
-1
(φ()) should preserve ordering and
ε
-1
(φ([0..N])) should be dense—preferably
containing only a few interval holes. This is similar
to treating encoded strings as numbers without
decoding. As an illustration, consider the following
approach. While the encoding/decoding function is
conventionally run every time an object ID is stored
or retrieved from a fact, it is possible to avoid this.
Most of the engine operations can be performed on
encoded IDs. One can apply ε
-1
to encoded object
IDs, making a number out of it without decoding;
more accurately ε
-1
•φ:[0..N]→[0..M]. The
function ε
-1
is fast and, by (IV), the ordering is
preserved. After applying fast encoding ε to the
resulting numbers, the original coded strings can be
obtained.
If property (IV) is satisfied, we can use
ε
-1
(φ([0..N])) = Φ ⊂ [0..M] inside the engine as
object IDs. This would save time by eliminating
encoding and decoding in most operations.
However, in this case, it is desirable that this set is
dense in [0..M] since the database engine uses
bitmaps for efficient indexes on attributes that have
only few possible values such as Boolean attributes
or the flag of belonging to a category. If Φ is not
dense enough, space in a bitmap is wasted for non-
existent object IDs.
Actually, Φ might have structure rather than just
being dense. As an illustration, consider the
compression of bitmaps used in the engine. Bitmaps
are broken into blocks; if a whole block contains
only zeroes, it is compressed out. This is a simple
and fast compression. Typically, a block of object
IDs is given to a category and new objects in that
category get IDs from the block. This ensures that
compression of the bitmaps works well since almost
every set consists of objects from the same category.
(V) It is not desirable to have different encoding
for databases of different space sizes. If N
1
< N
2
, φ
1
is the encoding for [0..N
1
] and φ
2
is a version of
the same encoding for [0..N
2
], then ∀x<N
1
⇒
φ
1
(x)= φ
2
(x). This ensures that if the ID space
expands over time, the database would remain
compatible. Properties (I) and (II) provide an
alternative solution for functions that do not satisfy
(V). If the ID space is chosen large enough from the
very beginning, small databases would have short
IDs and therefore there will be no waste in terms of
database space.
An ID space of 2
56
covers all currently existing
databases and the databases of the foreseeable
future. Let us review encodings we’ve discussed so
far and see if they satisfy the three requirements and
have the five properties discussed above.
OBJECT ID DISTRIBUTION AND ENCODING IN THE SEMANTIC BINARY ENGINE
281