MESSAGE INDEXING FOR DOCUMENT-ORIENTED

INTEGRATION PROCESSES

Matthias Boehm, Uwe Wloka

Dresden University of Applied Sciences, Database Group, 01069 Dresden, Germany

Dirk Habich, Wolfgang Lehner

Dresden University of Technology, Database Technology Group, 01062 Dresden, Germany

Keywords:

Enterprise Application Integration, Message Indexing, MIX, Algorithms, Data Structures, Performance.

Abstract:

The integration of heterogeneous systems is still an evolving research area. Due to the complexity of integra-

tion processes, there are challenges for the optimization of integration processes. Message-based integration

systems, like EAI servers and workﬂow process engines, are mostly document-oriented, using XML technolo-

gies to achieve suitable data independence from the different and particular proprietary data representations

of the supported external systems. However, such an approach causes large costs for single-value evaluations

within the integration processes. At this point, message indexing, adapting extended database technologies,

can be applied in order to achieve better performance. In this paper, we introduce our message indexing

structure MIX and discuss and evaluate immediate as well as deferred indexing concepts.

1 INTRODUCTION

The horizontal integration of systems by message-

based communication via middleware products is a

widely used method of application integration to en-

sure a loose coupling of participating systems and ap-

plications. In order to reach the required data inde-

pendence, most integration platforms—like the SAP

Exchange Infrastructure (XI) or TransConnect—work

in a document-oriented way. There, external data

is translated into XML and thus could be processed

in a uniform way. Thus, the term messages refers

to XML messages with additional attributes and bi-

nary attachments. This claim is also true for the ma-

jority of workﬂow management systems.Due to the

document-oriented approach, access to single values

of a message (e.g., in ASSIGN or SWITCH operators)

is cost-intensive, where those costs increase with the

data size. Imagine data-centric integration processes

with several attribute evaluations for control-ﬂow de-

termination; here, the document-oriented approach is

obviously inefﬁcient. In order to eliminate these dis-

advantages, message indexing can be used. This leads

to a performance improvement which is proportional

to the increasing data size. Hence, in this paper, we

present a new message indexing approach in order

to enable an efﬁcient read and write access to single

message values which are required during process ex-

ecution. Therefore, the paper is structured as follows.

In Section 2, we motivate our full approach and give

a detailed problem description. Further, we present

MIX—a transient message indexing structure—and

its processing concepts in Section 3. In Section 4,

we evaluate our approach with a number of experi-

ments showing the impact of the different concepts

and techniques. Finally, in Sections 5 and 6, we give

an overview of related work, summarize our results,

and conclude the paper. For a full version of the

paper—including all algorithms—see (B

ohm et al.,

2008).

proprietary message formats

External

System

Scheduler

Outbound

Adapter 1

...

Outbound

Adapter k

Process Engine

Operational Datastore (ODS)

Inbound

Adapter 1

...

Index Manager

External

System

Inbound

Adapter n

External

System

proprietary message formatsXML message formats

...

External

System

External

System

External

System

External

System

Figure 1: MOM Reference Architecture.

137

Boehm M., Wloka U., Habich D. and Lehner W. (2008).

MESSAGE INDEXING FOR DOCUMENT-ORIENTED INTEGRATION PROCESSES.

In Proceedings of the Tenth International Conference on Enterprise Information Systems - DISI, pages 137-142

DOI: 10.5220/0001679401370142

 SciTePress

2 PROBLEM DESCRIPTION

The indexing of messages for document-oriented in-

tegration processes is very different to indexing stati-

cally stored data. These approaches often assume that

data is statically stored with a low update rate and read

optimization is the goal. However, in our application

area, high message throughput results in a high update

rate. Hence, the write performance is very important.

Architectural Considerations. In order to general-

ize document-oriented integration processes, Figure 1

introduces a Message-Oriented Middleware (MOM)

reference architecture. There, proprietary mes-

sages are received from external systems with pas-

sively listening inbound adapters. These inbound

adapters transform the incoming messages into an

internal XML message format. Those messages are

put into speciﬁc message queues, which separate

the inbound side from the outbound side. Here, mes-

sages are made persistent for scalability and recov-

ery purposes. Further, the core Process Engine

uses speciﬁc outbound adapters to actively inter-

act (pull and push) with external systems. Also,

the outbound adapters realize translations between

XML and the proprietary data format of the target

system. Thus, the right places of index maintenance

for incoming messages are the inbound and outbound

adapters because here, all attributes are scanned and

transformed into the internal XML format. The in-

dexed attributes are then used within the different in-

tegration processes ( e.g., in ASSIGN and SWITCH op-

erators). Fundamentally, the speciﬁcation of which

attributes should be indexed can be explicitly deﬁned

by a user or implicitly derived from the integration

process workﬂow description.

Characteristics of Message Indexing. Based on the

architectural considerations, there are speciﬁc charac-

teristics of message indexing, which allow the usage

of context knowledge and thus more efﬁcient index

maintenance. Each message includes an sequence-

generated message ID. Thus, the message IDs are

an ascending sequence (not clean because of con-

currency) of BIGINT values and are thus advanta-

geous for the use as index tree key criteria. Further,

transformation operations within the process cause

the creation of new messages with new message IDs.

Thus, index scans must be aware of dynamic message

ID changes and dynamic attribute name changes.

Next, a high update rate (insert, update and delete)

is caused by the fact that indexed attributes are often

read only once. Finally, integration processes have a

throughput-oriented optimization goal, which offers

the possibility of asynchronous processing, where la-

tency time is acceptable.

3 MIX: MESSAGE INDEX

The message indexing structure MIX uses context

knowledge to speed up the index maintenance. In

this section, we deﬁne the core index structure and

describe our techniques for index maintenance.

3.1 Index Structure

The index structure is an extended B

tree. Indexed

attributes are exclusively stored within the leaf nodes,

while all other nodes contain key values. Based on

this core model, we deﬁne our indexing rules, which

differ from the original B

tree. The core index tree

comprises three types of nodes. First, each index node

comprises n− 1 search keys sorted in ascending order

and n (node size) pointers to child nodes. Here, all

types of nodes could be children of index nodes and

the message ID is used as key criterion. Second, all

indexed attributes are stored at the lowest tree level

within the leaf data nodes. Such a node contains a

hashmap of several indexed attributes related to one

message ID. In this hashmap, (1) the hash values of

the XPath expression for the speciﬁc attribute and (2)

the attribute values are stored. Third, the Leaf node

pointer is used similar to leaf data nodes, but it does

not contain any materialized data but instead it has a

pointer to either a leaf data node or another leaf node

pointer.

Deﬁnition 1. We deﬁne that the MIX structure is im-

balanced under all circumstances. However, the tree

works in a self-organizing way, which lets it tend to-

wards a balanced state. Leaf data nodes are always

present on the lowest tree level.

Deﬁnition 2. Let f be the ﬁll factor of the tree. We

deﬁne that the root node and all other index nodes of

the tree must have at least one pointer to a child node.

The index nodes are allowed to have an f of 1.0 with

n pointers to child nodes. The leaf data nodes must

include at least one (XPath expression, value) tuple

and the leaf node pointer must reference exactly one

leaf data node or leaf node pointer.

Deﬁnition 3. Let n be the node size; then, an index

node may include n − 1 search keys and n node point-

ers at the most. We deﬁne that n may vary between

nodes for adaptively changing n during runtime.

Deﬁnition 4. We deﬁne that single elements or index

nodes and thus complete subtrees may be inserted into

and deleted from the index tree.

Figure 2 illustrates the message index macro ar-

chitecture. Basically, there is the core index

tree, which works according to the above pre-

sented deﬁnitions. Furthermore, all accesses to the

ICEIS 2008 - International Conference on Enterprise Information Systems

138

Figure 2: Message Index Macro Architecture.

core index tree are synchronized using an exter-

nal lock table. For that, only the states NO LOCK,

READ LOCK and WRITE LOCK are supported. The

locking mechanism is based on hierarchical locks

for the speciﬁc message ID. Aside from the struc-

tural organization, there are four types of interac-

tions with the MIX: IXPUT(inserting indexed values

and subtrees), IXUPDATE (updating indexed values),

IXREMOVE (deleting indexed values and subtrees) and

IXSCAN (reading indexed values). Here, only IXSCAN

works in a synchronous and thus blocking way. All

other interactions may be used both synchronously

(immediate) or asynchronously (deferred). For the

case of deferred processing, we provide an insert

queue as well as a delete queue. However, a syn-

chronous IXSCAN causes the queues to be ﬂushed.

This structure is by design write-optimized (deferred

techniques) and allows for dynamic updates of in-

dexed values with low costs.

An example MIX index structure is illustrated by

Figure 3 and shows a balanced index tree with n = 3

and nine leaf nodes. Especially the leaf node pointers

are speciﬁc to our approach. There is a pointer from

leaf node (msgID = 104) to leaf node (msgID = 103).

In conclusion, all IXSCAN

104

will imply an IXSCAN

103

with very low overhead. Furthermore, there is also an

indirect leaf node pointer (msgID=117) that refers to

another leaf node pointer.

3.2 Inserting Indexed Values

The insertion of indexed values is realized with the

interaction IXPUT. Here, we distinguish two tech-

niques: the immediate IXPUT and the deferred

IXPUT. The algorithm for immediate IXPUT is real-

ized in the speciﬁc node type Index Node. The mes-

sage ID, an XPath expression, an instance value and

the current depth (decremented for each index level)

are given. Basically, there is a loop over all node

pointers of the current index node, searching the right

key. If it is found, three different cases may occur.

First, the current node pointer could be NULL; then,

a new child index node is created, depending on the

given index depth. Second, the pointer could refer-

ence a leaf node. If the msgID is equivalent, the new

attribute can simply be inserted; otherwise, the cur-

rent leaf node has to be reordered or split to a new

index node. Third, the pointer is not a leaf node; then

the IXPUT event is recursively pushed down, or if the

child node is already full, it also has to be split.

Figure 3: Example Index Tree Structure.

Especially the sequence-generated message ID in

connection with the—not per deﬁnition balanced—

tree allow for optimization. If the index tree is full,

which means that it is balanced and it has a ﬁll factor

of 1.0, the next IXPUT would result in an index split.

In order to use the context of sequence-generated

keys, we use the symmetric index grow-up. Here, the

root index node is checked to see if it is full. If it is, a

new index node with n node pointers is created. After

that, the whole old root node is inserted into the new

root tree. The result of this is that—in case that all

message IDs are inserted in sequence (default)—no

index node splits will occur any longer, which makes

the IXPUT operation really efﬁcient.

However, there is further optimization potential.

If we know that all inserted values are (nearly) in se-

quence, the tree path search for each new value can

be optimized by a deferred algorithm which reduces

the tree path searches. The precondition for that is the

introduced insert queue, where all insert requests

are asynchronously stored. In case the QUEUE SIZE

is reached, a speciﬁc thread locks the queue and sorts

the entries with the quick-sort algorithm. Now, we

create a new index node tmp with n key slots. Fur-

ther, we know the last inserted message ID (lastID).

For each request, it is tested whether or not the mes-

sage ID is larger than lastID. If it is, the request is

inserted into tmp; otherwise, it fails and has to be in-

serted directly into the root index. Finally, the whole

subtree tmp could be inserted into the root index and

the queue can be unlocked. The deferred IXPUT

algorithm works like the immediate IXPUT, except

for the fact that complete subtrees can be inserted. It

has to be checked if the complete subtree, with given

minID and maxID, can be inserted into the match-

ing node pointer. If so, the algorithm is very similar

MESSAGE INDEXING FOR DOCUMENT-ORIENTED INTEGRATION PROCESSES

139

to immediate IXPUT. Otherwise, an original subtree

splitting must occur. Due to cost-intensive splitting,

all leaf data nodes are simply inserted with IXPUT.

3.3 Updating Indexed Values

The update of indexed values is a very speciﬁc prob-

lem, related to integration processes. There, the

update is realized with the interaction IXUPDATE.

In analogy to the insertion of indexed values, we

contribute algorithms for two main techniques: the

immediate IXUPDATE and the deferred IXUPDATE.

Basically, an IXUPDATE ensures that multiple message

IDs may reference one leaf data node. Therefore, the

IXUPDATE is conceptually separated into an IXSCAN,

the creation of the leaf node pointer to the scanned

node, and an IXPUT of the leaf node pointer, which is

also a leaf node. Although this could be realized sim-

ply on the root index level, this would be inefﬁcient.

Our main approach here is that the update request is

recursively pushed down along the index tree until the

old message ID and the new message ID are not con-

tained within one pointer slot. At this point, the up-

date event is separated into a tree scan and a tree put.

Thus, the conceptual separation of events is realized

at the lowest possible index level. The algorithm of

immediate IXUPDATE searches for the right key slot

using the old message ID. If a slot is determined, the

new message ID is checked to see if it also could be

applied to the slot. If so, the update request is recur-

sively applied on the slot pointer. Otherwise, the low-

est possible update level in the index has been reached

and the separation of events occurs. In addition to the

dynamic message ID update, an update request can

also initiate dynamic attribute changes (with new and

old XPath expressions). The leaf node pointer then

also contains a mapping for these attributes. Thus,

both types of dynamic updates could be realized with

one generic algorithm. The deferred IXUPDATE is

correlated to the deferred IXPUT. There is a unique

insert queue for both deferred insert and deferred

update requests. The enhancement is achieved by the

substitution of update requests with insert requests.

3.4 Removing Indexed Values

There are as many deletion requests as insert re-

quests within the index. Thus, an efﬁcient IXREMOVE

is required. Here, we distinguish as well: the

immediate IXREMOVE and the deferred IXREMOVE.

The deferred IXREMOVE can be designed to be

lazier in order to achieve best performance. The core

algorithm of immediate IXREMOVE works as fol-

lows. It has two parameters: the message ID msgID

and a Boolean ﬂag f orce. Note, if f orce is set to false,

then this immediate algorithm is only used for setting

delete ﬂags as a precondition of the deferred algo-

rithm. Mainly, the algorithm comprises the search

for the right key slot. If the right one is found, it is

checked whether of not f orce is speciﬁed, whether

the child node is a leaf node and whether the child

node is unused. This latter check is caused by pend-

ing leaf node pointers (back ref counter). In case the

check has been successful, the pointer is simply set

to NULL. Otherwise, the remove event is recursively

pushed down. With this algorithm, only leaf nodes are

eliminated. All free index nodes have to be deleted by

a deferred IXREMOVE. The deferred IXREMOVE is

based on the described ﬂagging of unused data nodes.

Figure 4 shows this concept using tree annotations

(+,-). So, if a node is ﬂagged as deletable, it gets a

’-’, otherwise it has a ’+’. A parent node is ’-’ if all

its children are set to ’-’. The deferred IXREMOVE

only operates on the root index level, trying to cut off

complete subtrees. In conclusion, only a maximum of

n − 1 node pointers and keys have to be shifted to the

left. The deferred IXREMOVE algorithm is mainly

separated into two parts: (1) the removal of subtrees

and (2) the reordering of subtrees. So, in conclusion,

the current index tree could be seen as a sliding win-

dow over all indexed values from past to future.

3.5 Reading Indexed Values

The reading of indexed values can be processed exclu-

sively with an Immediate IXSCAN. There, the com-

plexity of scanning an attribute is O(log(n)). It com-

prises the search for the right key slot. If it is found,

an IXSCAN is recursively processed on the child node.

So, implicitly, three actions may occur. First, if the

child node is an index node, the algorithm is called in

a recursive way. Second, in case of a leaf data node,

the scan is transformed into a simple hashmap lookup

using the given XPath expression. Third, in case of a

leaf node pointer, the referenced node is scanned.

Figure 4: Deferred Index Removal.

ICEIS 2008 - International Conference on Enterprise Information Systems

140

4 EVALUATION EXPERIMENTS

The message indexing structure MIX and its opera-

tions IXPUT, IXUPDATE, IXREMOVE, and IXSCAN were

evaluated with a large set of experiments with dif-

ferent scale factors. Hence, we illustrate only the

main experiments. When doing so, we also compare

MIX with the alternatives, such as a B

tree imple-

mentation as well as DOM and SAX message scans.

The experimental setup for the process type execution

comprises a local computer system (CS [Dual Gen-

uine Intel T2400, 1.5GB RAM]) and a Java imple-

mentation of the MIX index structure, which works in

a transient way. This means the single values only re-

side in the main memory. We chose Java due to its

relevance for document-oriented integration systems.

SAX/DOM Comparison. The index usage has to be

compared with the direct access to single values of

document-oriented messages. The difference is the

performance impact when applying the message in-

dexing to integration systems. We conducted exper-

iments to measure the performance of MIX as well

as of the XML technologies Document Object Model

(DOM) and Simple API for XML (SAX). The result is

shown in Figure 6a. Obviously, using the index, sig-

niﬁcant performance improvements are possible. The

message index has a scanning time independent to the

message size. In contrast to this, there is an exponen-

tial scanning time for DOM, while the SAX scanning

time increases linearly for an increasing message size.

The message index is also nearly independent from

the used ﬁll factor due to its logarithmic complexity.

Tree Comparison. Then we compared MIX to a

default Java B

tree implementation. In order to eval-

uate this issue, we conducted two experiments. The

ﬁrst experiment’s focus was on the maintenance of

dynamic message ID changes. MIX can use the leaf

node pointers, while two B

trees (values, lookup) are

required. The second experiment simply compared

the insert, delete and read performance of both index

structures. Figure 5 shows the results of both experi-

ments. Obviously, the MIX IXSCAN is always superior

to the B

tree find and also scales better with an in-

creasing number of pointers. Although the B

tree

insert outperforms the MIX IXUPDATE for a small

number of pointers, MIX scales much better due to

the usage of update request forwarding. In summary,

MIX is more efﬁcient than the B

tree in managing

node pointer references. Further, all MIX operations

are slightly superior to the B

tree operations.

Deferred Technique Evaluation. In a next step, we

compared immediate and deferred techniques, where

the latter use context knowledge to achieve much

better performance. Thus, we conducted an experi-

Figure 5: MIX / B

Tree Comparison.

Figure 6: MIX Operation Comparisons.

ment comparing the immediate and deferred IXPUT

as well as the immediate and deferred IXREMOVE.

The deferred techniques both use a queue size of

10. The performance was examined in dependence

on the index ﬁll factor. Figure 6b illustrates the re-

sults. The deferred techniques are superior in nearly

all situations compared to their immediate counter-

parts. There is a large difference between immediate

IXPUT and deferred IXPUT. Although the differ-

ence between immediate IXREMOVE and deferred

IXREMOVE is not as large, it is still a good im-

provement. However, when using the deferred

IXREMOVE, performance peaks are possible.

Optimal Conﬁgurations. The chosen queue size and

the node size have measurable impact on the index

performance. We conducted two experiments for the

MIX operations in order to determine optimal conﬁg-

uration properties. First, we experimented with differ-

ent queue sizes, which are used for the deferred write

operations. There, a node size of 3, a ﬁll factor of

100,000 and a sequence factor of 0.99 was used. Fig-

ure 6c shows the performance plot. Up from a queue

size of 10, there is a pretty constant performance with

marginal differences for increasing queue sizes but

with no signiﬁcant trend towards better performance.

However, there are performance peaks, caused by the

characteristics of deferred processing. Second, we

tried to choose the optimal node size with the imme-

diate operations. The node size was expected to be a

MESSAGE INDEXING FOR DOCUMENT-ORIENTED INTEGRATION PROCESSES

141

compromise between high write performance (large

node size) and high read performance (small node

size). Figure 6d shows the result. Up from a node

size of 8, an almost linear performance trace should

be mentioned for increasing node sizes. The perfor-

mance of read and write operations reacts in the same

way for small node sizes. However, up from a node

size of 26, the expected trend is shown.

5 RELATED WORK

Basically, we survey four ﬁelds of indexing which

are close to our approach: XML indexing techniques,

B tree optimization, B

tree optimization, and ex-

isting work in the ﬁeld of information integration.

Due to the fact that we index single message val-

ues, identiﬁed by XPath expressions, we want to

separate us from XML indexing techniques. These

could be classiﬁed into the groups structural indexing

((Chung et al., 2002), (Grust, 2002), (Haustein et al.,

2005), (Kaushik et al., 2002), and (Qun et al., 2003)),

value indexing ((Bruno et al., 2002), (Rao and Moon,

2004)) and hybrid indexing (where information re-

trieval techniques are used). Typically, when applying

such indexing techniques, multiple indexes (not appli-

cable in our context) are built, indexing all single val-

ues of a document. We adopt MIX to context knowl-

edge of integration processes. Equal approaches—

using workload characteristics—were also used for B

tree indexes ((Graefe, 2004), (Graefe, 2006), (Graefe

and Larson, 2001), and (Lomet, 2001)). There, spe-

ciﬁc techniques (e.g., buffering) are provided for op-

timizing B trees for high update rates or special hard-

ware setups. In contrast to XML indexing techniques

and B tree indexing, our index structure is very simi-

lar to well-known B

tree indexes, where all data re-

side in the leaf nodes. In particular, we want to point

out (Chen et al., 2001) and (Chen et al., 2002), where

internal jump-pointers from the current leaf node to

the following leaf node are used in order to speed up

range scans by pre-fetching. However, due to the se-

mantic context and the type of usage, there are ma-

jor differences to our approach. In the area of inte-

gration of heterogeneous systems, there is only little

work on indexing. A very exciting approach is the

adaptation of information retrieval methods for index-

ing dataspaces (Dong and Halevy, 2007). Such an

inverted list (like the Hier-ATIL) would also be ap-

plicable for message indexing using the message IDs

as instance identiﬁers and the XPath expression as

keywords. However, in order to adapt to the context

knowledge, a B

tree extension is more efﬁcient.

6 SUMMARY AND

CONCLUSIONS

Our intent was to optimize integration processes by

applying message indexing using context knowledge

about the speciﬁc characteristics of message-based

and document-oriented integration processes. There-

fore, we developed the message indexing structure

MIX, which is able to handle the dynamic message

ID changes and dynamic attribute name changes in

a suitable way. Furthermore, we take advantage of

the integration process characteristics, the sequence-

generated message IDs, the high update rate and also

the throughput-oriented optimization goal by intro-

ducing deferred index maintenance techniques.

REFERENCES

ohm, M., Habich, D., Lehner, W., and Wloka, U. (2008).

Message indexing for document-oriented integration

processes. Technical report, Dresden University of

Applied Sciences.

Bruno, N., Koudas, N., and Srivastava, D. (2002). Holistic

twig joins: optimal xml pattern matching. In SIG-

MOD.

Chen, S., Gibbons, P. B., and Mowry, T. C. (2001). Im-

proving index performance through prefetching. In

SIGMOD.

Chen, S., Gibbons, P. B., Mowry, T. C., and Valentin, G.

(2002). Fractal prefetching btrees: optimizing both

cache and disk performance. In SIGMOD.

Chung, C.-W., Min, J.-K., and Shim, K. (2002). Apex: an

adaptive path index for xml data. In SIGMOD.

Dong, X. and Halevy, A. Y. (2007). Indexing dataspaces. In

SIGMOD.

Graefe, G. (2004). Write-optimized b-trees. In VLDB.

Graefe, G. (2006). B-tree indexes for high update rates.

SIGMOD Record, 35(1).

Graefe, G. and Larson, P.-

A. (2001). B-tree indexes and cpu

caches. In ICDE.

Grust, T. (2002). Accelerating xpath location steps. In SIG-

MOD.

Haustein, M. P., H

arder, T., Mathis, C., and Wagner, M.

(2005). Deweyids - the key to ﬁne-grained manage-

ment of xml documents. In SBBD.

Kaushik, R., Bohannon, P., Naughton, J. F., and Korth, H. F.

(2002). Covering indexes for branching path queries.

In SIGMOD.

Lomet, D. B. (2001). The evolution of effective b-tree:

Page organization and techniques: A personal ac-

count. SIGMOD Record, 30(3).

Qun, C., Lim, A., and Ong, K. W. (2003). D(k)-index: An

adaptive structural summary for graph-structured data.

In SIGMOD.

Rao, P. and Moon, B. (2004). Prix: Indexing and querying

xml using pr

ufer sequences. In ICDE.

ICEIS 2008 - International Conference on Enterprise Information Systems

142