ing is better than pipeline processing in all experi-
ments. The system active time decreases with increas-
ing number of relay nodes in both synthetic and re-
alistic document processing. In addition, the higher
the document depth and the more number of XML
tags, the larger the system active time in both doc-
uments. Useless processing is more pronounced at
each RelayNode for the synthetic documents with
higher depth, because the RelayNodes are not able
to match too many tags within the document part al-
located to them. Hence, the EndNode is left with a
large amount of processing left to be done. We may
reduce useless processing if we divide the document
conveniently according to the document structure and
grammar checking rules. In addition, node activity is
more sensitive to the number of RelayNodes in paral-
lel processing than in pipeline processing. Regarding
task complexity, node activity results in pipeline pro-
cessing are similar, regardless of the task performed.
Processing time is also similar for both parallel
processing and pipeline processing. The results show
the extra activity time in the pipeline processing is
due to extra sending/receiving thread times. Gener-
ally, the system processing time reduces as the num-
ber of RelayNodes increases. However, sometimes
(e.g. synthetic doc06) few RelayNodes are more effi-
cient, due to specific document partition and process-
ing allocation.
Regarding parallelism efficiency ratio, parallel
processing is more efficient than pipeline in every
case, because in parallel case, more threads are op-
erating concurrently on different parts of the docu-
ment. According to the parallelism efficiency ratio,
synthetic documents are more efficient than realis-
tic documents for processing. Moreover, in synthetic
document processing, the efficiency ratio is better in
4 RelayNode than in 2 RelayNode for parallel pro-
cessing only. In realistic document processing, the
efficiency ratio is improved in 4 RelayNode as com-
pared with 2 RelayNode for both pipeline and parallel
processing.
For convenience, we organize our performance
characterization results into Table 4.
4 RELATED WORK
XML parallel processing has been recently addressed
in several papers. (Lu and Gannon, 2007) proposes a
multi-threaded XML parallel processing model where
threads steal work from each other, in order to sup-
port load balancing among threads. They exem-
plify their approach in a parallelized XML serial-
izer. (Head and Govindaraju, 2007) focuses on par-
Table 4: Summary of Experimental Results.
Synthetic docs
(smaller docs)
Realistic docs
(larger docs)
PAR vs
PIP
number
of RN
PAR vs
PIP
number
of RN
Job
execution
time
PAR is
better
4 RN is
better in
PAR
PAR is
better
4 RN is
better in
both PIP
and PAR
System
active time
System
processing
time
No
difference
4 RN is
better
No
difference
4 RN is
better
Parallelism
efficiency
ratio
PAR is
better
4 RN is
better in
PAR
PAR is
better
allel XML parsing, evaluating multi-threaded parsers
performance versus thread communication overhead.
(Head and Govindaraju, 2009) introduces a parallel
processing library to support parallel processing of
large XML data sets. They explore speculative exe-
cution parallelism, by decomposing Deterministic Fi-
nite Automata (DFA) processing into DFA plus Non-
Deterministic Finite Automata (NFA) processing on
symmetric multi-processing systems. To our knowl-
edge, our work is the first to evaluate and compare
parallel against pipelining XML distributed process-
ing.
5 CONCLUSIONS
In this paper, we have studied two models of dis-
tributed XML document processing, parallel and
pipeline, on two types of environments, virtual and
real machines, using XML documents with various
characteristics. In general, both virtual and real envi-
ronments are similar in streaming data processing of
XML documents. Virtual machines are flexible in re-
source allocation, providing more efficient resource
utilization. Regarding processing models, pipeline
processing is less effi cient than parallel processing in
both document type. because parts of the document
that are not to be processed at a specific node needs
to be received and relayed to other nodes, consuming
node resources and increasing processing overhead.
Regardless the distributed model, efficiency of dis-
tributed processing depends on the XML document
characteristics, as well as its task partition. Optimal
partition of XML document for efficient distributed
processing is part of ongoing research.
In this work, we have focused on distributed well-
formedness and validation of single XML documents.
Other XML processing, such as filtering and XML
transformations, as well as multiple XML document
processing, can be studied. We are also interested in
DISTRIBUTEDXMLPROCESSINGOVERVARIOUSTOPOLOGIES-PipelineandParallelProcessing
Characterization
121