4.2 Algorithm Optimization
For the 1 GB and three-dimensional XML file in
our experiment, using the basic algorithm discussed
in section 3, the number of Map output records is
about 126,360,000in the first round but the number of
Reduce output records is approximately 35,101,440.
About 72% records are merged in Reducers. For the
1 GB and six-dimensional XML file, the number of
Map output records is approximately 1,075,164,192
but the number of Reduce output records is only
51,9156,510. The gap in the 5 GB file sizes is even
more pronounced. As the dimension increases, the
output of Map increases more rapidly. For a D-
dimensional XML file, assuming n is the number of
initial Data Nodes, the direct output records of Map
is 2
D
× n . In the basic algorithm, the workload is too
large when the dimension becomes higher in the first
round. Too much data is transferred from the Mapper
to the Reducer. In a cloud environment with limited
capacity, it may increase the risk of the jobs failing.
In the basic algorithm, all Additional Nodes come
from Data Nodes. However, some nodes can be gen-
erated by Link Nodes also. The Link Nodes with
longer paths can produce the Additional Nodes with
shorter paths. In Fig.4, Nodes in all − D can not only
be generated by all − A − B − C − D (Data Nodes),
but also can be generated by other six types of Link
Nodes like all−A−B−D, all−B−D, etc. Although
the information in all − A − B − D or all − B − D is
not as complete as the Data Nodes, it is enough for
all − D. The significant advantage of using shorter-
path nodes to generate Link Nodes is the decrease in
the I/O between Mapper and Reducer. For example,
there are only two Data Nodes, the paths of which are
all − a1− b1− c1 − d1 and all − a2 − b1 − c1− d1.
Thus, the Link Node in all − B−D will be all −b1−
d1. It is more efficient to generate the Link Node in
all − D by using Link Node all − b1− d1 instead of
using the Data Nodes because only one output record
is produced by the Mapper, whereas the Data Nodes
would produce two.
Fig.7 is an example of generating Link Nodes
by using an optimized algorithm for five-dimensional
data. As discussed in section 3, all of the pos-
sible Link Nodes for five-dimensional Data Nodes
(all − A − B − C − D − E) depend on combinations
of the path between all and E exclusively, which is
A − B − C − D. In the first round, by removing one
constraint in the path A− B−C− D, we get C
1
4
kinds
of Link Nodes shown in column 1 in Fig.7. The arrow
above the node is the indicator of a BreakPoint. The
arrow pointing to empty means the BreakPoint is at
all. Any Link Node whose BreakPoint does not point
A B C
A B D
A C D
B C D
A
C
B C
A B
A
D
B D
C D
B
C
D
A
2 1 3
4
Figure 7: Using Link Nodes for Generation.
to all can be used to generate new Link Nodes in the
next round. In the next round, another constraint of
the Link Nodes is to be removed, and this constraint
should be the nodes before the BreakPoint. As shown
in Fig.7, the A−B−C in column 1 stands for the Link
Nodes whose path is all − A− B−C− E. The Break-
Point is at C, so in next round shown in column 2,
successively removing the nodes before C (including
C), which are C, B, A, we obtain A− B, A−C, B−C.
The BreakPoint changes to the point at the node be-
fore the node has been removed. If the BreakPoint
points at all like B−C, it would not participate in the
generation work of future rounds, but the calculation
phase would continue. In this way, we obtain all of
the combinations after the removal of two constraints
from the path, which is C
2
4
in total. The Link Nodes
are generated according to the rules described above,
until all BreakPoints point to all. The total number is
C
1
4
+ C
2
4
+ C
3
4
+ C
4
4
= 15, which is the same number
produced by the basic algorithm.
Assuming the dimension is D, in the Nth round,
the number of Additional nodes for each Link Node
is C
N
D−1
. It would appear that the I/O should increase
when N is close to
D−1
2
. However, it does not because
the Additional Nodes are produced by Link Nodes in-
stead of Data Nodes, and after several rounds of merg-
ers, the number of Link Nodes decrease significantly.
The total outputs from the Mapper after N rounds are
still less than the first round in our experiment.
Fig.8 shows the different efficiencies obtained
by employing these two algorithms to construct the
XDCM using the six-dimensional XML data. The
improvement in the performance is significant. For
the 1 GB and 6-dimensional XML file, the total im-
provement is approximately 64%. For the 5 GB and
6-dimensional XML file, the improvement is approx-
imately 27%. Fig.9 shows the the performance of the
two algorithms using the 5GB XML data from City-
Bikes.
WEBIST2014-InternationalConferenceonWebInformationSystemsandTechnologies
198