In general, one can say our results improve upon
prior systems in all respects. The main advantages of
our results are better recall in the harder cases, and no
over segmentation of the different sets.
Table 2: Experimental Results for Record Retrieving.
URL Tree Cnt Frequent DEPTA
Size Similar Trees
Corr Time Corr
www.amazon.com 3589 16 16 5.67 0
forums.gentoo.org 2833 25 25 4.24 18(4)
forums.sun.com 1729 15 15 0.97 15(3)
shoutwire.com 3543 20 20 4.6 20
messages.yahoo.com 2017 38 38 1.25 37
www.gateway.com 1461 6 6 1.14 0
shop.ebay.com 3664 50 50 4.53 40
www.google.com 848 11 8 0.17 0
www.abt.com 5408 40 40 10.62 40(9)
www.alibris.com 3225 25 25 3.73 25
bobsdiscountmarine 1318 16 16 1.69 0
www.cameraworld. 2115 25 25 1.47 25
www.compusa.com 2884 18 16 2.99 18(5)
www.cooking.com 2199 23 23 1.67 23
www.dealtime.com 1388 11 11 0.51 11(3)
www.drugstore.com 1572 42 42 0.67 42(14)
magazinesofamerica 759 6 6 2.2 6(2)
www.nextag.com 5351 30 30 8.72 30
nothingbutsoftware 3047 24 24 4.25 24(6)
www.refurbdepot.com 2890 15 15 4.81 10(5)
rochesterclothing.c 1820 16 16 0.98 16(4)
www.smartbargains.c 3095 24 24 3.06 0
www.tigerdirect.c 1527 20 20 0.93 20(5)
Sum / Average 2534 516 511 3.08 420
Recall 99% 81%
Only for successful 95%
5.3 Results for SiSTeR to Allow Detect
Similarity of Forums
This section evaluate the SiSTeR variant of RTDM
to allow to be aware to similar sibling subtrees when
computing the similarity between DOM trees. In
particular, when looking on forums’ DOM trees the
amount of lines or quotes in different posts create
a big difference between different posts. Here, the
posts each of which might have very different number
of lines or different number of quotes of other posts,
should be discovered as similar. We have tested our
algorithm using SiSTeR in comparison with the same
algorithm using RTDM. We have checked the simi-
larity between different posts in all of the following
forums: forums13.itrc.hp.com, forums.oracle.com,
hackquest.com, fdt.powerflasher.com/forum, and fo-
rum.projecteuler.net. In all these examples we no-
ticed a significant reduction in the number of clusters
that were discovered by using our similarity measure
as a distance metric. For example, in the forum pro-
jecteuler.net the RTDM-based algorithm outputed 10
different clusters of posts and another 20% of posts
that were in no cluster, while the SiSTeR-based algo-
rithm reduced it to 4 clusters and 10% un-clustered
posts.
REFERENCES
Abe, K., Kawasoe, S., Asai, T., Arimura, H., and Arikawa,
S. (2002). Optimized substructure discovery for semi-
structured data. In PKDD ’02, pages 1–14, London,
UK. Springer-Verlag.
Bille, P. (2005). A survey on tree edit distance and related
problems. Theor. Comput. Sci., 337(1-3):217–239.
Chi, Y., Muntz, R. R., Nijssen, S., and Kok, J. N. (2005).
Frequent subtree mining - an overview. Fundamenta
Informaticae, 66:161–198–.
Koontz, W. L. G., Narendra, P. M., and Fukunaga, K.
(1976). A graph-theoretic approach to nonparamet-
ric cluster analysis. IEEE Trans. Comput., 25(9):936–
944.
Liu, B., Grossman, R., and Zhai, Y. (2003). Mining data
records in web pages. In KDD ’03, pages 601–606.
Lu, S. (1984). A tree-matching algorithm based on node
splitting and merging. IEEE Trans. Pattern Anal.
Mach. Intell., 6(2):249–256.
Luccio, F., Enriquez, A., Rieumont, P., and Pagli, L. (2004).
Bottom-up subtree isomorphism for unordered la-
beled trees. Technical Report TR-04-13, Universit`a
Di Pisa.
Luccio, F., Enriquez, A. M., Rieumont, P. O., and Pagli, L.
(2001). Exact rooted subtree matching in sublinear
time. Technical Report TR-01-14, Universit`a Di Pisa.
Park, J. and Barbosa, D. (2007). Adaptive record extraction
from web pages. In WWW ’07, pages 1335–1336.
Reis, D. C., Golgher, P. B., Silva, A. S., and Laender, A.
(2004). Automatic web news extraction using tree edit
distance. In WWW ’04, pages 502–511.
Selkow, S. M. (1977). The tree-to-tree editing problem. Inf.
Process. Lett., 6(6):184–186.
Tai, K.-C. (1979). The tree-to-tree correction problem. J.
ACM, 26(3):422–433.
Zaki, M. J. (2002). Efficiently mining frequent trees in a
forest. In KDD ’02, pages 71–80.
Zaki, M. J. (2004). Efficiently mining frequent embedded
unordered trees. Fundam. Inf., 66(1-2):33–52.
Zhai, Y. and Liu, B. (2005). Web data extraction based on
partial tree alignment. In WWW ’05, pages 76–85.
Zhang, K. (1995). Algorithms for the constrained edit-
ing distance between ordered labeled trees and related
problems. Pattern Recognition, 28(3):463–474.
Zhang, K. (1996). A constrained edit distance between un-
ordered labeled trees. Algorithmica, 15(3):205–222.
Zhang, K. and Shasha, D. (1989). Simple fast algorithms for
the editing distance between trees and related prob-
lems. SIAM J. Comput., 18(6):1245–1262.
A NEW FREQUENT SIMILAR TREE ALGORITHM MOTIVATED BY DOM MINING - Using RTDM and its New
Variant — SiSTeR
243