A NEW FREQUENT SIMILAR TREE ALGORITHM MOTIVATED BY DOM MINING - Using RTDM and its New Variant — SiSTeR

Barkol Omer, Bergman Ruth, Golan Shahar

Abstract

The importance of recognizing repeating structures in web applications has generated a large body of work on algorithms for mining the HTML Document Object Model (DOM). A restricted tree edit distance metric, called the Restricted Top Down Metric (RTDM), is most suitable for DOM mining as well as less computationally expensive than the general tree edit distance. Given two trees with input size n1 and n2, the current methods take time O(n1 · n2) to compute RTDM. Consider, however, looking for patterns that form subtrees within a web page with n elements. The RTDM must be computed for all subtrees, and the running time becomes O(n4). This paper proposes a new algorithm which computes the distance between all the subtrees in a tree in time O(n2), which enables us to obtain better quality as well as better performance, on a DOM mining task. In addition, we propose a new tree edit-distance—SiSTeR (Similar Sibling Trees aware RTDM). This variant of RTDMallows considering the case were repetitious (very similar) subtrees of different quantity appear in two trees which are supposed to be considered as similar.

References

  1. Abe, K., Kawasoe, S., Asai, T., Arimura, H., and Arikawa, S. (2002). Optimized substructure discovery for semistructured data. In PKDD 7802, pages 1-14, London, UK. Springer-Verlag.
  2. Bille, P. (2005). A survey on tree edit distance and related problems. Theor. Comput. Sci., 337(1-3):217-239.
  3. Chi, Y., Muntz, R. R., Nijssen, S., and Kok, J. N. (2005). Frequent subtree mining - an overview. Fundamenta Informaticae, 66:161-198-.
  4. Koontz, W. L. G., Narendra, P. M., and Fukunaga, K. (1976). A graph-theoretic approach to nonparametric cluster analysis. IEEE Trans. Comput., 25(9):936- 944.
  5. Liu, B., Grossman, R., and Zhai, Y. (2003). Mining data records in web pages. In KDD 7803, pages 601-606.
  6. Lu, S. (1984). A tree-matching algorithm based on node splitting and merging. IEEE Trans. Pattern Anal. Mach. Intell., 6(2):249-256.
  7. Luccio, F., Enriquez, A., Rieumont, P., and Pagli, L. (2004). Bottom-up subtree isomorphism for unordered labeled trees. Technical Report TR-04-13, Università Di Pisa.
  8. Luccio, F., Enriquez, A. M., Rieumont, P. O., and Pagli, L. (2001). Exact rooted subtree matching in sublinear time. Technical Report TR-01-14, Università Di Pisa.
  9. Park, J. and Barbosa, D. (2007). Adaptive record extraction from web pages. In WWW 7807, pages 1335-1336.
  10. Reis, D. C., Golgher, P. B., Silva, A. S., and Laender, A. (2004). Automatic web news extraction using tree edit distance. In WWW 7804, pages 502-511.
  11. Selkow, S. M. (1977). The tree-to-tree editing problem. Inf. Process. Lett., 6(6):184-186.
  12. Tai, K.-C. (1979). The tree-to-tree correction problem. J. ACM, 26(3):422-433.
  13. Zaki, M. J. (2002). Efficiently mining frequent trees in a forest. In KDD 7802, pages 71-80.
  14. Zaki, M. J. (2004). Efficiently mining frequent embedded unordered trees. Fundam. Inf., 66(1-2):33-52.
  15. Zhai, Y. and Liu, B. (2005). Web data extraction based on partial tree alignment. In WWW 7805, pages 76-85.
  16. Zhang, K. (1995). Algorithms for the constrained editing distance between ordered labeled trees and related problems. Pattern Recognition, 28(3):463-474.
  17. Zhang, K. (1996). A constrained edit distance between unordered labeled trees. Algorithmica, 15(3):205-222.
  18. Zhang, K. and Shasha, D. (1989). Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput., 18(6):1245-1262.
Download


Paper Citation


in Harvard Style

Omer B., Ruth B. and Shahar G. (2011). A NEW FREQUENT SIMILAR TREE ALGORITHM MOTIVATED BY DOM MINING - Using RTDM and its New Variant — SiSTeR . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2011) ISBN 978-989-8425-79-9, pages 230-235. DOI: 10.5220/0003658102380243


in Bibtex Style

@conference{kdir11,
author={Barkol Omer and Bergman Ruth and Golan Shahar},
title={A NEW FREQUENT SIMILAR TREE ALGORITHM MOTIVATED BY DOM MINING - Using RTDM and its New Variant — SiSTeR},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2011)},
year={2011},
pages={230-235},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003658102380243},
isbn={978-989-8425-79-9},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2011)
TI - A NEW FREQUENT SIMILAR TREE ALGORITHM MOTIVATED BY DOM MINING - Using RTDM and its New Variant — SiSTeR
SN - 978-989-8425-79-9
AU - Omer B.
AU - Ruth B.
AU - Shahar G.
PY - 2011
SP - 230
EP - 235
DO - 10.5220/0003658102380243