IDENTIFYING CLONES IN DYNAMIC WEB SITES USING SIMILARITY THRESHOLDS

Andrea De Lucia, Giuseppe Scanniello, Genoveffa Tortora

Abstract

We propose an approach to automatically detect duplicated pages in dynamic Web sites and on the analysis of both the page structure, implemented by specific sequences of HTML tags, and the displayed content. In addition, for each pair of dynamic pages we also consider the similarity degree of their scripting code. The similarity degree of two pages is computed using different similarity metrics for the different parts of a web page based on the Levenshtein string edit distance. We have implemented a prototype to automate the clone detection process on web applications developed using JSP technology and used it to validate our approach in a case study.

References

  1. Antoniol, G., Canfora, G., Casazza, G., and De Lucia, A., 2000. Web Site Reengineering using RMM. Proc. of International Workshop on Web Site Evolution, Zurich, Switzerland, pp. 9-16.
  2. Aversano, L., Canfora, G., De Lucia, A., and Gallucci, P., 2001. Web Site Reuse: Cloning and Adapting. Proc. of 3rd International Workshop on Web Site Evolution, Florence, Italy, IEEE CS Press, pp. 107-111.
  3. Baker, B. S., 1995. On finding duplication and near duplication in large software systems. Proc. of 2nd Working Conference on Reverse Engineering, Toronto, Canada, IEEE CS Press, pp 86-95.
  4. Balazinska, M., Merlo, E., Dangenais, M., Lague, B. and Kontogiannis, K., 1999. Measuring Clone Based Reengineering Opportunities. Proc. of 6th International Symposium on Software Metrics, Boca Raton, Florida, IEEE CS Press, pp. 292-303.
  5. Baxter, I. D., Yahin, A., Moura, L., Sant'Anna, M., and Bier, L., 1998. Clone Detection Using Abstract Syntax Trees, Proc. of International Conference on Software Maintenance, IEEE CS Press, pp. 368-377.
  6. Bieber, M. and Isakowitz, T., 1995. Special issue on Designing Hypermedia Applications. Communications of the ACM, 38(8).
  7. Boldyreff, C., Munro, M., and Warren, P., 1999. The evolution of websites. Proc. of 7th International Workshop on Program Comprehension, Pittsburgh, Pennsylvania, IEEE CS Press, pp. 178-185.
  8. Conallen, J., 2000. Building Web application with UML, Addison Wesley.
  9. Di Lucca, G. A., Di Penta, M., and Fasolino, A. R., 2002. An Approach to Identify Duplicated Web Pages. Proc. of 26th Annual International Computer Software and Application Conference (COMPSAC'02), Oxford, UK, IEEE CS Press, pp. 481-486.
  10. Ginige, A. and Murugesan, S., (eds.) 2001. Special issue on Web Engineering. IEEE Multimedia, 8(1-2).
  11. Kamiya, T., Kusumoto, S., and Inoue, K., 2002. CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code. IEEE Transactions on Software Engineering, 28(7), pp. 654-670.
  12. Lanubile, F. and Mallardo, T., 2003. Finding Function Clones in Web Application. In Proc. of 7thEuropean Conference on Software Maintenance and Reengineering, Benevento, Italy, IEEE CS Press, pp. 379-386.
  13. Levenshtein, V. L., 1966. Binary codes capable of correcting deletions, insertions, and reversals, Cybernetics and Control Theory, 10, 707-710.
  14. Ricca, F. and Tonella, P., 2001. Understanding and Restructuring Web Sites with ReWeb. IEEE Multimedia, 8(2), 40-51.
  15. Ricca, F. and Tonella, P., 2003. Using Clustering to Support the Migration from Static to Dynamic Web Pages. Proc. of 11th International Workshop on Program Comprehension, Portland, Oregon, IEEE CS Press, pp. 207-216.
Download


Paper Citation


in Harvard Style

De Lucia A., Scanniello G. and Tortora G. (2004). IDENTIFYING CLONES IN DYNAMIC WEB SITES USING SIMILARITY THRESHOLDS . In Proceedings of the Sixth International Conference on Enterprise Information Systems - Volume 1: ICEIS, ISBN 972-8865-00-7, pages 391-396. DOI: 10.5220/0002597303910396


in Bibtex Style

@conference{iceis04,
author={Andrea De Lucia and Giuseppe Scanniello and Genoveffa Tortora},
title={IDENTIFYING CLONES IN DYNAMIC WEB SITES USING SIMILARITY THRESHOLDS},
booktitle={Proceedings of the Sixth International Conference on Enterprise Information Systems - Volume 1: ICEIS,},
year={2004},
pages={391-396},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0002597303910396},
isbn={972-8865-00-7},
}


in EndNote Style

TY - CONF
JO - Proceedings of the Sixth International Conference on Enterprise Information Systems - Volume 1: ICEIS,
TI - IDENTIFYING CLONES IN DYNAMIC WEB SITES USING SIMILARITY THRESHOLDS
SN - 972-8865-00-7
AU - De Lucia A.
AU - Scanniello G.
AU - Tortora G.
PY - 2004
SP - 391
EP - 396
DO - 10.5220/0002597303910396