# ON STOCHASTIC TREE DISTANCES AND THEIR TRAINING VIA EXPECTATION-MAXIMISATION

### Martin Emms

#### Abstract

Continuing a line of work initiated in (Boyer et al., 2007), the generalisation of stochastic string distance to a stochastic tree distance is considered. We point out some hitherto overlooked necessary modifications to the Zhang/Shasha tree-distance algorithm for all-paths and viterbi variants of this stochastic tree distance. A strategy towards an EM cost-adaptation algorithm for the all-paths distance which was suggested by (Boyer et al., 2007) is shown to overlook necessary ancestry preservation constraints, and an alternative EM costadaptation algorithm for the Viterbi variant is proposed. Experiments are reported on in which a distanceweighted kNN categorisation algorithm is applied to a corpus of categorised tree structures. We show that a 67.7% base-line using standard unit-costs can be improved to 72.5% by the EM cost adaptation algorithm.

#### References

- BenedÃ, J.-M. and SÃ¡nchez, J.-A. (2005). Estimation of stochastic context-free grammars and their use as language models. Computer Speech and Language, 19(3):249-274.
- Bernard, M., Boyer, L., Habrard, A., and Sebban, M. (2008). Learning probabilistic models of tree edit distance. Pattern Recogn., 41(8):2611-2629.
- Bilenko, M. and Mooney, R. J. (2003). Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2003), pages 39-48.
- Boyer, L., Habrard, A., and Sebban, M. (2007). Learning metrics between tree structured data: Application to image recognition. In Proceedings of the 18th European Conference on Machine Learning (ECML 2007), pages 54-66.
- CCG (2001). corpus of classified questions by Cognitive Computation Group, University of Illinois l2r.cs.uiuc.edu/~cogcomp/Data/QA/QC.
- Dalvi, N., Bohannon, P., and Sha, F. (2009). Robust web extraction: an approach based on a probabilistic treeedit model. In SIGMOD 09: Proceedings of the 35th SIGMOD international conference on Management of data, pages 335-348, New York, NY, USA. ACM.
- Dempster, A., Laird, N., and Rubin, D. (1977). Maximum likelihood from incomplete data via the em algorithm. J. Royal Statistical Society, B 39:138.
- Dudani, S. (1976). The distance-weighted k-nearest neighbor rule. IEEE Transactions on Systems, Man and Cybernetics, SMC-6:325-327.
- Emms, M. (2011). Tree-distance code and datasets reported on in experiments www.scss.tcd.ie/Martin.Emms/ TreeDist.
- Emms, M. and Franco-Penya, H.-H. (2011). On order equivalences between distance and similarity measures on sequences and trees. In Proceedings of ICPRAM 2012 International Conference on Pattern Recognition Application and Methods.
- Judge, J. (2006a). Corpus of syntactically annotated questions http://www.computing.dcu.ie/jjudge/qtreebank/.
- Judge, J. (2006b). Adapting and Developing Linguistic Resources for Question Answering. PhD thesis, Dublin City University.
- Judge, J., Cahill, A., and van Genabith, J. (2006). Questionbank: creating a corpus of parse-annotated questions. In ACL 06: Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the ACL, pages 497-504, Morristown, NJ, USA. Association for Computational Linguistics.
- Kuboyama, T. (2007). Matching and Learning in Trees. PhD thesis, Graduate School of Engineering, University of Tokyo.
- Ristad, E. S. and Yianilos, P. N. (1998). Learning string edit distance. IEEE Transactions on Pattern Recognition and Machine Intelligence, 20(5):522-532.
- Tai, K.-C. (1979). The tree-to-tree correction problem. Journal of the ACM (JACM), 26(3):433.
- Takasu, A., Fukagawa, D., and Akutsu, T. (2007). Statistical learning algorithm for tree similarity. In ICDM 07: Proceedings of the 2007 Seventh IEEE International Conference on Data Mining, pages 667672, Washington, DC, USA. IEEE Computer Society.
- Wagner, R. A. and Fischer, M. J. (1974). The string-tostring correction problem. Journal of the Association for Computing Machinery, 21(1):168-173.
- Zhang, K. and Shasha, D. (1989). Simple fast algorithms for
- lems. SIAM Journal of Computing, 18:1245-1262.

#### Paper Citation

#### in Harvard Style

Emms M. (2012). **ON STOCHASTIC TREE DISTANCES AND THEIR TRAINING VIA EXPECTATION-MAXIMISATION** . In *Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,* ISBN 978-989-8425-98-0, pages 144-153. DOI: 10.5220/0003864901440153

#### in Bibtex Style

@conference{icpram12,

author={Martin Emms},

title={ON STOCHASTIC TREE DISTANCES AND THEIR TRAINING VIA EXPECTATION-MAXIMISATION},

booktitle={Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,},

year={2012},

pages={144-153},

publisher={SciTePress},

organization={INSTICC},

doi={10.5220/0003864901440153},

isbn={978-989-8425-98-0},

}

#### in EndNote Style

TY - CONF

JO - Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,

TI - ON STOCHASTIC TREE DISTANCES AND THEIR TRAINING VIA EXPECTATION-MAXIMISATION

SN - 978-989-8425-98-0

AU - Emms M.

PY - 2012

SP - 144

EP - 153

DO - 10.5220/0003864901440153