between the tree-pairs. A direction for further work
is the investigation of such a model of aligned trees,
and how it relates to some other recent proposals
concerning adaptive tree measures such as (Takasu et
al., 2007), (Dalvi et al., 2009)
ACKNOWLEDGEMENTS
This research is supported by the Science Foundation
Ireland (Grant 07/CE/I1142) as part of the Centre for
Next Generation Localisation (www.cngl.ie) at Trin-
ity College Dublin.
REFERENCES
Bened´ı, J.-M. and S´anchez, J.-A. (2005). Estimation of
stochastic context-free grammars and their use as lan-
guage models. Computer Speech and Language,
19(3):249–274.
Bernard, M., Boyer, L., Habrard, A., and Sebban, M.
(2008). Learning probabilistic models of tree edit dis-
tance. Pattern Recogn., 41(8):2611–2629.
Bilenko, M. and Mooney, R. J. (2003). Adaptive duplicate
detection using learnable string similarity measures.
In Proceedings of the Ninth ACM SIGKDD Interna-
tional Conference on Knowledge Discovery and Data
Mining (KDD-2003), pages 39–48.
Boyer, L., Habrard, A., and Sebban, M. (2007). Learning
metrics between tree structured data: Application to
image recognition. In Proceedings of the 18th Euro-
pean Conference on Machine Learning (ECML 2007),
pages 54–66.
CCG (2001). corpus of classified questions by Cog-
nitive Computation Group, University of Illinois
l2r.cs.uiuc.edu/∼cogcomp/Data/QA/QC.
Dalvi, N., Bohannon, P., and Sha, F. (2009). Robust web ex-
traction: an approach based on a probabilistic treeedit
model. In SIGMOD 09: Proceedings of the 35th
SIGMOD international conference on Management of
data, pages 335–348, New York, NY, USA. ACM.
Dempster, A., Laird, N., and Rubin, D. (1977). Maximum
likelihood from incomplete data via the em algorithm.
J. Royal Statistical Society, B 39:138.
Dudani, S. (1976). The distance-weighted k-nearest neigh-
bor rule. IEEE Transactions on Systems, Man and Cy-
bernetics, SMC-6:325–327.
Emms, M. (2011). Tree-distance code and datasets reported
on in experiments www.scss.tcd.ie/Martin.Emms/
TreeDist.
Emms, M. and Franco-Penya, H.-H. (2011). On order
equivalences between distance and similarity mea-
sures on sequences and trees. In Proceedings of
ICPRAM 2012 International Conference on Pattern
Recognition Application and Methods.
Judge, J. (2006a). Corpus of syntactically annotated ques-
tions http://www.computing.dcu.ie/jjudge/qtreebank/.
Judge, J. (2006b). Adapting and Developing Linguistic Re-
sources for Question Answering. PhD thesis, Dublin
City University.
Judge, J., Cahill, A., and van Genabith, J. (2006). Ques-
tionbank: creating a corpus of parse-annotated ques-
tions. In ACL 06: Proceedings of the 21st Interna-
tional Conference on Computational Linguistics and
the 44th annual meeting of the ACL, pages 497–504,
Morristown, NJ, USA. Association for Computational
Linguistics.
Kuboyama, T. (2007). Matching and Learning in Trees.
PhD thesis, Graduate School of Engineering, Univer-
sity of Tokyo.
Ristad, E. S. and Yianilos, P. N. (1998). Learning string edit
distance. IEEE Transactions on Pattern Recognition
and Machine Intelligence, 20(5):522–532.
Tai, K.-C. (1979). The tree-to-tree correction problem.
Journal of the ACM (JACM), 26(3):433.
Takasu, A., Fukagawa, D., and Akutsu, T. (2007). Statisti-
cal learning algorithm for tree similarity. In ICDM 07:
Proceedings of the 2007 Seventh IEEE International
Conference on Data Mining, pages 667672, Washing-
ton, DC, USA. IEEE Computer Society.
Wagner, R. A. and Fischer, M. J. (1974). The string-tostring
correction problem. Journal of the Association for
Computing Machinery, 21(1):168–173.
Zhang, K. and Shasha, D.(1989). Simple fast algorithms for
the editing distance between trees and related prob-
lems. SIAM Journal of Computing, 18:1245–1262.
APPENDIX
Proof Concerning Out of Table Costs
Let C
D
be a cost-table associated with a given label
alphabet S , let T be a tree with n symbols 6∈ S , and let
k be a fixed, out-of-table cost for any (x,u), where x ∈
S ∪ {l }, u 6∈ S . Suppose S is a tree whose labels are
in S . Every s ∈ E(S,T) involve n out-of-table events.
SupposeV is the least-cost script, with cost cost
k
(V ).
Now suppose under a higher setting of k
′
for out-of-
table costs, that V
′
6= V is the least-cost script, so
cost
k
′
(V
′
) < cost
k
′
(V ). But recosting according to k
gives cost
k
(V
′
) < cost
k
(V ), which contradicts min-
imality of V under k . So the minimal script is in-
variant to changes of k , and D
v
k
′
(S,T) − D
v
k
(S,T) =
n× (k − k
′
). It follows that neighbour ordering is in-
variant to changes of k .