implemented successfully in the Chinese language
as illustrated in our evaluation, the idea delineated
certainly does not limit or tailor-made for any
particular language. Only a minor modification is
needed to apply the technique to other languages.
5 CONCLUSIONS
In this paper, we have illustrated a shallow technique
in which semantic labels are extracted in forms of
chunks of phrases or words using a two-phase
feature-enhanced string matching algorithm. While
the first phase is to shortlist the potential trees in the
Treebank, chunks are further tagged with semantic
labels in the second phase. Based on the linguist’s
conception of phrase structure, our approach does
not require a full syntactic parse to pursue semantic
analysis and the recursively embedded phrases can
also be identified without pain. This shallow
technique is inspired by the research in the area of
bio-molecular sequences analysis which advocates
high sequence similarity usually implies significant
function or structural similarity. It is characteristic
of biological systems that objects have a certain
form that has arisen by evolution from related
objects of similar but not identical form. This
sequence-to-structure mapping is a tractable, though
partly heuristic, way to search for functional or
structural universality in biological systems. With
the support from the results as shown, we conjecture
this sequence-to-structure phenomenon appears in
our sentences. The sentence sequence encodes and
reflects the more complex linguistic structures and
mechanisms described by linguists. While our
system does not claim to deal with all aspects of
language, we suggest an alternate, but plausible, way
to handle the real corpus.
ACKNOWLEDGEMENTS
The work described in this paper was partially
supported by the grants from the Research Grants
Council of the Hong Kong Special Administrative
Region, China (Project Nos. CUHK4438/04H and
CUHK4706/05H).
REFERENCES
Abney, S. (1991). Parsing by chunks. In Berwick, R.,
Abney, S. & Tenny, C. (Eds.), Principle-Based
Parsing. Kluwer Academic.
Bod, R., Scha, R., & Sima’an, K. (2003). Data-Oriented
Parsing. Stanford: California, CSLI.
Chen, K.-J., Huang, C.-R., Chang, L.-P., & Hsu. H.-L.
(1996). Sinica Corpus: Design Methodology for
Balanced Corpora. Proceedings of the 11th Pacific
Asia Conference on Language, Information, and
Computation (PACLIC II), Seoul Korea, 167-176.
Church, K. (1988). A stochastic parts program and noun
phrase parser for unrestricted text. Proceedings of
Second Conference on Applied Natural Language
Processing, Austin, Texas.
CKIP (2004). Sinica Chinese Treebank: An Introduction of
Design Methodology. Academic Sinica.
Dowty, D. (1991). Thematic proto-roles and argument
selection. Language, 67, 547-619.
Fillmore, C.J. (1968). The case for case. In E. Bach &
R.T. Harms (Eds.), Universals in Linguistic Theory,
1-90. Holt, Rinehart & Winston.
Gee, J., & Grosjean, F. (1983). Performance structures: A
psycholinguistic and linguistic appraisal. Cognitive
Psychology, 15, 4, 411-458.
Gusfield, D. (1997). Algorithms on Strings, Trees, and
Sequences: Computer Science and Computational
Biology. Cambridge University Press.
Jackendoff, R. (1983). Semantics and Cognition. MIT
Press.
Kurohashi, S., and Nagao, M. (1994). A method of case
structure analysis for Japanese sentences based on
examples in case frame dictionary. IEICE
Transactions on Information and Systems, vol. E77-
D, no. 2, 227-239.
Ramshaw, L. A., & Marcus, M.P. (1995). Text chunking
using transformation-based learning. Proceedings of
the Third Workshop on Very Large Corpora, 82-94.
Sima’an, K. (2000). Tree-gram parsing: lexical
dependencies and structural relations. Proceedings
of the 38
th
Annual Meeting of the Association for
Computational Linguistics, 53-60, Hong Kong.
Somers, H.L. (1982). The use of verb features in arriving
at a ‘meaning representation’. Linguistics, 20, 237-
265.
Tsay, Y.T., & Tsai, W.H. (1989). Model-guided attributed
string matching by split-and-merge for shape
recognition. International Journal of Pattern
Recognition and Artificial Intelligence, 3, 2, 159-179.
Utsuro, T., Matsumoto, Y., and Nagao, M. (1993). Verbal
case frame acquisition from bilingual corpora.
Proceedings of the Thirteenth International Joint
Conference on Artificial Intelligence, vol. 2, 1150-
1156.
Wagner, R.A., & Fischer, M.J. (1974). The string-to-string
correction problem. Journal of the Association for
Computing Machinery, 21, 1, 168-173.
Weischedel, R., Meteer, M., Schwartz, R., Ramshaw, L.,
Palmucci, J. (1993). Coping with ambiguity and
unknown words through probabilistic models.
Computational Linguistics, 19, 2, 359-382.
FINDING APPROXIMATE LANGUAGE PATTERNS
301