4 CONCLUSIONS
Through above experiments, we find performance
on old term recognition is quite consistent across ten
subsets of testing corpora. While for novel terms, the
performance is influenced by sampling more greatly.
Different sampling will have quite different recalls.
The reason may be novel terms are scattered
sparsely in corpora; and if some novel terms never
appeared in training corpora, there is no chance that
CRF model could learn its features and label it
correspondingly. In such case, it would not be
tagged as true term in testing corpora; therefore, this
term would not be retrieved.
All in all, this research studies the performance
of CRF framework on term extraction with the use
of two kinds of unique syntactic information:
syntactic paths and term ratios. The conclusion can
be drawn that syntactic functions and syntactic paths
can be used as effective features under the CRF
framework. And the system performs quite well with
respect to old term extraction. For novel term
extraction, the precisions are also promising, though
the recalls are quite low compared to old term
extraction. This limitation indicates that more
distinguishing features are needed to improve the
performance, like semantic features of potential
novel terms to help novel term extraction. And also,
this work will be helpful for other machine learning
based term extraction system in respect of exploiting
effective syntactic features.
ACKNOWLEDGEMENTS
Research described in this article was supported in
part by research grants from City University of
Hong Kong (Project No. 7002387, 7008002, and
9610126).
REFERENCES
Justeson, J. S., and Katz, S.M. (1995). Technical
Terminology: some linguistic properties and an
algorithm for identification in text. Natural Language
Engineering 1(1):9--27.
Fang, A. C. (1996). The Survey Parser: Design and
Development. In S. Greenbaum (Ed.), Comparing
English World Wide: The International Corpus of
English (pp. 142-160). Oxford: Oxford University
Press.
Lafferty, J. D., McCallum, A. and Pereira, F. C. N.:
Conditional random fields: Probabilistic models for
segmenting and labeling sequence data. In ICML '01:
Proceedings of the Eighteenth International
Conference on Machine Learning, pp 282-289, San
Francisco, CA, USA, 2001.
National Library of Medicine. Fact sheet: medical subject
headings (MeSH). [Web document]. Bethesda, MD:
National Institutes of Health. 2010. [Last updated: 01
April 2010; cited 9 April 200].
<www.nlm.nih.gov/pubs/factsheets/mesh.html>.
Song Y., Kit, C. Y., Xu, R. F., Zhao, H. How
Unsupervised Learning Affects Character Tagging
based Chinese Word Segmentation: A Quantitative
Investigation, in Proceedings of International
Conference on Machine Learning and Cybernetics,
Jul, 2009.
Takeuchi, K., and Collier, N. (2004). Bio-medical entity
extraction using support vector machines, Artificial
Intelligence in Medicine, Volume 33, Issue 2, Pages
125-137.
Tsai, T. H., Chou, W. C. and Wu, S. H.(2005). Integrating
linguistic knowledge into a conditional random field
framework to identify biomedical named entities.
Expert Systems Appl. v30 i1. 117-128.
Zheng, D, Zhao, T. and Yang, J. (2009). Computer
Processing of Oriental Languages. Language
Technology for the Knowledge-based Economy, 22nd
International Conference, ICCPOL 2009, Hong Kong,
March 26-27. Proceedings 2009.
Zhao H., Huang C. N., and Li M.
An Improved Chinese Word Segmentation System
with Conditional Random Field, Proceedings of the
Fifth SIGHAN Workshop on Chinese Language
Processing (SIGHAN-5), pp.162-165, Sydney,
Australia, July 22-23, 2006.
Zhang, X. and Fang. A. C. (2010). An ATE System based
on Probabilistic Relations between Terms and
Syntactic Functions. In 10th International Conference
on Statistical Analysis of Textual Data. Sapienza,
University of Rome (Italy), 9 to 11 June 2010.
CONDITIONAL RANDOM FIELDS FOR TERM EXTRACTION
417