the automatic extraction of further features. On top of
that, we are trying to not only include constraints in
the learning phase of a Conditional Random Field, but
also in the inference step. We believe that this could
even improve labeling results.
REFERENCES
Bollacker, K. D., Lawrence, S., and Giles, C. L. (1998).
CiteSeer: An autonomous web agent for automatic re-
trieval and identification of interesting publications. In
Proceedings of the second international conference on
Autonomous agents, pages 116–123. ACM.
Chang, M.-W., Ratinov, L., and Roth, D. (2007). Guid-
ing semi-supervision with constraint-driven learning.
Proceedings of the 45th Annual Meeting of the Asso-
ciation of Computational Linguistics, pages 280–287.
Councill, I. G., Giles, C. L., and Kan, M.-Y. (2008). ParsCit:
An open-source CRF reference string parsing pack-
age. In International Language Resources and Evalu-
ation. European Language Resources Association.
Fayyad, U. M. and Irani, K. B. (1993). Multi-interval dis-
cretization of continuous-valued attributes for classi-
fication learning. In Thirteenth International Joint
Conference on Articial Intelligence, volume 2, pages
1022–1027. Morgan Kaufmann Publishers.
Finkel, J. R. (2007). Named entity recognition and the stan-
ford NER software.
Fontan, L., Lopez-Garcia, R., Alvarez, M., and Pan,
A. (2012). Automatically extracting complex data
structures from the web. International Conference
on Knowledge Discovery and Information Retrieval
(KDIR 2012).
Ganchev, K., Graa, J., Gillenwater, J., and Taskar, B.
(2010). Posterior regularization for structured latent
variable models. Journal of Machine Learning Re-
search, 11:2001–2049.
Gao, L., Qi, X., Tang, Z., Lin, X., and Liu, Y. (2012). Web-
based citation parsing, correction and augmentation.
In Proceedings of the 12th ACM/IEEE-CS joint con-
ference on Digital Libraries, pages 295–304. ACM.
Ha, J., Haralick, R. M., and Phillips, I. T. (1995). Recursive
XY cut using bounding boxes of connected compo-
nents. In Document Analysis and Recognition, 1995.,
Proceedings of the Third International Conference on,
volume 2, pages 952–955. IEEE.
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann,
P., and Witten, I. H. (2009). The WEKA data min-
ing software: an update. ACM SIGKDD Explorations
Newsletter, 11(1):10–18.
Hetzner, E. (2008). A simple method for citation meta-
data extraction using hidden markov models. In Pro-
ceedings of the 8th ACM/IEEE-CS joint conference on
Digital libraries, pages 280–284. ACM.
Jain, A. K. and Yu, B. (1998). Document representation and
its application to page decomposition. Pattern Anal-
ysis and Machine Intelligence, IEEE Transactions on,
20(3):294–308.
Keerthi, S. S., Shevade, S. K., Bhattacharyya, C., and
Murthy, K. R. K. (2001). Improvements to platt’s
SMO algorithm for SVM classifier design. Neural
Computation, 13(3):637–649.
Lafferty, J., McCallum, A., and Pereira, F. (2001). Con-
ditional random fields: Probablistic models for seg-
menting and labeling sequence data. In Proceedings of
the Eighteenth International Conference on Machine
Learning (ICML-2001), pages 282–289.
Lindner, S. and H¨ohn, W. (2012). Parsing and Maintaining
Bibliographic References. International Conference
on Knowledge Discovery and Information Retrieval
(KDIR 2012).
Mann, G. S. and McCallum, A. (2010). Generalized ex-
pectation criteria for semi-supervised learning with
weakly labeled data. Journal of Machine Learning
Research, 11:955–984.
McCallum, A. (2002). Mallet: A machine learning for lan-
guage toolkit. http://mallet.cs.umass.edu.
McCallum, A., Nigam, K., Rennie, J., and Seymore, K.
(2000). Automating the contruction of internet portals
with machine learning. Information Retrieval Journal,
3:127–163.
Park, S. H., Ehrich, R. W., and Fox, E. A. (2012). A hybrid
two-stage approach for discipline-independent canon-
ical representation extraction from references. In Pro-
ceedings of the 12th ACM/IEEE-CS joint conference
on Digital Libraries, JCDL ’12, pages 285–294, New
York, NY, USA. ACM.
Sebastiani, F. (2002). Machine learning in automated text
categorization. ACM Computing Surveys, 34(1):1–47.
Sutton, C. and McCallum, A. (2006). Introduction to Con-
ditional Random Fields for Relational Learning. MIT
Press.
Toutanova, K., Klein, D., Manning, C. D., and Singer, Y.
(2003). Feature-rich part-of-speech tagging with a
cyclic dependency network. In Proceedings of the
2003 Conference of the North American Chapter of
the Association for Computational Linguistics on Hu-
man Language Technology-Volume 1, pages 173–180.
Association for Computational Linguistics.
Zhai, Y. and Liu, B. (2006). Structured data extraction
from the web based on partial tree alignment. Knowl-
edge and Data Engineering, IEEE Transactions on,
18(12):1614–1628.
Zou, J., Le, D., and Thoma, G. R. (2010). Locating and
parsing bibliographic references in HTML medical ar-
ticles. International Journal on Document Analysis
and Recognition, 2:107–119.
KDIR2013-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval
36