Table 3: Results of postprocessing.
No. Journal start end months between
month vol. month vol. consecutive volumes
1. Phys. Rev. B Jan. 1970 1 Jan. 2009 79 6
2. J. Am. Chem. Soc. Jan. 1900 22 Jan. 2008 130 12
3. Phys. Rev. Lett. Jul. 1958 1 Jul. 2008 101 6
4. J. Appl. Phys. Jan. 1941 12 Jan. 1984 55 12
Jul. 1984 56 Jan. 2006 99 6
Rev.’.
Since reference strings most of the time only con-
tain a publishing year but no month, this method can
just detect changes in the release schedule within this
accuracy. Likewise we are unable to detect changes in
the release schedule that are shorter than 3 years be-
cause we used this accuracy for our line segment de-
tection. We also have to note that in order for this pro-
cedure to work a large data set should be used where
single journal titles appear several times.
4 CONCLUSIONS AND FUTURE
WORK
We proposed a new method of parsing references
with constraints that can easily be adapted to new
domains of data. The labeling results show that the
proposed method’s performance is comparable to or
even performs better than other state of the art semi-
supervised machine learning algorithms. Afterwards
we demonstrated how bibliographic references can
be clustered using the novel approach of analyzing a
journal title, publishing month and release time span
correlation.
We want to concentrate future effort in the auto-
matic extraction of features and using constraints in
the inferenceprocess in addition to the learning phase.
Although we used a method for the automatic extrac-
tion of keywords for learning, we would like to in-
tegrate data from other web knowledge bases. We
would also like to investigate the possibility to auto-
matically categorize citation data and then use the op-
timal corresponding CRF for its labeling. Addition-
ally we are going to improve our string-based cluster-
ing methods since typical Levenshtein-distance based
metrics do not work well with abbreviations.
REFERENCES
Bellare, K., Druck, G., and McCallum, A. (2009). Alter-
nating projections for learning with expectation con-
straints. In Proceedings of UAI.
Chang, M.-W., Ratinov, L., and Roth, D. (2007). Guid-
ing semi-supervision with constraint-driven learning.
Proceedings of the 45th Annual Meeting of the Asso-
ciation of Computational Linguistics, pages 280–287.
Councill, I. G., Giles, C. L., and Kan, M.-Y. (2008). Parscit:
An open-source crf reference string parsing package.
In International Language Resources and Evaluation.
European Language Resources Association.
Duda, R. O. and Hart, P. E. (1972). Use of the hough trans-
formation to detect lines and curves in pictures. Com-
mun. ACM, 15(1):11–15.
Ganchev, K., Graa, J., Gillenwater, J., and Taskar, B.
(2010). Posterior regularization for structured latent
variable models. Journal of Machine Learning Re-
search, 11:2001–2049.
Lafferty, J., McCallum, A., and Pereira, F. (2001). Con-
ditional random fields: Probablistic models for seg-
menting and labeling sequence data. In Proceedings of
the Eighteenth International Conference on Machine
Learning (ICML-2001).
Mann, G. S. and McCallum, A. (2010). Generalized ex-
pectation criteria for semi-supervised learning with
weakly labeled data. Journal of Machine Learning
Research, 11:955–984.
McCallum, A. (2002). Mallet: A machine learning for lan-
guage toolkit. http://mallet.cs.umass.edu.
McCallum, A., Nigam, K., Rennie, J., and Seymore, K.
(2000). Automating the contruction of internet portals
with machine learning. Information Retrieval Journal,
3:127–163.
Park, S. H., Ehrich, R. W., and Fox, E. A. (2012). A hybrid
two-stage approach for discipline-independent canon-
ical representation extraction from references. In Pro-
ceedings of the 12th ACM/IEEE-CS joint conference
on Digital Libraries, JCDL ’12, pages 285–294, New
York, NY, USA. ACM.
Sebastiani, F. (2002). Machine learning in automated text
categorization. ACM Computing Surveys, 34(1):1–47.
Sutton, C. and McCallum, A. (2006). Introduction to Con-
ditional Random Fields for Relational Learning. MIT
Press.
Zou, J., Le, D., and Thoma, G. R. (2010). Locating and
parsing bibliographic references in html medical arti-
cles. International Journal on Document Analysis and
Recognition, 2:107–119.
KDIR2012-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval
238