that, we want to discourage from mechanical usage
of gold standard evaluation methodology, start a dis-
cussion on evaluation methodology in NLP, as well
as a shift towards evaluations driven by particular ap-
plications. There is no such discussion going on now
and the gold standard methodology is usually taken as
a dogma.
ACKNOWLEDGEMENTS
This work has been partly supported by the Grant
Agency of CR within the project 15-13277S. The
research leading to these results has received fund-
ing from the Norwegian Financial Mechanism 2009–
2014 and the Ministry of Education, Youth and Sports
under Project Contract no. MSMT-28477/2014 within
the HaBiT Project 7F14047.
REFERENCES
Bies, A., Ferguson, M., Katz, K., MacIntyre, R., Tredin-
nick, V., Kim, G., Marcinkiewicz, M. A., and Schas-
berger, B. (1995). Bracketing guidelines for treebank
II style Penn treebank project.
Galliers, J. and Sp
¨
arck Jones, K. (1993). Evaluating nat-
ural language processing systems. Technical Report
UCAM-CL-TR-291, University of Cambridge, Com-
puter Laboratory.
Haji
ˇ
c, J. (2006). Complex corpus annotation: The Prague
dependency treebank. Insight into the Slovak and
Czech Corpus Linguistics, page 54.
Haji
ˇ
c, J., Panevov
´
a, J., Bur
´
a
ˇ
nov
´
a, E., Ure
ˇ
sov
´
a, Z., B
´
emov
´
a,
A.,
ˇ
St
ˇ
ep
´
anek, J., Pajas, P., and K
´
arn
´
ık, J. (2005). An-
notations at analytical level: Instructions for annota-
tors.
Katz-Brown, J., Petrov, S., McDonald, R., Och, F., Talbot,
D., Ichikawa, H., Seno, M., and Kazawa, H. (2011).
Training a parser for machine translation reordering.
In Proceedings of the Conference on Empirical Meth-
ods in Natural Language Processing, pages 183–192.
Association for Computational Linguistics.
Kilgarriff, A., Jakub
´
ı
ˇ
cek, M., Kov
´
a
ˇ
r, V., Rychl
´
y, P., and
Suchomel, V. (2014a). Finding terms in corpora for
many languages with the Sketch Engine. In Proceed-
ings of the Demonstrations at the 14th Conferencethe
European Chapter of the Association for Computa-
tional Linguistics, pages 53–56, Gothenburg, Sweden.
The Association for Computational Linguistics.
Kilgarriff, A., Rychl
´
y, P., Jakub
´
ı
ˇ
cek, M., Kov
´
a
ˇ
r, V., Baisa,
V., and Kocincov
´
a, L. (2014b). Extrinsic corpus
evaluation with a collocation dictionary task. In
Chair), N. C. C., Choukri, K., Declerck, T., Lofts-
son, H., Maegaard, B., Mariani, J., Moreno, A.,
Odijk, J., and Piperidis, S., editors, Proceedings of
the Ninth International Conference on Language Re-
sources and Evaluation (LREC’14), pages 1–8, Reyk-
javik, Iceland. European Language Resources Associ-
ation (ELRA).
Kim, J.-D., Ohta, T., Tateisi, Y., and Tsujii, J. (2003). GE-
NIA corpus – a semantically annotated corpus for bio-
textmining. Bioinformatics, 19(suppl 1):i180–i182.
Koehn, P. (2005). Europarl: A parallel corpus for statistical
machine translation. In MT summit, volume 5, pages
79–86. Citeseer.
Kov
´
a
ˇ
r, V. (2014). Automatic Syntactic Analysis for Real-
World Applications. Phd thesis, Masaryk University,
Faculty of Informatics.
Manning, C. D. (2011). Part-of-speech tagging from 97%
to 100%: Is it time for some linguistics? In Compu-
tational Linguistics and Intelligent Text Processing -
12th International Conference, CICLing 2011, pages
171–189. Springer, Berlin.
Marcus, M. P., Marcinkiewicz, M. A., and Santorini, B.
(1993). Building a large annotated corpus of En-
glish: The Penn Treebank. Computational Linguis-
tics, 19:313–330.
Mikulov
´
a, M. and
ˇ
St
ˇ
ep
´
anek, J. (2009). Annotation pro-
cedure in building the Prague Czech-English depen-
dency treebank. In Slovko 2009, NLP, Corpus Lin-
guistics, Corpus Based Grammar Research, pages
241–248, Bratislava, Slovakia. Slovensk
´
a akad
´
emia
vied.
Miyao, Y., Sagae, K., Sætre, R., Matsuzaki, T., and Tsu-
jii, J. (2009). Evaluating contributions of natural lan-
guage parsers to protein–protein interaction extrac-
tion. Bioinformatics, 25(3):394–400.
Moll
´
a, D. and Hutchinson, B. (2003). Intrinsic versus ex-
trinsic evaluations of parsing systems. In Proceed-
ings of the EACL 2003 Workshop on Evaluation Initia-
tives in Natural Language Processing: Are Evaluation
Methods, Metrics and Resources Reusable?, Evalini-
tiatives ’03, pages 43–50, Stroudsburg, PA, USA. As-
sociation for Computational Linguistics.
Nothman, J., Ringland, N., Radford, W., Murphy, T., and
Curran, J. R. (2012). Learning multilingual named
entity recognition from Wikipedia. Artificial Intelli-
gence, 194:151–175.
Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002).
BLEU: a method for automatic evaluation of machine
translation. In Proceedings of the 40th annual meeting
on association for computational linguistics, pages
311–318. Association for Computational Linguistics.
Radziszewski, A. and Gr
´
ac, M. (2013). Using low-cost an-
notation to train a reliable Czech shallow parser. In
Proceedings of Text, Speech and Dialogue, 16th Inter-
national Conference, volume 8082 of Lecture Notes in
Computer Science, pages 575–1156, Berlin. Springer.
Rundell, M. (2010). Macmillan Collocations Dictionary.
Macmillan.
Sampson, G. (2000). A proposal for improving the mea-
surement of parse accuracy. International Journal of
Corpus Linguistics, 5(01):53–68.
Sampson, G. and Babarczy, A. (2008). Definitional and
human constraints on structural annotation of English.
Natural Language Engineering, 14(4):471–494.
On Evaluation of Natural Language Processing Tasks - Is Gold Standard Evaluation Methodology a Good Solution?
545