dependency parser (Nivre et al., 2007). In Table 1,
we show the results from the English analysis and in
Table 2, the results from the Swedish analysis.
Table 1: English Wikipedia statistics, gathered from the se-
mantic and coreference analyses.
Corpus size 7.6 GB
Articles 4,012,291
Sentences 61,265,766
Tokens 1,485,951,256
Predicates 272,403,215
Named entities 148,888,487
Coreference chains 236,958,381
Processing time 462 hours
Table 2: Swedish Wikipedia statistics, gathered from the
syntactic analysis.
Corpus size 1 GB
Articles 976,008
Sentences 6,752,311
Tokens 142,495,587
Processing time 14 minutes
8 CONCLUSIONS
In this paper, we have described a framework,
KOSHIK, for end-to-end parsing and querying of
documents containing unstructured natural language.
Koshik uses an annotation model that supports a large
set of NLP tools including prefilters, tokenizer, named
entity taggers, syntactic and semantic dependency
parsers, and coreference solvers. Using the frame-
work, we complete the semantic parsing of the En-
glish edition of Wikipedia in less than 20 days and the
syntactic parsing of the Swedish one in less than 15
minutes. The source code for Koshik is available for
download at https://github.com/peterexner/KOSHIK/.
9 FUTURE WORK
While many high precision NLP tools exist for the
analysis of English, resources for creating tools for
other languages are more scarce. Our aim is to
useKoshik and create parallel corpora in English and
Swedish. By annotating the English corpus seman-
tically and the Swedish corpus syntactically, we hope
to find syntactic level features that may aid us in train-
ing a Swedish semantic parser. We will also continue
to expand the number and variety of tools and the lan-
guage models offered by Koshik.
ACKNOWLEDGEMENTS
This research was supported by Vetenskapsr
˚
adet, the
Swedish research council, under grant 621-2010-
4800 and has received funding from the Euro-
pean Union’s seventh framework program (FP7/2007-
2013) under grant agreement 230902.
REFERENCES
Bj
¨
orkelund, A., Hafdell, L., and Nugues, P. (2009). Mul-
tilingual semantic role labeling. In Proceedings of
CoNLL-2009, pages 43–48, Boulder.
Bohnet, B. (2010). Very high accuracy and fast dependency
parsing is not a contradiction. In Proceedings of the
23rd International Conference on Computational Lin-
guistics, pages 89–97. Association for Computational
Linguistics.
Bontcheva, K., Tablan, V., Maynard, D., and Cunningham,
H. (2004). Evolving gate to meet new challenges in
language engineering. Natural Language Engineer-
ing, 10(3-4):349–373.
Buchholz, S. and Marsi, E. (2006). Conll-x shared task
on multilingual dependency parsing. In Proceedings
of the Tenth Conference on Computational Natural
Language Learning, pages 149–164. Association for
Computational Linguistics.
Dean, J. and Ghemawat, S. (2008). Mapreduce: simplified
data processing on large clusters. Communications of
the ACM, 51(1):107–113.
Ferrucci, D. and Lally, A. (2004). Uima: an architec-
tural approach to unstructured information processing
in the corporate research environment. Natural Lan-
guage Engineering, 10(3-4):327–348.
Ferrucci, D. A. (2012). Introduction to “This is Wat-
son”. IBM Journal of Research and Development,
56(3.4):1:1 –1:15.
Finkel, J. R., Grenager, T., and Manning, C. (2005). Incor-
porating non-local information into information ex-
traction systems by gibbs sampling. In Proceedings of
the 43rd Annual Meeting on Association for Compu-
tational Linguistics, pages 363–370. Association for
Computational Linguistics.
Grishman, R., Caid, B., Callan, J., Conley, J., Corbin, H.,
Cowie, J., DiBella, K., Jacobs, P., Mettler, M., Og-
den, B., et al. (1997). Tipster text phase ii architecture
design version 2.1 p 19 june 1996.
Gurevych, I. and M
¨
uller, M.-C. (2008). Information extrac-
tion with the darmstadt knowledge processing soft-
ware repository (extended abstract). In Proceedings
of the Workshop on Linguistic Processing Pipelines,
Darmstadt, Germany. No printed proceedings avail-
able.
Ide, N. and V
´
eronis, J. (1994). Multext: Multilingual text
tools and corpora. In Proceedings of the 15th con-
ference on Computational linguistics-Volume 1, pages
588–592. Association for Computational Linguistics.
KOSHIK-ALarge-scaleDistributedComputingFrameworkforNLP
469