relevance of co-occurrences, and showed a small,
yet consistent improvement over the original
definition of PMI
2
.
Although the results gained from CSW may
seem very corpus specific, it is likely that there are
other similar datasets that may benefit from it. Some
examples may be other historical corpora, discussion
forum data (consisting of quotes of previous
messages) and movie subtitle collections. In general,
the advantage of CSW is that it is more resistant to
duplicate or semi-duplicate entries in case the corpus
is poorly pre-processed.
We only discussed collocation analysis in this
paper, but an obvious path for future investigation
would be to apply CSW to word embeddings. Our
preliminary experiments indicate, that cleaning word
vector representations with CSW do improve the
results in word similarity tasks, but a more
comprehensive evaluation and tests will be required
before drawing further conclusions.
ACKNOWLEDGEMENTS
We acknowledge funding from the Academy of
Finland for the Centre of Excellence in Ancient Near
Eastern Empires and from the University of Helsinki
for the Deep Learning and Semantic Domains in
Akkadian Texts Project (PI Saana Svärd for both).
We also thank Johannes Bach, Mikko Luukko and
Saana Svärd for their valuable feedback, Niek
Veldhuis for providing us with the JSON Oracc data
and Heidi Jauhiainen for pre-processing it.
REFERENCES
Agirre, E., Alfonseca, E., Hall, K., Kravalova, J., Pasca,
M., Soroa, A., 2009. A Study on Similarity and
Relatedness Using Distributional and WordNet-based
Approaches. In NAACL-HTL 2009.
Bach, J., 2020 (forthcoming). Untersuchungen zur
transtextuellen Poetik assyrischer herrschaftlich-
narrativer Texte. SAAS 30.
Bird, S., Loper, E., Klein, E., 2009. Natural Language
Processing with Python. O’Reilly Media Inc.
Blei, D. M., Ng, A. Y, Jordan, M. I. 2003. Latent Dirichlet
Allocation. In The Journal of Machine Learning
Research 3, pp. 994–1022.
Bojanowski, R., Grave, E., Joulin, A., Milokolov, T.,
2017. Enriching Word Vectors with Subword
Information. In TACL 5, pp. 135–146.
Bouma, G., 2009. Normalized (Pointwise) Mutual
Information in Collocation Extraction. In GSCL, pp.
31–40.
Broder, A. Z., Glassman, S. C., Manasse, M. S., Zweig, G.
1997. Syntactic Clustering of the Web. Digital Systems
Research Center.
Church, K., Hanks, P., 1990. Word association norms,
mutual information and lexicography. Computational
Linguistics 16, pp. 22–29.
Cifola, B., 1995. Analysis of Variants on the Assyrian
Royal Titulary. Istituto Universitario Orientale.
Citron, D. T., Ginsparg, P., 2015. Patterns of Text Reuse
in a Scientific Corpus. In the National Academy of
Sciences in the USA.
Clough, P., Gaizauskas, R., Piao, S, S. L., Wilks, Y. 2002.
Meter: Measuring Text Reuse. In the 40
th
Annual
Meeting of the ACL, pp. 152–159.
Daille, B., 1994. Approche mixte pour l’extraction
automatique de terminologie: statistiques lexicales et
filtres linguistiques. PhD thesis, Université Paris 7.
Deerwester, S., Dumais, S. T., Furnas, G. W., Laudauer,
T. K., Harsman, R. 1990. Indexing by Latent Semantic
Analysis. In JASIST.
Evert, S., 2005. The Statistics of Word Cooccurrences:
Word Pairs and Collocations. PhD thesis. IMS
Stuttgart.
Fano, R., 1961. Transmission of Information: A Statistical
Theory of Communications. MIT Press.
Gesche, P. D., 2001. Schulunterricht in Babylonien im
ersten Jahrtausend v. Chr. AOAT 275. Ugarit Verlag.
George, A., 2003. The Babylonian Gilgamesh Epic:
Critical Edition and Cuneiform Texts. Oxford
University Press.
Gipp, B., 2014. Citation-Based Plagiarism Detection:
Detecting Disguised and Cross-Language Plagiarism
using Citation Pattern Analysis. Springer.
Groneberg, B. 1996. Towards a Definition of Literariness
as Applied to Akkadian Literature. In Mesopotamian
Poetic Language: Sumerian and Akkadian, Edited by
M. Vogelzang, H. Vanstiphout. Groningen, pp. 59–84.
Halawi, G., Dror, G., Gabrilovich, E., Koren, Y. 2012:
Large-scale learning of word relatedness with
constraints. In KDD 2012, pp. 1406–1414.
http://www2.mta.ac.il/~gideon/mturk771.html
Huehnergard, J., Woods, C. 2008. Akkadian and Eblaite.
In The Ancient Languages of Mesopotamia, Egypt and
Aksum. Edited by R. R. Woodard. Cambridge
University Press.
Jungmaier, J., Kassner, N., Roth, B., 2020. Dirichlet-
Smoothed Word Embeddings for Low-Resource
Settings. In 12
th
LREC 2020, pp. 3560–3565.
Kouwenberg, B., 2011. Akkadian in General. In Semitic
Languages. An International Handbook. Edited by S.
Weninger, G. Khan, M. P. Streck, J. C. E. Watson, pp.
330–339. De Gruyter Mouton.
Lambert, G. W., 2013. Babylonian Creation Myths.
Eisenbrauns.
Lee, J. 2007. A Computational Model of Text Reuse in
Ancient Literary Texts. In he 45
th
Annual Meeting of
the ACL, pp. 472–479.
Levy, O., Goldberg, Y. 2014. Neural Word Embedding as
Implicit Matrix Factorization. In Advances in Neural
Information Processing Systems, pp. 2177–2185.