0
5
10
15
20
0.00
500.00
1,000.00
1,500.00
2,000.00
Number of Sentences per Instance
Time (in seconds)
TF-IDF+NB
Word2Vec+DLM
TF-IDF+SVM
Figure 5: Computational Time for varying number of sen-
tences per instance.
5 CONCLUSION
We conclude that the needed amount of information
for fiction classification is between 5 to 10 sentences
which returns sufficient accuracy. Furthermore we
conclude that while SVM has a higher accuracy rate
in classification, its computational runtime easily in-
creases as the number of instance increases, and that
Word2Vec has a steady and unimpressive accuracy
rate across all sentence counts. For the approaches
analysed, we finally conclude that NB with TF-IDF is
the better approach for fiction classification.
ACKNOWLEDGEMENTS
The authors would like to thank Gus Hahn-Powell,
Valerie Johnson, and Cynthia Mwenja for their valu-
able feedback and support.
REFERENCES
Bahdanau, D., Cho, K., and Bengio, Y. (2016). Neural ma-
chine translation by jointly learning to align and trans-
late.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.
(2019). BERT: Pre-training of deep bidirectional
transformers for language understanding. In Pro-
ceedings of the 2019 Conference of the North Amer-
ican Chapter of the Association for Computational
Linguistics: Human Language Technologies, Volume
1 (Long and Short Papers), pages 4171–4186, Min-
neapolis, Minnesota. Association for Computational
Linguistics.
Elson, D. K. and McKeown, K. R. (2010). Automatic at-
tribution of quoted speech in literary narrative. In
Proceedings of the Twenty-Fourth AAAI Conference
on Artificial Intelligence, AAAI’10, page 1013–1019.
AAAI Press.
Foundation, A. S. (2009). PDFBox. Last accessed 26 Sep
2021.
Goldberg, Y. and Levy, O. (2014). word2vec explained:
deriving mikolov et al.’s negative-sampling word-
embedding method. CoRR, abs/1402.3722.
Hua, T., Lu, C.-T., Choo, J., and Reddy, C. K. (2020). Prob-
abilistic topic modeling for comparative analysis of
document collections. ACM Trans. Knowl. Discov.
Data, 14(2).
Kalt, T. (1996). A new probabilistic model of text classifi-
cation and retrieval.
Kay, A. (2007). Tesseract: An open-source optical character
recognition engine. Linux J., 2007(159):2.
Kokkinakis, D., Malm, M., Bergenmar, J., and Ighe, A.
(2014). Semantics in storytelling in swedish fiction. In
Proceedings of the First International Conference on
Digital Access to Textual Cultural Heritage, DATeCH
’14, page 137–142, New York, NY, USA. Association
for Computing Machinery.
Melamud, O., Goldberger, J., and Dagan, I. (2016). con-
text2vec: Learning generic context embedding with
bidirectional LSTM. In Proceedings of The 20th
SIGNLL Conference on Computational Natural Lan-
guage Learning, pages 51–61, Berlin, Germany. As-
sociation for Computational Linguistics.
Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013).
Efficient estimation of word representations in vector
space. In Bengio, Y. and LeCun, Y., editors, 1st In-
ternational Conference on Learning Representations,
ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013,
Workshop Track Proceedings.
Pezoa, F., Reutter, J. L., Suarez, F., Ugarte, M., and Vrgo
ˇ
c,
D. (2016). Foundations of json schema. In Proceed-
ings of the 25th International Conference on World
Wide Web, pages 263–273. International World Wide
Web Conferences Steering Committee.
Pietro, M. (2021). Natural Language Processing Toolkit.
Last accessed 4 Oct 2021.
Pintas, J., Fernandes, L., and Garcia, A. (2021). Feature
selection methods for text classification: a systematic
literature review. Artificial Intelligence Review.
Pope., J., Powers., D., Connell., J. A. J., Jasemi., M., Tay-
lor., D., and Fafoutis., X. (2020). Supervised ma-
chine learning and feature selection for a document
analysis application. In Proceedings of the 9th Inter-
national Conference on Pattern Recognition Applica-
tions and Methods - ICPRAM,, pages 415–424. IN-
STICC, SciTePress.
ˇ
Reh
˚
u
ˇ
rek, R. and Sojka, P. (2010). Software Framework
for Topic Modelling with Large Corpora. In Proceed-
ings of the LREC 2010 Workshop on New Challenges
for NLP Frameworks, pages 45–50, Valletta, Malta.
ELRA. http://is.muni.cz/publication/884893/en.
Ren, X., Lv, Y., Wang, K., and Han, J. (2017). Compara-
tive document analysis for large text corpora. In Pro-
ceedings of the Tenth ACM International Conference
on Web Search and Data Mining, WSDM ’17, page
ICPRAM 2022 - 11th International Conference on Pattern Recognition Applications and Methods
516