able. To validate this work, we trained and tested
a Encoder-Decoder Network on this dataset, and al-
though the results are not at par with state of-the-
art English captioning systems, they are considerably
superior to the English-captioning →English-Arabic-
translate. Our current work includes adding vocaliza-
tion to the Arabic training set, as well as increasing
the size of the dataset.
REFERENCES
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z.,
Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin,
M., Ghemawat, S., Goodfellow, I., Harp, A., Irving,
G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kud-
lur, M., Levenberg, J., Mané, D., Monga, R., Moore,
S., Murray, D., Olah, C., Schuster, M., Shlens, J.,
Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Van-
houcke, V., Vasudevan, V., Viégas, F., Vinyals, O.,
Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and
Zheng, X. (2015). TensorFlow: Large-scale machine
learning on heterogeneous systems. Software avail-
able from tensorflow.org.
Al-Muzaini, H. A., Al-Yahya, T. N., and Benhidour, H.
(2018). Automatic arabic image captioning using rnn-
lstm-based language model and cnn. International
Journal of Advanced Computer Science and Applica-
tions, 9(6):67–73.
Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M.,
Gould, S., and Zhang, L. (2017). Bottom-up and top-
down attention for image captioning and vqa. arXiv
preprint arXiv:1707.07998, 2(4):8.
Bai, S. and An, S. (2018). A survey on automatic image
caption generation. Neurocomputing.
Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning
long-term dependencies with gradient descent is diffi-
cult. IEEE transactions on neural networks, 5(2):157–
166.
Boudad, N., Faizi, R., Thami, R. O. H., and Chiheb, R.
(2017). Sentiment analysis in arabic: A review of the
literature. Ain Shams Engineering Journal.
Chen, L., Zhang, H., Xiao, J., Nie, L., Shao, J., Liu, W., and
Chua, T.-S. (2017). Sca-cnn: Spatial and channel-wise
attention in convolutional networks for image caption-
ing. In 2017 IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), pages 6298–6306.
IEEE.
Chen, X., Fang, H., Lin, T.-Y., Vedantam, R., Gupta, S.,
Dollár, P., and Zitnick, C. L. (2015). Microsoft coco
captions: Data collection and evaluation server. arXiv
preprint arXiv:1504.00325.
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D.,
Bougares, F., Schwenk, H., and Bengio, Y. (2014).
Learning phrase representations using rnn encoder-
decoder for statistical machine translation. arXiv
preprint arXiv:1406.1078.
Chollet, F. et al. (2015). Keras. https://keras.io.
Chunseong Park, C., Kim, B., and Kim, G. (2017). At-
tend to you: Personalized image captioning with con-
text sequence memory networks. In Proceedings of
the IEEE Conference on Computer Vision and Pattern
Recognition, pages 895–903.
Dai, B., Fidler, S., Urtasun, R., and Lin, D. (2017). Towards
diverse and natural image descriptions via a condi-
tional gan. arXiv preprint arXiv:1703.06029.
Donahue, J., Anne Hendricks, L., Guadarrama, S.,
Rohrbach, M., Venugopalan, S., Saenko, K., and Dar-
rell, T. (2015). Long-term recurrent convolutional net-
works for visual recognition and description. In Pro-
ceedings of the IEEE conference on computer vision
and pattern recognition, pages 2625–2634.
Elman, J. L. (1990). Finding structure in time. Cognitive
science, 14(2):179–211.
Fang, H., Gupta, S., Iandola, F., Srivastava, R. K., Deng,
L., Dollár, P., Gao, J., He, X., Mitchell, M., Platt,
J. C., et al. (2015). From captions to visual concepts
and back. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 1473–
1482.
Fu, K., Jin, J., Cui, R., Sha, F., and Zhang, C. (2017).
Aligning where to see and what to tell: Image cap-
tioning with region-based attention and scene-specific
contexts. IEEE transactions on pattern analysis and
machine intelligence, 39(12):2321–2334.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,
Warde-Farley, D., Ozair, S., Courville, A., and Ben-
gio, Y. (2014). Generative adversarial nets. In
Advances in neural information processing systems,
pages 2672–2680.
Hochreiter, S., Bengio, Y., Frasconi, P., Schmidhuber, J.,
et al. (2001). Gradient flow in recurrent nets: the dif-
ficulty of learning long-term dependencies.
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term
memory. Neural computation, 9(8):1735–1780.
Hodosh, M., Young, P., and Hockenmaier, J. (2013). Fram-
ing image description as a ranking task: Data, models
and evaluation metrics. Journal of Artificial Intelli-
gence Research, 47:853–899.
https://www.figure eight.com/. Figure eight. High Quality
Training Data Platform for ML Models.
Inc., Z. (2018). Top 20 facebook statistics - up-
dated september 2018. https://zephoria.com/top-15-
valuable-facebook-statistics/.
Jia, X., Gavves, E., Fernando, B., and Tuytelaars, T. (2015).
Guiding the long-short term memory model for im-
age caption generation. In Proceedings of the IEEE
International Conference on Computer Vision, pages
2407–2415.
Jin, J., Fu, K., Cui, R., Sha, F., and Zhang, C. (2015). Align-
ing where to see and what to tell: image caption with
region-based attention and scene factorization. arXiv
preprint arXiv:1506.06272.
Jindal, V. (2018). Generating image captions in arabic using
root-word based recurrent neural networks and deep
neural networks. In Proceedings of the 2018 Confer-
ence of the North American Chapter of the Associa-
tion for Computational Linguistics: Student Research
Workshop, pages 144–151.
VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications
240