applying adaptive Gumbel in fully-connected layer of
the CNN models are recommended. Generally, CNN
models using proposed activation functions improves
prediction and convergence speed compared to mod-
els that work exclusively with standard functions.
A series of experiments using CNN models
trained on a top of word2vec text data is performed
to evaluate the performance of the proposed activa-
tion functions in a sentiment analysis application on
Movie Review benchmark. Our empirical results im-
ply that using adaptive Gumbel as activation functions
in fully-connected layer and adaptive ReLU in con-
volutional layers are strongly recommended. These
observations are consistent with the findings were no-
ticed in experiments on MNIST data. Also, a com-
parison between our best observations and the state-
of-the-art results in (Kim, 2014) indicates that our re-
ported accuracy using adaptive activation functions
reproduces their accuracy. We believe that applying
more fine-tuning hyper parameters and using other
complex variants of CNN models accompanied with
our proposed activation functions could improve the
existing results. To recap, our experiments on two
well-known image and text benchmarks imply that by
virtue of using adaptive-activation functions in CNN
models, we can improve the performance of the deep
networks in terms of accuracy and convergence.
Learning the adaptation parameter is feasible by
adding only one equation to back-propagation. Com-
putationally, letting neurons of a layer choose their
own activation function, in this framework, is equiv-
alent to adding a neuron to a layer. This minor extra
computation changes the network flexibility consider-
ably, especially in shallow architectures. We focused
only on the classic LeNet5 architecture, but there is a
potential of exploring this methodology with a wide
variety of distribution functions for portable architec-
tures such as MobileNets (Howard et al., 2017), Pro-
jectionNets (Ravi, 2017), SqueezeNets (Iandola et al.,
2016), QuickNets (Ghosh, 2017), etc. It also has
a potential to generalize modified quantized training
(Hubara et al., 2018) and (Partovi Nia and Belbahri,
2018).
REFERENCES
Agostinelli, F., Hoffman, M., Sadowski, P., and Baldi, P.
(2014). Learning activation functions to improve deep
neural networks. arXiv preprint arXiv:1412.6830.
Belbahri, M., Sari, E., Darabi, S., and Partovi Nia, V.
(2019). Foothill: A quasiconvex regularization for
edge computing of deep neural networks. In Interna-
tional Conference on Image Analysis and Recognition,
pages 3–14.
Box, G. E. and Cox, D. R. (1964). An analysis of trans-
formations. Journal of the Royal Statistical Society.
Series B (Methodological), pages 211–252.
Cho, Y. and Saul, L. K. (2010). Large-margin classifica-
tion in infinite neural networks. Neural Computation,
22(10):2678–2697.
Clevert, D.-A., Unterthiner, T., and Hochreiter, S.
(2015). Fast and accurate deep network learning
by exponential linear units (elus). arXiv preprint
arXiv:1511.07289.
Coles, S., Bawa, J., Trenner, L., and Dorazio, P. (2001). An
introduction to statistical modeling of extreme values.
Springer.
Darabi, S., Belbahri, M., Courbariaux, M., and Partovi
Nia, V. (2019). Regularized binary network training.
In Neural Information Processing Systems, Workshop
on Energy Efficient Machine Learning and Cognitive
Computing.
Dushkoff, M. and Ptucha, R. (2016). Adaptive activa-
tion functions for deep networks. Electronic Imaging,
2016(19):1–5.
Elfwing, S., Uchibe, E., and Doya, K. (2018). Sigmoid-
weighted linear units for neural network function ap-
proximation in reinforcement learning. Neural Net-
works.
Ghosh, T. (2017). Quicknet: Maximizing efficiency
and efficacy in deep architectures. arXiv preprint
arXiv:1701.02291.
Glorot, X. and Bengio, Y. (2010). Understanding the dif-
ficulty of training deep feedforward neural networks.
In Proceedings of the Thirteenth International Con-
ference on Artificial Intelligence and Statistics, pages
249–256.
Glorot, X., Bordes, A., and Bengio, Y. (2011). Deep sparse
rectifier neural networks. In Proceedings of the Four-
teenth International Conference on Artificial Intelli-
gence and Statistics, pages 315–323.
Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville,
A., and Bengio, Y. (2013). Maxout networks. arXiv
preprint arXiv:1302.4389.
Hashemi, H. B., Asiaee, A., and Kraft, R. (2016). Query
intent detection using convolutional neural networks.
In International Conference on Web Search and Data
Mining, Workshop on Query Understanding.
Hendrycks, D. and Gimpel, K. (2016). Gaussian error linear
units (gelus). arXiv preprint arXiv:1606.08415.
Hornik, K., Stinchcombe, M., and White, H. (1989). Multi-
layer feedforward networks are universal approxima-
tors. Neural networks, 2(5):359–366.
Hou, L., Samaras, D., Kurc, T., Gao, Y., and Saltz, J. (2017).
Convnets with smooth adaptive activation functions
for regression. In Artificial Intelligence and Statistics,
pages 430–439.
Hou, L., Samaras, D., Kurc, T. M., Gao, Y., and Saltz,
J. H. (2016). Neural networks with smooth adap-
tive activation functions for regression. arXiv preprint
arXiv:1608.06557.
Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D.,
Wang, W., Weyand, T., Andreetto, M., and Adam,
Activation Adaptation in Neural Networks
255