Exploring Alternatives to Softmax Function
Kunal Banerjee
1,∗ a
, Vishak Prasad C.
2
and Rishi Raj Gupta
2,†
, Kartik Vyas
2,†
, Anushree H.
2,†
and Biswajit Mishra
2
1
Walmart Global Tech, Bangalore, India
2
Intel Corporation, Bangalore, India
Keywords:
Softmax, Spherical Loss, Function Approximation, Classification.
Abstract:
Softmax function is widely used in artificial neural networks for multiclass classification, multilabel classifi-
cation, attention mechanisms, etc. However, its efficacy is often questioned in literature. The log-softmax loss
has been shown to belong to a more generic class of loss functions, called spherical family, and its member
log-Taylor softmax loss is arguably the best alternative in this class. In another approach which tries to en-
hance the discriminative nature of the softmax function, soft-margin softmax (SM-softmax) has been proposed
to be the most suitable alternative. In this work, we investigate Taylor softmax, SM-softmax and our pro-
posed SM-Taylor softmax, an amalgamation of the earlier two functions, as alternatives to softmax function.
Furthermore, we explore the effect of expanding Taylor softmax up to ten terms (original work proposed ex-
panding only to two terms) along with the ramifications of considering Taylor softmax to be a finite or infinite
series during backpropagation. Our experiments for the image classification task on different datasets reveal
that there is always a configuration of the SM-Taylor softmax function that outperforms the normal softmax
function and its other alternatives.
1 INTRODUCTION
Softmax function is a popular choice in deep learn-
ing classification tasks, where it typically appears as
the last layer. Recently, this function has found appli-
cation in other operations as well, such as the atten-
tion mechanisms (Vaswani et al., 2017). However, the
softmax function has often been scrutinized in search
of finding a better alternative (Vincent et al., 2015;
de Br
´
ebisson and Vincent, 2016; Liu et al., 2016;
Liang et al., 2017; Lee et al., 2018).
Specifically, Vincent et al. explore the spherical
loss family in (Vincent et al., 2015) that has log-
softmax loss as one of its members. Brebisson and
Vincent further work on this family of loss functions
and propose log-Taylor softmax as a superior alterna-
tive than others, including original log-softmax loss,
in (de Br
´
ebisson and Vincent, 2016).
Liu et al. take a different approach to enhance
the softmax function by exploring alternatives which
may improve the discriminative property of the final
layer as reported in (Liu et al., 2016). The authors
a
https://orcid.org/0000-0002-0605-630X
∗
Work done when the author worked at Intel Corporation
†
Work done during internship at Intel Corporation
propose large-margin softmax (LM-softmax) that tries
to increase inter-class separation and decrease intra-
class separation. LM-softmax is shown to outper-
form softmax in image classification task across vari-
ous datasets. This approach is further investigated by
Liang et al. in (Liang et al., 2017), where they pro-
pose soft-margin softmax (SM-softmax) that provides
a finer control over the inter-class separation com-
pared to LM-softmax. Consequently, SM-softmax is
shown to be a better alternative than its predecessor
LM-softmax (Liang et al., 2017).
In this work, we explore the various alternatives
proposed for softmax function in the existing litera-
ture. Specifically, we focus on two contrasting ap-
proaches based on spherical loss and discriminative
property and choose the best alternative that each has
to offer: log-Taylor softmax loss and SM-softmax,
respectively. Moreover, we enhance these functions
to investigate whether further improvements can be
achieved. The contributions of this paper are as fol-
lows:
• We propose SM-Taylor softmax – an amalgama-
tion of Taylor softmax and SM-softmax.
• We explore the effect of expanding Taylor soft-
max up to ten terms (original work (de Br
´
ebisson
Banerjee, K., C., V., Gupta, R., Vyas, K., H., A. and Mishra, B.
Exploring Alternatives to Softmax Function.
DOI: 10.5220/0010502000810086
In Proceedings of the 2nd International Conference on Deep Learning Theory and Applications (DeLTA 2021), pages 81-86
ISBN: 978-989-758-526-5
Copyright
c
2021 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved
81