context of three NASA datasets. Our clustering algo-
rithm was developed to discover two groupings based
on some software complexity metrics. The proposed
methodology has been shown to outperform Gaussian
mixture model as a classical approach and our offered
solution achieved better results in terms of data mod-
eling capabilities and clustering accuracy. The second
application was spam detection using the spam base
dataset from the UCI repository. The ultimate goal of
our extensive study is developing a powerful classi-
fier as a devoted filter to accurately distinguish spam
emails from legitimate emails in order to improve the
blocking rate of spam emails and decrease the mis-
classification rate of legitimate emails. Spam filtering
solutions presented in this paper generates acceptable,
accurate results in comparison with Gaussian mixture
model as the results of our algorithm has higher pre-
cision and recall. From the outcomes, we can infer
that the multivariate Beta mixture model could be a
competitive modeling approach for the software de-
fect and spam prediction problems. In other words,
we can say that our model produces enhanced clus-
tering results largely due to its model flexibility.
REFERENCES
Aleem, S., Capretz, L. F., and Ahmed, F. (2015). Bench-
marking machine learning technologies for software
defect detection. 6(3):11–23.
Amayri, O. and Bouguila, N. (2009a). A discrete mixture-
based kernel for svms: Application to spam and im-
age categorization. Artificial Intelligence Review,
34(1):73–108.
Amayri, O. and Bouguila, N. (2009b). Online spam filtering
using support vector machines. IEEE Symposium on
Computers and Communications, pages 337–340.
Amayri, O. and Bouguila, N. (2010). A study of spam fil-
tering using support vector machines. Artificial Intel-
ligence Review, 34(1):173–108.
Amayri, O. and Bouguila, N. (2012). Unsupervised feature
selection for spherical data modeling: Application to
image-based spam filtering. International Conference
on Multimedia Communications, Services and Secu-
rity, pages 13–23.
Bertolino, A. (2007). Software testing research: Achieve-
ments, challenges, dream. Future of Software Engi-
neering, page 85–103.
Bishop, C. (2006). Pattern recognition and machine learn-
ing. Springer, New York.
Blanzieri, E. and Bryl, A. (2008). A survey of learning-
based techniques of email spam filtering. Artificial
Intelligence Review, 29:63–92.
Boucher, A. and Badri, M. (2017). Predicting fault-prone
classes in objectoriented software. page 306–317.
Bouguila, N. and Amayri (2009). A discrete mixturebased
kernel for svms: Application to spam and image cate-
gorization. Information Processing and Management,
45:631–642.
Bouguila, N., W. J. and Hamza, A. (2010). Software mod-
ules categorization through likelihood and bayesian
analysis of finite dirichlet mixtures. Applied Statistics,
37(2):235–252.
Briand, L. C., B. V. and Hetmanski, C. J. (1993). Develop-
ing interpretable models with optimized set reduction
for identifying high-risk software components. vol-
ume 19, page 1028–1044. IEEE Transactions on Soft-
ware Engineering.
Chang, M., Y. W. and Meek, C. (2008). Partitioned logistic
regression for spam filtering. page 97–105. 14th ACM
SIGKDD international conference on knowledge dis-
covery and data mining.
Cockriel, W. M. and McDonald, J. B. (2018). Two multi-
variate generalized beta families, communications in
statistics. Theory and Methods, 47(23):5688–5701.
Cormack, G. and Lynam, T. (2007). Online supervised
spam filter evaluation. ACMTransactions on Informa-
tion Systems, 25(3):1–31.
Diaz-Rozo, J., B. C. and Larranaga, P. (2018). Clustering
of data streams with dynamic gaussian mixture mod-
els: An iot application in industrial processes. IEEE
Internet of Things Journal, 5:3533.
Drake, C., O. J. K. E. (2004). Anatomy of a phishing email.
First conference on email and anti- Spam (CEAS),
25(3):1–31.
El Emam, K. Benlarbi, S. G. N. and Rai, S. N. (2001). Com-
paring casebased reasoning classifiers for predicting
high risk software components. Journal of Systems
and Software, 55(3):301–320.
Elguebaly, T. and Bouguila, N. (2013). Finite asymmetric
generalized gaussian mixture models learning for in-
frared object detection. Computer Vision and Image
Understanding, 117:1659–1671.
Fan, W. and Bouguila, N. (2013). Variational learning of a
dirichlet process of generalized dirichlet distributions
for simultaneous clustering and feature selection. Pat-
tern Recognition, 46:2754–2769.
Fan, W., Bouguila, N., and Ziou, D. (2014). Variational
learning of finite dirichlet mixture models using com-
ponent splitting. Neurocomputing, 129:3–16.
Figueiredo, M. and Jain, A. K. (2002). Unsupervised
learning of finite mixture models. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence,
24(3):381–396.
Galati, L. (2018). The anatomy of a phishing attack. Fair-
field County Business Journal, 54:5.
Ganesalingam, S. (1989). Classification and mixture ap-
proaches to clustering via maximum likelihood. Jour-
nal of the Royal Statistical Society: Series C (Applied
Statistics, 38(3):455–466.
Gevers, T., S. A. (1999). Color-based object recognition.
Journal of the Royal Statistical Society: Series C (Ap-
plied Statistics, 32(3):453–464.
Giordan, M., W. R. (2015). A comparison of compu-
tational approaches for maximum likelihood estima-
tion of the dirichlet parameters on high-dimensional
data. SORT-Statistics and Operations Research Trans-
actions, 39(1):109–126.
A Probabilistic Approach based on a Finite Mixture Model of Multivariate Beta Distributions
379