comparison, we also performed up-sampling and
down-sampling.
By dividing 80% of the dataset as training data
and 20% as validation data, 110 out of 135 data
samples were used as training data which consist of
43 samples of healthy non-smokers, 48 samples of
healthy smokers, and 19 samples of COPD.
upSample function of “caret” package randomly
samples the dataset so that all classes have 48
samples, while
downSample randomly samples the
dataset so that all classes have 19 samples.
Table 4 shows the AUC values for multiclass LR
of different resampling methods. The AUCs of
SMOTE from both “DMwR” and “smotefamily” are
quite similar with the difference of only 0.8%.
Considering that both packages give insignificantly
different outcomes, we can randomly choose to use
one of the SMOTE functions from both packages.
As comparison, resampling the dataset using
upSample function increased the AUC performance
by 5% while
downSample decreased the
performance by 4.3%. However, the AUC
performance of upSample function is still lower than
that of SMOTE either using “DMwR” or
“smotefamily”. The models trained with SMOTE
outperformed the models without SMOTE in the
four evaluation metrics.
4 CONCLUSION
In this study, we used microarray dataset to predict
the presence of COPD by dealing with the class
imbalance at first. Prior study on this dataset have
tried to predict the presence of COPD regardless of
the existence of class imbalance.
The model we proposed can predict the presence
of COPD with an overall accuracy and AUC score
of 80% and 90% respectively, based on repeated 10-
fold cv 10-times. The outcomes indicate that by
dealing with class imbalance before performing
machine learning algorithms and regression analysis
can be used to predict the presence of COPD more
accurately. Our proposed methods also have higher
sensitivity and specificity values than that without
dealing with class imbalance. It shows that the
selected model can be used to correctly classify
subjects that belong to a certain class as well as a
subject that did not belong to the class. The
proposed method in this study can be used to assist
in determining better treatments to lower the fatality
rates caused by COPD.
In the future study, we are considering to employ
more recent and advanced resampling methods to
achieve a better performance.
REFERENCES
Anakal, S. and Sandhya, P, (2017). Clinical Decision
Support System for Chronic Obstructive Pulmonary
Disease using Machine Learning Techniques, 2017.
International Conference on Electrical, Electronics,
Communication, Computer and Optimization
Techniques (ICEECCOT).
Ballabio, D., Grisoni, F., and Todeschini, R., 2018.
Multivariate comparison of classification performance
measures. Chemometrics and Intelligent Laboratory
Systems, Vol 174:33-44.
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer,
W.P., 2002. SMOTE: Synthetic Minority Over-
sampling Technique. Journal of Artificial Intelligence
Research 16: 321-357.
GOLD, 2017. Global strategy for the diagnosis,
management, and prevention of copd.
Hand, D.J., Till, R.J., 2001. A simple Generalisation of the
Area Under the ROC Curve for Multiple Class
Classification problems. Machine Learning 45: 171-
186.
Filzmoser, P., Liebmann, B., and Varmuza, K., 2009.
Repeated double cross validation. Journal of
Chemometrics 23:160-171.
Lopez-Campos, J.L.,Tan, W., and Soriano, J.B., 2016.
Global burden of COPD. Respirology, 21: 14-23.
Luque, A., et al., 2019. The impact of class imbalance on
classification performance metrics based on the binary
confusion matrix. Pattern Recognition Vol.91: 216-
231.
Qian, X., Ba, Y., Zhuang, Q., and Zhong, G., 2014. RNS-
Seq Technology and Its Application in Fish
Transcriptomics. OMICS 18(2): 98-110.
Sekine, Y., Katsura, H., Koh, E., Hiroshima, K., Fujisawa,
T., 2012. Early detection of COPD is important for
lung cancer surveillance. European Respiratory
Journal 39:1230-1240.
Tharwat, A., 2018. Classification Assessment Methods.
Applied Computing and Informatics.
ThermoFisher, 2001. GenechipTM human genome U133
plus 2.0 array.
https://www.thermofisher.com/order/catalog/product/9
00468
Trtica-Majnaric, L., Zekic-Susac, M., Sarlija, N., and
Vitale, B., 2010. Prediction of influenza vaccination
outcome by neural networks and logistic regression.
Journal of Biomedical Informatics, 43(5): 774-781.
Yao, Y., Gu, Y., Yang, M., Cao, D., Wu, F, 2019. The
Gene Expression Biomarkers for Chronic Obstructive
Pulmonary Disease and Interstitial Lung Disease.
Frontiers in Genetics 10: 1154.