LPE was also applied to twenty-one real-world
datasets. This work could be extended by considering
a more detailed simulation study using much more
balanced types of datasets required to understand the
merits of LPE, especially larger datasets.
In sum, this paper provides the beginnings of a
better understanding of the relative strengths and
weaknesses of MDTs and using DTs as their
component classifier. It is hoped that it will motivate
future theoretical and empirical investigations into
incomplete data and DTs, and perhaps reassure those
who are uneasy regarding the use of non-imputed data
in prediction.
ACKNOWLEDGEMENTS
The work was funded by the Faculty of Engineering
and the Built Environment at the Durban University
of Technology. The authors would like to thank Chris
Jones for his helpful useful comments.
REFERENCES
Agresti, A. (1990). Categorical Data Analysis. New York:
John Wiley.
Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984).
Classification and Regression Trees, Wadsworth.
Cestnik, B., Kononenko, I. and Bratko, I. (1987). Assistant
86 a knowledge-elicitation tool for sophisticated users.
In I. Bratko and N. Lavrac, editors, European Working
Session on Learning – EWSL87. Sigma Press,
Wilmslow, England, 1987.
Cheeseman, P., Kelly, J., Self, M., Stutz, J., Taylor, W., and
Freeman, D. (1988). Bayesian Classification. In
Proceedings of American Association of Artificial
Intelligence (AAAI), Morgan Kaufmann Publishers:
San Meteo, CA, 607-611.
Dempster, A.P., Laird, N.M., and Rubin, D.B. (1977).
Maximum likelihood estimation from incomplete data
via the EM algorithm. Journal of the Royal Statistical
Society, Series B, 39, 1-38.
Hosmer, D.W. and Lameshow, S. (1989). Applied Logistic
Regression. New York: Wiley.
Khosravi, P., Vergari, A., Choi, YJ. and Liang, Y. (2020).
Handling Missing Data in Decision Trees: A
Probabilistic Approach.. https://arxiv.org/pdf/
2006.16341.pdf (Accessed on 16 September 2020)
Lakshminarayan, K., Harp, S.A., and Samad, T. (1999).
Imputation of Missing Data in Industrial Databases.
Applied Intelligence, 11, 259-275.
Little, R.J.A. and Rubin, D.B. (1987). Statistical Analysis
with missing data. New York: Wiley.
Long, J.S. (1998). Regression Models for Categorical and
Limited Dependent Variables. Advanced Quantitative
Techniques in the Social Sciences Number 7. Sage
Publications: Thousand Oaks CA.
McCullagh, P. and Nelder, J.A. (1990). Generalised Linear
Models, 2nd Edition, Chapman and Hall, London,
England.
MULTIPLE IMPUTATION SOFTWARE. Available from
<http:/www.stat.psu.edu/jls/misoftwa.html,
http:/methcenter.psu.edu/EMCOV.html>
Murphy, P. and Aha, D. (1992). UCI Repository of machine
learning databases [Machine-readable data repository].
The University of California, Department of
Information and Computer Science, Irvine, CA.
Quinlan, J.R. (1985). Decision trees and multi-level
attributes. Machine Intelligence. Vol. 11, (Eds.). J.
Hayes and D. Michie. Chichester England: Ellis
Horwood.
Quinlan, JR. (1993). C.4.5: Programs for machine
learning. Los Altos, California: Morgan Kauffman
Publishers, INC.
Rubin, D.B. (1987). Multiple Imputation for Nonresponse
in Surveys. New York: John Wiley and Sons.
Rubin, D.B. (1996). Multiple Imputation After 18+ Years.
Journal of the American Statistical Association, 91,
473-489.
Schafer, J.L. (1997). Analysis of Incomplete Multivariate
Data. Chapman and Hall, London.
Schafer, J.L. AND GRAHAM, J.W. (2002). Missing data:
Our view of the state of the art.
Psychological Methods,
7 (2), 147-177.
Tanner, M.A. AND WONG, W.H. (1987). The Calculation
of Posterior Distributions by Data Augmentation (with
discussion). Journal of the American Statistical
Association, 82, 528-550
Twala, B. (2005). Effective Techniques for Handling
Incomplete Data Using Decision Trees. Unpublished
PhD thesis, Open University, Milton Keynes, UK
Twala, B., Jones, M.C. and Hand, D.J. (2008). Good
methods for coping with missing data in decision trees.
Pattern Recognition Letters, 29, 950-956.
Twala, B and Cartwright, M. (2005). Ensemble Imputation
Methods for Missing Software Engineering Data. 11
th
IEEE Intl. Metrics Symp., Como, Italy, 19-22
September 2005.
Twala, B., Cartwright, M., and Shepperd, M. (2005).
Comparison of Various Methods for Handling
Incomplete Data in Software Engineering Databases.
4
th
International Symposium on Empirical Software
Engineering, Noosa Heads, Australia, November 2005.
Wu, C.F.J. (1983). On the convergence of the EM
algorithm. The Annals of Statistics, 11, 95-103.
Zhang, X., Yining, W., Jiahui, H., and Chen, Y. (2020).
Predicting Missing Values in Medical Data via
XGBoost Regression. Journal of Healthcare
Information Research. https://doi.org/10.1007/s41666-
020-00077-1 (accessed 16 September 2020).