Encoding Techniques for Handling Categorical Data in Machine Learning-Based Software Development Effort Estimation

Mohamed Hosni



Planning, controlling, and monitoring a software project primarily rely on the estimates of the software development effort. These estimates are usually conducted during the early stages of the software life cycle. At this phase, the available information about the software product is categorical in nature, and only a few numerical data points are available. Therefore, building an accurate effort estimator begins with determining how to process the categorical data that characterizes the software project. This paper aims to shed light on the ways in which categorical data can be treated in software development effort estimation (SDEE) datasets through encoding techniques. Four encoders were used in this study, including one-hot encoder, label encoder, count encoder, and target encoder. Four well-known machine learning (ML) estimators and a homogeneous ensemble were utilized. The empirical analysis was conducted using four datasets. The datasets generated by means of the one-hot encoder appeared to be suitable for the ML estimators, as they resulted in more accurate estimation. The ensemble, which combined four variants of the same technique trained using different datasets generated by means of encoder techniques, demonstrated an equal or better performance compared to the single ML estimation technique. The overall results are promising and pave the way for a new approach to handling categorical data in SDEE datasets.


Paper Citation

in Harvard Style

Hosni M. (2023). Encoding Techniques for Handling Categorical Data in Machine Learning-Based Software Development Effort Estimation. In Proceedings of the 15th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR; ISBN 978-989-758-671-2, SciTePress, pages 460-467. DOI: 10.5220/0012259400003598

in Bibtex Style

author={Mohamed Hosni},
title={Encoding Techniques for Handling Categorical Data in Machine Learning-Based Software Development Effort Estimation},
booktitle={Proceedings of the 15th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR},

in EndNote Style


JO - Proceedings of the 15th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR
TI - Encoding Techniques for Handling Categorical Data in Machine Learning-Based Software Development Effort Estimation
SN - 978-989-758-671-2
AU - Hosni M.
PY - 2023
SP - 460
EP - 467
DO - 10.5220/0012259400003598
PB - SciTePress