to analyze our model and how our attributes are con-
tributing to predicting leishmaniasis cases. Based on
our analysis maximum humidity is the most important
feature in the prediction, representing 18%. Average
temperature and minimum humidity represent another
15% and 14% of contribution while precipitation rep-
resents the lowest importance with 11%. We can see
that humidity has a good correlation with leishmania-
sis cases while there is a low correlation between pre-
cipitation and our target variable (cases), but we no-
ticed that this might be related to the meteorological
dataset.
5 CONCLUSION
In this work, we used a Random Forest model to pre-
dict leishmaniasis cases and possible outbreaks in the
future. As an ensembling model, RF shows good
results in predicting cases. We performed several
tests to get a better model with less error prediction,
showing that our original dataset was really small and
caused problems for the model. Even with synthetic
data, our error prediction was high, so an optimization
process was necessary. Optimizers showed great re-
sults, Random Search and a genetic algorithm (TPOT)
performed better than an approach like Grid Search
reducing error prediction in metrics like MAE from
3.77 to 3.64-65.
Our first approach was deleting 0 values due to the
noise that causes, but we noticed that low values cause
trouble. We proposed two experiments where we con-
sidered that low values won’t be necessary. After sev-
eral tests, we conclude that cases greater than 5 and 7
contribute to getting better metrics values with an im-
provement of 24% and 38% (MAE) respectively. That
experiment causes a model that is better at predicting
high values but worst at low values. Finally, we no-
ticed that humidity and temperature are the most im-
portant predictors.
As an extension of this work, a better and bigger
dataset is necessary to get a better model with the cor-
rect recollection of meteorological data and epidemi-
ological data. On the other hand, NTDs have several
diseases that cause problems in different parts of the
world.
REFERENCES
Al-Mudhafar, W. J. (2020). Integrating machine learning
and data analytics for geostatistical characterization of
clastic reservoirs. Journal of Petroleum Science and
Engineering, 195:107837.
Ao, Y., Li, H., Zhu, L., Ali, S., and Yang, Z. (2019). The
linear random forest algorithm and its advantages in
machine learning assisted logging regression model-
ing. Journal of Petroleum Science and Engineering,
174:776–789.
Chandramouli, S., Dutt, S., and Das, A. (2018). Machine
Learning. Pearson Education India, 1st edition.
da Silva, T. T., Francisquini, R., and Nascimento, M. C. V.
(2021). Meteorological and human mobility data on
predicting COVID-19 cases by a novel hybrid decom-
position method with anomaly detection analysis: A
case study in the capitals of brazil. Expert Syst. Appl.,
182:115190.
Elsheikh, A. H., Saba, A. I., Elaziz, M. A., Lu, S., Shan-
mugan, S., Muthuramalingam, T., Kumar, R., Mosleh,
A. O., Essa, F., and Shehabeldeen, T. A. (2021). Deep
learning-based forecasting model for covid-19 out-
break in saudi arabia. Process Safety and Environ-
mental Protection, 149:223–233.
Harvey, D., Valkenburg, W., and Amara, A. (2021). Predict-
ing malaria epidemics in burkina faso with machine
learning. PLOS ONE, 16(6):1–16.
Le, T. T., Fu, W., and Moore, J. H. (2020). Scaling tree-
based automated machine learning to biomedical big
data with a feature set selector. Bioinform., 36(1):250–
256.
Nejad, F. Y. and Varathan, K. D. (2021). Identification of
significant climatic risk factors and machine learning
models in dengue outbreak prediction. BMC Medical
Informatics Decis. Mak., 21(1):141.
Nguyen, V.-H., Tuyet-Hanh, T. T., Mulhall, J., Minh, H. V.,
Duong, T. Q., Chien, N. V., Nhung, N. T. T., Lan,
V. H., Minh, H. B., Cuong, D., Bich, N. N., Quyen,
N. H., Linh, T. N. Q., Tho, N. T., Nghia, N. D., Anh,
L. V. Q., Phan, D. T. M., Hung, N. Q. V., and Son,
M. T. (2022). Deep learning models for forecasting
dengue fever based on climate data in vietnam. PLOS
Neglected Tropical Diseases, 16(6):1–22.
Patki, N., Wedge, R., and Veeramachaneni, K. (2016). The
synthetic data vault. In DSAA, pages 399–410. IEEE.
Sarkar, D. and Natarajan, V. (2019). Ensemble Machine
Learning Cookbook. Packt Publishing, 1st edition.
Xu, J., Xu, K., Li, Z., Meng, F., Tu, T., Xu, L., and Liu,
Q. (2020). Forecast of dengue cases in 20 chinese
cities based on the deep learning method. Interna-
tional Journal of Environmental Research and Public
Health, 17(2):453.
Zhao, N., Charland, K., Carabali, M., Nsoesie, E. O.,
Maheu-Giroux, M., Rees, E., Yuan, M., Garcia Bal-
aguera, C., Jaramillo Ramirez, G., and Zinszer, K.
(2020). Machine learning and dengue forecasting:
Comparing random forests and artificial neural net-
works for predicting dengue burden at national and
sub-national scales in colombia. PLOS Neglected
Tropical Diseases, 14(9):1–16.
Zhou, Z.-H. (2021). Machine learning. Springer, Gateway
East, Singapore.
A Regression Based Approach for Leishmaniasis Outbreak Detection
211