IMPROVING THE PERFORMANCE OF THE RIPPER IN INSURANCE RISK CLASSIFICATION - A Comparitive Study using Feature Selection

Mlungisi Duma, Bhekisipho Twala, Tshilidzi Marwala, Fulufhelo V. Nelwamondo

Abstract

The Ripper algorithm is designed to generate rule sets for large datasets with many features. However, it was shown that the algorithm struggles with classification performance in the presence of missing data. The algorithm struggles to classify instances when the quality of the data deteriorates as a result of increasing missing data. In this paper, feature selection technique is used to help improve the classification performance of the Ripper algorithm. Principal component analysis and evidence automatic relevance determination techniques are chosen to improve the performance of the Ripper. A comparison is done to see which technique helps the algorithm improve the most. Training datasets with completely observable data were used to construct the algorithm, and testing datasets with missing values were used for measuring accuracy. The results showed that principal component analysis is a better feature selection for the Ripper. The results show that with principal component analysis, the classification performance improves significantly as well as increase in resilience in the presence of escalating missing data.

References

  1. Balasubramanian, D., Srinivasan, P., Gurupatham, R., 2007. Automatic Classification of Focal Lesions in Ultrasound Liver Images using Principal Component Analysis and Neural Networks. In AICIE'07, 29th Annual International Conference of the IEEE EMBS, pp. 2134 - 2137, Lyon, France.
  2. Bishop, C. M., 1995. Neural Network for Pattern Recognition. Oxford University Press, New York, USA.
  3. Cohen, W. W., 1995. Fact, Effective Rule Induction. In ICML'95, 12th International Conference on Machine Learning, pp.115-123.
  4. Crump, D., 2009. Why People Don't Buy Insurance. Ezine Articles. (Source: http://ezinearticles.com/?cat= Insurance).
  5. Duma, M., Twala, B., Marwala, T., Nelwamondo, F. V., 2010. Classification Performance Measure Using Missing Insurance Data: A Comparison Between Supervised Learning Models. In ICCCI'10 International Conference on Computer and Computational Intelligence, pp. 550 - 555, Nanning, China.
  6. Francis, L., 2005. Dancing With Dirty Data: Methods for Exploring and Cleaning Data. Casualty Actuarial Society Forum Casualty Actuarial Society, pp. 198- 254, Virgina, USA. (Source: http://www.casact.org/ pubs/forum/05wforum/05wf198.pdf)
  7. Han, X., 2010. Nonnegative Principal Component Analysis for Cancer Molecular Pattern Discovery. IEEE/ACM Transaction on Computational Biology and BioInformatics, 7(3), pp. 537 - 549
  8. Howe, C., 2010. Top Reasons Auto Insurance Companies DropPeople.eHow. (Source:http://www.ehow.com/facts_6141822_top-insurance-companiesdrop-people. html).
  9. Li, Y., Campbell, C., Tipping, M., 2002. Bayesian automatic relevance determination algorithms for classifying gene expression data. BIOINFORMATICS, 18(10), pp. 1332-1339.
  10. Little, R., J., A., Rubin, D., B., 1987. Statistical Analysis with Missing Data. Wiley New York, USA.
  11. MacKay, D., J., C., 1995. Probable networks and plausible predictions - a review of practical Bayesian methods for supervised neural networks. Network Computation in Neural Systems. 6(1995), pp. 469-505.
  12. Marwala T., 2001. Fault Identification using neural network and vibration data. Unpublished doctoral thesis, University of Cambridge, Cambridge.
  13. Marwala, T., 2009. Computational Intelligence for Missing Data Imputation Estimation and Management Knowledge Optimization Techniques, Information Science Reference, Hershey, New York, USA.
  14. Peng, Y., Kou, G., 2008. A Comparative Study of Classification Methods in Financial Risk Detection. In ICNCAIM'08, 4thInternational Conference on Networked Computing and Advanced Information Management, pp. 9-12, Gyeongju, South Korea.
  15. Smyrnakis, M., G., Evans, D., J., 2007. Classifying Ischemic Events Using a Bayesian Inference Multilayer Perceptron and Input Variable Evaluation Using Automatic Relevance Determination. IEEE, pp. 305 - 308.
  16. van Calster, B., Timmerman, D., Nabney, I., T., Valentin, L., van Holsbeke, C., van Huffel, S., 2006. Classifying ovarian tumors using Bayesian Multi-Layer Perceptrons and Automatic Relevance Determi-nation: A multi-center study. In AICIE'06, 28th An-nual International Conference of the IEEE EMBS, pp. 2134 - 2137, New York City, USA.
Download


Paper Citation


in Harvard Style

Duma M., Twala B., Marwala T. and V. Nelwamondo F. (2011). IMPROVING THE PERFORMANCE OF THE RIPPER IN INSURANCE RISK CLASSIFICATION - A Comparitive Study using Feature Selection . In Proceedings of the 8th International Conference on Informatics in Control, Automation and Robotics - Volume 1: ICINCO, ISBN 978-989-8425-74-4, pages 203-210. DOI: 10.5220/0003531902030210


in Bibtex Style

@conference{icinco11,
author={Mlungisi Duma and Bhekisipho Twala and Tshilidzi Marwala and Fulufhelo V. Nelwamondo},
title={IMPROVING THE PERFORMANCE OF THE RIPPER IN INSURANCE RISK CLASSIFICATION - A Comparitive Study using Feature Selection},
booktitle={Proceedings of the 8th International Conference on Informatics in Control, Automation and Robotics - Volume 1: ICINCO,},
year={2011},
pages={203-210},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003531902030210},
isbn={978-989-8425-74-4},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 8th International Conference on Informatics in Control, Automation and Robotics - Volume 1: ICINCO,
TI - IMPROVING THE PERFORMANCE OF THE RIPPER IN INSURANCE RISK CLASSIFICATION - A Comparitive Study using Feature Selection
SN - 978-989-8425-74-4
AU - Duma M.
AU - Twala B.
AU - Marwala T.
AU - V. Nelwamondo F.
PY - 2011
SP - 203
EP - 210
DO - 10.5220/0003531902030210