Israel Herraiz, Daniel M. German, Ahmed E. Hassan


Source code size is an estimator of software effort. Size is also often used to calibrate models and equations to estimate the cost of software. The distribution of source code file sizes has been shown in the literature to be a lognormal distribution. In this paper, we measure the size of a large collection of software (the Debian GNU/Linux distribution version 5.0.2), and we find that the statistical distribution of its source code file sizes follows a double Pareto distribution. This means that large files are to be found more often than predicted by the lognormal distribution, therefore the previously proposed models underestimate the cost of software.


  1. Baxter, G., Frean, M., Noble, J., Rickerby, M., Smith, H., Visser, M., Melton, H., and Tempero, E. (2006). Understanding the shape of java software. In Proceedings of the ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages, and Applications, pages 397-412, New York, NY, USA. ACM.
  2. Boehm, B. B. (1981). Software Engineering Economics. Prentice Hall.
  3. Clark, D. W. and Green, C. C. (1977). An empirical study of list structure in Lisp. Communications of the ACM, 20(2):78-87.
  4. Clauset, A., Shalizi, C. R., and Newman, M. E. J. (2007). Power-law distributions in empirical data.
  5. Concas, G., Marchesi, M., Pinna, S., and Serra, N. (2007). Power-laws in a large object oriented software system. IEEE Transactions on Software Engineering, 33(10):687-708.
  6. Graves, T. L., Karr, A. F., Marron, J., and Siy, H. (2000). Predicting fault incidence using software change history. IEEE Transactions on Software Engineering, 26(7):653-661.
  7. Halstead, M. H. (1977). Elements of Software Science. Elsevier, New York, USA.
  8. Herraiz, I. (2009). A statistical examination of the evolution and properties of libre software. In Proceedings of the 25th IEEE International Conference on Software Maintenance (ICSM), pages 439-442. IEEE Computer Society.
  9. Herraiz, I., Gonzalez-Barahona, J. M., and Robles, G. (2007). Towards a theoretical model for software growth. In International Workshop on Mining Software Repositories, pages 21-30, Minneapolis, MN, USA. IEEE Computer Society.
  10. Jones, C. (1995). Backfiring: converting lines of code to function points. Computer, 28(11):87 -88.
  11. Knuth, D. E. (1971). An empirical study of FORTRAN programs. Software Practice and Experience, 1(2):105- 133.
  12. Louridas, P., Spinellis, D., and Vlachos, V. (2008). Power laws in software. ACM Transactions on Software Engineering and Methodology, 18(1).
  13. McCabe, T. J. (1976). A complexity measure. IEEE Transactions on Software Engineering, SE-2(4):308-320.
  14. Mitzenmacher, M. (2004a). A brief history of generative models for power law and lognormal distributions. Internet Mathematics, 1(2):226-251.
  15. Mitzenmacher, M. (2004b). Dynamic models for file sizes and double Pareto distributions. Internet Mathematics, 1(3):305-333.
  16. R Development Core Team (2009). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3- 900051-07-0.
  17. Robles, G., Gonzalez-Barahona, J. M., Michlmayr, M., Amor, J. J., and German, D. M. (2009). Macrolevel software evolution: A case study of a large software compilation. Empirical Software Engineering, 14(3):262-285.
  18. Robles, G., Gonzlez-Barahona, J. M., and Michlmayr, M. (2005). Evolution of volunteer participation in libre software projects: evidence from Debian. In Proceedings of the 1st International Conference on Open Source Systems, pages 100-107, Genoa, Italy.
  19. Woodfield, S. N., Dunsmore, H. E., and Shen, V. Y. (1981). The effect of modularization and comments on program comprehension. In Proceedings of the 5th international conference on Software engineering, ICSE 7881, pages 215-223, Piscataway, NJ, USA. IEEE Press.
  20. Zhang, H., Tan, H. B. K., and Marchesi, M. (2009). The distribution of program sizes and its implications: An Eclipse case study. In Proc. of the 1st International Symposium on Emerging Trends in Software Metrics (ETSW 2009).

Paper Citation

in Harvard Style

Herraiz I., German D. and Hassan A. (2011). ON THE DISTRIBUTION OF SOURCE CODE FILE SIZES . In Proceedings of the 6th International Conference on Software and Database Technologies - Volume 2: ICSOFT, ISBN 978-989-8425-77-5, pages 5-14. DOI: 10.5220/0003426200050014

in Bibtex Style

author={Israel Herraiz and Daniel M. German and Ahmed E. Hassan},
booktitle={Proceedings of the 6th International Conference on Software and Database Technologies - Volume 2: ICSOFT,},

in EndNote Style

JO - Proceedings of the 6th International Conference on Software and Database Technologies - Volume 2: ICSOFT,
SN - 978-989-8425-77-5
AU - Herraiz I.
AU - German D.
AU - Hassan A.
PY - 2011
SP - 5
EP - 14
DO - 10.5220/0003426200050014