grams: they are probably less complex and smaller.
If the difference is due to this cause, it would mean
that double Pareto distributions are the signature of
the programming process, and that different program-
ming activities (scripting, complex programs coding)
can be identified by different statistical distributions
of software size.
In any case, the double Pareto distribution already
has important practical implications for software es-
timation. Previously proposed models (Zhang et al.,
2009) are based on the lognormal distribution, that
consistently and dangerously underestimate the size
of large files. It is true that large files are only a minor-
ity in software projects, the so-called small class/file
phenomenon, however they account for a proportion
of the size as important as in the case of small files.
Therefore, using the lognormal assumption leads to
an underestimation of the size of large files. This un-
derestimation will have a great impact on the accuracy
of the estimation of the size of the overall system.
REFERENCES
Baxter, G., Frean, M., Noble, J., Rickerby, M., Smith, H.,
Visser, M., Melton, H., and Tempero, E. (2006). Un-
derstanding the shape of java software. In Proceedings
of the ACM SIGPLAN Conference on Object-Oriented
Programming Systems, Languages, and Applications,
pages 397–412, New York, NY, USA. ACM.
Boehm, B. B. (1981). Software Engineering Economics.
Prentice Hall.
Clark, D. W. and Green, C. C. (1977). An empirical study
of list structure in Lisp. Communications of the ACM,
20(2):78–87.
Clauset, A., Shalizi, C. R., and Newman, M. E. J. (2007).
Power-law distributions in empirical data.
Concas, G., Marchesi, M., Pinna, S., and Serra, N. (2007).
Power-laws in a large object oriented software sys-
tem. IEEE Transactions on Software Engineering,
33(10):687–708.
Graves, T. L., Karr, A. F., Marron, J., and Siy, H. (2000).
Predicting fault incidence using software change his-
tory. IEEE Transactions on Software Engineering,
26(7):653–661.
Halstead, M. H. (1977). Elements of Software Science. El-
sevier, New York, USA.
Herraiz, I. (2009). A statistical examination of the evo-
lution and properties of libre software. In Proceed-
ings of the 25th IEEE International Conference on
Software Maintenance (ICSM), pages 439–442. IEEE
Computer Society.
Herraiz, I., Gonzalez-Barahona, J. M., and Robles, G.
(2007). Towards a theoretical model for software
growth. In International Workshop on Mining Soft-
ware Repositories, pages 21–30, Minneapolis, MN,
USA. IEEE Computer Society.
Jones, C. (1995). Backfiring: converting lines of code to
function points. Computer, 28(11):87 –88.
Knuth, D. E. (1971). An empirical study of FORTRAN pro-
grams. Software Practice and Experience, 1(2):105–
133.
Louridas, P., Spinellis, D., and Vlachos, V. (2008). Power
laws in software. ACM Transactions on Software En-
gineering and Methodology, 18(1).
McCabe, T. J. (1976). A complexity measure. IEEE Trans-
actions on Software Engineering, SE-2(4):308–320.
Mitzenmacher, M. (2004a). A brief history of generative
models for power law and lognormal distributions. In-
ternet Mathematics, 1(2):226–251.
Mitzenmacher, M. (2004b). Dynamic models for file sizes
and double Pareto distributions. Internet Mathemat-
ics, 1(3):305–333.
R Development Core Team (2009). R: A Language and
Environment for Statistical Computing. R Foundation
for Statistical Computing, Vienna, Austria. ISBN 3-
900051-07-0.
Robles, G., Gonzalez-Barahona, J. M., Michlmayr, M.,
Amor, J. J., and German, D. M. (2009). Macro-
level software evolution: A case study of a large soft-
ware compilation. Empirical Software Engineering,
14(3):262–285.
Robles, G., Gonzlez-Barahona, J. M., and Michlmayr, M.
(2005). Evolution of volunteer participation in libre
software projects: evidence from Debian. In Pro-
ceedings of the 1st International Conference on Open
Source Systems, pages 100–107, Genoa, Italy.
Woodfield, S. N., Dunsmore, H. E., and Shen, V. Y. (1981).
The effect of modularization and comments on pro-
gram comprehension. In Proceedings of the 5th inter-
national conference on Software engineering, ICSE
’81, pages 215–223, Piscataway, NJ, USA. IEEE
Press.
Zhang, H., Tan, H. B. K., and Marchesi, M. (2009). The
distribution of program sizes and its implications: An
Eclipse case study. In Proc. of the 1st International
Symposium on Emerging Trends in Software Metrics
(ETSW 2009).
ICSOFT 2011 - 6th International Conference on Software and Data Technologies
14