
 
distance within groups and the average distance 
between groups is minimized as shown in criterion 
(10): 
4.3 Illustrative Example 
Phase one of the proposed approach takes Matrix (4) 
as input and rearranges its rows and columns to obtain 
a sequence of records. As shown in Matrix (5), the 
sequence produced is (1 3 7 4 8 5 9 2 6 10). Phase two 
takes the sequence and classifies data records into 
groups to minimize criterion (10). As a result, we find 
this grouping result: {1,3,7}, {4,8}, {5,9}, {2}, {6}, 
{10}. 
Unlike most of the existing methods, our 
approach does not require any parameters. The 
generation of a sequence of data records and 
assignment of data records into groups are all 
automatic. 
Based on the grouping result, we conclude that 
records 2, 6, and 10 are outliers. A survey 
administrator will examine these outliers and 
determine whether they contain any error or not.  Our 
example postulates that an employee with a college 
degree, in a middle management position, and with 
more years of seniority should have higher current 
salary. On the other hand, an employee without a 
college degree, in a supervisor position, and with 
fewer years of seniority should have a lower current 
salary.  However, record 2 indicates that the employee 
has a college degree and in a middle management 
level but has relatively low current salary. Record 10 
indicates that the employee has exceptionally high 
current salary for his/her position (i.e., supervisor) and 
education (i.e., does not have a degree).  Record 6 may 
(or may not) contain errors. 
5 CONCLUSION 
In this work, we discussed the use of clustering 
algorithms for assessing the quality of Internet 
survey data. Limitations of some existing 
approaches were identified. To address these 
limitations, we first examined the nature of the 
quality problem of Internet surveys and then 
established that the problem is equivalent to a TSP. 
Our proposed approach exploits the underlying TSP 
structure. Although many algorithms may be used 
for the TSP-quality problem, we adopt genetic 
algorithm for computational efficiency and 
advantage. Compared to the existing approaches, our 
approach provides a better understanding of the 
nature of the Internet survey problem and seems to 
offer improvement potential. However, the quality 
model must be implemented and tested to verify our 
claims. 
REFERENCES 
Anderberg, M.R., 1993. Cluster analysis for applications, 
Academic Press, New York. 
Cheng, C.H., Goh, C.H., and Lee-Post, A., 2006. Data 
auditing by hierarchical clustering, Internatinal 
Journal of Applied Management & Technology, Vol.4, 
No.1, pp. 153-163. 
Felligi, I.P. and Holt, D., 1976.  A systematic approach to 
automatic editing and imputation, Journal of American 
Statistics Assocication, Vol. 71, pp. 17-35. 
Freund, R.J. and Hartley, H.O., 1967. A procedure for 
automatic data editing, J Journal of American Statistics 
Assocication, Vol. 62, pp. 341-352. 
Goldberg, D.E., 1989. Genetic Algorithms in Search, 
Optimization and Machine Learning. Massachusetts: 
Addison Wesley. 
Hartigan, J.A., 1975. Clustering Algorithms, McGraw-Hill, 
New York. 
Haupt, R.L. and Haupt, S.E., 1998. Practical Genetic 
Algorithms. New York: John Wiley & Sons. 
Holland, J.H., 1975. Adaptation in Natural and Artificial 
Systems. Michigan: Michigan Press. 
Lee, R.C., Slagle, J.R., and Mong, C.T., 1978. Towards 
automatic auditing of records, IEEE Transactions on 
Software Engineering., Vol. SE-4, pp. 441-448. 
Lenstra, J.K. and Kan Rinnooy, A.H.G., 1975. Some 
Simple Applications of the Traveling Salesman 
Problem, Operations Research Quarterly, Vol. 26, pp. 
717-733. 
Michalewicz, Z., 1999. Genetic Algorithms + Data 
Structures = Evolution Programs. Third, Revised and 
Extended Edition, Hong Kong: Springer. 
Naus, J.I., Johnson, T.G., and Montalvo, R., 1972. A 
probabilistic model for identifying errors and data 
editing,  Journal of American Statistics Assocication, 
Vol. 67, pp. 943-950. 
Slagle, J.R., Chang, C.L., and Heller, S.R., 1975. A 
clustering and data-reorganizing algorithm, IEEE 
Transactions on Systems, Man, and Cybernetics, Vol. 
SMC-5, pp. 125-128. 
Stanfel, L.E., 1983. Applications of clustering to information 
system design, Information Processing & Management., 
Vol. 19, pp. 37-50. 
Starkweather, T., McDaniel, S., Mathias, K., Whitley, D., 
and Whitley, C., 1991. A Comparison of Genetic 
Sequencing Operators, Proceedings of the fourth 
International Conference on. Genetic Algorithms and 
their Applications, pp.69-76 
Storer, W.F. and Eastman, C.M., 1990. Some Experiments in 
the use of clustering for data validation, Information 
Systems., Vol. 15, pp. 537-542. 
Whitley, D., 1989. The Genitor Algorithm and Selection 
Pressure: Why Rank-based Allocation of Reproductive 
Trials Is Best, Proeedings of. the Third International 
Conference on Genetic Algorithms, pp.116-121.  
A TWO-STAGED APPROACH FOR ASSESSING FOR THE QUALITY OF INTERNET SURVEY DATA
83