distance within groups and the average distance
between groups is minimized as shown in criterion
(10):
4.3 Illustrative Example
Phase one of the proposed approach takes Matrix (4)
as input and rearranges its rows and columns to obtain
a sequence of records. As shown in Matrix (5), the
sequence produced is (1 3 7 4 8 5 9 2 6 10). Phase two
takes the sequence and classifies data records into
groups to minimize criterion (10). As a result, we find
this grouping result: {1,3,7}, {4,8}, {5,9}, {2}, {6},
{10}.
Unlike most of the existing methods, our
approach does not require any parameters. The
generation of a sequence of data records and
assignment of data records into groups are all
automatic.
Based on the grouping result, we conclude that
records 2, 6, and 10 are outliers. A survey
administrator will examine these outliers and
determine whether they contain any error or not. Our
example postulates that an employee with a college
degree, in a middle management position, and with
more years of seniority should have higher current
salary. On the other hand, an employee without a
college degree, in a supervisor position, and with
fewer years of seniority should have a lower current
salary. However, record 2 indicates that the employee
has a college degree and in a middle management
level but has relatively low current salary. Record 10
indicates that the employee has exceptionally high
current salary for his/her position (i.e., supervisor) and
education (i.e., does not have a degree). Record 6 may
(or may not) contain errors.
5 CONCLUSION
In this work, we discussed the use of clustering
algorithms for assessing the quality of Internet
survey data. Limitations of some existing
approaches were identified. To address these
limitations, we first examined the nature of the
quality problem of Internet surveys and then
established that the problem is equivalent to a TSP.
Our proposed approach exploits the underlying TSP
structure. Although many algorithms may be used
for the TSP-quality problem, we adopt genetic
algorithm for computational efficiency and
advantage. Compared to the existing approaches, our
approach provides a better understanding of the
nature of the Internet survey problem and seems to
offer improvement potential. However, the quality
model must be implemented and tested to verify our
claims.
REFERENCES
Anderberg, M.R., 1993. Cluster analysis for applications,
Academic Press, New York.
Cheng, C.H., Goh, C.H., and Lee-Post, A., 2006. Data
auditing by hierarchical clustering, Internatinal
Journal of Applied Management & Technology, Vol.4,
No.1, pp. 153-163.
Felligi, I.P. and Holt, D., 1976. A systematic approach to
automatic editing and imputation, Journal of American
Statistics Assocication, Vol. 71, pp. 17-35.
Freund, R.J. and Hartley, H.O., 1967. A procedure for
automatic data editing, J Journal of American Statistics
Assocication, Vol. 62, pp. 341-352.
Goldberg, D.E., 1989. Genetic Algorithms in Search,
Optimization and Machine Learning. Massachusetts:
Addison Wesley.
Hartigan, J.A., 1975. Clustering Algorithms, McGraw-Hill,
New York.
Haupt, R.L. and Haupt, S.E., 1998. Practical Genetic
Algorithms. New York: John Wiley & Sons.
Holland, J.H., 1975. Adaptation in Natural and Artificial
Systems. Michigan: Michigan Press.
Lee, R.C., Slagle, J.R., and Mong, C.T., 1978. Towards
automatic auditing of records, IEEE Transactions on
Software Engineering., Vol. SE-4, pp. 441-448.
Lenstra, J.K. and Kan Rinnooy, A.H.G., 1975. Some
Simple Applications of the Traveling Salesman
Problem, Operations Research Quarterly, Vol. 26, pp.
717-733.
Michalewicz, Z., 1999. Genetic Algorithms + Data
Structures = Evolution Programs. Third, Revised and
Extended Edition, Hong Kong: Springer.
Naus, J.I., Johnson, T.G., and Montalvo, R., 1972. A
probabilistic model for identifying errors and data
editing, Journal of American Statistics Assocication,
Vol. 67, pp. 943-950.
Slagle, J.R., Chang, C.L., and Heller, S.R., 1975. A
clustering and data-reorganizing algorithm, IEEE
Transactions on Systems, Man, and Cybernetics, Vol.
SMC-5, pp. 125-128.
Stanfel, L.E., 1983. Applications of clustering to information
system design, Information Processing & Management.,
Vol. 19, pp. 37-50.
Starkweather, T., McDaniel, S., Mathias, K., Whitley, D.,
and Whitley, C., 1991. A Comparison of Genetic
Sequencing Operators, Proceedings of the fourth
International Conference on. Genetic Algorithms and
their Applications, pp.69-76
Storer, W.F. and Eastman, C.M., 1990. Some Experiments in
the use of clustering for data validation, Information
Systems., Vol. 15, pp. 537-542.
Whitley, D., 1989. The Genitor Algorithm and Selection
Pressure: Why Rank-based Allocation of Reproductive
Trials Is Best, Proeedings of. the Third International
Conference on Genetic Algorithms, pp.116-121.
A TWO-STAGED APPROACH FOR ASSESSING FOR THE QUALITY OF INTERNET SURVEY DATA
83