Figure 4: Croujaction Cluster Results.
and compress it to a proper size for a better showcase.
Figure 4 illustrates the results and drawing cluster-
ing in snapshots. Red line is used to separate different
jobname clusters.
We observe that as the average similarity threshold
increase more groups are clustered. A higher similar-
ity threshold yields a better clustering performance.
Note that in 90% diagram, compared with 50%, it
provides a more accurate jobname clustering.
6 CONCLUSION AND FUTURE
WORK
In this paper, we presented a novel approach for text
clustering. We introduced the algorithm Croujaction
for hadoop log file analysis. It helps solve text clus-
tering limitation caused by data storing in different
columns in log file when using TF-IDF algorithm. In
our experimental evaluation on the data set, we find
correlation between different columns and group job
names by user name as one document. This provides
efficient foundations for text clustering. It presents
a methodology for analyzing and clustering text con-
tents in log file. It details the approach which could
be used for correlation refine in contexts with columns
format.
6.1 Limitation and Advantage
In our algorithm, we reference TF-IDF and we have
seen that TF-IDF is efficient and simple for calculat-
ing similarity between texts. TF-IDF has its limita-
tions. In terms of synonyms, it does make the re-
lationship between words. In our system, we could
avoid this limitation because we dont need to worry
about semantic synonyms. We regard every word as
string object and just compare them by characters.
6.2 Future Work
Finally, we would like to outline a few directions for
future research. We already noticed that the most
important parts in text base job name clustering are
1) data preprocessing 2) building vector-space and 3)
clustering algorithm.
We could find out an improvement in data prepro-
cessing especially a better replacement rule to mean-
ingless characters. This could significantly speed-up
in term delimitating process and help building vector-
space more precisely. When system calculate vec-
tor ordinate for word, in job name, system could find
more properties of word and apply some weight value
to each word. This process may make the similarity
calculation more accurate. In our approach, system
just uses the simplest hierarchical clustering method
in the last step of clustering. We plan to assign some
other algorithms in data mining to our system. Thus,
we improve the time efficiency and memory space
in clustering process. In other perspective, we could
deepen correlation analysis between more columns,
complementing the space-vector building in TF-IDF
algorithm.
ACKNOWLEDGEMENT
This work was financially supported by China Intelli-
gent Urbanization Co-Creation Center for High Den-
sity Region (CIUC2014004).
REFERENCES
Beil, F., Ester, M., and Xu, X. (2002). Frequent term-based
text clustering. In Proceedings of the eighth ACM
SIGKDD international conference on Knowledge dis-
covery and data mining, pages 436–442. ACM.
Chandola, V., Banerjee, A., and Kumar, V. (2009).
Anomaly detection: A survey. ACM Comput. Surv.,
41(3):15:1–15:58.
Cutting, D. R., Karger, D. R., Pedersen, J. O., and Tukey,
J. W. (1992). Scatter/gather: A cluster-based approach
to browsing large document collections. In Proceed-
ings of the 15th annual international ACM SIGIR con-
ference on Research and development in information
retrieval, pages 318–329. ACM.
Fu, Q., Lou, J.-G., Wang, Y., and Li, J. (2009). Execu-
tion anomaly detection in distributed systems through
unstructured log analysis. In Data Mining, 2009.
ICDM’09. Ninth IEEE International Conference on,
pages 149–158. IEEE.
Huang, Z. (1998). Extensions to the k-means algorithm for
clustering large data sets with categorical values. Data
mining and knowledge discovery, 2(3):283–304.
Croujaction-ANovelApproachtoText-basedJobNameClusteringwithCorrelationAnalysis
203