Finally, the last group consists of both TF-IDF meth-
ods and the combination of the status method followed
by TF-IDF that performed well. Both of TF-IDF meth-
ods have performed very well for both datasets with re-
spectively 88% and 73% of the events labelled as irrele-
vant with (almost) no losses of information. Therefore,
the contextual TF-IDF method barely improve the accu-
racy of the traditional TF-IDF on event logs. The only
drawback of these methods are their computation times
that went up to one hour and a half for the irregular
dataset.
A combination of the status method followed by TF-
IDF can reduce significantly the computation time of
TF-IDF without impacting the accuracy of the method.
The TF-IDF scores were also still similar to the one
found without the application of the status method as
preprocessing step. The computation time on the reg-
ular dataset dropped from 28 minutes to 17 minutes and
was reduced by 65 minutes (77%) on the regular dataset
(from 84 minutes to 19 minutes).
As a conclusion, the methods Shannon index, com-
pression, pattern, contextual TF-IDF and Morning and
evening have not been found addapted for industrial
events logs. A good way to label the events as relevant
or irrelevant is to combine first a domain-based method
such as status method to already perform a first quick
labelling. Then TF-IDF method can be used to help to
label the remaining events. A statistical method can also
be used for irregular datasets. Both methods are actually
complementary as statistical methods only focus on the
correlation between events and outages while TF-IDF
method sort the event by their relevancy and uniqueness,
without any regard for the outages.
6 CONCLUSION AND FURTHER
RESEARCH
In this paper, we have considered and evaluated 10 meth-
ods (from various research fields) to estimate the event
relevancy in industrial event logs, to detect irrelevant
events that could be discarded during the preprocessing
of voluminous data. These methods have been bench-
marked on two datasets containing real industrial events
logs from two PV plants. We have found that a combina-
tion of two methods (one removing the state events and
one applying TF-IDF) allows to label up to 90% of the
events as irrelevant with a reasonable computation time
For further research, we intend to evaluate other
score-based methods from the static pruning index field,
especially the methods BM25, BB2 or the Rnyi diver-
gence used by (Chen et al., 2015), to benchmark them on
industrial events logs. In addition, the statistical method
can also be applied on device specific datasets, i.e. on
datasets containing the event logs of devices of same
type from multiple plants. This may allow to create de-
vice specific ranking that could then be applied on all
devices of that type without pre-processing of the data.
However, a thorough study of these scores would need to
be performed to assess e.g. if the location of the device
has an impact on these events scores.
ACKNOWLEDGEMENTS
This work was subsidised by the Region of Bruxelles-
Capitale - Innoviris.
REFERENCES
Billerbeck, B. and Zobel, J. (2004). Techniques for effi-
cient query expansion. In International Symposium
on String Processing and Information Retrieval, pages
30–42. Springer.
Bonchi, F., Giannotti, F., Mazzanti, A., and Pedreschi, D.
(2003). Exante: Anticipated data reduction in con-
strained pattern mining. In European Conference on
Principles of Data Mining and Knowledge Discovery,
pages 59–70. Springer.
Bose, R. J. C., Mans, R. S., and van der Aalst, W. M. (2013).
Wanna improve process mining results? In Compu-
tational Intelligence and Data Mining (CIDM), 2013
IEEE Symposium on, pages 127–134. IEEE.
Carmel, D., Cohen, D., Fagin, R., Farchi, E., Herscovici,
M., Maarek, Y. S., and Soffer, A. (2001). Static index
pruning for information retrieval systems. In Proceed-
ings of the 24th annual international ACM SIGIR con-
ference on Research and development in information
retrieval, pages 43–50. ACM.
Chen, R.-C., Lee, C.-J., and Croft, W. B. (2015). On diver-
gence measures and static index pruning. In Proceed-
ings of the 2015 International Conference on The The-
ory of Information Retrieval, pages 151–160. ACM.
Conforti, R., La Rosa, M., and ter Hofstede, A. H. (2016).
Filtering out infrequent behavior from business pro-
cess event logs. IEEE Transactions on Knowledge and
Data Engineering.
Cooley, R., Mobasher, B., and Srivastava, J. (1999). Data
preparation for mining world wide web browsing pat-
terns. Knowledge and information systems, 1(1):5–32.
Dagnely, P., Tsiporkova, E., Tourwe, T., Ruette, T., De Bra-
bandere, K., and Assiandi, F. (2015). A semantic
model of events for integrating photovoltaic monitor-
ing data. In Industrial Informatics (INDIN), 2015
IEEE 13th International Conference on, pages 24–30.
De Moura, E. S., dos Santos, C. F., Fernandes, D. R., Silva,
A. S., Calado, P., and Nascimento, M. A. (2005).
Improving web search efficiency via a locality based
static pruning method. In Proceedings of the 14th
international conference on World Wide Web, pages
235–244. ACM.
Fu, X., Ren, R., Zhan, J., Zhou, W., Jia, Z., and Lu,
G. (2012). LogMaster: mining event correlations in
logs of large-scale cluster systems. In Reliable Dis-
tributed Systems (SRDS), 2012 IEEE 31st Symposium
on, pages 71–80. IEEE.
Data-driven Relevancy Estimation for Event Logs Exploration and Preprocessing
403