umn “Value” in Figure 5 is the estimates. Although
automatically assigning tags to a dataset would be
ideal, due to the accuracy problem of recommenda-
tions, it is assumed that the users select the proper
tags from the recommended tags.
5 DISCUSSION
The frequency of some tags in Categories 1 to 4 is 0.
Multi-label classification cannot predict tags which
do not appear in the training data. Also, oversampling
methods cannot increase feature and label vectors of
such tags. One method that could be adapted to pre-
dict such tags is zero-shot learning.
It remains to check the accuracy of the tag rec-
ommendations. Various multi-label classification and
oversampling methods have been proposed, so we
need to conduct comparisons with such methods.
In the present study, we set the threshold of the
frequency between infrequent tags and others to 10 in
the datasets of DATA.GO.JP. It is important to auto-
matically determine the threshold at which the over-
sampling is effective.
Since multi-label classification cannot output tags
which do not appear in the training data, we need
to extract new tags from a dataset. We previ-
ously proposed a method to extract particular noun
phrases (Yamada et al., 2018; Yamada and Nakatoh,
2018) as characteristic phrases and words to use as
new tags that represent the dataset.
6 CONCLUSION
This paper proposed a tag recommendation system
for DATA.GO.JP which is the data catalog site of the
Japanese government. The system utilizes multi-label
classification in machine learning to recommend tags.
SMOTE, which is an oversampling method, is used
to increase the data of infrequent tags in the train-
ing data. The system is given the title of a dataset,
and it recommends several tags using a classifier con-
structed in advance.
The work reported herein is part of ongoing re-
search. Important remaining work includes increasing
the accuracy of predicting infrequent tags and making
new tags which are not included in previous datasets.
Other data catalog sites of countries include Data.gov
of the U.S. government. Our future work is also to
develop a system for such sites.
ACKNOWLEDGEMENTS
This work was supported by JSPS KAKENHI Grant
Numbers JP19K12715.
REFERENCES
Babbar, R. and Sch
¨
olkopf, B. (2017). DiSMEC: Distributed
sparse machines for extreme multi-label classification.
In Proceedings of the 10th ACM International Confer-
ence on Web Search and Data Mining, pages 721–729.
ACM.
Charte, F., Rivera, A. J., del Jesus, M. J., and Herrera, F.
(2015). MLSMOTE: Approaching imbalanced mul-
tilabel learning through synthetic instance generation.
Knowledge-Based Systems, 89:385–397.
Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer,
W. P. (2002). SMOTE: Synthetic minority over-
sampling technique. J. Artif. Int. Res., 16(1):321–357.
Haixiang, G., Yijing, L., Shang, J., Mingyun, G., Yuanyue,
H., and Bing, G. (2017). Learning from class-
imbalanced data: Review of methods and applica-
tions. Expert Systems With Applications, 73:220–239.
Jain, H., Prabhu, Y., and Varma, M. (2016). Extreme
multi-label loss functions for recommendation, tag-
ging, ranking & other missing label applications. In
Proceedings of the 22nd ACM SIGKDD International
Conference on Knowledge Discovery and Data Min-
ing, pages 935–944. ACM.
Liu, B., Blekas, K., and Tsoumakas, G. (2022). Multi-
label sampling based on local label imbalance. Pattern
Recognition, 122:108294.
Liu, B. and Tsoumakas, G. (2019). Synthetic oversam-
pling of multi-label data based on local label distribu-
tion. In Machine Learning and Knowledge Discovery
in Databases: European Conference, ECML PKDD,
page 180–193. Springer-Verlag.
Manning, C. D., Raghavan, P., and Sch
¨
utze, H. (2008). In-
troduction to Information Retrieval. Cambridge Uni-
versity Press.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,
Thirion, B., Grisel, O., Blondel, M., Prettenhofer,
P., Weiss, R., Dubourg, V., Vanderplas, J., Passos,
A., Cournapeau, D., Brucher, M., Perrot, M., and
Duchesnay, E. (2011). Scikit-learn: Machine learning
in Python. Journal of Machine Learning Research,
12:2825–2830.
Rao, K. N. and Reddy, C. S. (2020). A novel under sam-
pling strategy for efficient software defect analysis of
skewed distributed data. Evolving Systems, 11:119–
131.
Schultheis, E. and Babbar, R. (2022). Speeding-
up one-versus-all training for extreme classification
via mean-separating initialization. Mach. Learn.,
111(11):3953–3976.
Wu, T., Huang, Q., Liu, Z., Wang, Y., and Lin, D. (2020).
Distribution-balanced loss for multi-label classifica-
tion in long-tailed datasets. In Computer Vision –
Tag Recommendation System for Data Catalog Site of Japanese Government
329