Tag Recommendation System for Data Catalog Site of Japanese

Government

Yasuhiro Yamada

Institute of Science and Engineering, Academic Assembly, Shimane University,

1060 Nishikawatsu-cho, Matsue-shi, Shimane, 690-8504, Japan

Keywords:

e-Government, Open Government Data, Tag Recommendation, Multi-Label Classiﬁcation, Machine

Learning.

Abstract:

This paper proposes a tag recommendation system for a data catalog site of the Japanese government. The

site publishes datasets that include ﬁles containing statistical data, government documents, and other ﬁles of

the Japanese government. These datasets also each include the title, description, publication date, and tags,

where a tag is a single-word or compound term which represents the content of a dataset. The system uses

multi-label classiﬁcation in machine learning to recommend tags for the datasets; multi-label classiﬁcation is

a method that outputs multiple tags for each input dataset. There are many tags already in datasets hosted on

the site that appear infrequently. It is difﬁcult to predict such infrequent tags from the datasets by multi-label

classiﬁcation. To deal with this problem, we use an existing oversampling approach which increases the data

of infrequent tags in a training dataset for the learning process of the multi-label classiﬁcation.

1 INTRODUCTION

Various government organizations have recently pub-

lished datasets, each of which includes ﬁles contain-

ing, for example, statistics data and government doc-

uments, on their Web sites. Such datasets are referred

to as open government data, and such sites are called

data catalog sites. For example, the U.S. govern-

ment publishes datasets on the site “Data.gov

.” This

site had 196,587 datasets in September 2017. The

Japanese government publishes datasets on the Web

site “DATA.GO.JP

.” The site was hosting 27,169

datasets as of February 2021.

In addition to ﬁles, a dataset contains metadata

such as the title of the dataset, description, publica-

tion date, and tags. This paper focuses on the tags of

a dataset. A tag is a single-word or compound term

which represents the content of a dataset. Examples

of tags for DATA.GO.JP are “maps,” “budgets” and

“statistics survey result.” A dataset often has multiple

tags.

Tags are useful for getting a broad understanding

https://www.data.gov

This site was updated in March 2023. The new

Japanese government data catalog site is called “e-Gov Data

Portal” (https://data.e-gov.go.jp/info/ja/top). This paper de-

scribes our research for DATA.GO.JP.

of the content of a dataset before reading the dataset

ﬁles. If the ﬁles of a dataset are large, or if a dataset

has many ﬁles, then the tags are more important for

understanding the dataset. The tags are also useful

for users to search for the datasets they want. By in-

putting or selecting a tag on a search site, a list of

datasets associated with the tag can be obtained.

In order for users to gain this beneﬁt, the same tag

must be assigned to datasets with the same or simi-

lar content. It is difﬁcult to assign proper tags man-

ually to a dataset because government employees or

site managers need to ﬁrst understand the content of

the dataset in detail. Also, they need to remember the

tags assigned to other datasets in the past.

Data catalog sites use many tags. Tags with gen-

eral meanings appear in many datasets, whereas tags

with detailed meanings are rare in the datasets of a

site. The infrequent tags tend to express the con-

crete content of a dataset, and these tags are espe-

cially important for understanding the content of such

a dataset. However, it is especially difﬁcult to manu-

ally assign such tags to a dataset.

We propose a system which recommends tags for

a dataset for DATA.GO.JP. To select appropriate tags

for a dataset from among the tags which have been

assigned to previously tagged datasets, we use multi-

label classiﬁcation in machine learning. Labels in

Yamada, Y.

Tag Recommendation System for Data Catalog Site of Japanese Government.

DOI: 10.5220/0012260000003598

In Proceedings of the 15th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2023) - Volume 3: KMIS, pages 325-331

ISBN: 978-989-758-671-2; ISSN: 2184-3228

325

multi-label classiﬁcation correspond to tags. How-

ever, multi-label classiﬁcation has difﬁculty predict-

ing infrequent labels. We apply an existing oversam-

pling technique which increases datasets with infre-

quent labels artiﬁcially in training data for learning in

multi-label classiﬁcation.

The Japanese government adopted four categories

of tags: important data category of G8, category of

basic tags of e-government information, category for

business classiﬁcation in e-government action plan,

and prioritized ﬁelds in “Roadmap for Promotion of

Open Data in Electronic Administration.” The pol-

icy of the Japanese government in assigning tags is to

select proper tags from those in these four categories.

Additionally, government employees or site managers

manually assign tags outside these four categories. In

the present paper, we choose classiﬁers for predicting

tags in each category by multi-label classiﬁcation.

The remainder of the present paper is organized

as follows. Section 2 describes related research. Sec-

tion 3 shows the tags of datasets of open government

data on DATA.GO.JP. Section 4 describes a tag rec-

ommendation system. Section 5 discusses the prob-

lem of the proposed system. Finally, our conclusions

are presented in Section 6.

2 RELATED WORK

2.1 Multi-Label Classiﬁcation for

Imbalanced Data

We deﬁne multi-label classiﬁcation. Let L =

, l

, . . . , l

} be a set of labels which correspond to

tags of datasets of open government data, and D =

{(x

, Y

), (x

, Y

), . . . (x

, Y

)} be a set of training ex-

amples, where x

∈ R

is a d-dimensional feature vec-

tor and Y

⊆ L. The learning task of multi-label clas-

siﬁcation is to make a classiﬁer for L from D. Then,

given an unlabeled example x

x ∈ R

, the classiﬁer pre-

dicts labels for x

Multi-label classiﬁcation algorithms have been

proposed for imbalanced data (Xu et al., 2016; Bab-

bar and Sch

olkopf, 2017; Jain et al., 2016; Liu and

Tsoumakas, 2019; Wu et al., 2020; Schultheis and

Babbar, 2022; Yu et al., 2019). Imbalanced data in-

clude both frequent and infrequent labels in training

data. These algorithms address the difﬁculty of pre-

dicting infrequent labels.

2.2 Resampling Training Data in

Multi-Label Classiﬁcation

Various methods for resampling training data for im-

balanced data have been proposed. The oversampling

method increases data of infrequent labels in training

data by making feature vectors artiﬁcially (Chawla

et al., 2002; Charte et al., 2015; Liu et al., 2022).

This paper utilizes SMOTE (Chawla et al., 2002) for

the oversampling. A brief explanation of SMOTE

is given in Section 4. On the other hand, the un-

dersampling method decreases data with frequent la-

bels (Haixiang et al., 2017; Rao and Reddy, 2020).

2.3 Our Previous Work

We applied three multi-label classiﬁcation methods

for the tag recommendation (Yamada and Nakatoh,

2018) to 196,587 datasets on Data.gov, which is the

data catalog site of the U.S. government. Speciﬁ-

cally, we compared the following methods: support

vector machine, random forest, and multinomial naive

Bayes. The random forest method has obtained good

results (Yamada and Nakatoh, 2018). However, our

previous work did not deal with the prediction of in-

frequent tags.

3 TAGS OF JAPANESE OPEN

GOVERNMENT DATA

This section describes tags used by the data cata-

log site “DATA.GO.JP” of the Japanese government.

We collected 27,169 datasets from DATA.GO.JP in

February 2021.

The tag assignment policy of this site is to select

tags from those in each of the following four cate-

gories:

Category 1. important data category of G8 (16 tags),

Category 2. category of basic tags of e-government

information (16 tags),

Category 3. category for business classiﬁcation in e-

government action plan (31 tags),

Category 4. prioritized ﬁelds in “Roadmap for Pro-

motion of Open Data in Electronic Administra-

tion.” (7 tags).

Tables 1 to 4 in appendix A show the tags in these

four categories and their frequencies in the datasets.

We see that the frequency of some tags is 0. In Cate-

gories 1 to 4, Japanese and English tags with the same

meaning are assigned to the same dataset.

KMIS 2023 - 15th International Conference on Knowledge Management and Information Systems

326

Also, the site managers assign appropriate tags for

each dataset outside the four categories. Category 5

comprises such tags. Table 5 in appendix A shows

the top 20 most frequent tags with their frequencies.

Figure 1 shows the frequencies of all the tags in Cat-

egory 5. The vertical axis is in log scale. We can

see that a small number of tags appear frequently in

the datasets. We can also see that most of the tags

appear once or twice in the datasets. Tags with gen-

eral meanings tend to be assigned many times. On

the other hand, tags with concrete meanings tend to

appear rarely in the datasets.

Figure 1: Frequencies of tags.

4 TAG RECOMMENDATION

SYSTEM FOR DATA.GO.JP

This section describes a tag recommendation system

for DATA.GO.JP (see Figure 2). This system utilizes

multi-label classiﬁcation which outputs multiple tags

from a dataset.

The system learns a classiﬁer for each of the four

categories. Multi-label classiﬁcation cannot predict

the tags whose frequency in the training data is zero.

Therefore, the system cannot predict some tags in

Categories 1 to 4.

Figure 3: SMOTE.

Category 5 has many infrequent tags (see Fig-

ure 1). We use an oversampling approach to make

a classiﬁer for Category 5. The oversampling in-

creases feature and label vectors of infrequent labels

in the training data. We utilize SMOTE (Chawla et al.,

2002) for the oversampling, whose process is illus-

trated in Figure 3. Circles in the ﬁgure express feature

vectors. Given training data D = {(x

, Y

), (x

, Y

. . . (x

, Y

)}, SMOTE executes oversampling as the

following steps:

Figure 2: Tag recommendation system for DATA.GO.JP.

Tag Recommendation System for Data Catalog Site of Japanese Government

327

Figure 4: Input of tag recommendation system.

Figure 5: Output of tag recommendation system.

1. Select a feature vector x

randomly from the fea-

ture vectors with an infrequent label l as a center

vector c

c (red circle).

2. Calculate the k-nearest feature vectors (blue cir-

cles) in D for c

c by Euclidean distance.

3. Select a feature vector x

near

(top left blue circle)

from the k vectors randomly. Make a new feature

vector (green circle) between c

c and x

near

4. Assigns the new feature vector to label l.

SMOTE repeats the process until the number of

feature vectors of the infrequent label achieves the

threshold.

The number of collected datasets is 27,169. The

feature vector of each dataset was generated by nouns

in the title and description of the dataset and their fre-

quencies in the dataset. The feature and label vectors

of infrequent tags in Category 5 were increased by

SMOTE until the number of the vectors of each infre-

quent tag that appears fewer than 10 times in training

data reached 10. The number of all datasets in the

training data after the oversampling was 67,416.

We use the multinomial naive Bayes

method (Manning et al., 2008) for multi-label

classiﬁcation. We implemented this method using

scikit-learn

(Pedregosa et al., 2011).

Users input the title of a dataset and select a cate-

gory of tags. Figure 4 shows a screen capture of the

input of the system.

Figure 5 shows a screen capture of the output

of the system for the input title “the documents of

weather prediction, weather forecast and typhoon.”

We use the function predict

proba() in scikit-learn

which outputs probability estimates for tags. The col-

http://scikit-learn.org/stable/

KMIS 2023 - 15th International Conference on Knowledge Management and Information Systems

328

umn “Value” in Figure 5 is the estimates. Although

automatically assigning tags to a dataset would be

ideal, due to the accuracy problem of recommenda-

tions, it is assumed that the users select the proper

tags from the recommended tags.

5 DISCUSSION

The frequency of some tags in Categories 1 to 4 is 0.

Multi-label classiﬁcation cannot predict tags which

do not appear in the training data. Also, oversampling

methods cannot increase feature and label vectors of

such tags. One method that could be adapted to pre-

dict such tags is zero-shot learning.

It remains to check the accuracy of the tag rec-

ommendations. Various multi-label classiﬁcation and

oversampling methods have been proposed, so we

need to conduct comparisons with such methods.

In the present study, we set the threshold of the

frequency between infrequent tags and others to 10 in

the datasets of DATA.GO.JP. It is important to auto-

matically determine the threshold at which the over-

sampling is effective.

Since multi-label classiﬁcation cannot output tags

which do not appear in the training data, we need

to extract new tags from a dataset. We previ-

ously proposed a method to extract particular noun

phrases (Yamada et al., 2018; Yamada and Nakatoh,

2018) as characteristic phrases and words to use as

new tags that represent the dataset.

6 CONCLUSION

This paper proposed a tag recommendation system

for DATA.GO.JP which is the data catalog site of the

Japanese government. The system utilizes multi-label

classiﬁcation in machine learning to recommend tags.

SMOTE, which is an oversampling method, is used

to increase the data of infrequent tags in the train-

ing data. The system is given the title of a dataset,

and it recommends several tags using a classiﬁer con-

structed in advance.

The work reported herein is part of ongoing re-

search. Important remaining work includes increasing

the accuracy of predicting infrequent tags and making

new tags which are not included in previous datasets.

Other data catalog sites of countries include Data.gov

of the U.S. government. Our future work is also to

develop a system for such sites.

ACKNOWLEDGEMENTS

This work was supported by JSPS KAKENHI Grant

Numbers JP19K12715.

REFERENCES

Babbar, R. and Sch

olkopf, B. (2017). DiSMEC: Distributed

sparse machines for extreme multi-label classiﬁcation.

In Proceedings of the 10th ACM International Confer-

ence on Web Search and Data Mining, pages 721–729.

ACM.

Charte, F., Rivera, A. J., del Jesus, M. J., and Herrera, F.

(2015). MLSMOTE: Approaching imbalanced mul-

tilabel learning through synthetic instance generation.

Knowledge-Based Systems, 89:385–397.

Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer,

W. P. (2002). SMOTE: Synthetic minority over-

sampling technique. J. Artif. Int. Res., 16(1):321–357.

Haixiang, G., Yijing, L., Shang, J., Mingyun, G., Yuanyue,

H., and Bing, G. (2017). Learning from class-

imbalanced data: Review of methods and applica-

tions. Expert Systems With Applications, 73:220–239.

Jain, H., Prabhu, Y., and Varma, M. (2016). Extreme

multi-label loss functions for recommendation, tag-

ging, ranking & other missing label applications. In

Proceedings of the 22nd ACM SIGKDD International

Conference on Knowledge Discovery and Data Min-

ing, pages 935–944. ACM.

Liu, B., Blekas, K., and Tsoumakas, G. (2022). Multi-

label sampling based on local label imbalance. Pattern

Recognition, 122:108294.

Liu, B. and Tsoumakas, G. (2019). Synthetic oversam-

pling of multi-label data based on local label distribu-

tion. In Machine Learning and Knowledge Discovery

in Databases: European Conference, ECML PKDD,

page 180–193. Springer-Verlag.

Manning, C. D., Raghavan, P., and Sch

utze, H. (2008). In-

troduction to Information Retrieval. Cambridge Uni-

versity Press.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,

Thirion, B., Grisel, O., Blondel, M., Prettenhofer,

P., Weiss, R., Dubourg, V., Vanderplas, J., Passos,

A., Cournapeau, D., Brucher, M., Perrot, M., and

Duchesnay, E. (2011). Scikit-learn: Machine learning

in Python. Journal of Machine Learning Research,

12:2825–2830.

Rao, K. N. and Reddy, C. S. (2020). A novel under sam-

pling strategy for efﬁcient software defect analysis of

skewed distributed data. Evolving Systems, 11:119–

131.

Schultheis, E. and Babbar, R. (2022). Speeding-

up one-versus-all training for extreme classiﬁcation

via mean-separating initialization. Mach. Learn.,

111(11):3953–3976.

Wu, T., Huang, Q., Liu, Z., Wang, Y., and Lin, D. (2020).

Distribution-balanced loss for multi-label classiﬁca-

tion in long-tailed datasets. In Computer Vision –

Tag Recommendation System for Data Catalog Site of Japanese Government

329

ECCV 2020, pages 162–178. Springer International

Publishing.

Xu, C., Tao, D., and Xu, C. (2016). Robust extreme

multi-label learning. In Proceedings of the 22nd

ACM SIGKDD International Conference on Knowl-

edge Discovery and Data Mining, pages 1275–1284.

ACM.

Yamada, Y., Himeno, Y., and Nakatoh, T. (2018). Weighting

of noun phrases based on local frequency of nouns. In

Recent Advances on Soft Computing and Data Min-

ing - Proceedings of the 3rd International Conference

on Soft Computing and Data Mining, pages 436–445.

Springer.

Yamada, Y. and Nakatoh, T. (2018). Tag recommendation

for open government data by multi-label classiﬁcation

and particular noun phrase extraction. In Proceedings

of the 10th International Joint Conference on Knowl-

edge Discovery, Knowledge Engineering and Knowl-

edge Management (IC3K 2018), Vol. 3, pages 83–91.

Yu, H.-F., Zhong, K., Dhillon, I. S., Wang, W.-C., and Yang,

Y. (2019). X-bert: extreme multi-label text classiﬁca-

tion using bidirectional encoder representations from

transformers. In NeurIPS 2019 Workshop on Science

Meets Engineering of Deep Learning.

APPENDIX A

This appendix shows the frequencies of tags in each

of ﬁve categories. Note that tags in the following ta-

bles are originally written in Japanese.

Table 1: Category 1: Important data category of G8.

Tag #

statistics 9,711

budgets 2,163

education 1,574

energy and environment 1,075

maps 947

health 909

ﬁnance and contracts 808

crime and justice 492

science and research 438

government accountability and democracy 434

companies 304

social mobility and welfare 176

global development 122

earth observation 47

elections 40

transport and infrastructure 10

Table 2: Category 2: Category of basic tags of e-

government information.

Tag #

statistics survey result 6,848

budgets and account settlement 2,023

white paper and annual report 1,423

information disclosure 994

evaluation and result 556

procurement 551

organization and institution 163

press release 98

council and working group 65

laws and regulations and notices 56

application and notiﬁcation procedure 25

juridical competent authority 3

public comment 1

bill submitted to the diet 0

prior conﬁrmation procedures and rules 0

press conference of minister 0

Table 3: Category 3: Category for business classiﬁcation in

e-government action plan.

Tag #

environment 2,603

population and households 2,553

government except elsewhere classiﬁed 1,919

security 1,892

disaster 1,760

international 1,654

agriculture and forestry 1,621

land 1,033

business and household and economy 849

construction 614

climate 597

labor 523

justice 364

education and learning support 359

scientiﬁc research and professional and tech-

nical services

306

ﬁsheries 252

medical and health care and welfare 167

sightseeing 167

manufacturing 161

mining and quarrying of stone 129

ﬁnance and insurance 126

accommodations and eating and drinking

services

125

transport and postal services 117

information and communications 41

wholesale and retail trade 37

services not elsewhere classiﬁed 34

living-related and personal services and

amusement services

industries unable to classify 28

compound services 3

electricity and gas and heat supply and water 3

real estate and goods rental and leasing 0

KMIS 2023 - 15th International Conference on Knowledge Management and Information Systems

330

Table 4: Category 4: Prioritized ﬁelds in “Roadmap for Pro-

motion of Open Data in Electronic Administration”.

Tag #

statistics 9,711

budgets and ﬁnal accounts and procurement 4,816

white paper and annual report 2,490

geospatial 1,499

disaster prevention and mitigation 1,324

transportation and tourism 396

code 103

Table 5: Top 20 most frequent tags in Category 5 (other

tags).

Tag #

ﬁnance 2,705

economy 1,772

2000012010019 1,711

5000012080001 1,701

energy 1,613

industry 1,572

expenditure 1,271

supply 1,244

contract 1,020

study 936

trafﬁc 923

society 779

manufacture 684

manufacturing industry 616

law 588

observation 518

medical 471

strong economy that generates hope 443

ship 437

crime 426

gas 389

Tag Recommendation System for Data Catalog Site of Japanese Government

331