A semantic similarity measure and employs
semantic network data to determine the degree to
which words are similar. A knowledge-based
similarity metric is referred to as similarity.
The most current hybrid approaches extract
semantic knowledge from WordNet's structural
representation as well as Internet statistic data. The
author (Kim et.al., 2014) suggested TF-IDF, a new
linked data metric based on a hybrid semantic
similarity measure.
3 REVIEW OF EXISTING
METHODS
A number of deep learning methods have been
employed as a result of recent breakthroughs in the
field of deep learning (Hong et.al. 2015). However,
due to the lack of related metadata, such as citations
and co-author information, homonym identification
for a given context, such as the author's name, is
confined to use in common texts, despite their
qualified successes. As a result, many approaches for
detecting homonyms in common texts have been
devised. The author used a self-developed confusing
work list to detect typographical errors and
homonyms by adjusting the distance and applying a
naive Bayes classifier (Hong et.al. 2015).
The aforementioned investigations, on the other
hand, were conducted using a rule-based or statistical
method that required an answer set, rather than
relying on the semantic meaning of the word. Such
methods cannot be applied to a broad text domain
since the rule must be tailored to each text domain in
order to get reliable results. As a result, when using
the contextual word-embedding method, it is
presented a novel homonym-detection technique that
takes into account the semantic meaning of a word
(Hong et.al. 2015). In Natural Language Processing,
there are various ways for detecting word and
sentence similarity (Buchta et.al. 2017).
4 EXPERIMENTAL DESIGN
The concept-based similarity metric is based on three
key factors. The concepts that represent each
sentence's semantic structure are the analysed tagged
terms. The frequency of a concept is used to evaluate
both the concept's contribution to the sentence's
meaning and the main points of the document. While
assessing similarity, the quantity of papers that
contain the examined ideas is used to distinguish
across documents. The proposed concept-based
similarity measure, which considers the ctf measure
to evaluate the significance of each concept at the
sentence level, the tf measure at the document level,
and the df measure at the corpus level, is used to
evaluate these qualities.
The following aspects affect the similarity
measure:
1. total number of matching ideas, called ‘m’
in the given document's verb argument structures
2. total number of sentences denoted as ‘sn’ in
given document called ‘d’ which includes the
matching concept denoted with ‘ci’
3. total number of labeled verb argument
structures called ‘v’ in each sentence s,
4. the ctf
i
of each concept ci in sentence s,
where i = 1, 2,..., m for each document d
5. in each concept ci for tf
i
in each document
d
6. each concept's df
i
7. for each verb argument structure, the
length, Lv, that contains a matched concept
8. in the corpus, total number of documents, N
The concept-based similarity measure between
homonym words is calculated using the ctf. An exact
match or a partial match between two concepts is used
in concept-based matching. Both concepts share the
identical homonym words, which is referred to as an
exact match. A partial match occurs when one
concept contains all of the words found in the other
concept.
Consider the following concepts,
c
1
= ‘‘w
1
w
2
w
3
’’ and c
2
= ‘‘w
1
w
2
’’
where c
1
, c
2
are concepts and w
1
, w
2
, w
3
are
individual words.
After removing stop words, if c
2
c
1
, then c
1
holds more conceptual information than c
2
. In this
case, the length of c
1
is used in the similarity measure
between c
1
and c
2
.
The concept length is only used to compare two
concepts; it has nothing to do with determining the
importance of a concept in terms of sentence
semantics. The ctf is used to identify relevant ideas in
terms of sentence semantics known as Term
Frequency (tf).
𝑠𝑖𝑚
(
𝑑
,𝑑
)
=
𝑚𝑎𝑥
𝑙
𝐿
,
𝑙
𝐿
×𝑤𝑒𝑖𝑔ℎ𝑡
×𝑤𝑒𝑖𝑔ℎ𝑡
,……….
The concept-based similarity between two
documents, d
1
and d
2
is calculated by:
𝑤𝑒𝑖𝑔ℎ𝑡
=
(
𝑡𝑓 𝑤𝑒𝑖𝑔ℎ𝑡
+𝑐𝑡𝑓 𝑤𝑒𝑖𝑔ℎ𝑡
)
×𝑙𝑜𝑔
𝑁
𝑑𝑓
AI4IoT 2023 - First International Conference on Artificial Intelligence for Internet of things (AI4IOT): Accelerating Innovation in Industry