2.2 Pre-Processing
Stop words are removed during the preparation phase,
tokenization is conducted, and the word vector is
located in semantic space. Then find vector
representations of every word in the phrase by
tokenizing the text and relocating stop words
(Bhardwaj et.al., 2018).
2.3 Tokenization
The tokenization is a process of breaking down
phrases into tokens and removing unnecessary
punctuation and other types of characters. The Vector
Space Model with tf-idf weighting is used to
represent documents.
The following points are based on the calculation of
‘ctf’ for concept called ‘c’ in sentence ‘s’, where the
document is denoted as ‘d’:
Calculating ctf of Concept c in Sentence s
The term ‘ctf’ indicates how frequently a concept ‘c’
appears in the verb argument structures of a phrase
(sentence ‘s’). The purpose of the notion ‘c’, which
recurs in different verb argument formulations of the
same sentence ‘s’, is to principally add to its meaning.
In this case, the ‘ctf’ is a local metric at the phrase
level.
Calculating ctf of Concept c in Document d
A concept c can have many ctf values in different
sentences in the same document d. Thus, the value for
ctf for concept c in the given document d is calculated
as:
𝑐𝑡𝑓 =
∑
𝑐𝑡𝑓
𝑠𝑛
where sn is the complete sentences in the given
document d that include the concept c. The complete
relevance of concept c to the denotation of its
sentences in document d is dignified by averaging the
ctf values of concept c in its sentences in document d.
The total importance of each notion to the semantics
of a document as indicated through sentences is
calculated by averaging the ctf values.
Algorithm 1: Proposed Clustering-based Similarity
Measure.
1. Notion of new document as ddoci
2. Denotation of Empty List as L (L is a concept list)
3. New sentence formation as sdoci in document
ddoci
4. Building concepts list as Cdoci from New Sentence
sdoci
5. for each concept ci ∈ Ci
do
6. calculate ctfi of ci in ddoci
7. calculate tfi of ci in ddoci
8. calculate dfi of ci in ddoci
9. A seen document denoted as dk where k = {0, 1, .
. . , doci–1}
10. Apply Sentence sk is in document dk
11. Building concepts list as Ck from sentence sk
12. for each concept cj Ck do
13. if (ci == cj) then
14. update dfi of ci
15. calculate ctfweight = average (ctfi, ctfj)
16. Addition of new concept which matches in L
17. end if
18. end for
19. end for
20. Output of the matched concepts in list L
The process of calculating the ctf, tf and df in the
matched concepts from the text is designated by the
concept-based measure algorithm. The procedure
starts with a new document (in line 1) that has
evidently specified text boundaries. Each statement
gives a semantic label. For concept-based similarity
calculations, the lengths of the matched concepts and
its verb argument structures are saved (Buchta et.al.
2018).
The concept-based similarity measure between
words with homonym words is calculated using the
Conceptual Term Frequency (ctf).
Consider the following concepts,
c1 = ‘‘w1w2w3’’ and c2 = ‘‘w1w2’’
where c1, c2 are concepts and w1, w2, w3 are
individual words. After removing stop words, if c2
c1, then c1 holds more conceptual information than
c2. In this case, the length of c1 is used in the
similarity measure between c1 and c2.
The concept length is only used to compare two
concepts; it has nothing to do with determining the
importance of a concept in terms of sentence
semantics. The ctf is used to identify relevant ideas in
terms of sentence semantics as tf.
𝑠𝑖𝑚
𝑑
, 𝑑
=
𝑚𝑎𝑥
𝑙
𝐿
,
𝑙
𝐿
× 𝑤𝑒𝑖𝑔ℎ𝑡
× 𝑤𝑒𝑖𝑔ℎ𝑡
,……….
The concept-based similarity between two
documents, d1 and d2 is calculated by:
𝑤𝑒𝑖𝑔ℎ𝑡
=
𝑡𝑓 𝑤𝑒𝑖𝑔ℎ𝑡
+ 𝑐𝑡𝑓 𝑤𝑒𝑖𝑔ℎ𝑡
× 𝑙𝑜𝑔
𝑁
𝑑𝑓