The Impact of Clustering for Learning Semantic Categories
Mário Antunes
1
, Diogo Gomes
1,2
and Rui L. Aguiar
1,2
1
Instituto de Telecomunicações, Universidade de Aveiro, Aveiro, Portugal
2
DETI, Universidade de Aveiro, Aveiro, Portugal
Keywords:
Clustering, Text Mining, Machine Learning, IoT, M2M, Context Awareness.
Abstract:
The evergrowing number of small devices with sensing capabilities produce massive amounts of diverse data.
However, it becomes increasingly difficult to manage all these new data sources. Currently, there is no sin-
gle way to represent, share, and understand IoT data, leading to information silos that hinder the realization
of complex IoT/M2M scenarios. IoT/M2M scenarios will only achieve their full potential when the devices
become intelligent: communicate, work and learn together with minimal human intervention. Pursuing these
goals, we developed methods to estimate semantic similarity based on distributional profiles. Cluster algo-
rithms were used to learn semantic categories and improve the model accuracy. In this paper, we discuss the
impact of the clustering algorithm and respective heuristic to estimate input parameters for the task of learning
semantic categories. Our evaluation has shown that K-means combined with silhouette methods achieved the
higher result.
1 INTRODUCTION
Mining information from data has been the main
task of computers over the past years. Yet, the fo-
cus is shifting, computers of the near future will be
used to extract knowledge from information. This
shift can be explained by the ever-growing number
of devices, a direct consequence of the Internet of
Things (IoT) (Wortmann et al., 2015) and machine-
to-machine (M2M) (Chen and Lien, 2014).
The sheer volume and complexity of data make it
difficult to extract possible relevant information and
knowledge. Data gathered by these devices has no
value in its raw state, it must be analysed, interpreted
and understood. Context-awareness computing plays
an important role in tackling this issue (Perera et al.,
2014), and is an intrinsic property of IoT and M2M
scenarios.
In IoT and M2M scenarios, its possible to learn/
detect an entity’s context by combining data from
multiple sources. An entity’s context can them be
used to provide added value: improve efficiency, op-
timize resources and detect anomalies. However, re-
cent projects still follow a vertical approach (Fantacci
et al., 2014; Robert et al., 2016; Datta et al., 2016).
Devices/manufacturers do not share context informa-
tion, or share it with a different structure, leading
to low interoperability and information silos respec-
tively. This has hindered interoperability and the re-
alization of even more powerful IoT scenarios, which
are being pursued mostly by large corporations with a
dominant market position and considerable resources
(e.g. Google, Amazon and Microsoft). Another im-
portant issue is the need felt for a new way to manage,
store and process such diverse machine made con-
text information; unconstrained and without limiting
structures. The full potential of IoT and M2M sce-
narios will only be achieved when we overcome these
limitations.
In previous publications, we addressed the above-
mentioned issues and developed an organization
model that relies on machine learning features to
organize context information. At the time we ex-
plored two types of features: semantic (Antunes et al.,
2017a) and stream similarity (Jesus et al., 2017).
Regarding semantic similarity, we developed a
model named Distributional Profile of Word Cate-
gories (DPWC) and an unsupervised method to learn
the model from common web services, such as search
engines. The learning method uses clustering al-
gorithms to identify word categories automatically
(word category is closely related to concepts and the
possible meaning of a word). Our prototype relied on
k-means to cluster the distributional profile into cate-
gories. However, the algorithm has two main disad-
vantages: is not deterministic and requires the num-
320
Antunes, M., Gomes, D. and Aguiar, R.
The Impact of Clustering for Learning Semantic Categories.
DOI: 10.5220/0006813603200327
In Proceedings of the 3rd International Conference on Internet of Things, Big Data and Security (IoTBDS 2018), pages 320-327
ISBN: 978-989-758-296-7
Copyright
c
2019 by SCITEPRESS Science and Technology Publications, Lda. All rights reserved
ber of clusters a priori. In this paper, we explore other
clustering methods and study their impact on category
learning. Since the DPWC learning method is unsu-
pervised, we are interested in automatic methods to
estimate the clustering parameters.
The remainder of the paper is organized as fol-
lows. In Section 2 we detail our context organization
model. Section 3 summarizes the DPWC semantic
model. An overview of clustering algorithms is given
in Section 4. In Section 5 we discuss the clustering
algorithms and parameters’ estimation heuristic used.
Section 6 gives important details about our prototype.
The results of our evaluation are given in Section 7.
Finally, discussion and conclusions are presented in
Section 8.
2 CONTEXT ORGANIZATION
MODEL
Context information is an enabler for further data
analysis, potentially exploring the integration of an
increasing number of information sources. Com-
mon definitions of context information (Abowd et al.,
1999; Dey, 2001) do not provide any insight about its
structure. In fact, each device can share context infor-
mation with a different structure. One important ob-
jective of context representation is to standardize the
process of sharing and understanding context infor-
mation. However, nowadays no widely accepted con-
text representation scheme exists; instead, there are
several approaches to deal with context information.
These can be divided into three categories: i) adopt/
create a new context representation, ii) normalize the
storing process through ontologies, iii) accept the di-
versity of context representations.
We accepted the diversity of context representa-
tion as a consequence of economic pressures and de-
vised a bottom-up model (Antunes et al., 2016) to or-
ganize context information without enforcing a spe-
cific representation. Our organization model is di-
vided into four main parts, as depicted in Figure 1.
The first two components represent the structured part
of our model and account for the source ID and fixed
d-dimensions respectively. These d-dimensions al-
low human users to select information based on time,
location or even other dimensions and can be under-
stood as an OLAP cube helping in the process of fil-
tering information. The remaining components of our
model extract information from the content itself and
organize it based on semantic and stream similarity.
This paper focuses on semantic features, especially
the impact of clustering on category detection.
Figure 1: Context organization model based on semantic
and stream similarity.
3 DISTRIBUTIONAL PROFILE
OF WORD CATEGORIES
Semantic distance/similarity is a property of lexical
units, typically between words but this notion can be
generalized to larger units such as phrases, sentences,
etc. Two words are considered semantically close
if there is a lexical semantic relation between them.
There are two types of lexical relations: classical re-
lation (such as synonyms, antonyms and hypernymy)
and ad-hoc non-classical relation (such as cause-and-
effect). If the closeness in meaning is due to a certain
classical relation, then the terms are said to be seman-
tically similar. On the other hand, semantic relat-
edness is the term used to describe the more general
form of semantically closeness caused by any seman-
tic relation. For instance the nouns liquid and water
are both semantically similar and related, whereas the
nouns water and boat are semantically related but not
similar.
Semantic features allow us to estimate similarity
between concepts. This similarity allows us to orga-
nize, extract and cluster information based on con-
cepts and not on sub-strings nor regular expressions.
In other words, the devices are able to autonomously
learn concepts and not only strings. These concepts
provide latent knowledge to the underlying informa-
tion and do not depend on human users or context
representation. This is especially important for IoT/
M2M scenarios. IoT/M2M devices share a diverse
amount of data.
Given a target word u we use public web services,
namely search engines, to gather a potentially relevant
corpus and extract the word u distributional profile.
The profile is built based on proximity, which means
if a word w is within the neighbourhood of a target
word u it is properly processed and extracted. This
distributional profile of a word (DPW ) is defined as
The Impact of Clustering for Learning Semantic Categories
321
DPW(u) = {w
1
, f (u, w
1
);...;w
n
, f (u, w
n
)} (1)
where u is the target word, w
i
are words that oc-
cur within the neighbourhood of u and f stands for
co-occurrence frequency (can be generalized for any
strength of association metric). A distributional pro-
file can also be interpreted as a vector that represents
a point in high dimensional space, each word w
i
rep-
resent a dimension and f (u, w
i
) represents its value in
that dimension. From this point onward we will refer
to words inside a DPW as dimensions. We evaluate
the similarity between two DPW with cosine similar-
ity:
S(u, v) =
n
i=1
f (u, w
i
) × f (v, w
i
)
p
n
i=1
f (u, w
i
)
2
×
p
n
i=1
f (v, w
i
)
2
(2)
Other similarity measures can be used, however co-
sine is invariant to scale, which means it does not take
into account the vector’s magnitude, only their direc-
tion. This property is important for unbalanced cor-
pus, such as corpus in M2M scenarios or corpus gath-
ered from web services (due to the ranking algorithms
used by web-services).
Although public web services offer some impor-
tant advantages, they also have some disadvantages.
Distributional profiles can be noisy and contain sev-
eral dimensions with low relevance. A dimension
with low relevance is a dimension with a low value
of co-occurrence frequency ( f (u, w
n
)). The com-
bined weight of several low relevance dimensions can
change the direction of the word vector and damage
the cosine similarity. Also, a profile can contain sev-
eral senses of the target word (sense-conflation). Mul-
tiple words senses in a single profile may also change
the word vector direction and decrease accuracy, lim-
iting the potential of this method.
In order to minimize the above-mentioned issues,
we propose using clustering (overview of clustering
methods in Section 4) on the distributional profile to
identify categories/word senses. The rationale is that
dimensions belonging to the same category are closer
to each other than words from other categories. Clus-
tering methods require a distance metric in order to
group similar elements. Co-occurrence was used as
a distance metric to cluster the dimensions into cate-
gories.
It is important to mention that the clusters do
not represent word senses from a Roget-style the-
saurus (Jarmasz and Szpakowicz, 2004). Which
means that there is not a one-to-one relation between
the clusters and a word in a thesaurus. Conceptually
the clusters are more similar to categories in latent se-
mantic analysis, and may not have a correspondence
to our human perception. Since a cluster may not rep-
resent a classical word sense, from this point onward
we will refer to them as categories. One implication
of this statement is that some clusters represent high
relevance categories, while others represent low rel-
evance categories. Consider the following scenario,
two target words u and v are not related but may end
up with the same low relevance category. These cate-
gories match with each other and produce a false pos-
itive.
In order to minimize this issue, our model incor-
porates an affinity value between the target word and
each category (can be understood as a bias) that mea-
sures the natural tendency from a word to be used
as a specific category. The affinity is computed as
the average similarity between the target word and all
the cluster’s elements. After the clustering and com-
puting the affinity of the target word to each cluster,
the distributional profile of multiple words categories
(DPWC) is extracted from the DPW and grouped ac-
cording to the clusters obtained. After computing all
the affinity values, they are normalized between ]0, 1]
with the following expression
a
0
i
=
a
i
max(a)
(3)
The profile is defined as follows:
DPWC(u) =
a
1
; DPW
1
(u)
...
a
m
; DPW
m
(u)
(4)
where u is the target word, DPW
i
(u) is the distribu-
tional profile for the word category i and a
i
is the
affinity between u and a word category.
Finally, the similarity between two DPWC is
given by the following expression
S(u, v) = max(cosine(u
c
, v
c
) × (a
u
c
+ a
u
v
.
2)) (5)
where u
c
and v
c
represent a specific category from
u and v respectively and a represents the category’s
affinity. Our final similarity measure is the maxi-
mum similarity between all the possible categories
weighted by the average category’s affinity. By in-
corporating affinities our model minimizes the effect
of low relevance categories.
4 BACKGROUND AND RELATED
WORK
Machine learning provides methods that automati-
cally learn from data and accurately predict future
data based on what we learn from current obser-
vations (generalization). In classification tasks, we
have a labelled set of training points, and we learn
IoTBDS 2018 - 3rd International Conference on Internet of Things, Big Data and Security
322
the grouping structure in a supervised way. In other
words, classification tries to predict the label of (un-
labelled) data. However, when a labelled set is not
available, it is possible to groups data into ”natural“
categories without class labels. This task is named
clustering.
Cluster algorithm deals with finding some struc-
ture in a collection of unlabeled data, as such are con-
sidered unsupervised learning. A loose definition of
clustering could be the process of “organizing objects
into groups whose members are similar in some way”.
A cluster is a collection of objects which are “simi-
lar” between them and are “dissimilar”to the objects
belonging to other clusters. In simple words, the aim
is to segregate groups with similar traits and assign
them into clusters.
Clustering algorithms can be organized into sev-
eral taxonomies, one of the most common character-
izations is by the underlying model. In the following
list, we present four classes of clustering algorithms.
1. Connectivity-based clustering: Or hierarchical
clustering, these models are based on the notion
that closer data points exhibit more similarity to
each other than the points lying farther away.
These models can follow two approaches. In the
first approach, they start with classifying all data
points into separate clusters & then aggregating
them as the distance decreases. In the second
approach, all data points are classified as a sin-
gle cluster and then partitioned as the distance in-
creases. Also, the choice of distance function is
subjective. These models are very easy to inter-
pret but lacks scalability for handling big datasets.
Examples of these models are hierarchical cluster-
ing algorithm (Ward, 1963) and its variants.
2. Centroid-based models: These are iterative clus-
tering algorithms in which the notion of similarity
is derived by the closeness of a data point to the
centroid of the clusters. The number of clusters is
required beforehand, which makes it important to
have prior knowledge about the dataset. K-Means
clustering algorithm (Lloyd, 1982) is a popular al-
gorithm that falls into this category.
3. Distribution-based models: These clustering
models are based on the notion of how probable
is it that all data points in the cluster belong to
the same distribution (e.g. Normal, Gaussian). A
popular example of these models is Expectation-
maximization algorithm (Dempster et al., 1977)
which uses multivariate normal distributions.
4. Density-based Models: These models search the
data space for areas with high density of data
points. It isolates them into different density re-
gions and assign the data points within these re-
gions to the same cluster. Popular examples of
density models are DBSCAN(Ester et al., 1996)
and OPTICS(Ankerst et al., 1999).
In Section 5 we discuss the algorithms considered
and the parameters’ estimation heuristics evaluated.
5 LEARN SEMANTIC
CATEGORIES THROUGH
CLUSTERING
As previously stated, we use clustering algorithms to
learn semantic categories from distributional profiles.
This minimizes the issue of sense-conflation and im-
proves the semantic similarity estimation. However,
most clustering algorithms require fine-tuning in or-
der to achieve good results.
In this paper, we discuss the impact of clustering
algorithm and the possible methods to estimate the
initial parameters on the task of learning categories.
In our previous work (Antunes et al., 2017a) we used
K-means++ (Arthur and Vassilvitskii, 2007) with a
method for estimating the number of clusters simi-
lar to gap statistics (Tibshirani et al., 2001), without
having to generate reference features based on the el-
ements to compare the clustering with a uniform sam-
ple. The framework used was proposed in (Pham
et al., 2005).
Although this combination provided good results,
it is unstable. Since k-means algorithm is not deter-
ministic, each time it is executed can lead to different
results. One way to deal with this issue is to run the
algorithm several times and only consider the set of
clusters with lower distortion, the obvious drawback
is the performance penalty.
With this in mind, we intended to explore alter-
native clustering algorithms. The ideal cluster al-
gorithm should be reasonably fast (if possible deter-
ministic) and require little no none initial parameters.
Hierarchical-based and Distribution-based clustering
are not ideal techniques for this task. Hierarchical-
based clustering is computationally expensive and re-
quires a method to extract cluster from the hierar-
chy. DPW models are highly dimensional by na-
ture, meaning that parameter estimation is rather dif-
ficult. Distribution-based clustering requires a distri-
bution model to fit the data points. As mentioned,
the high dimensionality associated with DPW makes
it difficult to fit a distribution. Taking this into ac-
count we selected for this evaluation three algorithms:
K-means++, DBSCAN and Fast Greedy Clustering
(FGC).
The Impact of Clustering for Learning Semantic Categories
323
K-means++ only requires the number of clusters
a priori. Several algorithms can be used to esti-
mate this parameter, we evaluated the following ap-
proaches. Elbow method, this method computes the
distortion for each set of clusters, it takes into account
the percentage of variance explained as a function of
the number of clusters: one should choose a number
of clusters so that adding another cluster doesn’t give
much better modelling of the data. More precisely,
if one plots the average distortion against the num-
ber of clusters. The first clusters will add much in-
formation, but at some point the marginal gain will
drop, giving an angle in the graph (named “elbow”).
We used Kneedle algorithm (Satopaa et al., 2011) to
identify the elbow. Average silhouette method, this
approach measures the quality of a clustering, it de-
termines how well each object lies within its cluster.
It uses a metric named silhouette (Rousseeuw, 1987),
a high average silhouette width indicates a good clus-
tering. Average silhouette method computes the aver-
age silhouette for different values of k. The optimal
number of clusters k is the one that maximizes the av-
erage silhouette. Finally, gap statistics can be used
to estimate the ideal number of clusters, as mentioned
we used an alternative framework to achieve the same
result.
DBSCAN clusters data points based on point den-
sity, one important advantage is that it can produce
cluster with any shape. It requires two parameters,
ε and minPts. ε is the maximum distance it uses to
search for neighbour data points, and minPts is the
minimal number of neighbours to be considered a
dense region. Unfortunately, to the best of our knowl-
edge, there is no method or heuristic to identify these
two parameters. As a rule of thumb, a minimum
minPts can be derived from the number of dimensions
D in the dataset, however, the nature of DPW makes it
difficult to select the ideal minPts. Although distribu-
tional profiles are highly dimensional, there are few
data points. In fact, each data point is a dimension,
the number of data points is equal to the number of
dimensions. However, it is possible to select the ideal
ε based on the number of minPts. The value for ε can
then be chosen by using a k-distance graph, plotting
the average distance to the k nearest neighbours or-
dered from the largest to the smallest value (Schubert
et al., 2017). Good values of ε are where this plot
shows an “elbow”, again we used the Kneedle algo-
rithm to automatically identify the “elbow”. In our
evaluation, we tested several values of minPts (from
3 to 5, more than 5 isn’t relevant since almost no data
point had 6 neighbours), and returned the set of clus-
ters with lower distortion.
FGC is a clustering algorithm developed for a spe-
cific scenario (Antunes et al., 2017b), cluster GPS
data points to identify potholes. Algorithm 1 de-
scribes in detail the inner-workings of the method.
It is a cross between DBSCAN and K-means, iter-
ates over the points a single time to identify poten-
tial pairs but only merges them if the resulting clus-
ter does not invalidate the radius parameters. It has
two advantages, it automatically identifies the num-
ber of clusters and only requires a single parameter,
a maximum radius for the clusters. This parameter
controls the growth of each cluster. We evaluated
two different heuristics to estimate the radius. The
first one was similar to the DBSCAN method, fixed
a number of neighbours and selected the ideal el-
bow from a k-distance graph. Tested several values
of neighbours, and selected the ones that generate a
cluster with smaller distortion. The second one esti-
mated the radius based on threshold algorithms, these
types of algorithm separate data points in two differ-
ent groups (high and low groups). These methods are
used in image processing to convert a grey scale im-
age into black and white. The algorithm used to esti-
mate the ideal threshold was IsoData (Ridler and Cal-
vard, 1978), due to its simplicity this algorithm can be
adapted to several use cases.
Algorithm 1: Fast Greedy Clustering Algorithm.
1: function CLUSTERING(points, radius)
2: kdTree KDTree.init(points, radius)
3: pairs {}
4: for each p points do
5: neighbours kdTree.near(p, radius)
6: pairs pairs + neighbours
7: end for
8: clusters {}
9: for each p pairs do
10: c f indClosestCluster(cluster s, p)
11: if maxRadius(c + p) < radius then
12: c c + p
13: end if
14: end for
15: return clusters
16: end function
6 IMPLEMENTATION
In this section, we discuss some relevant details about
our prototype implementation. Our prototype is di-
vided into 5 different components as depicted in Fig-
ure 2. All the components were written in Java.
The first component (corpus extraction) bridges
our solution with web search engines. Given a target
IoTBDS 2018 - 3rd International Conference on Internet of Things, Big Data and Security
324
Figure 2: Proposed DP extraction system’s architecture.
word u our prototype uses web search engines to ex-
tract its DPW (u) and DPWC(u). It can be used with
any search engine, and currently it uses three: Fa-
roo
1
, Yacy
2
and Searx
3
. This component basic func-
tion is to extract a corpus from search engines. The
corpus is composed of snippets returned by search-
ing for the target word. In a previous work (Antunes
et al., 2017a) we compared the impact of using only
snippets against the full web-pages. We observed that
snippets contain enough information to build reliable
DPWs.
The second component (text processing) imple-
ments a preprocessing pipeline that cleans the corpus
and divides it into tokens. The various spaces of the
pipeline are depicted in Figure 3. First the snippets
are tokenized and the resulting tokens are filtered us-
ing a stop word filter. Stop words are deemed irrel-
evant because they occur frequently in the language
and provide little information. We used the MySQL
stop word list
4
. For the exact same reason, we also
remove tokens that are too big or too small: any token
with less than 3 or more than 14 (9 being the average
word length in English) characters was removed from
the pipeline.
Figure 3: Text processing pipeline.
1
http://www.faroo.com/hp/api/api.html
2
http://yacy.net/en/index.html
3
https://searx.me/
4
https://dev.mysql.com/doc/refman/5.1/en/fulltext-
stopwords.html
The DPW extraction component analyses the out-
put of the pipeline and extracts the DPW of the tar-
get word u. After extracting the DPW , we cluster the
profile dimensions based on co-occurrence similar-
ity. The methods discussed in Section 5 were imple-
mented and used at turns for the evaluation. Finally,
the DPWC component uses the DPW and the clusters
to return the DPWC(u) of the target word, this com-
ponent also computes the affinity between the target
word and each category.
7 PERFORMANCE EVALUATION
We evaluate our model against Miller-Charles
dataset (Miller and Charles, 1991), the reference
dataset for semantic similarity evaluation. It is com-
posed of 30 word-pairs rated by a group of 38 human
subjects. The word pairs are rated on a scale from 0
(no similarity) to 4 (perfect synonymy).
To the best of our knowledge, there is no seman-
tic dataset specifically for IoT/M2M available. In or-
der to evaluate our semantic features against IoT vo-
cabulary, we devised one. We mined a popular IoT
plaform
5
to extract the most commonly used terms
(ranked by term frequency). The 20 most used terms
were collected and organized into 30 word pairs. Each
pair was rated on a scale from 0 to 4 by ve fellow
researchers. Although not as comprehensive as the
Miller-Charles dataset, our still reach 0.8 correlation
amongst human classification. In a future work, we
intend to further explore and improve our dataset. The
final similarity of each pair is the average of the previ-
ously stated rates. This dataset is publicly available
6
and can be used by other researchers.
Correlation between sets of data is a measure of
how well they are related. The correlation r can range
from 1 to 1. A r of 1 indicates a perfect nega-
tive linear relationship between variables, an r of 0
indicates no linear relationship between variables, fi-
nally and an r of 1 indicates a perfect positive linear
relationship between variables. In short, the highest
correlation indicates the most accurate solution.
We evaluated the performance of DPW(u) and
DPWC(u) (using multiple clustering algorithms).
Our models were learned from a corpus formed from
the top 300 snippets returned by three search engines:
Faroo, Yacy and Searx.
The results of the evaluation using Miller-Charles
and IoT dataset are listed in Table 1 and Table 2
respectively. The optimal neighbourhood’s size for
5
ThingSpeak: https://thingspeak.com/
6
https://github.com/mariolpantunes/ml/blob/master/src/
main/resources/en-iot-30.csv
The Impact of Clustering for Learning Semantic Categories
325
Miller-Charles and IoT dataset appears to be 7 and 5
respectively. K-means with silhouette method outper-
forms the other clustering algorithms. In our tests, it
proved to be more stable and robust than the statis-
tic framework. The statistic framework appears to
be less robust than the non-deterministic behaviour
of the clustering algorithm. Through manual verifi-
cation, we verified that when the clusters better repre-
sent semantic categories all three methods preformed
well. However, the overall quality of the clusters de-
creased when elbow and statistic methods have dif-
ficulty in identifying the correct number of clusters.
Both DBSCAN and FGC generate a single cluster
with all the data points. As mentioned, the high di-
mensionality associated with DPW profiles leads to a
classical curse of dimensionality problem. Since each
data point is also a dimension, it becomes a highly di-
mensional clustering problem with few examples. In
short, the heuristic used to estimate the optimal dis-
tances (either ε or radius) selected a distance large
enough to group all the data points into a single clus-
ter.
In a future publication, we will address this is-
sue and devise new methods to correctly estimate the
distance for DBSCAN and FGC for this specific sce-
nario. In the mean time, k-means combined with aver-
age silhouette method appears to be the ideal method
to learn semantic categories from distributional pro-
files.
Table 1: Performance evaluation on Miller-Charles dataset.
Neighborhood size
Methods 3 5 7
DPW 0.32 0.37 0.45
DPWC (k-means elbow) 0.37 0.27 0.25
DPWC (k-means silhouette) 0.34 0.45 0.61
DPWC (k-means statistic) 0.28 0.41 0.46
DPWC (DBSCAN elbow) 0.32 0.37 0.45
DPWC (FGC elbow) 0.32 0.37 0.45
DPWC (FGC IsoData) 0.32 0.37 0.45
Table 2: Performance evaluation on IoT dataset.
Neighborhood size
Methods 3 5 7
DPW 0.25 0.38 0.32
DPWC (k-means elbow) 0.35 0.32 0.39
DPWC (k-means silhouette) 0.37 0.51 0.39
DPWC (k-means statistic) 0.15 0.19 0.29
DPWC (DBSCAN elbow) 0.25 0.38 0.32
DPWC (FGC elbow) 0.25 0.38 0.32
DPWC (FGC IsoData) 0.25 0.38 0.32
8 CONCLUSIONS
The number of IoT devices is increasing at a steady
step. Each one of them capable of generating mas-
sive amounts of diverse data. However, each device/
manufactures share context information with a dif-
ferent structure, hindering interoperability in IoT and
M2M scenarios.
In a previous publication, we developed a model
named Distributional Profile of Word Categories
(DPWC) and an unsupervised method to learn the
model from common web services, such as search
engines. The learning method uses clustering al-
gorithms to identify word categories automatically
(word category is closely related to concepts and the
possible meaning of a word). The implementation re-
lied on k-means to cluster the distributional profile
into categories. However, there are several cluster-
ing algorithms, each one of them with specific advan-
tages. It is important to mention the high dimension-
ality associated with DPW distributional profiles. In
this paper, we explore the impact of clustering on cat-
egory detection. Since the DPWC learning method
is unsupervised, we are also interested in automatic
methods to estimate the clustering parameters.
For our evaluation, we selected k-means, DB-
SCAN and FGC as good candidates to learn seman-
tic categories. Each of them has at least a heuristic
to estimate the ideal parameters. All of them were
evaluated using two different datasets Miller-Charles
dataset (Miller and Charles, 1991) and an IoT seman-
tic dataset. Based on our results we can state that k-
means combined with silhouette method appears to
be the ideal combination to learn semantic categories.
The other two clustering algorithms appear to suffer
from the curse of dimensionality and outputted a sin-
gle cluster with all the data points. This implies that
the heuristics used were not able to capture the rele-
vant characteristics.
There is still room for improvement, hypernyms
can be used to learn more abstract dimensions im-
proving performance and decrease the size of each
distributional profile. Non-negative matrix factoriza-
tion can also be used to discover latent semantic infor-
mation in distributional profiles and increase accuracy
(by estimating the value of dimensions with value
zero). Furthermore, other elbow detection method can
be devised and tested. This algorithm has a crucial
role is selecting the ideal ε and radius for DBSCAN
and FGC respectively. We intend to explore several
of the previous mentions optimizations and improve
our model. Nevertheless, our model was able to learn
distributional profiles from a small corpus, achieving
a relatively high accuracy on both datasets.
IoTBDS 2018 - 3rd International Conference on Internet of Things, Big Data and Security
326
ACKNOWLEDGEMENTS
The present study was developed in the scope of
the Smart Green Homes Project [POCI-01-0247-
FEDER-007678], a co-promotion between Bosch
Termotecnologia S.A. and the University of Aveiro.
It is financed by Portugal 2020 under the Competitive-
ness and Internationalization Operational Program,
and by the European Regional Development Fund.
This work was also partially supported by research
grant SFRH/BD/94270/2013.
REFERENCES
Abowd, G. D., Dey, A. K., Brown, P. J., Davies, N., Smith,
M., and Steggles, P. (1999). Towards a better under-
standing of context and context-awareness. In Proc.
of the 1st international symposium on Handheld and
Ubiquitous Computing, pages 304–307.
Ankerst, M., Breunig, M. M., Kriegel, H.-P., and Sander, J.
(1999). OPTICS: ordering points to identify the clus-
tering structure. ACM SIGMOD Record, 28(2):49–60.
Antunes, M., Gomes, D., and Aguiar, R. L. (2016). Scalable
semantic aware context storage. Future Generation
Computer Systems, 56:675–683.
Antunes, M., Gomes, D., and Aguiar, R. L. (2017a). To-
wards IoT data classification through semantic fea-
tures. Future Generation Computer Systems.
Antunes, M., Gomes, D., Barraca, J. P., and Aguiar, R. L.
(2017b). Vehicular dataset for road assessment condi-
tions. In Procedings in the third IEEE Annual Inter-
national Smart Cities Conference (ISC2 2017).
Arthur, D. and Vassilvitskii, S. (2007). K-means++: The
advantages of careful seeding. In Proceedings of the
Eighteenth Annual ACM-SIAM Symposium on Dis-
crete Algorithms, SODA ’07, pages 1027–1035. So-
ciety for Industrial and Applied Mathematics.
Chen, K.-C. and Lien, S.-Y. (2014). Machine-to-machine
communications: Technologies and challenges. Ad
Hoc Networks, 18:3–23.
Datta, S. K., Bonnet, C., Costa, R. P. F. D., and Härri, J.
(2016). Datatweet: An architecture enabling data-
centric iot services. In 2016 IEEE Region 10 Sym-
posium (TENSYMP), pages 343–348.
Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977).
Maximum likelihood from incomplete data via the
EM algorithm. Journal of the Royal Statistical So-
ciety, Series B, 39(1):1–38.
Dey, A. K. (2001). Understanding and using context. Per-
sonal and Ubiquitous Computing, 5(1):4–7.
Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. (1996).
A density-based algorithm for discovering clusters a
density-based algorithm for discovering clusters in
large spatial databases with noise. In Proceedings of
the Second International Conference on Knowledge
Discovery and Data Mining, KDD’96, pages 226–
231. AAAI Press.
Fantacci, R., Pecorella, T., Viti, R., and Carlini, C. (2014).
Short paper: Overcoming iot fragmentation through
standard gateway architecture. In 2014 IEEE World
Forum on Internet of Things (WF-IoT), pages 181–
182.
Jarmasz, M. and Szpakowicz, S. (2004). Roget’s thesaurus
and semantic similarity. In Recent Advances in Nat-
ural Language Processing III, page 111. John Ben-
jamins Publishing Company.
Jesus, R., Antunes, M., Gomes, D., and Aguiar, R. (2017).
Extracting knowledge from stream behavioural pat-
terns. In Proceedings of the 2nd International Con-
ference on Internet of Things, Big Data and Security.
SCITEPRESS - Science and Technology Publications.
Lloyd, S. (1982). Least squares quantization in pcm. In-
formation Theory, IEEE Transactions on, 28(2):129–
137.
Miller, G. A. and Charles, W. G. (1991). Contextual corre-
lates of semantic similarity. Language and Cognitive
Processes, 6(1):1–28.
Perera, C., Zaslavsky, A., Christen, P., and Georgakopoulos,
D. (2014). Context aware computing for the internet
of things: A survey. IEEE Communications Surveys
Tutorials, 16(1):414–454.
Pham, D. T., Dimov, S. S., and Nguyen, C. D. (2005). Se-
lection of k in k-means clustering. Proceedings of the
Institution of Mechanical Engineers, Part C: Journal
of Mechanical Engineering Science, 219(1):103–119.
Ridler, T. and Calvard, S. (1978). Picture thresholding using
an iterative selection method. IEEE Transactions on
Systems, Man, and Cybernetics, 8(8):630–632.
Robert, J., Kubler, S., Traon, Y. L., and Främling, K. (2016).
O-mi/o-df standards as interoperability enablers for
industrial internet: A performance analysis. In IECON
2016 - 42nd Annual Conference of the IEEE Industrial
Electronics Society, pages 4908–4915.
Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to
the interpretation and validation of cluster analysis.
Journal of Computational and Applied Mathematics,
20:53–65.
Satopaa, V., Albrecht, J., Irwin, D., and Raghavan, B.
(2011). Finding a "kneedle" in a haystack: Detecting
knee points in system behavior. In 2011 31st Interna-
tional Conference on Distributed Computing Systems
Workshops. IEEE.
Schubert, E., Sander, J., Ester, M., Kriegel, H. P., and Xu,
X. (2017). DBSCAN revisited, revisited. ACM Trans-
actions on Database Systems, 42(3):1–21.
Tibshirani, R., Walther, G., and Hastie, T. (2001). Estimat-
ing the number of clusters in a data set via the gap
statistic. Journal of the Royal Statistical Society: Se-
ries B (Statistical Methodology), 63(2):411–423.
Ward, J. H. (1963). Hierarchical grouping to optimize an
objective function. Journal of the American Statistical
Association, 58(301):236–244.
Wortmann, F., Flüchter, K., et al. (2015). Internet of
things. Business & Information Systems Engineering,
57(3):221–224.
The Impact of Clustering for Learning Semantic Categories
327