set. As seen in Figure 5, Decision Tree has the highest
recall (92.86%), followed by Naïve Bayes and KNN,
while the rule-based algorithm performs the worst
(47.42%). According to these results, Decision Tree
can find the key attributes better than Naïve Bayes,
KNN and the rule-based algorithm. One reason that
Decision Tree yields the best recall is because
Decision Tree constructs a decision tree by trying to
select the best feature that can best classify the data.
In this case, the attribute name is the root of the tree,
coinciding with our observations in Section 3.
The precision number expresses the proportion of
the attributes that an algorithm labels as key attributes
are actual key attributes, calculated using (2). As
illustrated in Figure 5, KNN has the most precision
(33.91%) and the runner-up is the rule-based
algorithm (19.74%), while Decision Tree and Naïve
Bayes perform similarly. The reason that KNN yields
the best precision is because KNN is running with
K=1. This means KNN only finds the one nearest
neighbor while all neighbor are true key attributes. In
our case, KNN tries to find the closest attribute name,
the closest attribute type, the closest meta data type,
and the closest repetition. From our observations,
several key attribute names share similar names but
they are not exactly the same. This is why the rule-
based algorithm performs worse than KNN. The rule-
based algorithm checks the exact match of certain
attribute names, including data type, meta data type
and repetition.
We also compute the accuracy values for the rule-
based algorithm, Decision Tree, Naïve Bayes and
KNN, using (3). As shown in Figure 5, KNN has the
best accuracy (87.15%), while the rule-based
algorithm has the worst accuracy (83.91%). Both
Decision Tree and Naïve Bayes perform closely.
The false positive rate (FPR) is calculated using
(4). False positive is a non-key attribute which is
classified as a key attribute. Higher FPR implies
worse usability because the system shows a value of
a non-significant attribute. As illustrated in Figure 5,
the rule-based algorithm has the highest false positive
rate (13.53%), while KNN has the least FPR
(11.29%). Both Decision Tree and Naïve Bayes have
the similar false positive rates. The accuracy rate of
the rule-based algorithm, Decision Tree, Naïve Bayes
and KNN algorithms in percentage.
The false negative rate (FNR) is computed using
(5). As shown in Figure 5, the rule-based algorithm
yields the worst false negative rate (52.58%), while
Decision Tree has the lowest false negative rate
(7.14%). Both Naïve Bayes and KNN have similar
FNR, 30.56% and 31.30% respectively. The reason
that the rule-based algorithm performs badly is
because the rules do not cover all cases of key
attributes. Thus, the rule-based algorithm cannot label
actual key attributes correctly.
In summary, according to the statistical
calculations in Figure 5, we can see that KNN
outperforms the other algorithms, namely the rule-
based algorithm, Decision Tree and Naïve Bayes.
KNN yields the highest precision and accuracy values
and the lowest false positive rate. KNN also has the
second highest recall and the moderate false negative
rate. In contrast, the rule-based algorithm has the
worst recall, the lowest accuracy and the highest false
negative rate. The rule-based algorithm seems to
perform the worst because the rules are defined
statically and cannot adapt to the unseen data,
resulting in high false rates. Although Decision Tree
has the highest recall, it also yields a very low
precision number. This suggests that the constructed
decision tree is overfitting.
7 CONCLUSION
Integration of data and information exchange
amongst various IoT devices and systems is one of
the core problems in providing pervasive computing
environment. The proliferation of APIs and IoT
devices in heterogeneous environments require
different systems to integrate and utilize various API
services. In this paper, we propose a technique which
utilize recent development in machine learning to
facilitate the key integration point and allow systems
to automatically identify and utilize key data
attributes from heterogeneous sources. Different
machine learning approaches have been evaluated as
an alternative to a manual integration of data
heterogeneity and reduce the time for new services to
be implemented and integrated with existing source
of information. From our experiments, KNN is the
most promising algorithm to use to classify a key
attribute which is essential to data verification. The
rule-based algorithm seems to perform the worst
because the rules are rigid and static to exactly match
unseen attribute names. In contrary, KNN has more
flexibility to find a key attribute by using the training
data to guide.
REFERENCES
L. Columbus, L., 2018. “IoT Market Predicted To Double
By 2021,” August 16, 2018
Sangpetch, O., and Sangpetch, A., 2017. Graph-based,
Microservice Architecture for Federated Smart City