system by collecting evidence from heterogeneous
sources in a statistical approach. The candidate con-
cepts were extracted and organized in the ‘is-a’ rela-
tions by using chi-squared co-occurrence significance
score. In comparison with (Wohlgenannt, 2015), we
use a structured machine learning approach, which
can be applied on unseen datasets. In (Wohlgenannt,
2015), all evidence was integrated into a large seman-
tic network and the spreading activation method was
used to find most important candidate concepts. The
candidate concepts are then manually evaluated before
adding to an ontology. In comparison with this, in
our approach the latent features, e.g. context features,
polysemy features from the data are identified to train
a machine learning classifier. Hence, it exploits richer
data characteristics compared to (Wohlgenannt, 2015).
Finally, ours is a probability based classifier and it
can be applied to any new data to extract and classify
important concepts effectively. The only manual inter-
vention involved in our approach is to assign labels to
the
n
-grams included in the training data. Finally, the
model proposed by (Wohlgenannt, 2015) is determin-
istic in nature and it does not consider the notion of
context. Hence, it is very difficult to imagine how such
model can be generalized to extract concepts specified
in different context in the new data.
(Doing-Harris et al., 2015) makes use of the co-
sine similarity, TF-IDF, a C-value statistic, and POS
to extract the candidate concepts to construct an ontol-
ogy. This work was done in a statistical and linguistic
approach. The key difference between our work and
the one proposed in (Doing-Harris et al., 2015), is
ours is a principled machine learning model. It makes
our system scalable to extract and classify multi-gram
terms from industrial scale new data without manual
intervention. The linguistic features, e.g. POS exploits
syntactic information for better understanding text.
(Yosef et al., 2012) constructs a hierarchical on-
tology by employing support vector machine (SVM).
The SVM model heavily relies on the part-of-speech
(POS) as the primary feature to determine classifica-
tion hyperplane boundary. In comparison to (Yosef
et al., 2012), in our approach the POS is used as one of
the features, but we also consider additional features,
such as the context, polysemy, and word embedding to
establish the context of a unigram or multi-gram con-
cepts. Moreover, we also perform two rounds of active
learning to further boost the classifier performance. As
word embedding features are not considered by (Yosef
et al., 2012) it is difficult to envisage how the context
associated with each concept was considered during
their extraction.
(Pembeci, 2016) evaluates the effectiveness of
word2vec features in ontology construction. The statis-
tic based on 1-gram and 2-gram counts was used to
extract the candidate concepts. However, the actual
ontology was then constructed manually. In our work,
we not only train a word2vec model to develop word
embedding based context features, but other critical
features, such as POS, polysemy features, etc. are also
used to train a robust probabilistic machine learning
model. The word embedding features included in our
approach dominate statistical features and, therefore,
other statistical features are not used in our approach.
(Ahmad and Gillam, 2005) constructs an ontology
by using the ‘weirdness’ statistic. The collocation
analysis was performed along with domain expert ver-
ification process to construct a final ontology. There
are two key differences between our approach and the
one proposed by (Ahmad and Gillam, 2005). Firstly,
in our approach the labelled training data along with
different features as well as stop words are used to
train a classification model, while in their approach
the notions of ‘weirdness’ and ‘peakedness’ statistics
are used to extract the candidate concepts. Secondly,
in their work, there was a heavy reliance on domain
experts to verify and curate newly constructed ontol-
ogy. With our approach, no such manual intervention
is needed during concept extraction stage or classifi-
cation stage. Hence, our system can be deployed as a
standalone tool to learn an ontology from an unseen
data.
In our work, we also propose a new approach to
disambiguate abbreviations. There are several related
works. (Stevenson et al., 2009) extract features, such
as concept unique identifiers and then built a classifi-
cation model. (HaCohen-Kerner et al., 2008) identify
context based features to train a classifier, but they
assumed an ambiguous phrase only with one correct
expansion in the same article. (Li et al., 2015) pro-
pose a word embedding based approach to select the
expansion from all possible expansions with largest
embedding similarity. There are two major differences
between our approach and these works. First, we
propose a new model that seamlessly combines the
statistical approach (TF-IDF) with machine learning
model (Naïve Bayes classifier). That is, we measure
the importance of each concept in terms of TF-IDF
and then estimate the posterior probability of each
possible expansion. Alternate approaches either only
apply machine learning model or simply calculate sta-
tistical similarity between abbreviation and possible
expansions. Second, in these works a strong assump-
tion is made that each abbreviation only has a single
expansion in the same article and therefore the features
are conditionally independent. No such assumption is
made in our approach and therefore it is more robust.
Automatic Ontology Learning from Domain-specific Short Unstructured Text Data
31