ing samples. Second, an assumption is made that it
is known which value corresponds to each feature,
for example such that small and large are values of
the size feature. Third, it is also assumed that a sin-
gle training sample contains information about all the
relevant features of the concept, which are known in
advance and pre-selected by the user. In practice,
robots would need to learn from information extracted
from the sensors, and would have to solve the symbol
grounding problem by relating their observationswith
verbal cues provided by a human user. The learning
information is also not structured as complete feature-
value arrays, but rather comes from natural means of
communication (such as speech or visual cues), which
by nature cannot express information in this way. A
human user might only mention that an object is large
or small, but not that these values relate to the object’s
size. In addition, it is not ensured that a human would
enumerate all the attributes of an object of interest, to
provide the complete information.
This paper takes a developmental approach to
concept learning, in order to address the above lim-
itations. The hypothesis is that robots need to be able
to learn from visual and auditory cues during interac-
tions with human teachers, in an incremental fashion,
in a manner similar to how young children acquire
new concepts.
The remainder of the paper is structured as fol-
lows: Section 2 discusses related research, Section 3
describes our approach, Section 4 presents our results
and Section 5 gives a summary of our work.
2 RELATED WORK
Concept learning is a significant research problem,
which has been addressed in computer science, cogni-
tive science, neuroscience and psychology. This sec-
tion presents related research in these areas.
The goal of psychology and child development re-
search, as it relates to concept learning, is to under-
stand the mechanisms that underlie the formation of
concepts in children and humans in general. Various
aspects of this problem have been explored. (Schyns
et al., 1998) explore the interplay between the high-
level cognitive process over the perceptual system,
which gives rise to new concept formation. (Feldman,
2003) proposes a principle that indicates that people
induce the simplest categories consistent with a given
set of examples and introduces an algebra for repre-
senting human concept learning. (Kaplan and Mur-
phy, 200) evaluate the effect of prior knowledge on
category learning and suggest that the category exem-
plars as well as prior knowledge about the category’s
domain influence the learning process. The concept
learning approach proposed in this paper aims to build
a feature space for representing the concepts. The is-
sue of category dimensionality has been examined in
(Hoffman and Murphy, 2006), supporting the moti-
vation to address this problem at the level of the ob-
ject feature space. This approach is consistent with
findings in child psychology research, which indicate
that children start by learning the individual features
and only form a single category after more exten-
sive familiarization (Horst et al., 2005). (Mayor and
Plunkett, 2008) presents work quite close to ours, as
they learn to associate spoken words with the image
of the associated object in order to model early word
learning. In contrast our method attempts to associate
words with whatever they refer to within an image,
whether that be the entire object or a piece, position,
or characteristic of the object.
This paper takes the view that a robot should learn
by using both language and vision input, which stud-
ies in neuroscience and psychology have found likely
in human children (Scholl, 2005), (Pinker, 2007). The
simple comparison of sights and sounds may allow an
infant to develop a world model, and development re-
lies on interaction with people and the environment.
For more information on developmental robotics see
(Lungarella et al., 2003). Most previous work done
with images and text has been done in data mining.
For example, images from the internet can be auto-
matically associated with labels, as those on websites
like Flickr, or web pages related to keywords can be
retrieved. Usually the focus of these works is not to
learn the meaning of the words but to accurately label
the images so that a user may find them quickly with a
text search. However this is really the same problem,
and many of these techniques may be applied here, es-
pecially methods used to eliminate poor labels which
are common in internet databases (Brodley and Friedl,
1999). There has been much work done purely on im-
ages or on text, such as in the cases of document re-
trieval and content-based image retrieval, which rely
on word features or image features, not on both. For
a more complete review of existing methods see (Lew
et al., 2006).
The field of computer vision provides a wide spec-
trum of approaches to this problem. (Huang and Le-
Cun, 2006) proposes a combination of support vector
machines (SVMs) and convolutional nets to charac-
terize objects in variable conditions of illuminations
and with multiple different viewpoints. (Yang and
Kuo, 2000) uses content as the relevant feature to
categorize images. (Wolfgang Einhauser and Konig,
2002) demonstrates how a hierarchical neural net-
work evolves structures invariant to features such as
ICINCO2013-10thInternationalConferenceonInformaticsinControl,AutomationandRobotics
338