hts for human designed representations and featu-
res. Recently, representation learning, as sometimes
deep learning is called, has emerged as a new area
of Machine Learning research, and it attempts to au-
tomatically learn good latent features. Deep lear-
ning attempts to learn multiple levels of representa-
tion of increasing complexity/abstraction. The goal
of this approach is to explore how computers can take
advantage of data to develop features and represen-
tations appropriate for complex interpretation tasks.
The central idea behind early deep learning models
was to pre-train neural networks layer-per-layer in an
unsupervised fashion, which allows to learn hierar-
chy of features one level at a time. Moreover, pre-
training can be purely unsupervised, allowing rese-
archers to take advantage of the vast amount of un-
labeled data. Such approach makes Deep Learning
particularly well suited for Image and Natural Lan-
guage Processing tasks, where there is a huge number
of images and texts abound. Additionally, one can
use deep features as an input to standard supervised
machine learning methods.
Deep architectures are mainly neural networks
(recurrent, convolutional, deep belief) and can be
summarized as the composition of three elements: (1)
input layer - raw sensory inputs (e.g. words, Red-
Green-Blue values of pixels in an image); (2) hidden
layers - those layers learn more abstract non-obvious
representations/features; (3) output layer - predicting
the target (LeCun et al., 2015).
Recently, deep learning approaches have obtained
very high performance across many different NLP
tasks. These models can often be trained with a sin-
gle end-to-end model and do not require traditional,
task-specific feature engineering. The most attractive
quality of these techniques is that they can perform
well without any external hand-designed resources or
time-intensive feature engineering. Moreover, it has
been shown that unified architecture and learning al-
gorithm can be applied to solve several common NLP
tasks such as part-of-speech tagging, named entity re-
cognition or semantic role labeling (Collobert et al.,
2011). Such end-to-end system is capable of lear-
ning internal representation directly from the unla-
beled data allowing researchers to move away from
task-specific, hand-crafted features.
Similar insights are found in the image classifica-
tion and detection problems. In order to learn about
an enormous number of objects from millions of ima-
ges, we need a model with a large learning capacity.
However, the immense complexity of the object re-
cognition task means that this problem cannot be spe-
cified only by building such huge training data set,
so our model should also have lots of prior know-
ledge to compensate for all the data we dont have.
In particular, a deep convolutional neural network can
achieve reasonable performance on hard visual recog-
nition and categorization tasks – matching or excee-
ding human performance in some domains.
A Convolutional Neural Network (CNN) is a po-
werful machine learning technique from the field of
deep learning. CNNs are trained using large col-
lections of diverse images. Convolutional neural net-
work’s capacity can be controlled by varying their
depth and breadth, and they also make strong and
mostly correct assumptions about the nature of ima-
ges (namely, stationarity of statistics and locality of
pixel dependencies). Thus, compared to standard
feedforward neural networks with similarly-sized lay-
ers, CNNs have much fewer connections and para-
meters and so they are easier to train, while their
theoretically-best performance is likely to be only
slightly worse (Goodfellow et al., 2016). From these
large collections, CNNs can learn rich feature repre-
sentations for a wide range of images. These feature
representations often outperform hand-crafted featu-
res such as HOG, LBP, or SURF. An easy way to
leverage the power of CNNs, without investing time
and effort into training, is to use a pre-trained CNN as
a feature extractor for some multiclass linear SVM.
This approach to image category classification fol-
lows the standard practice of training an off-the-shelf
classifier using features extracted from images. For
example, the Image Category Classification Using
Bag Of Features example uses SURF features within a
bag of features framework to train a multiclass SVM.
The difference here is that instead of using image fe-
atures such as HOG or SURF, features are extrac-
ted using a CNN. Despite the attractive qualities of
CNNs, and despite the relative efficiency of their local
architecture, they have still been prohibitively expen-
sive to apply in large scale to high-resolution images
and usually, they demand GPUs grids to facilitate the
training of interestingly-large CNNs.
Researchers have demonstrated steady progress in
computer vision by validating their work against Ima-
geNet
2
– an academic benchmark for computer vi-
sion. Successive models continue to show impro-
vements, each time achieving a new state-of-the-art
result. ImageNet Large Visual Recognition Chal-
lenge is a standard task in computer vision, where
models try to classify entire images into 1000 clas-
ses, like ”Zebra”, ”Dalmatian”, and ”Dishwasher”. In
2012, an ensemble of CNNs achieved best results on
the ImageNet classification benchmark (Krizhevsky
et al., ). The authors of winning method trained a
large, deep convolutional neural network to classify
2
http://www.image-net.org/
ICPRAM 2018 - 7th International Conference on Pattern Recognition Applications and Methods
310