Image-based Object Classiﬁcation of Defects in Steel using Data-driven

Machine Learning Optimization

Fabian B¨urger

, Christoph Buck

, Josef Pauli

and Wolfram Luther

Lehrstuhl Intelligente Systeme, Universit¨at Duisburg-Essen, Bismarckstraße 90, 47057 Duisburg, Germany

Lehrstuhl Computergraﬁk und Wissenschaftliches Rechnen, Universit¨at Duisburg-Essen,

Lotharstr. 63, 47057 Duisburg, Germany

Keywords:

Object Classiﬁcation, Algorithm Recommendation, Data-driven Hyper-parameter Optimization.

Abstract:

In this paper we study the optimization process of an object classiﬁcation task for an image-based steel quality

measurement system. The goal is to distinguish hollow from solid defects inside of steel samples by using

texture and shape features of reconstructed 3D objects. In order to optimize the classiﬁcation results we

propose a holistic machine learning framework that should automatically answer the question “How well

do state-of-the-art machine learning methods work for my classiﬁcation problem?” The framework consists

of three layers, namely feature subset selection, feature transform and classiﬁer which subsequently reduce

the data dimensionality. A system conﬁguration is deﬁned by feature subset, feature transform function,

classiﬁer concept and corresponding parameters. In order to ﬁnd the conﬁguration with the highest classiﬁer

accuracies, the user only needs to provide a set of feature vectors and ground truth labels. The framework

performs a totally data-driven optimization using partly heuristic grid search. We incorporate several popular

machine learning concepts, such as Principal Component Analysis (PCA), Support Vector Machines (SVM)

with different kernels, random trees and neural networks. We show that with our framework even non-experts

can automatically generate a ready for use classiﬁer system with a signiﬁcantly higher accuracy compared to

a manually arranged system.

1 INTRODUCTION

High quality steel is a versatile material that has many

demanding applications such as pipeline tubes, con-

struction and automotive engineering. In order to im-

prove the cleanliness – the quality – of the steel, the

amount of non-metallic inclusions (NMIs) inside of

the product has to be reduced. These inclusions oc-

cur during the production process and usually contain

materials like oxides, sulﬁdes or nitrides.

At ﬁrst, it is necessary to measure the cleanliness

of steel samples to identify possible sources of con-

tamination in the production line. Currently avail-

able measurement techniques are microscope obser-

vation, sulfur prints, SEM-scans, x-ray and ultrasonic

sound analysis. Usually a high measurement preci-

sion in combination with information about the chem-

ical composition of the NMIs is very time-consuming

and therefore cost-intensive. In (Herwig et al., 2012)

an automated measurement system based on opti-

cal scanning of milled steel surfaces is described.

Thereby a milling machine slices thin layers of 10µm

of a raw steel sample and each layer is imaged with

a resolution of 2.5 – 20µm per pixel. Within these

surface images, sections of different kinds of objects,

namely NMIs, pores, cracks, shrink holes and milling

artifacts can be found. After segmentation of the de-

fects the binary 2D masks on consecutive layers are

connected to obtain a 3D voxel-based reconstruction

of the objects.

The distinction of the different object classes is

crucial since only solid NMIs are relevant for the

cleanliness analysis. Hollow objects like pores –

which occur during treatments of the molten steel

with gases (e.g. argon purging) – and cracks usually

disappear after metal forming of the raw steel prod-

uct. Shape features of the 3D objects in combination

with machine learning can be used to classify objects

into globular objects, cracks and artifacts with an ac-

curacy of 98 – 99% (B¨urger et al., 2013). At ﬁrst,

it was believed that all detected globular defects are

solid NMIs but microscopy analyses revealed that a

signiﬁcant number of these objects are hollow (Buck

et al., 2013). The main challenge is that solid NMIs

143

Bürger F., Buck C., Pauli J. and Luther W..

Image-based Object Classiﬁcation of Defects in Steel using Data-driven Machine Learning Optimization.

DOI: 10.5220/0004737101430152

In Proceedings of the 9th International Conference on Computer Vision Theory and Applications (VISAPP-2014), pages 143-152

ISBN: 978-989-758-004-8

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

500µm

Figure 1: Texture and 3D voxels of 4 exemplary pores.

Figure 2: Texture and 3D voxels of 4 exemplary solid NMIs.

and hollow pores have a very similar appearance in

the images of the proposed measurement system as

the image resolution is at least a magnitude lower than

the microscopy resolution (≤ 1µm per pixel) used for

manual veriﬁcation.

The task to design a suitable classiﬁcation system

is difﬁcult as it is not evenknownif a desired accuracy

performance can be achieved with the given sensor

data. First, it is not clear which texture or shape fea-

tures make a distinction possible. Secondly, a suitable

classiﬁer concept with adequate parameters has to be

chosen to achieve good recognition rates and gen-

eralization abilities. Additionally, dimension reduc-

tion techniques such as Principal Component Analy-

sis may increase the performance of the system. Usu-

ally, a time-consuming development process has to be

done in order to ﬁnd a good classiﬁer system. This pa-

per targets the problem to automatically obtain such a

classiﬁer system using a holistic machine learning op-

timization framework which realizes a reliable image-

based distinction between NMIs and pores.

2 PROBLEM DESCRIPTION

After the segmentation process a list of n

Obj

objects

is obtained and each object o

with 1 ≤i ≤ n

Obj

is de-

scribed with a binary voxel set V

(x, y, z) ∈ { 0, 1} and

the 3D gray-value texture array T

(x, y, z) ∈ [0, 255].

The object size is D

×D

and 1 ≤ x ≤ D

1 ≤ y ≤ D

and 1 ≤ z ≤D

. Example objects of pores

can be found in ﬁgure 1 and objects showing NMIs

are depicted in ﬁgure 2.

The variation of the object’s appearance in the

class NMIs is especially huge because the compo-

sition of the materials and their size are unknown.

While the differences in the 3D shape are hardly no-

ticeable, the most signiﬁcant difference is the “tail”

of the solid objects. It can be seen in the texture

below the actual object (see arrows in ﬁgure 2). It

originates from the milling cutter crushing the solid

material along the milling grooves. But this tail is

not clearly appearing at every NMI – hence even steel

experts cannot perfectly distinguish pores from solid

NMIs when only seeing their texture and shape at the

system’s relatively low resolution.

2.1 Machine Learning Challenges

The following steps are usually performed in a ma-

chine learning approach to distinguish objects (Jain

et al., 2000). First, a set of objects is labeled from

experts as ground truth. Then a set of meaningful

features is derived that describes especially the dif-

ferences between the classes that should be separated.

Finally, a classiﬁer is trained with the feature vectors

and the ground truth labels.

Our classiﬁcation task bares typical problems oc-

curring in many machine learning applications. Due

to the high cost for labeling ground truth data (in our

case manual microscopic inspection) only few labels

are available (72 pores and 52 solid inclusions). As it

is not clear which feature set works best on the de-

scribed problems, we use a set of problem-speciﬁc

features combined with well-known standard descrip-

tors which are presented in section 4.

The combination of many features is usually high-

dimensional which leads to the curse of dimension-

ality. As only few training samples are available,

high dimensional feature spaces are sampled only

very sparsely and are therefore almost empty (Jain

et al., 2000). Canonical distance measures, like the

Euclidean distance, become less meaningful in high

dimensional spaces (Beyer et al., 1999) which can

lead to a degradation of classiﬁer performance. An-

other effect is the negative inﬂuence of irrelevant

and noisy features on the classiﬁcation performance

which is known as the peaking phenomenon. To over-

come these dimensionality related problems, dimen-

sion reduction techniques have been proposed. The

most common approaches are feature subset selection

or feature transforms, such as Principal Component

Analysis (PCA) (Jain et al., 2000). The feature sub-

set selection approach removes irrelevant dimensions

while a feature transform generates a compressed data

representation based on the geometric properties of

the original data.

The choice of a classiﬁer concept itself determines

the adaptation and generalization abilities (also called

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

144

bias and variance) of the system. Many classiﬁers ex-

ist, such as naive Bayes classiﬁers, k-Nearest Neigh-

bor classiﬁers (kNN), Multilayer Perceptrons (MLP),

Support Vector Machines (SVM) or random trees.

But there is not a single machine learning concept that

performs best on all problems – this is referred to as

the no-free-lunch theorem (Wolpert, 1996). Addition-

ally, most machine learning concepts have various pa-

rameters such as kernel widths or network sizes (see

section 3.3) which havea great inﬂuence on the classi-

ﬁer performance. In order to produce optimal results

these parameters have to be tuned for each new clas-

siﬁcation problem.

2.2 Related Work

In real world machine learning tasks it is usually

the case that any combination of the aforementioned

problems can occur. It is time-consuming manual

work to evaluate all possible combinations of solu-

tions to achieve the optimal classiﬁer performance.

This is a general hindrance to use classiﬁer systems

in practice as each new data set, a novel feature or a

different sensor may make a revision of the whole sys-

tem necessary. These observations motivate a holis-

tic view on machine learning with an automatic opti-

mization process of the components.

In recent literature there are two categories of ap-

proaches in the context of classiﬁer framework opti-

mization, namely search-based and metalearning al-

gorithms. In search algorithms, a classiﬁer system

is optimized by trying and evaluating a set of the

system’s hyper-parameters. Usually, the ﬁnal clas-

siﬁer accuracy based on the training data is used as

the objective function. Different search strategies and

framework components have incorporated within this

optimization process. In (Bergstra and Bengio, 2012)

grid search and random search methods for classi-

ﬁer parameter optimization are compared. Further-

more, an additional feature selection component has

been incorporated using genetic algorithms (Huang

and Wang, 2006), particle swarms (Lin et al., 2008b)

and simulated annealing (Lin et al., 2008a). In (So-

morjai et al., 2004) a larger framework for biomed-

ical spectra classiﬁcation is proposed that incorpo-

rates data visualization, preprocessing, feature extrac-

tion and selection, classiﬁer development and aggre-

gation. The authors use several strategies to optimize

the hyper-parameters and conﬁgurations. However,

their approach is focused especially on the develop-

ment of spectral features and interpretability in the

diagnostics ﬁeld.

Metalearning is an emerging ﬁeld in the machine

learning community and deals with the questions

Feature

Selection

Layer

Feature

Transform

Layer

Classifier

Layer

Figure 3: Overview of the layers of the proposed machine

learning framework. The dimensionality is successively re-

duced by the feature selection and feature transform layers

until the classiﬁer returns the ﬁnal one-dimensional class

label y.

what knowledge can be extracted of speciﬁc learn-

ing tasks. An all-encompassing survey of this ﬁeld

can be found in (Lemke et al., 2013). One core as-

pect of metalearning is algorithm recommendation

based on previous learning tasks. The idea is to es-

tablish a connection between the feature data char-

acteristics and algorithm performance without trying

all possible algorithms. In (Reif et al., 2012) a re-

cent review of automatic classiﬁer selection systems

based on metalearning can be found. They focus on

the simplicity for non-experts to automatically choose

a proper classiﬁcation pipeline with optimized fea-

ture selection, classiﬁer and parameters. These ap-

proaches use meta-features of the base features to pre-

dict the performance of classiﬁers on the dataset. Pop-

ular meta-features contain statistical properties of the

base feature vectors such as the number of samples

and classes as well as entropy based distribution met-

rics. Furthermore, so-called landmarking features are

applied as well that use the performance of very sim-

ple classiﬁers as a descriptor. The approach of met-

alearning is certainly faster than exhaustive search for

the best algorithm, but it relies on the ability of meta-

features to describe the “kind” of data distribution.

Finally, the performance of the meta-classiﬁer may

also suffer from the same aforementioned challenges

which can lead to a bad performance of the suggested

algorithm on the base classiﬁcation task.

3 DATA-DRIVEN MACHINE

LEARNING FRAMEWORK

3.1 System Concept

In order to overcome all problems mentioned in the

previous sections, we propose a data-drivenand holis-

tic machine learning framework that covers all neces-

sary ﬁelds in machine learning, namely feature subset

selection, feature transform, classiﬁer concept, classi-

ﬁer parameters and evaluation with cross-validation.

The coarse system structure is depicted in ﬁgure 3.

The main principle of the framework is successive di-

mension reduction which is realized in three layers,

Image-basedObjectClassificationofDefectsinSteelusingData-drivenMachineLearningOptimization

145

Table 1: Framework components on the different layers.

layer components / algorithms

Feature

Selection

dimension reduction using feature

subset F

sub

⊆ F

all

Feature

Transform

{no PCA transform, PCA with n

PCA

dimensions}

Classiﬁer {naive Bayes, random tree, k-nearest

neighbors (kNN), multilayer percep-

tron (MLP), Support Vector Machine

(SVM) with linear and Gaussian ker-

nel}

namely feature selection, feature transform and ﬁnally

the classiﬁer layer. Each layer contains one or more

components which are listed in table 1. A system con-

ﬁguration is deﬁned by the selection of one compo-

nent on each layer as well as the components’ speciﬁc

parameters. First we describe the forward direction

of the framework when the system conﬁguration is

known and a new instance should be classiﬁed. The

optimization of the framework conﬁguration can be

seen as training or learning process and is described

in section 3.4.

3.2 Dimension Reduction

The input for the framework is a feature group which

is a set F

all

= {F

, F

, . . . , F

} containing N features

channels. A feature channel is a vector x

∈ R

1×n

with n

≥1 dimensions. This deﬁnition allows to han-

dle single feature values as well as higher dimensional

descriptors which occur e.g. in texture features (see

section 4). The concatenated feature vector of a fea-

ture group F

(·)

set is deﬁned as

(·)

= [x

, x

, . . . , x

] ∈ R

1×ˆn

, ˆn =

∑

∈F

(·)

. (1)

Note that these concatenated feature vectors could di-

rectly be used as an input for classiﬁers. Feature scal-

ing should be applied to all feature channels as it

improves the classiﬁer performance (Juszczak et al.,

2002). We scale the value domains to a range of [0, 1].

In the ﬁrst layer the feature subset selection removes

irrelevant features that could disturb the further pro-

cess. The resulting feature subset F

sub

⊆ F

all

is used

to form a concatenated vector x

that is passed to the

feature transform layer.

Feature transforms are a powerful tool to re-

duce the dimensionality with estimating low dimen-

sional manifolds inside of the high dimensional fea-

ture space. The aim is that simple classiﬁers (e.g. with

linear kernels) may perform better on the transformed

data. An overview of several common linear and non-

linear techniques can be found in (Van der Maaten

et al., 2009). The feature transform layer contains

the Principal Component Analysis (PCA) algorithm.

This is a linear transform that uses a new coordinate

system whose base vectors are ordered by the high-

est statistical variance of the original data. The PCA

can only handle linear correlations of the feature di-

mensions but it has been shown that it works well in

many cases and allows a simple out-of-sample exten-

sion that transforms new feature vectors into the new

coordinate space. If the PCA component is chosen by

the framework training, the new feature vector x

formed by using the ﬁrst n

PCA

dimensions of the PCA

transformed vector of x

. Otherwise, when the PCA

is deactivated, the vector x

is passed directly to the

classiﬁer and x

= x

3.3 Classiﬁer Layer

Finally, the classiﬁer layer contains a set of 5 popular

classiﬁer concepts which take the feature vector x

and returns a class label y for the given instance. A

review of recent classiﬁers in machine learning can be

found in (Jain et al., 2000). Table 1 lists the classiﬁer

concepts that are available in the classiﬁer layer of our

framework.

The naive Bayes classiﬁer is based on the Bayes’

theorem in combination with a Gaussian normal dis-

tribution that assigns the most probable class to an

instance. This classiﬁer has no parameters and works

well for simple classiﬁcation tasks.

The random tree classiﬁer consists of a tree struc-

ture of decision nodes that check a threshold of single

variables inside of the feature data.

The k-nearest neighbor classiﬁer stores all feature

vectors in the training phase. To classify a new in-

stance it performs a search to ﬁnd the k closest in-

stances according to a distance metric which can be

e.g. Euclidean or Mahalanobis distance. The parame-

ter k varies the smoothness of the decision boundary.

The multilayer perceptron (MLP) is a feedforward

neural network that consists of several layers which

have itself a number of neurons. The layers are con-

nected with weights that are trained using backprop-

agation so that the network output – the class label

– matches with the input feature vector. The network

size is determined by the number of hidden layers and

the number of neurons per layer.

The Support Vector Machine (SVM) classiﬁer es-

timates a maximum margin decision boundary in the

feature vector space to minimize the risk of misclassi-

ﬁcation. It is a powerful concept that can also handle

non-linear classiﬁcation using kernel functions such

as the Gaussian kernel. We use the ν-SVM (Chang

and Lin, 2011) which has a parameter ν that penal-

izes misclassiﬁed instances. Furthermore, we use the

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

146

linear kernel with no additional parameters as well as

the Gaussian kernel with the kernel width γ that con-

trols the smoothness of the decision boundary. Note

that only one classiﬁer is used to classify instances in

a trained framework. Compared to previous frame-

works we investigate the concept of the SVM with

different kernels more deeply.

3.4 Training of the Framework

Conﬁguration

A framework conﬁguration consists of the choice of

active components on each layer (see table 1) as well

as the choice of the corresponding parameters. The

forward direction to classify a new instance is fast

as only one, namely the trained conﬁguration has to

be evaluated. Clearly, it is a challenge to ﬁnd the

best framework conﬁguration for a given classiﬁca-

tion problem as the number of all possible combina-

tions is exponential and there is no straightforward

way to “guess” the best conﬁguration. Furthermore,

an automatic optimization or training process is de-

sired that does not require expert knowledge. The

goal is that experts may focus on the development of

better features or sensor techniques rather than time-

consuming classiﬁer optimization.

In our framework training process the only thing

the user has to provide is a number of annotated

ground truth instances with their feature sets. As ma-

chine learning in general is an ill-posed problem, one

can describe our training approach as totally data-

driven regularization of all system hyper-parameters.

In order to ﬁnd a solution in reasonable time, we pro-

pose a partly-heuristic grid search approach with the

classiﬁer performance as objective function. To de-

ﬁne an efﬁcient optimization strategy we use the black

box principle and introduce a machine learning black

box and a classiﬁer black box which are depicted in

ﬁgure 4. We separate these to use a different search

strategy for each black box.

3.5 Classiﬁer Black Box

The classiﬁer black box (see ﬁgure 4) is the core

component that optimizes the feature transform, the

classiﬁer and its parameters and evaluates the per-

formance of the current system conﬁguration. First,

the classiﬁer black box investigates the effect of the

PCA and performs a grid search with its parameters

PCA

= {1, 2, 3, 4, 5, 10}. In order to save computa-

tion time, we choose this list of values as in most fea-

ture sets the intrinsic dimensionality is rather low and

the ﬁrst few PCA dimensions are usually sufﬁcient.

As an additional conﬁguration, the PCA transform is

Classifier

Feature transform

Features and ground truth labels

Feature subset selection

System performance evaluation

Best system configuration

concept,

parameters

concept,

parameters

concept,

feature sets

Classifier black box

Machine learning black box

Figure 4: Black box structure for the training phase of the

framework. The loop arrows on the right indicate the inves-

tigated method and parameter variations.

skipped and the feature vector x

is directly passed

to the classiﬁer. The classiﬁer is the most important

part of the classiﬁer black box. In every round of

the feature transform grid search we perform a grid

search for each classiﬁer model and its correspond-

ing parameters that are listed in table 2. Grid search

still a popular method in hyper-parameter optimiza-

tion for classiﬁers with up to two parameters, though

random search methods work better for higher dimen-

sional search spaces (Bergstra and Bengio, 2012).

3.6 System Quality Metric

The system performance evaluation is done for each

feature transform, classiﬁer and parameter combi-

nation using k-fold cross-validation. This method

achieves a good estimation of the classiﬁers’ bias and

variance at still reasonable computation costs (Jain

et al., 2000). We choose k ≤ 10 such that at least 15

samples remain in the evaluation set. The minimum

size constraint for the evaluation set is necessary for

small sample sizes to avoid undeﬁned values for ac-

curacy, precision and recall. There are several ways to

deﬁne a quality metric for the whole classiﬁer system

that serves as objective function for the optimization

process. It is possible to use the overall F-measure

(harmonicmean of precision and recall) or e.g. the ac-

curacy of a speciﬁc class. In many frameworks in lit-

erature (see section 2.2) the overall accuracy value is

used. However, we use the average value of accuracy,

precision and recall of all cross-validation rounds to

reward a solution with a high value in all of the three

base metrics. Precision and recall are also very im-

portant quality metrics in many classiﬁcation tasks.

Image-basedObjectClassificationofDefectsinSteelusingData-drivenMachineLearningOptimization

147

Table 2: Classiﬁer concepts and parameter ranges.

classiﬁer parameter ranges (steps)

Naive Bayes -

Random Tree (Matlab

Statistics Toolbox)

- (self-optimizing with

pruning)

k-nearest neighbors k ∈ [1, 15](15), distance

metric ∈ {Euclidean,

Mahalanobis}

MLP with backpropaga-

tion

number of hidden layers

∈ [0, 2](3), number of neu-

rons per layer ∈[2, 20](6)

ν-SVM with linear

kernel

ν ∈ [0.1, 0.9](9)

ν-SVM with Gaussian

kernel

ν ∈ [0.1, 0.9](9),

γ ∈ [

1000

](5)

3.7 Feature Subset Selection

The classiﬁer black box is used to perform feature

subset selection within the machine learning black

box. The classiﬁer black box is already performing

feature transforms to reduce the dimensionality, but

feature selection is an alternative, powerful approach

to optimize the classiﬁcation performance (Kohavi

and John, 1997). In our experiments in section 5

we show that the performance can be signiﬁcantly in-

creased using additional feature selection. We choose

wrapper approaches (rather than ﬁlters) for feature se-

lection because the ﬁnal classiﬁer performanceis used

directly. As an exhaustive feature subset selection has

exponential complexity with the number of features,

heuristics help to efﬁciently decrease the computa-

tional costs. We use simple Sequential Forward and

Backward Selection (SFS and SBS) as well as their

more sophisticated ﬂoating extensions to ﬁnd a good

feature subset F

sub

. SFS starts with an empty subset

and consecutively adds the features which achieves

the best quality metric when combined with the cur-

rent subset. SBS starts with the full set and sequen-

tially removes features. Sequential Floating Forward

and Backward Selection (SFFS and SFBS) use these

two simple strategies to obtain backtracking capabili-

ties and achieve near-optimal solutions at reasonably

higher costs (Jain et al., 2000). As each feature chan-

nel of the feature group F

all

= {F

, F

, . . . , F

} is not

limited to one dimension (see section 4), the feature

subset selection only selects whole channels to save

computational time.

3.8 Training Complexity

We start with the complexity of the classiﬁer black

box optimization. Let C

be the ith classiﬁer of the

classiﬁer set C. The complexities of the classiﬁer

training and evaluation are denoted as f

Train

(·) and

Eval

(·), respectively. Of course, they also depend on

the training set and the current feature subset

. Us-

ing k-fold cross-validation, the complexity for a sin-

gle classiﬁer evaluation and one combination of pa-

rameters

P becomes

Param

P) = k ·( f

Train

) + f

Eval

)).

(2)

The classiﬁer C

has the set of P

parameters and each

parameter has a sampling set P

i, j

of values. Testing

all parameter combinations, the Cartesian product has

to be considered and the complexity for all classiﬁers

and all of their parameters is

Classiﬁers

(

) =

|C|

∑

i=1

∏

j=1

i, j

· f

Param

(3)

Additionally, the tier of feature transforms with n

combinations leads to a total complexity of the clas-

siﬁer black box of

ClassiﬁerBlackbox

(

) = n

· f

Classiﬁers

(

). (4)

Using the classiﬁers and parameters of table 2 as well

as the preprocessing steps, a total amount of 728 com-

binations is tested. As these steps are independent of

each other, they can easily be parallelized.

Finally, the feature subset selection step of n

fea-

tures is performed. Testing all combinations would

lead to a total complexity of O(2

) and is only feasi-

ble for small feature sets. The simple selection strate-

gies SFS and SBS have a determined number of it-

erations of

+ n

). Using ﬂoating methods the

number of iterations depends on the actual quality re-

sults but is higher (though still polynomial) due to the

combination of SFS and SBS.

4 OBJECT FEATURES

In order to distinguish pores from NMI objects, we

derive 17 texture and shape features while the idea

is to combine problem-speciﬁc features with stan-

dard state-of-the-art descriptors. The feature subset

selection chooses the most promising feature subset.

Therefore, the classiﬁcation results most likely bene-

ﬁt from a large feature pool, though the computation

time for the feature selection increases.

4.1 Texture Features

The tail of solid objects (described section 2) is a

promising feature that we describe in the following

way. First, the main direction of the milling pattern is

estimated using the 2D Fourier transform. The spec-

trum shows a wedge shaped area of high magnitudes

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

148

(a) Pore. (b) NMI with tail.

Figure 5: Areas for texture analysis of the tail indicator. The

arrows indicate the main pattern direction. For many NMIs

(b) the texture in R

is darker than in R

which can be used to robustly estimate the main angle

of the pattern even in presence of noise (Herwig et al.,

2012). As depicted in ﬁgure 5, two rectangular areas

are deﬁned containing the texture before (R

) and af-

ter (R

) the object in direction of the milling pattern.

The tail indicator is calculated by dividing the aver-

age gray value inside of area R

by the one of area

. This value is very close to 1 for pores and signif-

icantly lower (between 0.5 and 0.8) for NMIs with a

visible tail.

Additionally, we use standard texture features,

namely the mean and standard deviation of the gray

values inside of the object. Local binary patterns

(LBP) are a popular state-of-the-art description of tex-

tures. Generally, LBP descriptors analyze the bright-

ness differences of a pixel and its direct neighbor-

hood. The pattern is encoded to a binary string and

aggregated to a descriptor histogram counting the

frequencies of the pattern conﬁgurations. Several

variants have been introduced (Doshi and Schaefer,

2012). We use the rotation invariant, uniform LBP

with 8 neighbors and a radius of 1 which yields a rel-

atively low descriptor dimension of 10 bins. We com-

pute a histogram for each layer of the object and use

the average histogram as a feature.

4.2 Shape Features

As NMI objects have different material properties

than pores, their shape appears less round but more

“jagged” (see ﬁgure 1 and 2). To ﬁnd a proper de-

scription we use several features derived from the bi-

nary voxel array V(x, y, z). The Minkowski function-

als are a set of motion-invariant additive and contin-

uous functionals which form a complete system on

the set of objects that are unions of a ﬁnite number

of convex bodies. A single voxel can be considered

as convex body, therefore this deﬁnition can be ap-

plied to a connected voxel set like our 3D objects be-

cause it is a union of convex bodies. For the three-

dimensional case the Minkowski functionals are the

volume V, the surface area S, the mean curvature M

(which is proportional to the mean breadth for convex

particles) and the Euler number χ. Roughly speaking,

the Euler number is the number of connected compo-

nents minus the number of tunnels, plus the number

of cavities. For the calculation of these features the

reader is referred to (Buck et al., 2013). We deﬁne

the density of the Minkowski functionals as a frac-

tion of the particular functional over the total number

of voxels of the minimal bounding box of the defect.

Doing so we achieve a normalization of the function-

als. Furthermore the densities can give interesting in-

sights of the structure of the defects. For example

the density of the surface area for a defect occupy-

ing the same volume as another defect will increase,

when the surface is rougher, i.e. consists of more

voxel conﬁgurations with diagonal neighbors, and ap-

proaches inﬁnity for a fractal structure. This might be

a useful feature for automatic classiﬁcation. Based

on the Minkowski functionals we can calculate the

isoperimetric shape factors given by f

= 6

√

= 48π

and f

= 4π

which are measures for

the sphericity of 3D objects. The values of f

, f

and

are close to 1 for spheres.

Furthermore, we derive the box dimension of the

voxel set for each object. The box dimension is an

empirical estimation of the upper bound of the Haus-

dorff dimension (Falconer, 2003). The Hausdorff di-

mension can be used to describe the intrinsic dimen-

sion or “porosity” of fractal objects.

We also use a voxel conﬁguration histogram of

the object border shape. In a 2×2×2 neighborhood

there are 256 different binary voxel conﬁgurations.

The conﬁguration can be described in a similar way

to LBP using the binary string (Ohser and M¨ucklich,

2000)

g(x, y, z) = 2

·V(x, y, z) + 2

·V(x+ 1, y, z)+

·V(x, y+ 1, z)+ 2

·V(x+ 1, y+ 1, z)+

·V(x, y, z+ 1) + 2

·V(x+ 1, y, z+ 1)+

·V(x, y + 1, z+ 1) + 2

·V(x+ 1, y+ 1, z+ 1).

(5)

Obviously, there are 256 different values whose fre-

quencies are – similar to LBP – stored into a his-

togram. This histogram can be compressed by com-

bining symmetric cases to obtain a 22 dimensional,

rotation-invariant histogram (Toriwaki and Yoshida,

2009). Another way of describing the coarseness of

objects is to compute a variant of the morphological

gradient. We use a 3D opening operation with a ball-

shaped structuring element of radius 2 and count the

additional voxels compared to the initial binary vol-

ume. To obtain a size-invariant metric, we normalize

this number by the total number of voxels.

Image-basedObjectClassificationofDefectsinSteelusingData-drivenMachineLearningOptimization

149

Table 3: Best classiﬁcation results of the classiﬁer black box without automatic feature selection.

Feature set classiﬁer, parameters, PCA accuracy / precision

/ recall

only tail indicator SVM Gauss, ν=0.5, γ=0.001,

no PCA transform

0.912 / 0.899 / 0.911

all texture features: gray-value mean, gray-value std, tail

indicator, LBP histogram

SVM linear, ν=0.2, PCA,

PCA

= 10

0.920 / 0.936 / 0.895

all shape features: volume, morphological gradient, surface

area, surface density, mean breadth, Euler number (6 neigh-

borhood), Euler number (26 neighborhood), sphericity f

sphericity f

, sphericity f

, voxel conﬁguration 256, voxel

conﬁguration 22, box dimension

random tree, no PCA transform 0.654 / 0.653 / 0.641

all 17 features random tree, no PCA transform 0.831 / 0.823 / 0.822

5 EXPERIMENTS AND RESULTS

5.1 Classiﬁer Black Box Evaluation

First, we evaluate the classiﬁer black box alone with-

out the feature selection layer (see ﬁgure 4). We per-

form a “manual” feature subset selection and com-

pare the classiﬁer optimization results in table 3. The

promising tail indicator alone already reaches an ac-

curacy of 0.912 which shows that the tail is a key de-

scriptor for the NMIs. The accuracy increases only

slightly when the other texture features are used, too.

The shape features alone perform poorly when they

are concatenated. Finally, when using the simple

combination of all 17 presented features, the accu-

racy drops signiﬁcantly compared to the case when

only the texture features are used. The inﬂuence of

the PCA is also marginal in this scenario as it is only

chosen once for the texture features. These effects

indicate that some shape features contain irrelevant

information that disturb the dimension reduction and

classiﬁer algorithms.

The SVM concept performs best on the texture

features, but when shape features are used, the ran-

dom tree classiﬁer achieves the best results. This

shows that the texture features can be separated

“smoothly” and the complete set of shape features

needs a rather complex decision boundary as espe-

cially the voxel conﬁguration histograms are high-

dimensional. When all 17 features are used the total

feature vector space has 302 dimensions.

5.2 Feature Subset Selection

In the next experiment, all features are inside of the

selection pool for the feature subset selection of the

machine learning black box (see ﬁgure 4). We com-

pare SFS, SBS and the ﬂoating versions SFFS and

SFBS. The best classiﬁcation results can be found

in table 4. The performance increases in every case

when feature selection is used. However, the results

of the different selection strategies show noticeable

differences. The highest performance is achieved by

the SFFS selection strategy with an accuracy of 0.967.

In most metrics, the forward selection strategies per-

form better than the backward selection algorithms.

The selected subsets show a different number of items

(5 – 10 features) as well as a different selection. This

indicates that the subset selection strategies may get

stuck in local minima, though the differences in ac-

curacy are not extreme. The classiﬁer concepts show

that the SVM is dominating the best solutions, though

its kernel parameters vary. But the MLP and the

kNN classiﬁers appear in the top 10 result rankings,

too. Here the feature transform layer has more inﬂu-

ence and the PCA is chosen in many conﬁgurations

which supports the order of our framework structure

in which the feature selection is applied before the

feature transform.

We can use the selected feature subsets with the

top ranks (table 4 only shows rank 1) to obtain a fea-

ture relevance ranking. In table 5 the occurrences of

each feature in the best 20 subsets of all selection

strategies are counted. This ranking strategy is es-

pecially interesting because the feature combinations

and the ﬁnal classiﬁer quality are considered. In our

example, the tail indicator has been selected every

time and the Minkowski functionals have also been

chosen very frequently. The high dimensional his-

tograms of the texture and the voxel conﬁgurationsal-

most never appear in the best solutions so their usage

degrades the classiﬁer performance and should not be

used.

As the theoretical complexity of the approach is

relatively high, we evaluated the processing times and

feature selection steps. Using an Intel Xeon CPU with

6 × 2.50GHz and Matlab, the average computation

time for a single classiﬁer black box optimization is

74.3 ± 38.4 seconds. The feature selection uses this

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

150

Table 4: Best classiﬁcation results of the machine learning black box with different feature selection strategies.

selection

strategy

selected feature subset classiﬁer, parameters, PCA accuracy / precision

/ recall

SFS box dimension, Euler number (6 neighbor-

hood), sphericity f

, sphericity f

, surface

density, tail indicator

SVM linear, ν=0.2, PCA,

PCA

= 3

0.944 / 0.953 / 0.944

SBS box dimension, Euler number (6 neighbor-

hood), Euler number (26 neighborhood),

gray-value std, mean breadth, morphologi-

cal gradient, sphericity f

, surface density, tail

indicator, volume

SVM Gauss, ν=0.3, γ=0.1,

PCA, n

PCA

= 5

0.943 / 0.945 / 0.946

SFFS box dimension, Euler number (6 neighbor-

hood), mean breadth, sphericity f

, sphericity

, surface area, surface density, tail indicator,

volume

SVM Gauss, ν=0.4, γ=0.001,

no PCA transform

0.967 / 0.970 / 0.962

SFBS mean breadth, sphericity f

, surface density,

tail indicator, volume

SVM Gauss, ν=0.3, γ=0.002,

PCA, n

PCA

= 2

0.952 / 0.954 / 0.953

0 100 200 300 400

0.65

0.7

0.75

0.8

0.85

0.9

0.95

subset selection iteration

best quality metric

SFS ( best solution in 1.07 hours)

SBS ( best solution in 2.96 hours)

SFFS ( best solution in 3.73 hours)

SFBS ( best solution in 7.30 hours)

Figure 6: Progress and processing time of the classiﬁcation

quality depending on the subset selection strategy.

optimization and so the number of iterations needed

is important. Figure 6 shows the progress of the best

quality metric (see section 3.5) during the feature se-

lection. The forward strategies (SFS and SFFS) need

far less iterations to select a good subset which shows

that only relatively few features are needed to solve

the example classiﬁcation task. Reasonably good sub-

sets can be selected using SFS within around one hour

of computation time while the ﬁne-tuning of the SFFS

needs almost 4 hours to ﬁnd the overall best combina-

tion.

6 CONCLUSIONS

We presented a holistic machine learning framework

that incorporates key machine learning components,

namely classiﬁer and parameters, feature transform

and feature subset selection. The framework auto-

matically optimizes the best conﬁguration using a to-

tally data-driven search strategy. The user just needs

to provide a set of labeled ground truth feature vec-

Table 5: Feature relevance ranking by counting the percent-

aged occurrence in the top 20 results of the feature selection

strategies (SFS, SBS, SFFS, SFBS).

rank feature %

1 tail indicator 100.00

2 surface density 97.50

3 Euler number (6 neighborhood) 80.00

4 mean breadth 71.25

5 box dimension 62.50

6 sphericity f

57.50

7 volume 55.00

8 sphericity f

52.50

9 Euler number (26 neighborhood) 47.50

10 surface area 46.25

11 gray-value std 42.50

12 sphericity f

33.75

13 morphological gradient 28.75

14 gray-value mean 11.25

15 LBP histogram 1.25

16 voxel conﬁguration 256 0.00

17 voxel conﬁguration 22 0.00

tors. We could successfully show the beneﬁts of the

proposed approach with an image-based object clas-

siﬁcation task. By passing feature vectors and ground

truth labels, the optimization framework returns the

best classiﬁer pipeline that can be directly used. As

a “byproduct” the proposed feature relevance ranking

provides insight into the most distinctive descriptors

of the classiﬁcation problem. This knowledge can be

used to develop better features. The framework can

also be used as a feasibility study for classiﬁcation

problems to check if the measured data and the fea-

tures are able to achieve a required classiﬁcation ac-

curacy. Using this framework, the machine learning

component becomes a black box so that experts can

focus on the development of task-speciﬁc features.

The framework uses party heuristic grid search, so

the complexity of the proposed approach is relatively

Image-basedObjectClassificationofDefectsinSteelusingData-drivenMachineLearningOptimization

151

high. On the other hand, our experiments showed that

it is computationally feasible on today’s computer ar-

chitectures for up to 20 features. The framework is

in every case faster than a manual optimization by an

expert. A fully exhaustive search of all feature se-

lection combinations is clearly infeasible. Due to the

huge parameter search space of the framework, e.g.

evolutionary or random optimization strategies have

a great potential in this approach. In future work,

the framework will be applied to a larger variety of

classiﬁcation tasks to show its universality. Further-

more, the best classiﬁers can be combined to obtain

an ensemble classiﬁer (Jain et al., 2000) with a poten-

tially greater predictive performance. Additionally,

the framework can easily be extended with the newest

state-of-the-art classiﬁers, feature transform and se-

lection algorithms as “plugins”.

REFERENCES

Bergstra, J. and Bengio, Y. (2012). Random search for

hyper-parameter optimization. J. Mach. Learn. Res.,

13(1):281–305.

Beyer, K., Goldstein, J., Ramakrishnan, R., and Shaft, U.

(1999). When is ”nearest neighbor” meaningful? In

Beeri, C. and Buneman, P., editors, Database Theory

ICDT99, volume 1540 of Lecture Notes in Computer

Science, pages 217–235. Springer Berlin Heidelberg.

Buck, C., B¨urger, F., Herwig, J., and Thurau, M. (2013).

Rapid inclusion and defect detection system for large

steel volumes. ISIJ International, 53, No. 11. ac-

cepted.

B¨urger, F., Herwig, J., Thurau, M., Buck, C., Luther, W.,

and Pauli, J. (2013). An auto-adaptive measurement

system for statistical modeling of non-metallic inclu-

sions through image-based analysis of milled steel

surfaces. In Bosse, H. and Schmitt, R., editors,

ISMTII 2013, 11th International Symposium on Mea-

surement Technology and Intelligent Instruments. Ap-

primus Wissenschaftsverlag.

Chang, C.-C. and Lin, C.-J. (2011). LIBSVM: A library

for support vector machines. ACM Transactions on

Intelligent Systems and Technology, 2:27:1–27:27.

Doshi, N. and Schaefer, G. (2012). A comparative analysis

of local binary pattern texture classiﬁcation. In Visual

Communications and Image Processing (VCIP), 2012

IEEE, pages 1–6.

Falconer, K. (2003). Fractal geometry: mathematical foun-

dations and applications. Wiley, 2 edition.

Herwig, J., Buck, C., Thurau, M., Pauli, J., and Luther,

W. (2012). Real-time characterization of non-metallic

inclusions by optical scanning and milling of steel

samples. In Proc. of SPIE Vol, volume 8430, pages

843010–1.

Huang, C.-L. and Wang, C.-J. (2006). A GA-based fea-

ture selection and parameters optimizationfor support

vector machines. Expert Systems with Applications,

31(2):231 – 240.

Jain, A., Duin, R. P. W., and Mao, J. (2000). Statistical

pattern recognition: a review. Pattern Analysis and

Machine Intelligence, IEEE Transactions on, 22(1):4–

37.

Juszczak, P., Tax, D., and Duin, R. (2002). Feature scal-

ing in support vector data description. In Proc. ASCI,

pages 95–102. Citeseer.

Kohavi, R. and John, G. H. (1997). Wrappers for feature

subset selection. Artiﬁcial Intelligence, 97(12):273 –

324.

Lemke, C., Budka, M., and Gabrys, B. (2013). Metalearn-

ing: a survey of trends and technologies. Artiﬁcial

Intelligence Review, pages 1–14.

Lin, S.-W., Lee, Z.-J., Chen, S.-C., and Tseng, T.-Y.

(2008a). Parameter determination of support vector

machine and feature selection using simulated anneal-

ing approach. Applied Soft Computing, 8(4):1505 –

1512. Soft Computing for Dynamic Data Mining.

Lin, S.-W., Ying, K.-C., Chen, S.-C., and Lee, Z.-J. (2008b).

Particle swarm optimization for parameter determina-

tion and feature selection of support vector machines.

Expert Systems with Applications, 35(4):1817 – 1824.

Ohser, J. and M¨ucklich, F. (2000). Statistical analysis of

microstructures in materials science. John Wiley New

York.

Reif, M., Shafait, F., Goldstein, M., Breuel, T., and Den-

gel, A. (2012). Automatic classiﬁer selection for non-

experts. Pattern Analysis and Applications, pages 1–

14.

Somorjai, R. L., Alexander, M. E., Baumgartner, R., Booth,

S., Bowman, C., Demko, A., Dolenko, B., Man-

delzweig, M., Nikulin, A. E., Pizzi, N., Pranckevi-

ciene, E., Summers, A. R., and Zhilkin, P. (2004).

A data-driven, ﬂexible machine learning strategy for

the classiﬁcation of biomedical data. In Dubitzky, W.

and Azuaje, F., editors, Artiﬁcial Intelligence Methods

And Tools For Systems Biology, volume 5 of Compu-

tational Biology, pages 67–85. Springer Netherlands.

Toriwaki, J. and Yoshida, H. (2009). Fundamentals of three-

dimensional digital image processing. Springer.

Van der Maaten, L., Postma, E., and Van Den Herik, H.

(2009). Dimensionality reduction: A comparative re-

view. Journal of Machine Learning Research, 10:1–

41.

Wolpert, D. H. (1996). The lack of a priori distinctions

between learning algorithms. Neural computation,

8(7):1341–1390.

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

152