Minimum Modal Regression

Koichiro Yamauchi

1

and Vanamala Narasimha Bhargav

2

1

Department of Computer Science, Chubu University, Matsumoto-cho 1200 Kasugai, Japan

2

Indian Institute of Technology, Guwahati, Assam, India

Keywords: Modal Regression, Kernel Distribution Estimator, Incremental Learning on a Budget, Kernel Machines,

Projection Method.

Abstract: The recent development of microcomputers enables the execution of complex software in small embedded

systems. Artificial intelligence is one form of software to be embedded into such devices. However, almost

all embedded systems still have restricted storage space. One of the authors has already proposed an

incremental learning method for regression, which works under a fixed storage space; however, this method

cannot support the multivalued functions that usually appear in real-world problems. One way to support the

multivalued function is to use the model regression method with a kernel density estimator. However, this

method assumes that all sample points are recorded as kernel centroids, which is not suitable for small

embedded systems. In this paper, we propose a minimum modal regression method that reduces the number

of kernels using a projection method. The conditions required to maintain accuracy are derived through

theoretical analysis. The experimental results show that our method reduces the number of kernels while

maintaining a specified level of accuracy.

1 INTRODUCTION

The recent development of microcomputers enables

the embedding of complex software into small

devices. Machine learning algorithms are one

example of such software. One of the authors has

previously proposed a learning algorithm for kernel

regression in embedded systems (Yamauchi, 2014),

but this general regression method estimates the

conditional expectation of the dependent variable (Y)

given the independent variables (X=x). In contrast,

modal regression (Einbeck et al, 2006) estimates the

conditional modes of Y given X=x. This strategy

enables the learning machine to predict a portion of

the missing variables from the other known variables

according to the given sample distribution. This

property is quite different from that of other typical

regression methods.

To estimate the conditional modes, partial mean

shift (PMS) is an assured method. At first, the PMS

method attempts to obtain the joint kernel density and

derives it using the gradient ascent. However, if the

number of samples is increasing, minimum modal

regression is proposed, which can estimate the joint

kernel density by projecting the new sample,

replacing the old kernel, or adding the new kernel to

the sample. The equation for PMS is then modified

accordingly.

2 MODAL REGRESSION

Modal regression approximates a multivalued

function to search the local peaks of a given sample

distribution. Modal regression consists of the kernel

density estimator with a PMS method.

2.1 Kernel Density Estimator

The kernel density estimator (KDE) is a variation of

the Parzen window (Parzen, 1962).

Let



be the set of learning samples, and

 

Np

n

p

,,2,1  x



. The estimator

approximates the probability density function by

using a number of kernels, namely, the support set

t

S

.

The kernels used are Gaussian kernels, and





















t

Si

x

i

h

Kp

xx

x)(

,

(1)

448

Yamauchi, K. and Bhargav, V.

Minimum Modal Regression.

DOI: 10.5220/0006601304480455

In Proceedings of the 7th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2018), pages 448-455

ISBN: 978-989-758-276-9

where































2

exp

x

i

x

i

h

K

xxxx

.

(2)

Normally, the same number of kernels as that of

the dataset is required. However, if the storage

capacity of a target device is small, the number of

kernels must be restricted. There are several ways to

realize the density estimation using a limited number

of kernels. Traditionally, self-organizing feature

maps or learning vector quantization methods

approximate the distribution by using a fixed number

of templates.

As mentioned in (Sasaki et al., 2016), the KDE

used in modal regression should approximate the

peak points of the distribution, rather than the

distribution itself. Let

)(

ˆ

xp

be





















t

Si

x

i

h

Kp

xx

x)(

ˆ

,

(3)

then

)(

ˆ

xp

should satisfy the following condition.

0)(,0)(

ˆ

0)()(

ˆ

**

22







xxxx

pp

xx

XX

xx

,

(4)

where

*

x

denotes a local peak point of the

distribution.

2.2 Partial Mean Shift

Modal regression searches the peaks of the

distribution model represented by the KDE. The PMS

method realizes quick convergence to the nearest

peak from the initial point. Let us denote the initial

point as

0

x

, representing the starting point for the

search of the peak points. Thus, modal regression

repeats the modification of the current

y

as follows:





























































j

x

j

y

jold

i

x

i

y

iold

old

new

h

K

h

yy

K

h

K

h

yy

Ky

y

XX

,

(5)

where

X

denotes

 

T

N

yxx ,

1

X

. Note

that

X

includes

y

.

3 MINIMUM MODAL

REGRESSION

To realize the minimum modal regression, a

minimum KDE, which realizes the KDE with a

minimum support set, is proposed. Moreover, the

KDE should support incremental learning during its

service. To this end, we modify an online learning

method for kernel perceptrons on a budget and apply

the modified method for online learning of the KDE.

The existing kernel perceptron on a budget

maintains a minimized or a constant support set by

applying projection and pruning with replacement. In

this study, we derivate some conditions to make an

online learning algorithm for the KDE in order to be

used in the modal regression.

In the following section, we use the following

relationship to represent the pruning with

replacement and a projection of kernels. Therefore,

we choose Gaussian kernel for

)(K

, which is a kind

of reproducing kernel. Thus, we have following

relationship, referred to as the kernel trick:

),(),,( 















xx

kk

h

K

j

x

j

,

(6)

where

,

denotes the dot product.

3.1 Minimum KDE

The KDE for modal regression should represent the

peak points of the distribution within a certain

number of kernels. Therefore, the modal regression

finds the

 

T

MP

T

MPMP

yxX 

which satisfies the

following two conditions:

















0)(

ˆ

0)(

ˆ

2

MP

p

x

xx

X

,

(7)

where

)(

ˆ

Xp

is defined in (3). Next, we describe

)(

ˆ

Xp

as a dot product of the corresponding vector

in Hilbert space and the input:

),(,

ˆ

)(

ˆ

 XX kpp

.

As

)(

ˆ

Xp

is described by a linear combination of

several Gaussian kernels, which is one of the

reproducing kernels, we can apply the kernel trick to

calculate it. Thus, the KDE is also described by using

the kernel method. Therefore, the learning method of

the KDE is described as follows. Let us assume that

Minimum Modal Regression

449

the KDE used in this study tends to realize a sparse

allocation of kernels. Therefore, the KDE normally

adds a new kernel when a new sample

 

tt

y,x

is

presented. Therefore,



,),,(

ˆˆ

11

tSSkwpp

tttttt





X

(8)

where

t

S

denotes the support set at the

t

th round,

1

t

w

, and

t

p

ˆ

is

,),(

ˆ





j

jjt

kwp X

(9)

where

j

w

is the extension coefficient for each kernel,

whose default value is 1 and

0

j

w

. The KDE is

not for regression, so (9) does not contain

t

y

. Instead,

t

y

is one element of the centroid of a kernel.

Equation (8) represents the same procedure as that of

the original kernel distribution estimator. This

strategy, however, continues to increment the size of

the support set

t

S

forever if the number of datasets

is infinite. This is not suitable for an environment in

which storage space is limited.

t

S

should only

contain some essential kernels to represent the

distribution of inputs.

To maintain a small value of

t

S

, we apply an

improved version of the kernel perceptron on a

budget (Orabona et al., 2008) (He et al., 2012)

(Yamauchi, 2013). If we apply their method to the

KDE, the KDE attempts to apply the projection or

replacement operation instead of appending a new

kernel. Therefore, if a condition explained in the latter

section is satisfied, the KDE applies the replacement

or projection operation. The replacement operation is

).,(

),(),(

ˆˆ

****

1









t

iit

i

ii

tt

k

kPwkwpp

X

XX

.

(10)

On the other hand, the projection operation is

),(

ˆˆ

*

1





 t

t

tt

kPpp X

,

(11)

where

),(

**

1



 iit

kP X

denotes the projected vector of

the

*

i

th kernel to the space spanned by the remaining

kernels. The projected vector

),(

**

1



 iit

kP X

is









*

***

\

1

),(),(

iSj

j

jiiit

t

kakP XX

.

(12)

This means that the KDE removes the most

ineffective

*

i

th kernel after projecting the kernel to

the space spanned by the remaining kernels. The most

ineffective kernel is detected by estimating the

approximated linear dependency.

 

ii

i



minarg

*



,

(13)

where

2

\

),(),(min







iSj

jijiai

t

i

kak XX



.

(14)

The following two theorems derivate the

condition to maintain the

'

MP

X

s of the peak points,

even after the replacement or projection operations.

Theorem 1

Let

*

i

be the most ineffective kernel in

1t

S

,

which is determined by (13). Let

'

ˆ

t

p

be

 

),(),(

ˆˆ

**

11

'





i

t

i

itt

kPkwpp XX

.

Let

MP

x

be the point that satisfies























0),(,

ˆ

0),(,

ˆ

1

2

1

MP

kp

tX

xx

X

.

When

0

2

*



i



, we have

















0),(,

ˆ

0),(,

ˆ

'2

'

MP

kp

tx

xx

X

.

Theorem 2

Let

t

X

be a new input at the

t

th round, and

),(

1



 tt

kP x

be the projected vector of

),( 

t

k X

to the space spanned by the kernels at round

1t

. Let

'

ˆ

t

p

be

),(

ˆˆ

11

'



 tttt

kPpp X

.

Let

MP

x

be a point that satisfies the following

condition.























0),(,

ˆ

0),(,

ˆ

1

2

1

MP

kp

tx

xx

X

ICPRAM 2018 - 7th International Conference on Pattern Recognition Applications and Methods

450

When

0

2



t



, we have















































x

MP

xtxtx

x

MP

xtx

h

Kpkp

h

Kkp

MP

XX

X

XX

X

xx

2

1

2'2

'

ˆ

),(,

ˆ

),(,

ˆ

.

The proofs for the Theorems 1 and 2 are described

in the appendix.

Theorem 2 demonstrates that if

MP

X

is far from

t

X

,

0),(,

ˆ

'





















x

MP

xtx

h

Kkp

MP

XX

X

xx

(15)

0

ˆ

),(,

ˆ

1

2

1

2

'2

























tx

x

MP

xtx

tx

p

h

Kp

kp

MP

XX

X

xx

(16)

From these theorems, the minimum KDE can be

described in Algorithm 1.

Algorithm 1: Learning algorithm for the Minimum KDE.

Receive

),(

tt

yX

Detect the most ineffective kernel

*

i

by using (13) (the

lightweight version is (19)).

If





2

t

111

),,(

ˆˆ





tttttt

SSkPpp X

else if





2

*

i

 

),(),(

ˆˆ

**

11





i

t

i

itt

kPkwpp XX

  

tiSS

tt





\

1

else



tSSkpp

ttttt



 11

),,(

ˆˆ

X

Endif

For all i

if

0

i

w

// To maintain

0

i

w

0

i

w

endif

endfor

1 tt

Return

t

p

ˆ

3.2 Modified Partial Mean Shift

The minimum KED described in the previous

section maintains the minimum size of the support

set by applying a projection or pruning with a

replacement. Through these processes, the

expansion parameter of each kernel

i

w

has a

certain value to represent the target distribution.

For example, if

2

i

w

, the ith kernel shares the

duty of two kernels. Therefore, we have also

improved the PMS method to adjust the solution

according to the expansion parameters, as follows.





























































j

x

j

y

jold

j

i

x

i

y

iold

new

h

K

h

yy

Kw

h

K

h

yy

Kwy

y

XX

(17)

3.3 Lightweight Learning Algorithm

In Section 3.1, we have already presented the

minimum KDE. The algorithm includes the

calculation of the approximated linear dependency

(ALD) to detect the most ineffective kernel, which

has a wasteful computational cost of

)(

3

t

SO

. The

computational cost is too large to execute the

minimum KDE. To overcome this difficulty, we need

a lightweight version of the minimum KDE.

The lightweight KDE does not use (13) to detect

the most ineffective kernel. Instead, the proposed

algorithm uses a slightly improved version of a

lightweight algorithm from our previous study

(Yamauchi, 2014). Therefore, the proposed method

chooses the most ineffective kernel, which has the

largest value, defined as





















jSi

x

ij

j

t

h

KV

\

XX

.

(18)

Note that if the kernel is located in the

neighborhood of other kernels,

j

V

becomes large.

There is a high possibility that such a kernel can be

represented by a linear combination of the other

kernels. Therefore, instead of applying (13), (19) is

used.

jj

Vi maxarg

*



(19)

Minimum Modal Regression

451

Algorithm 2: Minimum modal regression.

If a new learning sample

 

T

ttt

yxX 

is given,

Learn the minimum KDE by Algorithm 1

endif

If a new query

p

x

is given,

For (i=0; i<M; i++)

Select one of a kernel index

)(

p

k xΝ

(see

(20)) randomly.

set the initial

y

as

Nk

Xy 

.

Set initial

X

as

 

y

T

p

xX 

.

For (r=0; r<R; r++)

Update

y

by using (17)

Reset

X

as

 

y

T

p

xX 

endfor

 

yAnsAns 

endfor

endif

return

.Ans

where

)(

p

xΝ

denotes a set of kernels defined below

equation.





































 s

h

Kj

x

nj

p

xx

x )(

,

(20)

where

s

denotes a threshold and we set

.1.0s

4 EXPERIMENT

In this section, some preliminary results of the

proposed method are shown.

4.1 Performance for Synthetic Dataset

We tested the proposed method with two synthetic

datasets and evaluated its performance.

4.1.1 Third-Order Function

The first dataset is generated by

nyyx  4

3

,

where

n

is a uniform random value in the interval of

]1,1[

. With a changing

y

in the interval [-3, 3],

8000 datasets were generated. The dataset was

presented to the minimum KDE, and the minimal

modal regression predicted the values for you from

the value of each x. The number of repeats for the

prediction (the parameter R in Algorithm 2) was 10.

The hyper parameters used were

25.0

x

h

and

25.0

y

h

. The evaluation should be made using the

mean square error between the desired and predicted

values of y.

However, the evaluation of multi-valued output is

complex, so we evaluated the proposed method as

follows. Instead of making a direct comparison of the

resultant and predicted values of y, we calculated the

corresponding

yyx 4

ˆ

3



and compare the

actual x with

x

ˆ

. The difference was evaluated by the

averaged square error:

 

]

ˆ

[

2

xxE 

.

Figure 1, 2 and 3 show the results of y predicted by

the proposed method with

9.0,5.0,1.0



,

respectively. From these figures, we can see that the

threshold value



is small, and the predicted values

show a smooth curve. The estimated errors and

number of kernels are listed in Table 1. From this table,

the estimated error of the modal regression is reduced

when the threshold value is small. However, the

number of kernels is increased when the threshold

value is small. Therefore, there are tradeoff

relationships between the error and number of kernels.

Table 1: Number of kernels and the averaged error for the

corresponding x for each threshold value.

Threshold (



)

0.9

0.5

0.1

No. of kernels

124

188

292

 

]

ˆ

[

2

xxE 

0.018

0.010

0.0063

Figure 1: The predicted values from the proposed method

with

1.0



.The x-axis denotes x and the y-axis denotes

the predicted value.

ICPRAM 2018 - 7th International Conference on Pattern Recognition Applications and Methods

452

Figure 2: The predicted values from the proposed method

with

5.0



. The x-axis denotes x and the y-axis denotes

the predicted value.

Figure 3: The predicted values from the proposed method

with

9.0



. The x-axis denotes x and the y-axis

denotes the predicted value.

4.1.2 Helix Function

The second dataset is a helix dataset. By using this

dataset, we have checked whether our method

approximates more complex outputs. The dataset is

described as follows.

ttttttttt

bzayax



 ,sin,cos

,

where

t



2

. We set

tt

na  2

, where

t

n

denotes a uniform random value in the interval [-0.1,

0.1], and

tt

nb  3

. By increasing

t

gradually

from 0 to 9, 3000 instances were generated. The

dataset has a spiral shape. The used hyper parameters

were

.0.2,0.2 

yx

hh

Figure 4 and Figure 5

show the results for a threshold of



= 0.1 and 0.95.

In the case of threshold



= 0.1, 157 kernels were

generated. On the other hand, in the case of threshold



= 0.95, 45 kernels were generated. In the both cases,

the proposed system regenerated almost the same

correct multivalued outputs.

Figure 4: The output of the proposed method of Helix data

for a threshold

.1.0



Figure 5: The output of the proposed method of Helix data

for a threshold

.95.0



4.2 Performance for Real Dataset

We also tested the proposed method with a real-

dataset: Data from the network journey time and

traffic flow on highways in England

1

. We used the

traffic flow data on January 2006 MIDIAS Site 1030

(LM205) and made the proposed system learn the

pairwise data between total carriageway flow versus

total flow vehicles above 11.6m. The dataset records

the data at every 15 minutes. The four total carriage

flows and corresponding speed flow between every

45 minutes are almost the same. Therefore, we picked

up the first data of the four data set for the

corresponding 45 minutes. By this procedure, we

reduced the dataset size to 1/4 (8580 instances).

Moreover, each speed data and flow data was

normalized by dividing them by 140 and 1400,

respectively. The used hyper-parameters are

2.0,15.0 

yx

hh

. From the data plotted in

Minimum Modal Regression

453

Figure 6 The predicted outputs from the proposed method

with

1.0



. The generated kernel size was 57.

Figure 7: The predicted outputs from the proposed method

with

5.0



. The generated kernel size was 29.

The predicted outputs from the proposed method with

1.0



and 0.5 are shown in Figure 6 and Figure 7.

The kernel sizes were 57 and 29, respectively.

We can see the proposed method predicted more than

two distributions in the speed flow.

5 CONCLUSION

In this paper, we proposed a new method for modal

regression. While forming the KDE when a new

sample is given, it may be projected onto the existing

kernel space, it may replace the existing kernel, or a

new kernel may be generated with a given sample as

the center. This depends on the threshold and the

dependencies of each kernel in the existing kernel

space. The equation for the PMS method is also

modified according to this method by adding weights

to the kernels. The experimental results show that the

proposed method can approximate the multivalued

functions properly, and it also reduces the complexity

greatly compared to the case where a kernel is

allocated to each sample.

REFERENCES

Einbeck, J. & Tutz, G. (2006), ‘Modelling beyond

regression functions: an applica- tion of multimodal

regression to speed?flow data’, Applied Statistics

55(4), 461–475.

He, W. & Wu, S. (2012), ‘A kernel-based perceptron with

dynamic memory’, Neural Networks 25, 105–113.

Orabona, F., Keshet, J. & Caputo, B. (2008), The

projectron: a bounded kernel-based perceptron, in

‘ICML2008’, pp. 720–727.

Parzen, E. (1962), ‘On estimation of a probability density

function and mode’, Annals of Mathematical Statistics

33(3), 1065–1076.

Sasaki, H., Ono, Y. & Sugiyama, M. (2016), Modal

regression via direct log-density derivative estimation,

in A. Hirose, S. Ozawa, K. Doya, K. Ikeda, M. Lee &

D. Liu, eds, ‘Neural Information Processing –23rd

International Conference, ICONIP 2016–’, Vol. PartII,

Springer-Verlag.

Yamauchi, K. (2013), An importance weighted projection

method for incremental learning under unstationary

environments, in ‘IJCNN2013: The International Joint

Conference on Neural Networks 2013’, The Institute of

Electrical and Electronics Engineers, Inc. New York,

New York, pp. 1–9.

Yamauchi, K. (2014), ‘Incremental learning on a budget

and its application to quick maximum power point

tracking of photovoltaic systems’, Journal of Advanced

Computational Intelligence and Intelligent Informatics

18(4), 682–696.

APPENDIX

The proof of Theorem 1 is

Proof 1.

From

0

2

*



i



, we obtain

00

**

2



i

x

i

x



.

Therefore, we also have

0

*

2



i

x



.

From the pruning and replacement operation,

1

http://tris.highwaysengland.co.uk/detail/trafficflowdata

ICPRAM 2018 - 7th International Conference on Pattern Recognition Applications and Methods

454

0),(,

ˆ

),(,

ˆ

),(,

ˆ

1

'

*

















MP

kp

w

kpkp

tx

i

xi

txt

xx

x

xx



0),(,

ˆ

),(,

ˆ

),(,

ˆ

1

2

1

2'2

*

















MP

x

MP

kp

w

kpkp

tx

i

xi

tt

xx

x

xx

x



This concludes the proof.

The proof of Theorem 2 is

Proof 2.

From

0

2



t



, we obtain

00

2



txtx



.

Therefore, we also have

0

2



tx



.

From the projection operation, we have

tttt

pkp





1

'

ˆ

),(

ˆ

x

.

From this equation, we obtain the following two

equations.

MP

x

t

xt

txtx

x

t

xt

h

Kkp

kp

h

Kkp

xx

x

xx

x

xx

x

xx

x









































),(,

ˆ

0),(,

ˆ

),(,

ˆ

'

1

'



MP

MPMP

MP

x

t

x

txtx

tx

txtx

x

t

xtx

h

K

kpkp

kp

h

Kkp

xx

xxxx

xx

x

xx

x





















































2

1

2'2

1

2

1

2

2'2

),(,

ˆ

),(,

ˆ

),(,

ˆ

),(,

ˆ

),(,

ˆ



This concludes the proof.

Minimum Modal Regression

455