A Generalization of the CMA-ES Algorithm for Functions with Matrix

Input

Simon Konzett

Faculty of Mathematics, University of Vienna, Oskar-Morgenstern-Platz 1, Vienna, Austria

Keywords:

Evolutionary strategies, Evolutionary algorithms, Adaptations, High-dimensional, covariance matrix, Muta-

tion distribution, Self-adaptation.

Abstract:

This paper proposes a novel modiﬁcation to the covariance matrix adaptation evolution strategy (CMA-ES)

introduced by (Hansen and Ostermeier, 1996) under a special problem setting. In this paper the case is con-

sidered when the function which has to be optimized takes a matrix as input. Here an approach is presented

where without vectorizing directly matrices are sampled and column and row-wise covariance matrices are

adapted in each iteration of the proposed evolution strategy to adapt the mutation distribution. The method

seems to be able to capture correlations in the entries of the considered matrix and adapt the corresponding

covariance matrices accordingly. Numerical tests are performed on the proposed method to show advantages

and disadvantages.

1 INTRODUCTION

Evolution strategies are used to minimize non-linear

objective functions mapping usually a subspace X ⊆

to R . The iterative strategy is to select good points

and then variate these points in the most promising

way to evolve towards a population of points with

higher quality. Selection is done by comparing func-

tion values of the points in each generation. Variation

means suitable recombination of the population found

so far and to mutate these points to gain more insight.

Mutation can simply mean adding normally dis-

tributed random vectors but often more sophisticated

methods are necessary. Mostly the mutation distri-

bution is characterized by a covariance matrix which

should reﬂect shape and size of the distribution. Sev-

eral approaches have been proposed to adapt these co-

variance matrices in a promising way. One of the ﬁrst

ideas proposed is by (Schwefel, 1977). However in

this work a modiﬁcation of the CMA-ES (Covariance

Matrix Adaption Evolution Strategy) by (Hansen and

Ostermeier, 1996), (Hansen, 1998) and (Hansen and

Ostermeier, 2001) is proposed. The CMA-ES collects

information about successful search steps and stores

this information in so-called evolution paths. The

gained information is used to adapt the covariance

and to slowly derandomise the mutation distribution.

Until now there have been many modiﬁcations of

the CMA-ES proposed as in (Jastrebski and Arnold,

2006), (Igeland et al., 2007), (Igel et al., 2006) and

many more.

A modiﬁcation of the CMA-

ES is proposed concerning functions

f : R

n×m

→ R with matrices as input. This

kind of problem appeared during my research. In

particular the task was to determine suitable param-

eter matrices for a state space model. Obviously

the original CMA-ES method can treat this kind of

problem by just vectorizing the input matrix although

the dimension of such a problem gets quite high soon

for reasonably large matrices. In higher dimensions

the original CMA-ES method need large covariance

matrices to characterize the mutation distribution

and as a result the adaptation for these covariance

matrices should be done carefully and slowly. Same

as in the original CMA-ES the new method will make

use of past successful steps and adapts covariance

matrices related to the rows respectively columns of

the matrix valued mutation distribution. The matrices

to adapt in my proposed method are much smaller

and so on we hope that the method is more ﬂexible

and faster in certain cases.

The paper is from now on organised as in the next

section the original CMA-ES is brieﬂy discussed as

in (N. Hansen, 2011). Then in section 3 the proposed

modiﬁcation of the CMA-ES is described. Section 4

then shows some computational results and section 5

gives a short review and conclusion.

337

Konzett S..

A Generalization of the CMA-ES Algorithm for Functions with Matrix Input.

DOI: 10.5220/0005159703370342

In Proceedings of the International Conference on Evolutionary Computation Theory and Applications (ECTA-2014), pages 337-342

ISBN: 978-989-758-052-9

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

2 CMA-ES

In this section a brief outline of the CMA-ES is given.

The problem to solve is generally

min

x∈R

f(x) (1)

where f is a non-linear real-valued function.

The algorithm needs initially a starting point m

∈

and a characterization of the initial mutation dis-

tribution given by a covariance matrix C

∈ R

n×n

and

an initial global step length σ

> 0. Then the algo-

rithm in each iteration is the following.

First a new sample of λ > 0 candidate points is

taken

∼ N (0, I) (2)

= BDz

∼ N (0, C) (3)

= m+ σy

∼ N (m, σ

C) (4)

where m is the best point at this time, the matrices B

and D are given by an eigendecomposition of the co-

variance matrix C = BDDB

and the value σ > 0 is

the global step length at this time. So one has a set of

sample points { x

}

k=1,...,λ

and these points are com-

pared to each other. The indices of the points x

, y

, z

are reordered such that f(x

) ≤ f(x

k+1

) and a new

best point is determined by

m = m+ σ

∑

i=1

= m+ σhyi

(5)

where µ ≤ λ is the parental population size and w

are

weights such that better points have more inﬂuence on

the adaptation.

Second the global step size control is done

= (1− c

(2− c

)µ

eff

−1

hyi

(6)

σ = σ· exp



E(N (0, I))

− 1



(7)

where p

is called the conjugate evolution path. The

conjugated evolution path is normalized such that

∼ N (0, I) under purely random selection and then

its length is compared with the expected length un-

der purely random selection. For more details about

the normalization and the step length adaptation I ref-

erence to (Hansen and Ostermeier, 1996), (Hansen,

1998), (Hansen and Ostermeier, 2001) or (N. Hansen,

2011). Basically the idea is that if good steps y

point

in a similar direction the accumulation of these points

leads to a larger path length kp

k compared with its

expectation



E(N (0, I))



under purely random se-

lection and as a consequence the global step length

σ > 0 shall grow. Similarly if good steps y

can-

cel each other out the global step length σ > 0 gets

smaller. The real parameter d > 0 controls how fast

the global step length can change.

The last part in each iteration of the algorithm is

the covariance adaptation,

= (1− c

(2− c

)µ

eff

hyi

(8)

C = (1− c

− c

)C+

+ c

) + c

∑

k=1

. (9)

The evolution path p

is accumulated by the non-

normalized successful steps and the covariance ma-

trix C is updated by a rank-oneupdate of the evolution

path p

and a rank-µ update of the selected steps y

. If

the covariance matrix C is too large to perform eigen-

decompositions then equation (9) can be replaced by

C = (1− c

− c

)C+

+ c

diag(p

) + c

diag

∑

k=1

(10)

where diag(x) denotes the diagonal matrix with the

vector x in its diagonal and the square of a vector is

meant component-wise here.

The change rates c

, c

> 0 are assumed to

be chosen appropriately. In general it can be said that

the choice of suitable changing rates is crucial for the

success of the algorithm. Larger change rates allow

better adaptation and fast change but there are restric-

tions due to the ﬁnite amount of information which

is gained through selection. If the change rates are

too large the procedure gets random and if they are

too small the system is rigid and adapts too slowly

which slows the algorithm down. A discussion about

suitable choices of these parameters as well as a dis-

cussion about a suitable population λ > 0 size can be

found in (Hansen and Ostermeier, 1996), (Hansen,

1998) and (Hansen and Ostermeier, 2001). The pa-

rameter µ

eff

> 0 is determined such that p

∼ N (0, I)

and p

∼ N (0, C) under purely random selection. It

is easy to see that the adaptation of the mutation distri-

bution is based on the path of successful steps usually

called evolution path.

In the next section the procedure is generalized for

matrices instead of vectors as points of the population.

3 A GENERALIZATION FOR

MATRICES

As the outline of the algorithm remains the same ﬁrst

a matrix distribution and a sampling procedure need

ECTA2014-InternationalConferenceonEvolutionaryComputationTheoryandApplications

338

to be introduced. The random matrix

X ∼ M N (M, U, V) (11)

is is said to be matrix normally distributed where M ∈

m×n

, U ∈ R

m×m

and V ∈ R

n×n

. The distribution is

deﬁned by

vec(X) ∼ N (vec(M), V⊗ U) (12)

and for the matrix normal distribution

E(X) = M (13)



(X− M)(X− M)



= tr(V) · U (14)



(X− M)



= tr(U) · V (15)

is true and where tr(X) denotes the trace of a quadratic

matrix X. The expression V ⊗ U denotes the Kro-

necker product of two matrices and vec(X) denotes

the vectorization of a matrix X. For more about ma-

trix distributions I refer to (Dawid, 1981). The prop-

erties (14) and (15) and as a consequence the matri-

ces U ∈ R

m×m

and V ∈ R

n×n

can be interpreted as

column-wise and row-wise covariance. The matri-

ces U ∈ R

m×m

and V ∈ R

n×n

have to be real-valued,

symmetric and positive deﬁnite. Further the distribu-

tion M N (M, U, V) is sampled easily by

Y = B

(16)

where each entry of the matrix X ∈ R

m×n

is stan-

dard normally distributed and U = B

and

V = B

are eigendecompositions.

Thus the ﬁrst step of the algorithm is

∼ M N (0, I

, I

) (17)

= B

∼

∼ M N (0, U, V) (18)

= M+ σY

∼

∼ M N (M, σ

U, V). (19)

There is to note that

M N (M, σ

U, V) = M N (M, U, σ

is true.

Next as in the outline of the original

CMA-ES proposed the population points indices are

reordered by their corresponding function values

such that f(X

) ≤ f(X

k+1

). So next the best point

M ∈ R

n×n

is updated by

M = M+ σ

∑

i=1

= M+ σhYi

. (20)

Next the global step length σ > 0 has to be

adapted. First the conjugate evolution path is updated

= (1− c

(2− c

)µ

eff

hZi

(21)

where

hZi

∑

i=1

(22)

= B

−1

hYi

−1

(23)

By using the same arguments as in the original CMA-

ES the conjugate evolution path p

is standard nor-

mally distributed in each row and each column un-

der purely random selection and as a consequence

the vectorized evolution path vec(p

) is also standard

normally distributed under purely random selection.

So the global step length σ > 0 is adapted in the same

manner as in the previous section

σ = σ · exp



kvec(p

E(N (0, I))

− 1



. (24)

The last step in each iteration is again the covari-

ance adaptation. Here both the column-wise and row-

wise covariance matrices have to be adapted. The

equations are

= (1 − c

(2− c

)µ

eff

hYi

(25)

tr(V) · U = (1− c

− c

)tr(V) · U+

+ c

) + c

∑

k=1

(26)

tr(U) · V = (1− c

− c

)tr(U) · V+

+ c

) + c

∑

k=1

. (27)

Again the accumulation of the evolution path leads to

∼ M N (M, U, V) under purely random selection

by the same arguments as in the original CMA-ES.

As easily seen the outline of the algorithm is basi-

cally the same as the original CMA-ES but in the last

covariance adaptation step one has to be careful when

updating the left and right covariance matrices. For

reasons of simplicity the system of equations (26) and

(27) is not solved but the values tr(U) and tr(V) are

taken from the previous iteration.

As already mentioned in the beginning in this pro-

posed algorithm the covariance matrices U and V

which characterize the distribution of the mutation

operator have fewer degree of freedom than the co-

variance matrix C in the original CMA-ES. By us-

ing a matrix normal distribution for the mutation dis-

tribution we have

m· (m+ 1) + n · (n+ 1)

degrees of

freedom in the covariance matrices U and V as in

the original CMA-ES the covariance matrix C has

m· n· (m· n+ 1)

degrees of freedom. So the vec-

torized problem has a lot more degrees of freedom

AGeneralizationoftheCMA-ESAlgorithmforFunctionswithMatrixInput

339

for adaptation. In high dimensions it is suggested to

use the CMA-ES only with diagonal covariance ma-

trices because computing an eigendecomposition in

high dimensions is costly. Then the degree of free-

dom is only m · n any more which is comparable to

m· (m+ 1) + n · (n+ 1)

. However using a diagonal

covariance matrix for characterizing the distribution

of the mutation means just considering the variance of

each coefﬁcient. Thus no correlation between differ-

ent coefﬁcients is considered any more and as a con-

sequence it is also not reﬂected in the mutation distri-

bution. So the beneﬁt of the proposed method should

be that even in high dimensions correlation between

different variables can be identiﬁed.

Next the proposed method is tested on some test

functions.

4 NUMERICAL RESULTS

In this section the original CMA-ES method is com-

pared with the here proposed modiﬁcation. The

CMA-ES method is executed with the change rates

proposed in (N. Hansen, 2011). Here no eigende-

composition of the covariance matrix in the CMA-ES

are computed because the dimension of the consid-

ered problems is to high. So only diagonal covari-

ance matrices are considered in the original CMA-

ES. The changing rates for the here proposed meth-

ods are adapted from the original method. Numerical

tests suggested that the here proposed method beneﬁts

from larger change rates and fewer sampling points in

each iteration.

I compare here two versions of the original the

CMA-ES algorithm with the here proposed modiﬁ-

cation. The algorithm denoted by CMA-ES1 uses the

initially suggested population size λ > 0 in (Hansen

and Ostermeier, 2001) or (N. Hansen, 2011) and the

algorithm denoted by CMA-ES2 uses smaller popula-

tion size λ > 0 in each iteration same as the method

proposed in this paper. The method which is intro-

duced in this paper is denoted by matrix-CMA-ES in

the ﬁgures below.

The ﬁrst test problem which is considered here is

just the spectral norm for matrices,

(A) = kA− Bk

(28)

where B ∈ R

15×20

is a randomly chosen matrix. Ob-

viously the solution for minimizing this function is B

and f (B) = 0. The dimension of the problem by vec-

torizing the matrix A for the original CMA-ES algo-

rithm then is 300. In ﬁgure 1 one can easily see that

the here proposed method does not do well solving

1 2 3 4 5 6

x 10

matrixCMA−ES

CMA−ES1

CMA−ES2

Figure 1: Performance of matrix-CMA-ES compared with

CMA-ES1 and CMA-ES2 on the function f

. Shown is on

the y-axis the current best function value and on the x-axis

the number of function evaluations at this time.

this problem. As the algorithm continues the progress

gets smaller and smaller. In contrary the original

CMA-ES method works quite good and is able to

solve the problem in a reasonable time. The reason

for the bad behaviour is that there is no linear correla-

tion between different variables to capture and so the

original CMA-ES algorithm where only the variances

of each entry are adapted does a lot better. The be-

haviour for other kind of norm operators is very simi-

lar.

The second problem that is considered is

(A) = kC· (A− B) · Dk

∞

(29)

where B ∈ R

15×20

, C ∈ R

6×15

and D ∈ R

20×5

are

randomly chosen matrices. The optimal solution

value is again f = 0 but the problem has no unique so-

lution. The dimension of the problem by vectorizing

the matrix A for the original CMA-ES algorithm then

is again 300. This problem obviously has some linear

correlation added by the matrices C and D. In ﬁgure

2 one can see that the here proposed method performs

a lot better than the original CMA-ES method. The

reasoning for the better performance is that the modi-

ﬁed procedure is capable of covering the linear struc-

ture of the problem as the original CMA-ES struggles

to do that. The original CMA-ES method even gets

random after some time where no real progress with

respect to the function value can be seen.

The third function numerical tests were performed

on is

(A) = k(A− B)k

+ k(A− C)k

(30)

where B ∈ R

15×20

and C ∈ R

15×20

are randomlycho-

sen matrices. The dimension of the problem by vec-

torizing the matrix A for the original CMA-ES al-

gorithm then is again 300. Although the third prob-

ECTA2014-InternationalConferenceonEvolutionaryComputationTheoryandApplications

340

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 10

100

150

200

250

300

matrixCMA−ES

CMA−ES1

CMA−ES2

Figure 2: Performance of matrix-CMA-ES compared with

CMA-ES1 and CMA-ES2 on the function f

. Shown is on

the y-axis the current best function value and on the x-axis

the number of function evaluations at this time

0.5 1 1.5 2 2.5 3 3.5

x 10

278.025

278.03

278.035

278.04

278.045

278.05

278.055

278.06

matrixCMA−ES

CMA−ES1

CMA−ES2

Figure 3: Performance of matrix-CMA-ES compared with

CMA-ES1 and CMA-ES2 on the function f

. Shown is on

the y-axis the current best function value and on the x-axis

the number of function evaluations at this time.

lem looks similar than the ﬁrst problem the perfor-

mance of the modiﬁed algorithm is much better. The

proposed algorithm does not outperform the original

CMA-ES but it is a little better than the original ver-

sions. The reason for the better performance is that

here correlation between the coefﬁcients has to be

captured.

The last function I consider is

(A) = log(|det(C· (A− B) · D)|+ 1) (31)

where B ∈ R

10×12

, C ∈ R

6×12

and D ∈ R

10×6

are

randomly chosen matrices. This problem has no

unique solution but the minimal function value is

f = 0. In ﬁgure 4 one can see that the here pro-

posed method drastically outperforms the conven-

tional methods for this problem. The newly proposed

method adapts better to inner structure of the problem

whereas the conventional algorithms can not adapt to

this structure.

0 1 2 3 4 5 6

x 10

matrixCMA−ES

CMA−ES1

CMA−ES2

Figure 4: Performance of matrix-CMA-ES compared with

CMA-ES1 and CMA-ES2 on the function f

. Shown is on

the y-axis the current best function value and on the x-axis

the number of function evaluations at this time.

5 CONCLUSION

As a conclusion this paper has introduced a modiﬁca-

tion to the many existing CMA-ES algorithms. The

goal was to be able to apply a CMA-ES algorithm

to a function where the input is not a vector but a

matrix without vectorizing this matrix. The method

proposed here gives the possibility to do this and the

paper also investigates the advantages and disadvan-

tages of vectorizing the matrix input when a CMA-

ES algorithm is applied. The problem with vector-

izing matrices for the purpose of optimization with

CMA-ES is that vectorizing even fairly small matri-

ces leads quite fast to a high dimensional problem.

In higher dimensions it is very costly and therefore

unreasonable to perform eigendecompositions all the

time during the algorithm. Therefore linear correla-

tions between the variables can not be captured any

more by the conventional CMA-ES algorithm. The

here proposed method considers covariance matrices

of smaller size and it is therefore no problem to do

this eigendecompositions in each iteration. This re-

sults into a better way of capturing the inner structure

of the considered problem. The disadvantage is the

loss of some degrees of freedom in the characteriza-

tion of the distribution of the mutation operator. If the

characterization of the mutation distribution has too

few degrees of freedom like in the ﬁrst example in

the previous section the behaviour of the optimization

procedure gets random.

In section 4 where some numerical tests have been

performed one can see that the more complicated the

function to minimize is the better the here proposed

method does work as for the easiest kind of problem

where the inner structure is very smooth the conven-

tional method performs a lot better. The modiﬁed al-

AGeneralizationoftheCMA-ESAlgorithmforFunctionswithMatrixInput

341

gorithm also seems to allow larger change rates and

is therefore more ﬂexible in capturing the distribution

of the mutation operator.

The method seems applicable for problems which

are not too smooth and where strong linear correla-

tions have to be captured. It is also imaginable for a

high dimensional problems where the covariance ma-

trix in the conventional CMA-ES is too large to per-

form eigendecompositions to rewrite the vector in the

form of an suitably sized matrix such that linear cor-

relations between the variables can be captured by the

applied method.

In further research methods to overcome the dif-

ﬁculties when the problems structure is very smooth

have to be found. Here I think it is possible to look at

some other modiﬁcations of the CMA-ES algorithm.

Moreover the change rates are crucial to the success

of a CMA-ES algorithm. So ﬁnding solid and good

change rates for the algorithm will further improve the

proposed method. In addition more numerical tests

have to be performed.

REFERENCES

Dawid, A. P. (1981). Some matrix-variate distribution the-

ory: notational considerations and a Bayesian appli-

cation. Biometrika, 68(1):265–274.

Hansen, N. (1998). Verallgemeinerte individuelle Schrit-

tweitenregelung in der Evolutionsstrategie. Mensch &

Buch Verlag, Berlin.

Hansen, N. and Ostermeier, A. (1996). Adapting arbitrary

normal mutation distributions in evolution strategies:

The covariance matrix adaptation. In Evolutionary

Computation, 1996., Proceedings of IEEE Interna-

tional Conference on, pages 312–317. IEEE.

Hansen, N. and Ostermeier, A. (2001). Completely deran-

domized self-adaptation in evolution strategies. Evo-

lutionary computation, 9(2):159–195.

Igel, C., Suttorp, T., and Hansen, N. (2006). A compu-

tational efﬁcient covariance matrix update and a (1+

1)-CMA for evolution strategies. In Proceedings of

the 8th annual conference on Genetic and evolution-

ary computation, pages 453–460. ACM.

Igeland, C., Hansen, N., and Roth, S. (2007). Covari-

ance matrix adaptation for multi-objective optimiza-

tion. Evolutionary computation, 15(1):1–28.

Jastrebski, G. and Arnold, D. (2006). Improving evolution

strategies through active covariance matrix adaptation.

In Evolutionary Computation, 2006. CEC 2006. IEEE

Congress on, pages 2814–2821. IEEE.

N. Hansen (2011). The CMA Evolution Strategy: A Tuto-

rial.

Schwefel, H. (1977). Numerische Optimierung von

Computer-Modellen mittels der Evolutionsstrategie.

Birkh¨auser Verlag, Basel.

ECTA2014-InternationalConferenceonEvolutionaryComputationTheoryandApplications

342