INCLUDING IMPROVEMENT OF THE EXECUTION TIME IN A

SOFTWARE ARCHITECTURE OF LIBRARIES WITH

SELF-OPTIMISATION

Luis-Pedro Garc

ıa

Servicio de Apoyo a la Investigaci

on Tecnol

ogica, Universidad Polit

ecnica de Cartagena, 30203 Cartagena, Spain

Javier Cuenca

Departamento de Ingenier

ıa y Tecnolog

ıa de Computadores, Universidad de Murcia, 30071 Murcia, Spain

Domingo Gim

enez

Departamento de Inform

atica y Sistemas Inform

aticos, Universidad de Murcia, 30071 Murcia, Spain

Keywords:

Auto-tuning, performance modelling, self-optimisation, hierarchy of libraries.

Abstract:

The design of hierarchies of libraries helps to obtain modular and efﬁcient sets of routines to solve problems

of speciﬁc ﬁelds. An example is ScaLAPACK’s hierarchy in the ﬁeld of parallel linear algebra. To facilitate

the efﬁcient execution of these routines, the inclusion of self-optimization techniques in the hierarchy has been

analysed. The routines at a level of the hierarchy use information generated by routines from lower levels. But

sometimes, the information generated at one level is not accurate enough to be used satisfactorily at higher

levels, and a remodelling of the routines is necessary. A remodelling phase is proposed and analysed with a

Strassen matrix multiplication.

1 INTRODUCTION

Important tuning systems exists that attempt to adapt

software to tune automatically to the conditions of the

execution platform. These include FFTW for discrete

Fourier transforms (Frigo, 1998), ATLAS (Whaley

et al., 2001) for the BLAS kernel, sparsity (Dongarra

and Eijkhout, 2002) and (Vuduc et al., 2001), SPI-

RAL (Singer and Veloso, 2000) for signal and image

processing, MPI collective communications (Vadhi-

yar et al., 2000), linear algebra routines (Chen. et al.,

2004) and (Katagiri et al., 2005), etc. Furthermore,

the development of automatically tuned software fa-

cilitates the efﬁcient utilisation of the routines by non-

expert users. For any tuning system, the main goal is

to minimise the execution time of the routine to tune,

but with the important restriction of not increasing the

installation time of that routine too much.

A number of auto-tuning approaches are focused

on modelling the execution time of the routine to op-

timise. When the model has been obtained theoreti-

cally and/or experimentally, given a problem size and

execution environment, this model is used to obtain

the values of some adjustable parameters with which

to minimise the execution time. The approach chosen

by FAST (Caron et al., 2005) is an extensive bench-

mark followed by a polynomial regression applied to

ﬁnd optimal parameters for different routines. (Vuduc

et al., 2001) apply the polynomial regression in their

methodology to decide the most appropriate version

from variants of a routine. They also introduce a

black-box pruning method to reduce the enormous

implementation spaces. In the approach of FIBER

(Katagiri et al., 2003) the execution time of a routine

is approximated by ﬁxing one parameter and varying

the other. A set of polynomial functions of grades 1

to 5 are generated and then the best one is selected.

The values provided by these functions for differ-

ent problem sizes are used to generate other function

where now the second parameter is ﬁxed and the ﬁrst

one is varied. (Tanaka et al., 2006) introduce a new

method, named Incremental Performance Parameter

Estimation, in which the estimation of the theoretical

model by polynomial regression is started from the

least sampling points and incremented dynamically to

improve accuracy. Initially, they apply it on sequen-

156

García L., Cuenca J. and Giménez D. (2007).

INCLUDING IMPROVEMENT OF THE EXECUTION TIME IN A SOFTWARE ARCHITECTURE OF LIBRARIES WITH SELF-OPTIMISATION.

In Proceedings of the Second International Conference on Software and Data Technologies - SE, pages 156-161

DOI: 10.5220/0001337501560161

 SciTePress

tial platforms and with just one algorithmic parameter

to seek. (Lastovetsky et al., 2006) reduce the num-

ber of sampling points starting from a previous shape

of the curve that represents the execution time. They

also introduce the concept ”speed band” as the natural

way to represent the inherent ﬂuctuations in the speed

due to changes in load over time.

In this context, our approach has used dense lin-

ear algebra software for message-passing systems as

routines to be tuned. In our previous studies (Cuenca

et al., 2004) a hierarchical approach to performance

modelling was proposed. The basic idea has been to

exploit the natural hierarchy existing in linear alge-

bra programs. The execution time models of lower-

level routines are constructed ﬁrstly from code anal-

ysis. After that, to model a higher-level routine, the

execution time is estimated by injecting in its model

the information of those lower-level routines that are

invoked by the higher-level routine.

Figure 1: Life-cycle of a SOLAR.

In this work, a technique for redesigning the

model from a serial of sampling points by means of

polynomial regression has been included in our orig-

inal methodology. The basic idea is to start from the

hierarchical model using the information from lower

level routines to model the higher level ones without

experimenting with these. However, if for a concrete

routine all this information is not useful enough, then

its model would be built again from the beginning us-

ing a series of experimental executions and polyno-

mial regression applied appropriately.

The rest of the paper is organised as follows:

in section 2 the improved architecture of a self-

optimised routine is analysed, then in section 3 some

experimental results are described with their special

features, and ﬁnally, in section 4 the conclusions are

summarised and possible future research is outlined.

2 CREATION AND UTILISATION

OF A SELF-OPTIMISED

LINEAR ALGEBRA ROUTINE

In this section the design, installation and execution of

a Self-Optimised Linear Algebra Routine (SOLAR) is

described step by step, following the scheme in ﬁg-

ure 1. This life-cycle of a routine such as is described

bellow is a modiﬁcation/extension of that originally

proposed in (Cuenca et al., 2004). The main differ-

ence lies in that in our previous work, during the in-

stallation of a routine its theoretical model and infor-

mation from lower level routines were used to com-

plete a theoretical-practical model. Now this process

is extended with a testing sub-phase of this model,

comparing from a series of sampling points (different

sets of problem sizes plus algorithmic parameters) the

distance between the modelled time and the experi-

mental one. If this distance is not small enough then

a remodelling sub-phase starts, where a new model

of the routine is built from zero, using benchmarking

and polynomial regression in different ways.

2.1 Design

This process is performed only once by the designer

of the Linear Algebra Routine (LAR), when this LAR

is being created. The tasks to be done are:

Create the LAR: The LAR is designed and pro-

grammed if it is new. Otherwise, the code of the LAR

does not have to be changed.

Model the LAR theoretically: The complexity of

the LAR is studied, obtaining an analytical model of

its execution time as a function of the problem size,

n, the System Parameters (SP) and the Algorithmic

Parameters (AP): T

exec

= f(SP, AP, n).

SP describe how the hardware and the lower level

routines affect the execution time of the LAR. SP

are costs of performing arithmetic operations by us-

ing lower level routines, for example, BLAS routines

(Dongarra et al., 1988); and communication parame-

ters (start-up, t

, and word-sending time, t

AP represent the possible decisions to take in or-

der to execute the LAR. These are the block size, b, in

block based algorithms, and parameters deﬁning the

logical topology of the processes grid or the data dis-

tribution in parallel algorithms.

Select the parameters for testing the model:

The most signiﬁcant values of n and AP are provided

by the LAR designer. They will be used to test the the-

oretical model during the installation process. These

AP values will also be used at execution time to se-

lect their theoretical optimum values when the run-

time problem size is known.

INCLUDING IMPROVEMENT OF THE EXECUTION TIME IN A SOFTWARE ARCHITECTURE OF LIBRARIES

WITH SELF-OPTIMISATION

157

Create the SOLAR manager: The

SOLAR

manager is also programmed by the LAR

designer. This subroutine is the engine of the SOLAR.

It is in charge of managing all the information inside

this software architecture. At installation time, it tests

the theoretical model and, if necessary, improves it.

At execution time, it tunes the model according to

the situation of the platform. Then, using the tuned

model, it decides the appropriate values for the AP,

and ﬁnally calls the LAR.

2.2 Installation

In the installation process of a SOLAR the tasks per-

formed by the SOLAR

manager are:

Execute the LAR: The LAR is executed using the

different values of n and AP contained in the structure

Model

Testing Values.

Test the model: The obtained execution times are

compared with the theoretical times provided by the

model for the same values of n and AP, and using the

SP values returned by the previously installed lower

level SOLARs

. If the averaged distance between the-

oretical and experimental times exceed an error quota,

the SOLAR

manager would remodel the LAR.

2.2.1 Remodelling the LAR

If the installation process reaches this step, is because

the information of the SP obtained from lower level

SOLARs has not been accurate enough for the cur-

rent SOLAR model, and so the SOLAR

manager has

to build an improved model by itself.

This new model is built, from the original model

by designing a polynomial scheme for the different

combinations of n and AP in Model

Testing Values.

The coefﬁcients of each term of this polynomial

scheme must be calculated. In order to determine

these coefﬁcients, four different methods could be

applied in the following order: FIxed Minimal Exe-

cutions (FI-ME), VAriable Minimal Executions (VA-

ME), FIxed Least Square (FI-LS) and VAriable Least

Square (VA-LS). When one of them provides enough

accuracy then the other methods are not used.

FI-ME: The experimental time function is ap-

proximated by a single polynomial. The coefﬁcients

are obtained with the minimum number of executions

needed.

VA-ME: The experimental time function is ap-

proximated by a set of p polynomials corresponding

When a SOLAR receives a request from an upper level

SOLAR about its execution time for a speciﬁc combination

of (n, AP) it substitutes these values in its tested model, ob-

taining the corresponding theoretical time, which it sends

back to the SOLAR in question.

to p intervals of combinations of (n, AP). For each of

these intervals the above method is applied.

FI-LS: In this method, like in the FI-ME method,

the experimental time function is approximated by a

single polynomial. Now a least square method that

minimises the distance between the experimental and

the theoretical time for a number of combinations of

(n, AP) is applied to obtain the coefﬁcients.

VA-LS: As in the VA-ME method, the execution

time function is divided in p intervals, applying the

FI-LS method in each one.

2.3 Execution

When a SOLAR is called to solve a problem of size

, the following tasks are carried out:

Collecting system information: The

SOLAR

manager collects the information that

the NWS (Wolski et al., 1999) (or any other similar

tool installed on the platform) provides about the

current state of the system (CPU load and network

load between each pair of nodes).

Tuning the model: The SOLAR

manager tunes

the Tested

Model according to the system condi-

tions. Basically the arithmetic parts of the model will

change inversely to the availability of the CPU, and

the same occurs for the communication parts of the

model and the availability of the network.

Selection of the Optimum

AP: Using the

Tuned

Model, with n = n

, the SOLAR manager

looks for the combination of AP values that provides

the lowest execution time.

Execution of the LAR: Finally, the

SOLAR

manager calls the LAR with the param-

eters (n

, Optimum

AP).

3 SELF-OPTIMISED LINEAR

ALGEBRA ROUTINE SAMPLES

In this section the indicated procedure in 2 is ap-

plied to the Strassen algorithm. The idea is to show,

through this particular application, that it is possible

to automatically obtain the best possible execution

time for given input parameters (n, AP), as well as

the scheme followed by a SOLAR

manager to build

(if necessary) a new model for a LAR.

3.1 Strassen Matrix-Matrix

Multiplication

(Strassen, 1969) introduced an algorithm to multiply

matrices of n × n size which has a lower level com-

plexity than the classical O(n

). It is based on a

ICSOFT 2007 - International Conference on Software and Data Technologies

158

scheme for the product, C = AB, of two 2 × 2 block

matrices:













where each block is square. In the ordinary algo-

rithm there are eight multiplications and four addi-

tions. Strassen discovered that the original algorithm

can be reorganized to compute C with just seven mul-

tiplications and eighteen additions:

Pre-additions Recursive Calls

= A

+ A

= B

+ B

= S

× T

= A

+ A

= B

− B

= S

× B

= A

+ A

= B

− B

= A

× T

= A

− A

= B

+ B

= A

× T

= A

− A

= B

+ B

= S

× B

= S

× T

= S

× T

Post-additions

= P

+ P

− P

+ P

= P

+ P

= P

+ P

= P

+ P

− P

+ P

Strassen algorithm can apply to each of the half-size

block multiplications associated with P

. Thus, if

the original A and B are square matrices of dimen-

sion n = 2

, the algorithm can be applied recursively.

As a result, Strassen’s algorithm has a complexity of

O(n

log

) ≈ O(n

2.807

). Variations of this algorithm

available in the literature (Douglas et al., 1994) and

(Huss-Lederman et al., 1996), allow us to handle ma-

trices of arbitrary size.

In order to build the analytical model of the run

time, we identify different parts of the Strassen algo-

rithm with different basic routines:

• The BLAS3 function DGEMM is used to com-

pute the seven P

at the lower level. If the recur-

sion level used is l, the matrix multiplications are

made in blocks of size

• In the recursion level l, with l = 1, 2, 3, . . ., there

are 7

(l−1)

× 18 matrix additions of size

. The

BLAS1 function DAXPY can be used to compute

, T

and C

The execution time for this algorithm can be mod-

elled by the formula:

= 7

mult





+ 18

∑

i=1

i−1

add





(1)

where t

mult

(

) is the theoretical execution time for

one matrix multiplication of size

, and t

add

(

) is

the theoretical execution time for a matrix addition of

size

3.2 Experimental Results

The experiments have been performed in two differ-

ent systems: a Linux Intel Xeon 3.0 GHz workstation

(Xeon) and a Unix HP-Alpha 1.0 GHz workstation

(Alpha). Optimized versions of BLAS (provided by

the manufactures) have been used for the basic rou-

tines. In both cases the experiments are repeated sev-

eral times, obtaining an average value.

In the experiments for Xeon and Alpha we found,

that for the matrix multiplication the third order poly-

nomial function was the best and for matrix addi-

tion the best was the sixth order polynomial function.

Therefore we use this number of coefﬁcients for its

SOLAR. These coefﬁcients can be calculated using

any of the four methods described in 2.2.1, but in

Xeon and Alpha good results were obtained using the

FI-LS method.

For Matrix Multiplication the values for the struc-

ture Model

Testing Values in Xeon and Alpha are:

Matrix sizes n =

{

500, 1000, 1500, 2000, . . . , 10000

}

;

and for Matrix Addition: Matrix sizes n =

{

64, 128, 192, 256, . . . , 2000

}

In Xeon the polynomial functions that model the

basic routines are:

• Matrix Multiplication: 1.338 × 10

−01

− 2.261 ×

−04

n+ 1.039× 10

−07

+ 3.963× 10

−10

• Matrix Addition: 1.507 × 10

−03

− 2.952 × 10

−05

n +

1.521 × 10

−07

− 1.970 × 10

−10

+ 1.614 ×

−13

− 6.367× 10

−17

+ 9.687× 10

−21

and in Alpha:

• Matrix Multiplication: −9.517 × 10

−02

+ 2.128 ×

−04

n− 9.079× 10

−08

+ 1.136× 10

−09

• Matrix Addition: −9.983× 10

−04

+ 1.683 × 10

−05

n−

6.700 × 10

−08

+ 1.624 × 10

−10

− 1.284 ×

−13

+ 4.698× 10

−17

− 6.509× 10

−21

The SOLAR

manager strassen uses the

theoretical model shown in equation 1, and

sends a request for the theoretical execution

time to the SOLAR

manager at lower levels:

SOLAR

manager mult and SOLAR manager add

for t

mult

(

) and t

add

(

) respectively. For the

Strassen algorithm the values for the structure

Model

Testing Values in Xeon and Alpha are:

l =

{

1, 2, 3

}

and n =

{

3072, 4096, 5120, 6144

}

For the SOLAR

manager to calculate

an error quota for the Theoretical

Model,

the function Φ

err

is deﬁned: Φ

err

∑

Model

Testing Values



1−

Theoretical

Model

LAR Execution



× 100

INCLUDING IMPROVEMENT OF THE EXECUTION TIME IN A SOFTWARE ARCHITECTURE OF LIBRARIES

WITH SELF-OPTIMISATION

159

Table 1: Comparison between measured execution times (in

seconds) with modelled times for Strassen algorithm. In

Xeon and Alpha.

Xeon Alpha

n l

mod. exp. dev. mod. exp. dev.

(%) (%)

3072 1 11.75 12.86 8.58 29.96 29.70 0.89

3072 2

13.90 13.63 1.99 28.54 27.82 2.57

3072 3

37.04 15.76 135.06 17.55 27.61 36.46

4096 1 27.21 29.71 8.41 69.85 70.85 1.43

4096 2

28.59 30.10 5.02 66.04 64.55 2.30

4096 3

48.76 33.34 46.26 57.82 62.56 7.58

5120 1 53.14 56.83 6.51 135.03 134.67 0.26

5120 2

53.53 56.43 5.13 125.76 123.38 1.92

5120 3

71.08 60.19 18.09 118.12 118.45 0.28

6144 1 96.48 96.32 0.17 229.79 232.27 1.07

6144 2

95.39 93.69 1.82 211.10 210.88 0.11

6144 3

110.40 98.39 12.21 201.15 199.33 0.92

Table 1 shows the theoretical execution time pro-

vided by the model (mod.), the experimental execu-

tion time (exp.) and the deviation (dev.) (

mod

−t

exp

)

for the values in Model

Testing Values. The op-

timum experimental and theoretical times are high-

lighted. In Xeon and Alpha the SOLAR

Manager

makes a correct prediction of the level of recursion

with which the optimal execution times are obtained.

Only in Xeon and for matrix size 5120 the recursion

levels are different. In this case the execution time

obtained with the values provided by the model is

about 0.71% higher than the optimum experimental

time. The deviation, however, ranged from 0.17 %

to 135.06 % in Xeon and from 0.11 % to 36.46 %

in Alpha, and the ﬂuctuation is large; Φ

err

= 8.38 %

for Xeon and Φ

err

= 0.87 % for Alpha. This means

that the SOLAR

manager has to build an improved

model by itself for Xeon. In Alpha it is not neces-

sary to make a remodelling of the LAR, and so the

Theoretical

Model would be the Tested Model.

3.2.1 Remodelling Strassen

In Xeon it is necessary to build a new model for the

Strassen algorithm. The scheme consists of deﬁning

a set of third grade polynomial functions from the

Strassen’s theoretical model. This set of polynomials

is:

(n, l) = 2× 7





M(l) +

A(l)

∑

i=1





i−1

(2)

where the coefﬁcients M(l) and A(l) must be calcu-

lated. The parameter l (recursion level) is ﬁxed and

the problem size n varies. For each l a set of Strassen

executions is carried out, and the values M(l) and A(l)

are obtained using the least squares method (FI-LS).

Table 2 shows the values obtained for M(l)

and A(l) in Xeon with l =

{

1, 2, 3, 4

}

and n =

{

512, 1024, 1536, 2048, 2560, 3072, 3584, 4096, 4608

}

Table 2: Values of M(l) and A(l) in Xeon.

l M(l) A(l)

1 2.222× 10

−10

3.890× 10

−08

2 2.244× 10

−10

3.033× 10

−08

3 1.986× 10

−10

3.025× 10

−08

4 3.477× 10

−10

1.528× 10

−08

The A(l) values can be approximated by a ﬁrst

grade polynomial and the M(l) values by a second

grade polynomial. The coefﬁcients are obtained by

least squares. Finally, the formulae for M(l) and A(l)

are:

• M(l) = 1.907 × 10

−10

+ 4.580 × 10

−11

× l − 1.445 ×

−11

× l

• A(l) = 4.378× 10

−08

− 5.131× 10

−09

× l.

Thus, we have a single theoretical model for any

combination of (n, AP).

Now SOLAR

manager strassen veriﬁes the new

Theoretical

Model for the Strassen algorithm. The

values for the structure Model

Testing Values are l =

{

1, 2, 3

}

and n =

{

2688, 3200, 5120, 5632

}

In the table 3 it is seen that the model success-

fully predicts the recursion level with which the opti-

mum execution times are obtained, except for matrix

size 5120. In this case the execution time obtained

with the values provided by the model is about 3.49 %

higher than the optimum experimental time. The rela-

tive error ranges from 0.17 % to 15.52 % and the ﬂuc-

tuation is smaller, Φ

err

= 0.37 %. This means that the

new Theoretical

Model will be the Tested Model.

Table 3: Comparison of the experimental execution times

(in seconds) with modelled times for Strassen algorithm

with remodelling. In Xeon.

n l mod. exp. dev. (%)

2688 1 7.87 8.80 11.92

2688 2 8.40 9.67 15.23

2688 3 10.28 10.52 2.38

3200 1 13.02 14.51 11.92

3200 2 13.56 15.51 14.38

3200 3 16.00 16.30 1.87

5120 1 56.80 56.71 0.17

5120 2 56.44 57.01 1.00

5120 3 60.04 55.09 8.25

5632 1 75.78 74.92 1.12

5632 2 73.50 74.56 1.45

5632 3 71.70 70.97 1.03

The values of the algorithmic parameters vary for

different systems and problem sizes, but with the

ICSOFT 2007 - International Conference on Software and Data Technologies

160

model and with the inclusion of the possibility of re-

modelling, a satisfactory selection of the parameters

is made in all the cases, enabling us to take the appro-

priate decisions about their values prior to the execu-

tion.

4 CONCLUSIONS AND FUTURE

WORKS

The use of modelling techniques can contribute to im-

prove the decisions taken in order to reduce the execu-

tion time of the routines. The modelling allows us to

introduce information about the behavior of the rou-

tine in the tuning process, guiding this process.

It is necessary that the modelling time is small be-

cause at least part of this process could be carried out

in each installation of the routines. Therefore, differ-

ent ways of reducing it have been studied here, and

the results have been satisfactory.

Today, our research group is working on the inclu-

sion of meta-heuristics techniques in the modelling

(Mart

ınez-Gallar et al., 2006), and in applying the

same methodology to other types of routines and al-

gorithmic schemes (Cuenca et al., 2005) and (Carmo-

Boratto et al., 2006).

ACKNOWLEDGEMENTS

This work partially has been supported by the Conse-

jer

ıa de Educaci

on de la Regi

on de Murcia, Fundaci

eneca 02973/PI/05.

REFERENCES

Carmo-Boratto, M. D., Gim

enez, D., and Vidal, A. M.

(2006). Automatic parametrization on divide-and-

conquer algorithms. In proceedings of International

Congress of Mathematicians.

Caron, E., Desprez, F., and Suter, F. (2005). Parallel exten-

sion of a dynamic performance forecasting tool. Scal-

able Computing: Practice and Experience, 6(1):57–

69.

Chen., Z., Dongarra, J., Luszczek, P., and Roche, K. (2004).

LAPACK for clusters project: An example of self

adapting numerical software. In proceedings of the

HICSS 04’, page 90282.1.

Cuenca, J., Gim

enez, D., and Gonz

alez, J. (2004). Archi-

tecture of an automatic tuned linear algebra library.

Parallel Computing, 30(2):187–220.

Cuenca, J., Gim

enez, D., and Mart

ınez-Gallar, J. P. (2005).

Heuristics for work distribution of a homogeneous

parallel dynamic programming scheme on heteroge-

neous systems. Parallel Computing, 31:735–771.

Dongarra, J., Croz, J. D., and Duff, I. S. (1988). A set of

level 3 basic linear algebra subprograms. ACM Trans.

Math. Software, 14:1–17.

Dongarra, J. and Eijkhout, V. (2002). Self-adapting numer-

ical software for next generation applications. In ICL

Technical Report, ICL-UT-02-07.

Douglas, C. C., Heroux, M., Slishman, G., and Smith, R. M.

(1994). GEMMW: A portable level 3 BLAS Wino-

grad variant of Strassen’s matrix–matrix multiply al-

gorithm. J. Comp. Phys., 110:1–10.

Frigo, M. (1998). FFTW: An adaptive software architecture

for the FFT. In proceedings of the ICASSP conference,

volume 3, pages 1381–1384.

Huss-Lederman, S., Jacobson, E. M., Tsao, A., Turnbull,

T., and Johnson, J. R. (1996). Implementation of

strassen’s algorithm for matrix multiplication. In pro-

ceedings of Supercomputing ’96, page 32.

Katagiri, T., Kise, K., Honda, H., and Yuba, T. (2003).

FIBER: A generalized framework for auto-tuning

software. Springer LNCS, 2858:146–159.

Katagiri, T., Kise, K., Honda, H., and Yuba, T. (2005). AB-

CLib

DRSSED: A parallel eigensolver with an auto-

tuning facility. Parallel Computing, 32:231–250.

Lastovetsky, A., Reddy, R., and Higgins, R. (2006). Build-

ing the functional performance model of a processor.

In proceedings of the SAC’06, pages 23–27.

Mart

ınez-Gallar, J. P., Almeida, F., and Gim

enez, D. (2006).

Mapping in heterogeneous systems with heuristical

methods. In proceedings of PARA’06.

Singer, B. and Veloso, M. (2000). Learning to predict per-

formance from formula modeling and training data.

In proceedings of the 17th International Conference

on Mach. Learn., pages 887–894.

Strassen, V. (1969). Gaussian elimination is not optimal.

Numerische Mathematik, 3(14):354–356.

Tanaka, T., Katagiri, T., and Yuba, T. (2006). d-spline based

incremental parameter estimation in automatic perfor-

mance tuning. In proceedings of the PARA’06.

Vadhiyar, S. S., Fagg, G. E., and Dongarra, J. J. (2000). Au-

tomatically tuned collective operations. In proceed-

ings of Supercomputing 2000, pages 3–13.

Vuduc, R., Demmel, J. W., and Bilmes, J. (2001). Statistical

models for automatic performance tuning. In proceed-

ings of ICCS’01, LNCS, volume 2073, pages 117–126.

Whaley, R. C., Petitet, A., and Dongarra, J. J. (2001). Au-

tomated empirical optimizations of software and the

ATLAS project. Parallel Computing, 27(1-2):3–35.

Wolski, R., Spring, N. T., and Hayes, J. (1999). The net-

work weather sevice: a distributed resource perfor-

mance forescasting service for metacomputing. Jour-

nal of Future Generation Computing System, 15(5–

6):757–768.

INCLUDING IMPROVEMENT OF THE EXECUTION TIME IN A SOFTWARE ARCHITECTURE OF LIBRARIES

WITH SELF-OPTIMISATION

161