AUTOMATED STAR/GALAXY DISCRIMINATION IN

MULTISPECTRAL WIDE-FIELD IMAGES

Jorge de la Calleja

Instituto Nacional de Astrof

ısica,

Optica y Electr

onica, Santa Mar

ıa Tonantzintla, Puebla, M

exico

Olac Fuentes

Computer Science Department, University of Texas at El Paso, El Paso, Texas, U.S.A.

Keywords:

Image Analysis, Machine Learning, Imbalanced Datasets.

Abstract:

In this paper we present an automated method for classifying astronomical objects in multi-spectral wide-

ﬁeld images. The classiﬁcation method is divided into three main stages. The ﬁrst one consists of locating

and matching the astronomical objects in the multi-spectral images. In the second stage we create a compact

representation of each object applying principal component analysis to the images. In the last stage we classify

the astronomical objects using locally weighted linear regression and a novel oversampling algorithm to deal

with the unbalance that is inherent to this class of problems. Our experimental results show that our method

performs accurate classiﬁcation using small training sets and in the presence of signiﬁcant class unbalance.

1 INTRODUCTION

Currently, several multi-band astronomical surveys,

such as the Sloan Digital Sky Survey

, the Two Mi-

cron All Sky Survey

, and the Digitized Palomar Ob-

servatory Sky Survey

, are producing enormous im-

age databases that require automated tools for any

kind of analysis. Analyzing wide-ﬁeld images has

been and still is of great importance in astrophysics:

from studies of the structure and dynamics of our

Galaxy, to galaxy formation and evolution, to the

large scale structure of the Universe (Andreon et al.,

2000).

Recently, there has been a great deal of inter-

est from astronomers in applying computer vision

and pattern recognition techniques to solve astronom-

ical image analysis problems. Examples of these

works include classiﬁcation of galaxies (Bazell and

Aha, 2001; DelaCalleja and Fuentes, 2004; Lahav,

1996; Naim et al., 1995; Owens et al., 1996; Storrie-

Lombardi et al., 1992), stars (Bailer-Jones et al.,

1998), binary stars (Weaver, 2000), star/galaxy dis-

crimination (Andreon, 1999; M ¨ahonen and Hakala,

http://www.sdss.org

http://www.ipac.caltech.edu/2mass

http://dposs.ncsa.uiuc.edu

1995; Philip et al., 2002) and many others. Some

works have started using multi-spectral images, for

example, Zhang and Zhao used data from the optical,

X-ray, and infrared bands to classify active galactic

nuclei (AGN), stars, and galaxies, using learning vec-

tor quantization, support vector machines and single-

layer perceptrons (Zhang and Zhao, 2004).

In general, the ﬁrst steps in analyzing astronom-

ical images are the detection and classiﬁcation of

the objects in them. This presents signiﬁcant chal-

lenges in the case of multi-spectral images, as as-

tronomical objects have widely varying appearances

at different wavelengths. In this work we propose

a method to classify astronomical objects in multi-

spectral wide-ﬁeld images in a fully automated man-

ner. This method ﬁrst locates and matches the ob-

jects in the different multi-spectral images. Then, it

creates a new representation for each object using its

multi-spectral images. After that, it employs princi-

pal component analysis to reduce the dimensionality

of the data and to ﬁnd relevant information. Finally,

it uses locally weighted linear regression to classify

the objects. Since most astronomical problems in-

volve large degrees of class unbalance (for example in

our galaxy/star discrimination problem, galaxies out-

number stars by a factor of 10), we also introduce a

method to deal the problem of imbalanced data sets.

155

de la Calleja J. and Fuentes O. (2007).

AUTOMATED STAR/GALAXY DISCRIMINATION IN MULTISPECTRAL WIDE-FIELD IMAGES.

In Proceedings of the Second International Conference on Computer Vision Theory and Applications - IU/MTSV, pages 155-160

 SciTePress

Figure 1: An astronomical wide-ﬁeld image taken in the

infrared band.

The paper is organized as follows: in Section 2 we

give a brief introduction about multi-spectral wide-

ﬁeld images in Astronomy. In Section 3 we describe

the method for classifying the astronomical objects,

including its main three components. Next, in Sec-

tion 4, we introduce our proposed method for dealing

with imbalanced data sets. Section 5 presents exper-

imental results. Some conclusions and directions of

future research are presented in Section 6.

2 MULTI-SPECTRAL

WIDE-FIELD IMAGES

An astronomical wide-ﬁeld image (see Figure 1) nor-

mally contains from tens to thousands of objects.

These objects may be stars, galaxies, nebulas, or

quasars, among others. Multi-spectral imaging refers

to acquiring several images of the same scene using

different spectral bands. The spectral distribution of

celestial sources carries essential information about

the physical processes that take place in these ob-

jects (Nuzillard and Bijaoui, 2000). Generally, multi-

spectral images provide more information than a sin-

gle wavelength one.

The ﬁnal goal in processing multi-spectral wide-

ﬁeld images is usually to construct catalogues that

contain astrometric, geometric, morphological and

photometric parameters for each object in the image

(Andreon, 1999).

3 THE CLASSIFICATION

METHOD

The method we present to classify the astronomical

objects is divided into three main stages. The ﬁrst

one consists of locating and matching the astronom-

ical objects among the multi-spectral images. In the

second stage we create a new representation for each

object using its multi-spectral images, and also we

ﬁnd a set of features using principal component analy-

sis. Finally, in the last task we classify the astronomi-

cal objects using locally weighted linear regression in

combination with an oversampling algorithm to deal

with the class unbalance inherent to this problem. The

following subsections describe in detail these tasks.

3.1 Location and Matching of the

Astronomical Objects

First, for each multi-spectral image I we separate the

objects from background applying a threshold as fol-

lows

S(i, j) =



I(i, j), if I(i, j) ≥ threshold;

0, otherwise.

(1)

where S is the segmented image, and (i, j) repre-

sent a position in the image.

Then, we locate the astronomical objects in the

segmented images. We can locate an object by the

coordinates (x

) of the bounding rectangle

that encloses it. Our algorithm for locating objects

is based on the ﬂood-ﬁll algorithm (Hearn and Baker,

1997). The purpose of ﬂood ﬁll is to color an entire

area of connected pixels with the same color.

We assume that all the pixels that appear different

from the background correspond to astronomical ob-

jects. Thus, our algorithm proceeds as follows: We

examine every location in the image, then when we

ﬁnd a pixel different from the background, we search

its neighbors, i.e. we search its neighbors in the right,

left, up and down directions, and mark these pixels.

We store the positions of these neighbors to examine

again their neighbors. When we ﬁnd a left neighbor

and its position x

is smaller than the previous one,

we will change it. We do the same for y

, x

and

, but considering up, right and down, respectively.

This process is repeated until the image is fully ex-

amined. In the end we will have located each object

in each of the multi-spectral images by its coordinates

). In Table 1 we outline the algorithm to

locate objects, and in Figure 2 we show an example

of location of objects.

We select some of the located objects according to

their size, i.e. if the objects are larger than a threshold

we will select them. We do this selection process be-

cause many objects are very small, two or three pixels

of size.

Figure 2: The located objects are marked by a rectangle.

Table 1: The algorithm to locate the astronomical objects.

S is the segmented image, with r rows and c columns

N is the set of neighbors, initially empty

For i = 1 to r

For j = 1 to c

If S(i, j) 6= background and is not marked

- x

= j

- y

= i

- x

= j

- y

= i

- N = (i, j)

while N 6= ⊘ do:

- Find the right, left, up

and down neighbors of (i, j)

- Add to N the neighbors of (i, j)

- if it is necessary

Modify (x

)

- Mark S(i, j) and its neighbors

- Erase marked neighbors in N

endwhile

endfor

Most of the time a single object does not appear

in exactly the same location in the different multi-

spectral images. This can be due to slight misalign-

ments in the imaging system, tracking errors in the

telescope, or actual differences in appearance of the

object at different wavelengths. Also, these objects

may appear in different sizes or they may not even ap-

pear at all at some wavelengths. Therefore, we have

to devise an algorithm to robustly identify the same

object in the different multi-spectral images.

The idea of our algorithm to match objects is the

following: We search the largest object among the

multi-spectral images. Then, we use its coordinates

) to ﬁnd the same object, but in the other

images. Almost always the object may not appear ex-

actly in the same position, then we use the Euclidean

distance to match the closest object to our interest

point. Sometimes we may not match any object be-

(a)(b)

Figure 3: Column (a) shows an object in the different multi-

spectral images. We can see that this one is not located in

the same position. Column (b) shows the same object after

matching.

cause in the images may appear only black pixels; in

these cases we assign as match point the location of

the largest object. This process is repeated until all

the objects have been matched. Because we match

an object in ﬁve multi-spectral images, we will have

ﬁve representations of the same object. Table 2 sum-

marizes our algorithm to match objects and Figure 3

shows an example of matching.

Table 2: The algorithm to match the astronomical objects.

Γ is the set of multi-spectral images

while Γ contains objects to match do:

- Find G, the largest object in Γ

- Obtain p

, the coordinate where G was found

- Obtain r

, the region that encloses G

- Obtain Γ

, the image where G was found

- Q = Γ − Γ

- Search G in Q

- Let q = (x

) be the coordinates

of the objects in Q

- Find the closest objects to p

with q using the Euclidean distance

- ∆ = the matched objects

- Erase the matched objects in Γ

endwhile

3.2 Extracting Features

Once we have located and matched the astronomical

objects in the multi-spectral images, we crop each ob-

ject in the original image set to create our data set.

g u i z r

Vectors

ofsize m nx

mxn

Objectinthefivebands

Newrepresentationofanobject

Figure 4: Process to create a new object representation.

Again, we use the coordinates (x

) of each

object to crop it.

Each object has ﬁve representations, according to

the different bands u, g, r, i and z. Each m× n image at

each wavelength is converted into a vector of length

mn, then, these vectors are concatenated to create a

matrix of size of m × n × 5, i.e. the size of the vec-

tor by the number of multi-spectral representations.

In ﬁgure 4 we show the process of creating this new

representation.

Because we have a large data set, we use princi-

pal component analysis (PCA) to reduce its dimen-

sionality and also to ﬁnd features (principal compo-

nents) that permit us to classify the astronomical ob-

jects. Details about PCA can be found in (Turk and

Pentland, 1991).

3.3 Classifying Objects

Finally, we use locally weighted linear regression

(LWLR) to classify the astronomical objects. This

machine learning algorithm takes as input parameters

the projection of the new representation of the ob-

jects onto a few set of principal components. Next,

we brieﬂy describe this method.

3.3.1 Locally Weighted Linear Regression

Locally-weighted regression belongs to the family of

instance-based learning methods. These kinds of al-

gorithms simply store all training examples, and when

they have to classify new instances, they ﬁnd similar

examples to them. In this work we use a linear model

around the query point to approximate the target func-

tion.

Given a query point x

, to predict its output pa-

rameters y

, we assign to each example in the training

set a weight given by the inverse of the distance from

the training point to the query point:

− x

(2)

Let W, the weight matrix, be a diagonal matrix

with entries w

,. ..,w

. Let X be a matrix whose rows

are the vectors x

,. ..,x

, the input parameters of the

examples in the training set, with the addition of a ”1”

in the last column. Let Y be a matrix whose rows are

the vectors y

,. ..,y

, the output parameters of the ex-

amples in the training set. Then the weighted training

data are given by Z = WX and the weighted target

function is V = WY. Then we use the estimator for

the target function deﬁned as:

= x

∗

V (3)

where Z

∗

is the pseudoinverse of Z.

4 THE METHOD TO DEAL

IMBALANCED DATA SETS

The class imbalance problem occurs when there are

many more examples of some classes than others,

such as in our case, where we have more examples

of stars than galaxies. Two main approaches have

been used in machine learning to deal with class im-

balance. The ﬁrst consists of assigning distinct costs

to training examples, weighting more heavily those in

the minority class (Pazzani et al., 1994; Domingos,

1999). The second approach is to re-sample the origi-

nal dataset, either by over-sampling the minority class

and/or under-sampling the majority class (Japkowicz,

2000; Kubat and Matwin, 1997; Chawla et al., 2002).

Our method follows the second approach, and is

similar to the SMOTE (Chawla et al., 2002) algo-

rithm, i.e. the minority class is over-sampled by tak-

ing each minority class sample and adding synthetic

examples in the original data. The purpose of aug-

manting the minority class is to create a balanced data

set, which helps to improve results in the classiﬁca-

tion task.

Our method generates the synthetic examples as

follows: First we separate positive and negative ex-

amples from the original data set, then we ﬁnd the n

Table 3: The algorithm to create new examples.

D is the original data set

N is the set of negative examples

P = D− N

for each example E in P

- Find the n closest examples to E

using the weighted distance

- Obtain A, the average of n

- δ = E - A

- η = E + δ ∗ σ(0,1)

Add η to D

endfor

closest examples to each positive (minority) example

using the weighted distance and taking into account

only the positive data set. Then we average these n

closest instances, take the difference between each

minority example (under consideration) and the av-

erage instance, multiply this difference by a random

number between 0 and 1, and add it to the original

data set. Table 4 outlines our oversampling algorithm.

5 EXPERIMENTAL RESULTS

We tested our method using images taken in ﬁve

wavelengths: u, g, r, i and z, i.e. one in ultraviolet,

one using the green ﬁlter, one using the red ﬁlter, and

two infrared, respectively. These images are of size of

1489×2048 pixels and were obtained from the Sloan

Digital Sky Survey. We used two data sets, the ﬁrst

one (DB1) contained 62 stars and 7 galaxies, and the

the second one (DB2) had 141 stars and 28 galaxies.

These objects were labeled by hand, i.e. we examine

each image and assign a label to each object.

We used 3 principal components that represent

about 90% of the original information in the data sets.

We implemented locally weighted linear regression

and the over-sampling method in Matlab

In all the experiments reported here we used 10-

fold cross-validation. Also, we vary the amount for

over-sampling from 0% to 1000%. The results we

show later correspond to the average of ﬁve runs.

We evaluated our method using two metrics: pre-

cision and recall. This metrics can be deﬁned as fol-

lows:

Recall = TP/(TP+ FN) (4)

Precision = TP/(TP+FP) (5)

Where TP denotes the number of positive exam-

ples that are classiﬁed correctly, while FN and FP

Table 4: The table below show the results for the ﬁrst data

set (DB1) using different amount for over-sampling.

0% 100% 200% 400% 1000%

Recall .343 .400 .343 .485 .542

Precision .486 .530 .500 .474 .607

Table 5: The table below show the results for the second

data set (DB2) using different amount for over-sampling.

0% 100% 200% 400% 1000%

Recall .692 .757 .750 .742 .742

Precision .740 .789 .774 .771 .667

denote the number of misclassiﬁed positive and nega-

tive examples, respectively.

In Table 4 we show the results for the ﬁrst data

set (DB1). We can observe that the best results are

obtained when we use 1000% for over-sampling, i.e.

.542 and .607 for recall and precision, respectively.

In Table 5 we show the results for the second data set

(DB2). Here we can see that better results are ob-

tained than using DB1. This may be due to the fact

that we have more examples to train our classiﬁer.

Also, the best results were obtained using only 100%

for over-sampling, with .757 for recall and .789 for

precision.

6 CONCLUSION

We have presented a method for classifying astro-

nomical objects in multi-spectral wide-ﬁeld images

in a fully automated manner. Also, we introduced a

method to deal with imbalanced data sets that permits

to improve classiﬁcation results. Our results are com-

parable with the best reported in the literature, but we

are using signiﬁcantly smaller training sets, thus bet-

ter results should be expected when we experiment

with larger data sets. Current and future work in-

cludes: testing the method for classifying more types

of astronomical objects, using larger data sets, and

testing the oversampling method on standardized data

sets.

ACKNOWLEDGEMENTS

The ﬁrst author wants to thank CONACyT for sup-

porting this research under grant 166596.

REFERENCES

Andreon, S. (1999). Neural nets and star/galaxy separation

in wide ﬁeld astronomical images. In Proceedings of

International Joint Conference on Neural Networks.

Andreon, S., Gargiulo, G., Longo, G., Tagliaferri, R., and

Campuano, N. (2000). Wide ﬁeld imaging - i. ap-

plications of neural networks to object detection and

star/galaxy classiﬁcation. Monthly Notices of the

Royal Astronomical Society, 319:700–716.

Bailer-Jones, C., C.A.L., Irwin, M., and von Hippel, T.

(1998). Automated classiﬁcation of stellar spectra. ii:

Two-dimensional classiﬁcation with neural networks

and principal components analysis. Monthly Notices

of the Royal Astronomical Society, 298:361.

Bazell, D. and Aha, D. (2001). Ensembles of classiﬁers for

morphological galaxy classiﬁcation. The Astrophysi-

cal Journal, 548:219–233.

Chawla, N., Bowyer, K., Hall, L., and Kegelmeyer, P.

(2002). Smote: Synthetic minority over-sampling

technique. Journal of Artiﬁcial Intelligence Research,

16:321–357.

DelaCalleja, J. and Fuentes, O. (2004). Machine learning

and image analysis for morphological galaxy classi-

ﬁcation. Monthly Notices of the Royal Astronomical

Society, 349:87–93.

Domingos, P. (1999). Metacost: A general method for mak-

ing classiﬁers cost-sensitive. In Proceedings of the

5th International Conference on Knowledge Discov-

ery and Data Mining, pages 155–164.

Hearn and Baker (1997). Computer Graphics. Prentice

Hall, 2nd. edition.

Japkowicz, N. (2000). The class imbalance problem: Sig-

niﬁcance and strategies. In Proceedings of the 2000

International Conference on Artiﬁcial Intelligence

(IC-AI’2000): Special Track on Inductive Learning,

pages 111–117.

Kubat, M. and Matwin, S. (1997). Addressing the curse of

imbalanced training sets: One sided selection. In Pro-

ceedings of the Fourteenth International Conference

on Machine Learning, pages 179–186.

Lahav, O. (1996). Artiﬁcial neural networks as a tool for

galaxy classiﬁcation. In Proceedings in Data Analysis

in Astronomy.

M¨ahonen, P. and Hakala, P. (1995). Automated source clas-

siﬁcation using a kohonen network. The Astrophysical

Journal, 452:L77–L80.

Naim, A., Lahav, O., Sodr

e, L., Jr, and Storrie-Lombardi,

M. (1995). Automated morphological classiﬁcation of

apm galaxies by supervised artiﬁcial neural networks.

Monthly Notices of the Royal Astronomical Society,

275:567.

Nuzillard, D. and Bijaoui, A. (2000). Blind source sepa-

ration and analysis of multispectral astronomical im-

ages. Astronomy and Astrophysics, 147:129.

Owens, E., Grifﬁths, R., and Ratnatunga, K. (1996). Us-

ing oblique decision trees for the morphological clas-

siﬁcation of galaxies. Monthly Notices of the Royal

Astronomical Society, 281:153.

Pazzani, M., Merz, C., Murphy, P., Ali, K., Hume, T., and

Brunk, C. (1994). Reducing misclassiﬁcation costs. In

Proceedings of the Eleventh International Conference

on Machine Learning, pages 217–225.

Philip, N., Wadadekar, Y., Kembhavi, A., and Joseph, K.

(2002). A difference boosting neural network for au-

tomated star-galaxy classiﬁcation. Astronomy and As-

trophysics, 385:1119–1126.

Storrie-Lombardi, M., Lahav, O., Sodr

e, L., Jr, and Storrie-

Lombardie, L. (1992). Morphological classiﬁcation

of galaxies by artiﬁcial neural networks. Monthly No-

tices of the Royal Astronomical Society, 259:8.

Turk, M. and Pentland, A. (1991). Eigenfaces for recogni-

tion. Journal of Cognitive Neuroscience, 3(1):71–86.

Weaver, B. (2000). Spectral classiﬁcation of unresolved bi-

nary stars with artiﬁcial neural networks. The Astro-

physical Journal, 541:298–305.

Zhang, Y. and Zhao, Y. (2004). Automated clustering algo-

rithms for classiﬁcation of astronomical objects. The

Astrophysical Journal, 422:1113–1121.