The variability of the data in the output space
existing at this node V
i
is taken as a spread around
the representative ( m
i
) where again we consider a
partial involvement of the elements in X
i
by
weighting the distance by the associated
membership grade,
()()
)()
() ()()
.
,
2
∑
−
=
×∈
ii
YXkykx
ii
i
mky
kxuV
(4)
In the next step, we select the node of the tree
(leaf) that has the highest value of V
i
, and expand
the node by forming its children by applying the
clustering of the associated data set into c clusters.
The process is then repeated: we examine the leaves
of the tree and expand the one with the highest value
of the diversity criterion.
The growth of the tree is controlled by conditions
under which the clusters can be further expanded
(split).We envision two intuitively conditions that
tackle the nature of the data behind each node.
The first one is self-evident: a given node can be
expanded if it contains enough data points. With
clusters, we require this number to be greater than
the number of the clusters; otherwise, the clusters
cannot be formed.
The second stopping condition pertains to the
structure of data that we attempt to discover through
clustering. It becomes obvious that once we
approach smaller subsets of data, the dominant
structure (which is strongly visible at the level of the
entire and far more numerous data set) may not
manifest that profoundly in the subset. It is likely
that the smaller the data, the less pronounced its
structure. This becomes reflected in the entries of the
partition matrix that tend to be equal to each other
and equal to 1/c.
If no structure becomes present, this equal
distribution of membership grades occurs across
each column of the partition matrix.
The diversity criterion (sum of variabilities at
the leaves) can be also viewed as another
termination criterion.
2.2 Classification (Prediction) Mode
Once the C-tree has been constructed, it can be used
to classify a new input ( x) or predict a value of the
associated output variable (denoted here by ).
In the calculations, we rely on the membership
grades computed for each cluster as the standard
fuzzy C-means (FCM). The calculations pertain to
the leaves of the C-tree, so for several levels of
depth we have to traverse the tree first to reach the
specific leaves. This is done by computing u
i
(x) and
moving down. At some level, we determine the path
that maximizes u
i
(x). The process repeats for each
level of the tree. The predicted value occurring at the
final leaf node is equal to m
i
defined in Equation (3).
3 DEFAULT PREDICTION
The sample data set comes from a state-owned
commercial bank. The dataset of 243 samples
represent SMEs. Among these enterprises, the
number of the enterprises which can repay the loan
is 123, the rest 120 are those which can not repay the
loan.
In order to evaluate the performance of the tree a
fivefold cross-validation was used. More
specifically, in each pass, an 80–20 split of data is
generated into the training and testing set,
respectively, and the experiments are repeated for
five different splits for training and testing data.
The binary default variable Yi = 1 if firm i
defaults, and Y
i
= 0 otherwise.
Our model is an accounting based model. In this
kind of model, accounting balance sheets are used
and the input indexes include the enterprise’s
capability of returning loan and wish of returning
loan. The wish of returning loan is measured by the
rate of returning interests, namely
X
0
= Amount of interests that has been repaid /
Amount of interests that should be repaid.
The capability of returning loan is measured by
several indexes that reflect the financial situation of
enterprise, such as profitable capability, operating
efficiency, repayment capability and situation of
enterprise’s cash flow, etc. The several rates are as
follows:
X
1
= Earnings before taxes / Average total assets
X
2
= Total liabilities / Ownership interest
X
3
= Operational cash flow / Total liabilities
X
4
= Working capital / Total assets.
Each index represented the average of three
periods before the prediction period.
In order to evaluate the performance of the tree a
fivefold cross-validation was used. More
specifically, in each pass, an 80–20 split of data is
generated into the training and testing set,
respectively, and the experiments are repeated for
five different splits for training and testing data.
The chosen number of clusters was c=2, since
we were dealing with a binary classification. We
selected the first node of the tree, which is
IJCCI 2009 - International Joint Conference on Computational Intelligence
96