The improvement is quite significant: if grades 4
and 5 are considered as high damage grades and
grades 1 and 2 as low damage grades, the probability
of buildings getting a high damage grade is only 8%
after controlling these 4 features, compared with
70.2% taking opposite values, or 60.3% when the
building is randomly chosen in the area, according to
the overall damage distribution in Figure 2. Before
going into the conclusion, some definitions in which
the features involved should be made clear:
RC (Reinforced Concrete): most structures are
built by timber, steel, and reinforced (including
prestressed) concrete. Lightweight materials such as
aluminum and plastics are also becoming more
common in use. Reinforced concrete is unique
because the two materials, reinforcing steel and
concrete, are used together; thus the principles
governing the structural design in reinforced concrete
differ in many ways from those involving design in
one material (Wang, Salmon 1979).
Non-Engineered: non-engineered buildings are
defined as those that are spontaneously and
informally constructed in various countries in a
traditional manner with little intervention by
qualified architects and engineers (Arya 1994).
Therefore, a conclusion can be drawn that, to
improve the seismic ability in the area, the local
government should give top priority to improving the
quality of the construction material, specifically,
getting rid of the mud mortar stone and using the
cement mortar brick, reinforced concrete, or other
stronger materials instead. Also, interestingly,
buildings should better be both engineered, as
expected, and non-engineered in some other parts,
meaning that traditional methods, which may
incorporate some regional characteristics, should also
be taken into account during the construction. Once
these have been done, under a similar situation, the
risk of getting highly destructed (grade 4 and 5) will
plummet to around 1/8 of that of buildings with
features not intentionally controlled.
3 MACHINE LEARNING:
PREDICTING BUILDING
DAMAGES FROM FEATURES
By now, some conclusions have been drawn based on
the data analysis, which generates some insights that
are helpful in this part. Experiments and comparisons
will be carried out on both datasets.
3.1 Choosing Appropriate
Classification Algorithms
Four basic mainstream classification algorithms are
considered here: Support Vector Machines (SVM),
Naïve Bayes Classifier, Decision Tree Classifier
(DT), and Random Forest Classifier (RF).
Support Vector Machines (SVM). Applying
SVM requires input features expressible in the
coordinate system. In pattern recognition, the training
data are given in the form below:
(
x
,y
)
,…,
(
x
,y
)
∈R
×
+1, −1
(1)
These are n-dimensional patterns (vectors) x
and their labels y
. A label with the value of +1
denotes that the vector is classified to class +1 and a
label of -1 denotes that the vector is a part of class -1
(Busuttil 2003).
But both V1 and V2 contain many categorical
variables such as “ground_floor_type” and
“roof_type” which do not make sense in R
and
should not be simply discarded. Thus, SVM fails
here.
The Naïve Bayes Classifier. The nature of Naïve
Bayes assumes independence and involves the
calculating product of all features:
PXy
Py
=PX,y
=P(x
,x
,…,x
,y
)=
Px
x
,x
,…,x
,y
Px
,x
,…,x
,y
(2)
Because
P
(
a, b
)
=P
(
a
|
b
)
P
(
b
)
=
Px
x
,x
,…,x
,y
Px
|x
,x
,…,x
,y
Px
,x
,…,x
,y
=
Px
x
,x
,…,x
,y
Px
|x
,x
,…,x
,y
··· P(x
|y
)P(y
) (3)
Assuming that the individual x
is independent
from each other, then this is a strong assumption,
which is clearly violated in most practical
applications and is, therefore, naïve—hence the
name. This assumption implies that
Px
x
,x
,…,x
,y
=Px
y
, for example.
Thus, the joint probability of x and y
is (Berrar
2018):
PXy
Py
=Px
y
·Px
y
···
Px
y
Py
=
∏
Px
y
Py
(4)
However, correlation maps for both V1 and V2 in
Figure 6 and Figure 7 show several strong
correlations between variables. Also, since there are
many features, it might lead to a relatively large bias
during the calculation of the product. Naïve Bayes
will not be used here.
The Decision Tree Classifier (DT) and
Random Forest Classifier (RF). Instead, the
training will be conducted by tree-based algorithms,
say RF and DT, with the definition below:
Random forests are a combination of tree
predictors such that each tree depends on the values