parameter is set to True, ensuring normalization of the
regression variables before fitting. Additionally,
'copy X' is adjusted to True, leading to the creation of
a data copy. Lastly, the 'n job' parameter is set to -1,
allowing the utilization of all available CPUs for
efficient computation."
2.2.3 Decision Tree Regression
Decision tree regression is a machine learning
algorithm used to address regression problems. In this
study, decision tree regression is employed to predict
house prices. The generation process involves
recursively partitioning the feature space by selecting
the optimal features to divide the dataset into subsets
until reaching a stopping condition. Each leaf node
stores a numerical value, representing the continuous
output prediction. This study employed an exhaustive
grid search to optimize decision tree regression. The
purpose was to identify the optimal parameter
configurations. Exhaustive grid search is a technique
that explores all possible combinations within
specified parameter ranges. Widely used in machine
learning, especially in hyperparameter tuning, the
main advantage of exhaustive grid search lies in its
ability to try all possible parameter combinations,
ensuring the discovery of the globally optimal
hyperparameter configuration. Through exhaustive
grid search, the final optimal parameter ranges for
decision tree regression were determined as follows:
The 'max depth' parameter is defined within the range
of 10 to 17, while 'min samples split' is fine-tuned
with values [35, 40, 45, 50]. Simultaneously, 'min
impurity decrease' undergoes adjustments with the
values [0, 0.0005, 0.001, 0.002, 0.003, 0.005, 0.006,
0.007]. The 'max depth' setting governs the maximum
depth of the decision tree, striking a balance between
model complexity and generalization. Regarding 'min
samples split,' it denotes the minimum number of
samples a node must have before splitting,
influencing tree growth and mitigating overfitting
risks for larger values. Similarly, 'min impurity
decrease,' representing the Minimum Impurity
Decrease, establishes a threshold for evaluating the
worthiness of a split, thereby controlling tree growth
and minimizing overfitting. The remaining
parameters adhere to the default settings for decision
tree regression in scikit-learn.
2.2.4 XGboost
XGBoost is a powerful gradient boosting algorithm
that employs decision trees as base learners. It
iteratively trains weak learners, focusing on samples
that the previous model failed to classify correctly,
gradually improving the overall model accuracy.
XGBoost uses CART trees, defines an objective
function incorporating regularization terms to
measure model performance, and fits new tree models
through gradient boosting. Regularization is
employed to prevent overfitting. Ultimately, by
summing the predictions of all trees, the final
prediction of the XGBoost model is obtained.
XGBoost is renowned for its efficiency, fast training
speed, robust performance, and handling of missing
values. In this study, the XGBoost model is
configured with the following parameters: The 'n
estimators' parameter is configured at 300, signifying
the number of base learners employed.
Simultaneously, 'learning rate' is fine-tuned to 0.1,
governing the weight contraction of each base
learner. Additionally, 'max depth' is established at 7,
delineating the maximum depth of each base learner.
All remaining parameters in the XGBoost regressor
retain their default values.
2.2.5 Random Forest
Several decision trees are used in Random Forest, a
potent ensemble learning method, to make
predictions. A randomly chosen sample of the data
and features is used to train each decision tree.
Random Forest delivers great accuracy, robustness,
and successfully reduces overfitting by mixing
predictions from numerous trees. In this study, the
Random Forest model is configured with the
following parameters: The 'n estimators' parameter is
established at 300, designating the number of base
learners (decision trees) within the ensemble. The
'criterion' is configured to 'mse,' specifying the
criterion for tree splitting using mean squared error.
'Max depth' is set to 6, indicating the maximum depth
of each base learner. 'Min samples split' is defined as
0.1, determining the minimum number of samples
required for internal node splitting. Additionally, 'min
impurity decrease' is set to 0.01, specifying the
minimum impurity reduction necessary for a split. All
remaining parameters adhere to the default values in
the scikit-learn Random Forest implementation.
2.2.6 Optimization
This study utilized Bayesian optimization to fine-tune
hyperparameters for XGBoost and Random Forest
models. Bayesian optimization is an iterative method
for global optimization, aiming to find the global
optimum with minimal iterations. It combines prior
knowledge and observed results to estimate the
posterior distribution of the objective function. In
each step, it selects the next sampling point to