traditional methods against advanced ensemble
techniques.
In this research, "New York Housing Market"
dataset from Kaggle is used, which offers a
comprehensive and realistic compilation of data
pertaining to the New York real estate sector. The sale
price of properties designated as the primary target
feature for prediction and a range of independent
variables including house type, location and area,
among others are used to predict the sale price. Such
a detailed and multifaceted dataset allows us to
construct a nuanced model capable of capturing the
complexities of New York's housing market.
The structure of the paper is organized as follows:
Section 2 provides a review of related work, focusing
on methodologies used for predicting housing prices.
Section 3 details the methodologies selected for this
study. In Section 4, the paper analyzes experimental
results, presenting findings and their implications.
Section 5 offers a conclusion, summarizing the
study’s contributions and outlining directions for
future research. References to all cited sources are
included at the end of the document.
2 RELATED WORK
Housing prices are influenced by a complex interplay
of factors, including but not limited to the type of
house, its location, and size. Given the unique
dynamics of New York City's real estate market, a
thorough consideration of these variables is crucial
for enhancing the accuracy and depth of research in
this domain. Historically, the field of housing price
prediction has explored a broad spectrum of
methodologies, ranging from Hedonic Pricing
Models (HPM) to advanced machine learning
methods including LR, SVM, RF, and Gradient
Boosting Machines(GBM). This study employs
machine learning methods to identify the most
effective approaches for modeling the intricacies of
New York City's housing market
Central to the discourse on housing valuation is
the HPM, which systematically accounts for both the
internal characteristics of properties and the external
SVM economic factors influencing their value. This
approach has been notably applied by researchers like
Goodman, and Hallvorsen and Pollakowski,
highlighting its utility in dissecting the multifaceted
nature of real estate valuation (Goodman 1978 &
Halvorsen and Pollakowski, 1981). Despite its
widespread use, the HPM has faced criticism,
particularly concerning its assumptions of linearity
and the challenges posed by multicollinearity among
variables. These critiques underscore the model's
limitations in capturing the nonlinear dynamics and
interdependencies inherent in the housing market,
prompting a shift towards more flexible and robust
machine learning techniques in recent studies.
In response to the limitations identified in HPM,
researchers turned to Machine Learning Methods
(MLMs) for more sophisticated analyses. Ho
employed three distinct MLMs— SVM, RF, and
GBM—to analyze approximately 40,000 housing
transactions over 18 years in Hong Kong (Ho et al.,
2021). Their findings indicated superior performance
of RF and GBM over SVM, as evidenced by lower
scores in mean squared error (MSE), root mean
squared error (RMSE), and mean absolute percentage
error (MAPE).
A prevalent strategy among researchers in this
domain involves the creation of ensemble models,
which combine multiple machine learning algorithms
to improve predictive accuracy. For instance, Quang
Truong developed an ensemble model by integrating
Lasso and XGB (Truong et al., 2020), whereas Ali
Soltani constructed an ensemble from RF and
Gradient-Boosted Trees (Ali et al., 2021). Both
studies reported enhanced predictive performance
with these ensemble models, underscoring the
effectiveness of this approach in housing price
prediction.
In light of the comprehensive review of data
science applications in the realm of housing price
prediction, the study decidedly leans towards the
adoption of machine learning models. This choice is
informed by the inherent limitations of HPM,
particularly their assumption of linearity, which is
found problematic. Simple regression techniques,
while foundational, fall short in capturing the
complexity of the housing market's dynamics in this
context. Consequently, ensemble learning stands out
as a critical methodological approach in the
investigation, notable for its ability to unravel feature
importance. This aspect of ensemble learning not
only enhances the predictive performance of the
models but also aligns with the key objectives of the
paper, providing a deeper understanding of the
variables that significantly impact housing prices.
3 METHOD
The initial step in this study involves conducting an
overview of the dataset to understand its