Furthermore, the models can meet different
complexity and scale of stock analysis market
requirement. Specifically, for the more complex data
come from BSE, NSE and BTC-INR, the SMAPE
value of GLM are 0.06754, 0.06497 and 0.04921,
which stand for more reliable and closer prediction to
actual data, then leads investors can get more valuable
information, so as to achieve revenue maximization
and stand out in drastic marketing competition.
Based on the Seif’s et al. studies (Seif, Ramzy
Hamed and Abdel, 2018), the first bar chart compares
the accuracy of two models for sentiment analysis
using Logistic Regression, SVM, and Random Forest
algorithms. The 'Data Mining tool with Sentiment
Analysis' shows slightly lower accuracy rates
(72.13%, 71.42%, and 77.88%, respectively)
compared to the 'Proposed Model with Sentiment
Analysis' (71.8%, 70.32%, and 75.8%, respectively).
The second chart contrasts the accuracy of the same
models without sentiment analysis. Here, the 'Data
Mining tool' has higher accuracies for Logistic
Regression and SVM (71.56%, 70.63%) but a lower
one for Random Forest (72.87%) compared to the
'Proposed Model' (70.68%, 69.52%, and 75.95%,
respectively). This demonstrated the importance of
sentiment analysis in the domain of stock price
prediction.
With the constant development of artificial
intelligence and machine learning, people could
foresee a more automated and intelligent stock market
analysis. Through using the Apache Spark to analyse
streaming live data and the combination of the
Nowcasting technique and advanced machine
learning algorithm (Das, 2024; Qiu, 2024),
developers can build more stronger and accuracy
prediction model, which provide a deeper market
insight and more personalized investment advice.
Moreover, the constant evolution of distributed
computing frames like the Apache Spark will give
more support on processing on the larger scale and
more complex dataset.
But at the same time, development comes with
limitations and challenges, such as interpretability
and data skew. For interpretability, it is a one of
important indicators to evaluate machine learning
model, which makes model more reliable and
explains how model works. So, add Shapley Value
(SHAP) is a good choice to help ameliorate
interpretability (Jia, 2019; Sundararajan, 2020).
Shapley Value is an explanatory method based on
game theory, which is utilized to measure the weight
of each feature to the model analysis results. And it
evaluates the importance of feature by calculating the
average contribution of a feature across all possible
feature subsets, which makes the results have
uniqueness, local accuracy and consistency.
Furthermore, data skew is a normal problem in
distributed computing environment. In stock analysis,
if some stocks trade much more than others, then it is
possible to meet data skew. Preprocessing data and
adjusting parallelization are two possible ways to
figure out this problem. Preprocessing data can filter
out some keys that skew the data. Like if there are lot
of useless null values, then these values can be
filtered out before shuffle. To adjust parallelize,
which means to adjust the parallelize of shuffle, the
reduce tasks quantity can be increased by doing this,
then release the data skew. The future prospects of
Apache Spark in stock analysis will be brighter and
brighter. The tools in Apache Spark like MLlib, Spark
SOL and GraphX make it can handle more complex
data in stock market analysis, Apache Spark's
potential should be efficiently used in stock market
analysis.
4 CONCLUSIONS
In this work, in order to improve the accuracy in stock
market analysis, two ways under Apache Spark and
machine learning are proposed. The Nowcasting the
financial time series with streaming data analytics
under Apache Spark and Sentiment analysis and
machine learning model are two effective ways to
enhance the reliability in stock market. Nowcasting
technology offset the inadequacy of Apache Spark
processing real-time data and by using the strong
streaming data analysis ability, more larger data scale
can be handled easily and rapidly. The sentiment
analysis assists machine learning model to recognize
emotional tendency, which provides greater
advantage on catch the market by referring the events
happening in real-time that may cause volatility to
stock market. In the future, we need to optimize the
performance of Apache Spark by making research on
interpretability and data skew then deal with these
kinds of problems which could be barriers on
enhancing development of model.
REFERENCES
Chicco, D., Ferraro Petrillo, U., & Cattaneo, G. 2023. Ten
quick tips for bioinformatics analyses using an Apache
Spark distributed computing environment. PLOS
Computational Biology, 19(7), e1011272.
Das, N., Sadhukhan, B., Chatterjee, R., & Chakrabarti, S.
2024. Integrating sentiment analysis with graph neural