ciency. The core of Hadoop architecture is distrib-
uted file system (HDFS), distributed computing
programming framework (MapReduce) and resource
distributed scheduling framework (YARN).
HBase is a distributed database with column
storage, but HBase itself is not directly involved in
file storage, and its actual functions are still realized
by HDFS under Hadoop framework. The design
core of HBase is to realize random and real-time
read/write access of HDFS system.
2.2 Data Mining Technology
As a kind of computer science and technology, data
mining is a processing method for big data, which
aims to extract information and knowledge that peo-
ple don't know in advance but have potential useful-
ness from a large number of, incomplete, noisy,
fuzzy and random actual data. (Wang, 2021) The
construction of data mining model is the core of the
whole data mining work, which corresponds to the
data analysis method. According to the functional
requirements of the financial fraud audit system
studied in this paper, the construction of data mining
model aims to complete the identification and risk
assessment of financial fraud, that is, classifying all
kinds of sample data into fraud samples and
non-fraud samples, which belongs to the standard
two-category problem. So, we can choose single
classifiers to solve it.
In the past research, we found that there are
many indicators that affect financial fraud. Accord-
ing to the application environment and sample
number requirements of this system, Relief method
is selected as the representative of filtering feature
selection method, which can score data features
according to correlation, and build the optimal fea-
ture data set based on the score, so as to improve the
accuracy of subsequent data mining results.
2.3 Development Process
According to the application requirements of the
above related technologies, complete the configura-
tion and deployment of the development environ-
ment of the financial fraud audit system. the Hadoop
cluster architecture is built with Linux as the operat-
ing system, the version is CentOS 6.7(x86_64), and
the JDK version is jdk-8u291-linux-x64. According
to the application requirements of the system, Ha-
doop cluster will be set up into seven nodes. The
version of Hadoop is 2.7.7, which is installed in each
node, and components such as Yarn, HDFS,
Zookeeper and HBase are also deployed in each
node.
Secondly, for the development of Web applica-
tion server, the operating system is Windows10.0.
The Web server is Nginx server, the project devel-
opment language is Python 3.6.7, the development
tool is PyCharm 2018.3.1 x64, and the database is
MySQL5.7 to complete the construction and support
of the system database system. In the server, Django
framework is adopted, and the development and
construction of modules, algorithms and models will
be completed in the directory of "mysite" according
to the requirements of system functions. Figure 1
shows the key code for the implementation of Relief
method. In addition, the implementation of each
classifier will also depend on the sklearn module of
Python. As shown in Figure 2, the key code of Lo-
gistic regression model is realized by Pipeline()
method. Through the introduction of the above key
technical theories, the overall environment of the
system development, the configuration of related
software and tools, and the technical feasibility of
the overall project of the financial fraud audit system
are determined.
Figure 1: Python implementation of the Relief algorithm key code.