Design and Development of Financial Fraud Audit System Based on

Big Data Technology

Binglan Meng

Dalian University of Finance and Economics, Dalian City, Liaoning Province, 116600, China

Keywords: Big Data Technology, Data Mining, Audit of Financial Fraud, Selection of Features, Classifier Model.

Abstract: Based on the combination of big data technology and financial fraud audit, Hadoop framework, Relief algo-

rithm under data mining technology, Logistic, SVM and Random Forest classifier are combined to complete

the sample data feature acquisition and financial fraud identification model construction, and the financial

fraud audit system is packaged and published in Python language environment. The system is presented in

the form of Web, which is convenient for auditors to query all kinds of financial data or non-financial data,

identify financial fraud and assess the risk of financial fraud through simple and convenient operation. It

provides comprehensive application solutions for the problems of complexity, concealment, difficulty and

risk in the audit of financial fraud in the data age.

1 INTRODUCTION

At present, under the background of digital economy

era, the business operation mode of enterprises is

becoming more and more complex with the empow-

erment of the new generation of digital information

technology. As a result, the means and forms of

financial fraud have also changed, showing the

characteristics of diversification, complexity and

concealment. (Jiao, 2021) At the same time, the

traditional audit procedures and means have gradu-

ally fallen behind, and it is more and more difficult

to complete the audit of many types of financial data

by relying solely on auditors' personal ability and

work experience. In addition, the change of network

and digital environment has prompted a qualitative

leap in the quantity and dimension of enterprise data

information, resulting in increasing risk of financial

fraud audit failure. For this reason, this paper be-

lieves that taking big data technology as the core,

Hadoop framework as the foundation, using HDFS,

HBase and other distributed storage frameworks to

capture, clean and store all kinds of data information

in enterprises, combining with data mining technol-

ogy, Relief algorithm, Logistic, SVM and Random-

Forest classifier to complete the selection of sample

data characteristics and the construction of financial

fraud identification model, and to complete the de-

sign and development of financial fraud audit system

in Python environment. The system is convenient for

internal auditors of enterprises to complete the

whole process of financial fraud audit through sim-

ple and efficient Web application operation, which is

not only conducive to the innovation of the working

mode and method of financial fraud audit, but also

greatly improves the working efficiency of auditors.

2 OVERVIEW OF KEY

TECHNOLOGIES

2.1 Big Data Technology

The big data (mega data) can be called huge data,

which is a kind of data collection whose scale is so

large that its acquisition, storage, management and

analysis greatly exceed the capability of traditional

database software tools. (Zhao, 2022) The embodi-

ment of the value of big data depends on big data

processing technology, that is, big data technology.

The Hadoop is an open source framework writ-

ten by Java language, which stores massive data on

distributed server clusters and runs distributed anal-

ysis applications. (Shi, 2021) Hadoop has quickly

become the most popular and powerful big data tool

with its application advantages of high reliability,

high scalability, high fault tolerance and high effi-

378

Meng, B.

Design and Development of Financial Fraud Audit System Based on Big Data Technology.

DOI: 10.5220/0011737600003607

In Proceedings of the 1st International Conference on Public Management, Digital Economy and Internet Technology (ICPDI 2022), pages 378-381

ISBN: 978-989-758-620-0

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

ciency. The core of Hadoop architecture is distrib-

uted file system (HDFS), distributed computing

programming framework (MapReduce) and resource

distributed scheduling framework (YARN).

HBase is a distributed database with column

storage, but HBase itself is not directly involved in

file storage, and its actual functions are still realized

by HDFS under Hadoop framework. The design

core of HBase is to realize random and real-time

read/write access of HDFS system.

2.2 Data Mining Technology

As a kind of computer science and technology, data

mining is a processing method for big data, which

aims to extract information and knowledge that peo-

ple don't know in advance but have potential useful-

ness from a large number of, incomplete, noisy,

fuzzy and random actual data. (Wang, 2021) The

construction of data mining model is the core of the

whole data mining work, which corresponds to the

data analysis method. According to the functional

requirements of the financial fraud audit system

studied in this paper, the construction of data mining

model aims to complete the identification and risk

assessment of financial fraud, that is, classifying all

kinds of sample data into fraud samples and

non-fraud samples, which belongs to the standard

two-category problem. So, we can choose single

classifiers to solve it.

In the past research, we found that there are

many indicators that affect financial fraud. Accord-

ing to the application environment and sample

number requirements of this system, Relief method

is selected as the representative of filtering feature

selection method, which can score data features

according to correlation, and build the optimal fea-

ture data set based on the score, so as to improve the

accuracy of subsequent data mining results.

2.3 Development Process

According to the application requirements of the

above related technologies, complete the configura-

tion and deployment of the development environ-

ment of the financial fraud audit system. the Hadoop

cluster architecture is built with Linux as the operat-

ing system, the version is CentOS 6.7(x86_64), and

the JDK version is jdk-8u291-linux-x64. According

to the application requirements of the system, Ha-

doop cluster will be set up into seven nodes. The

version of Hadoop is 2.7.7, which is installed in each

node, and components such as Yarn, HDFS,

Zookeeper and HBase are also deployed in each

node.

Secondly, for the development of Web applica-

tion server, the operating system is Windows10.0.

The Web server is Nginx server, the project devel-

opment language is Python 3.6.7, the development

tool is PyCharm 2018.3.1 x64, and the database is

MySQL5.7 to complete the construction and support

of the system database system. In the server, Django

framework is adopted, and the development and

construction of modules, algorithms and models will

be completed in the directory of "mysite" according

to the requirements of system functions. Figure 1

shows the key code for the implementation of Relief

method. In addition, the implementation of each

classifier will also depend on the sklearn module of

Python. As shown in Figure 2, the key code of Lo-

gistic regression model is realized by Pipeline()

method. Through the introduction of the above key

technical theories, the overall environment of the

system development, the configuration of related

software and tools, and the technical feasibility of

the overall project of the financial fraud audit system

are determined.

Figure 1: Python implementation of the Relief algorithm key code.

Design and Development of Financial Fraud Audit System Based on Big Data Technology

379

Figure 2: Key code of Logistic Logistic regression classifier implemented by SkLearn module.

Table 1: Classification results of three classifiers.

Number

Classifier Accuracy rate Precision rate Recall rate

1 Logistic 0.88 0.78 0.44

2 SVM 0.90 0.74 0.66

3 RandomForest 0.91 0.77 0.74

3 FUNCTION REALIZATION

3.1 Administrator's Side

Under the indicator management module, the ad-

ministrator can finish the primary selection of indi-

cators, and select as many indicators as possible that

have certain influence on financial fraud. After re-

search in this paper, 55 indicators were selected, and

they were divided into 8 categories according to

different meanings: ratio structure, solvency, profit-

ability, operating ability, development potential,

cash flow, risk level and governance ability. (Lu,

2022) The determination of indicators will directly

affect the determination of the sample data range of

subsequent financial fraud design, and will also have

an impact on the final fraud identification.

With the sample management module, the ad-

ministrator can import all kinds of sample data. The

sample determination will be completed in combina-

tion with the indicator information, which contains

59 key fields such as indicator information, primary

key number, fiscal year, industry code, and fraud

judgment.

3.2 Audit Client

In the model management module, audit users can

add models, view models and delete models. Among

them, Logistic, SVM and RandomForest models

supported by the system will be divided into two

groups according to the completion of training. If the

training is completed, "Completed Training" will be

displayed in the "Details" column on the page, and

users can click the model name to view the details of

the model. When the user chooses to add a model,

the system will automatically complete the training

of the new model, and the new model will complete

the training will automatically enter the model list,

convenient for the user's subsequent use.

In the financial fraud identification module, the

audit user selects the unmarked sample data existing

in the system, that is, the financial data and

non-financial data of the enterprise in a certain fiscal

year. There are 55 indicators contained in the sample

data, which will be selected by the Relief algorithm.

On the premise of the threshold of 0.001, 25 indica-

tors will be selected as the results of special diagno-

sis of the sample data. According to the feature in-

formation, the classification results of the three clas-

sifier models are shown in Table 1.

According to the classification result, the system

will automatically determine whether the sample

data is a financial fraud sample. If the prediction

result shows fraud, the sample will be identified as a

fraud sample, otherwise, it will be a non-fraud sam-

ple.

4 CONCLUDING REMARKS

In this paper, based on the challenges in the process

of financial fraud audit in the era of digital economy,

ICPDI 2022 - International Conference on Public Management, Digital Economy and Internet Technology

380

an online interactive financial fraud audit system

based on big data technology, data mining technol-

ogy as the core and Web application technology as

the framework is proposed. By using the Relief al-

gorithm, Logistic, SVM and RandomForest classifi-

er under the system data mining technology, the

sample data features are selected and the financial

fraud identification model is constructed. Finally,

the financial fraud is automatically identified and

judged by the accuracy and precision of classifica-

tion.

REFERENCES

Jiao Haixian. Research on the Problems and Counter-

measures of Corporate Financial Fraud Audit under

the New Situation [J]. Money China.2021.08

Lu Xin, Li Huiming, et al. Construction of Financial Fraud

Identification Framework —— Based on Accounting

Information System Theory and Big Data Perspective

[J]. Accounting research.2022.03

Shi Fang Xia, Gao Yi. Application analysis of Hadoop big

Data Technology [J]. Modern electronic technolo-

gy.2021.09

Wang Lili. Application of data mining technology in the

context of big data [J]. Computer and network.2021.10

Zhao Peng, Zhu Yilan. Overview and development pro-

spect of big data technology [J]. Astronaut systems

Engineering Technology.2022.01

Design and Development of Financial Fraud Audit System Based on Big Data Technology

381