class of programs according to Internet rules. Python
is the most commonly used development language
for this technology. The principle of web crawler
technology is realized by setting up new crawling
rules and setting the URL of the portal. (Ji, 2017)
Firstly, the developer selects a certain amount of
seeds according to the requirements and saves the
corresponding URLs. Then, a URL queue to be
grabbed is set by the algorithm to save the selected
URLs. After that, the program starts to download the
contents corresponding to these URLs and grab the
key information, and the processed URLs will be
saved in the new grabbed queue. In the meantime,
DNS resolution data and webpage download data
generated by URL resolution will also be saved in the
downloaded webpage database.
2.2 Hadoop Processing Platform
As a development and application ecosystem, Ha-
doop platform can support data-intensive applica-
tions, and its component team is growing with time.
The most important components are distributed file
system HDFS and parallel programming model
MapReduce. The HDFS is responsible for the dis-
tributed storage of massive data, while mapreduce is
to realize centralized parallel computing of distrib-
uted data, and the two complement each other. Ha-
doop ecosystem has many subprojects including
Ambari, Hive, HBase, Zookeeper, Flume, Mahout,
etc. besides Hadoop and mapreduce. With the coop-
eration of multiple components and clear division of
labor, even inexperienced developers can use the
advantages of clusters to deal with big data conven-
iently and quickly. (Li, 2017)
2.3 J2EE Framework
The J2EE is a simplified javaweb development plat-
form designed and developed by SUN Company,
which can develop a series of application software
platforms. In order to simplify the application soft-
ware development program of large enterprises, J2EE
has specially developed a reusable component mod-
ule to improve the development efficiency. Besides,
it has also built a structure that can automatically
handle the level, thus reducing the skill requirements
of developers in developing application software.
(Ma, 2022)
2.4 Development Environment
In this paper, the author briefly introduces the related
technologies of platform development and use. In the
big data precision marketing system, Hadoop is used
as a big data server cluster to process data and store it
in MySQL database, and the corresponding applica-
tion platform is developed by using JavaWeb tech-
nology.
According to the data volume and overall opera-
tion requirements of the system, this paper chooses to
build a Hadoop3.3.1 cluster with three nodes. Then,
the distributed collaboration system zookeeper-3.4.1,
distributed file system HDFS 2.6.5, flume1.9.0, Hive
0.13.1 and Hbase2.6.5 are installed and deployed in
these three nodes synchronously, and the initial con-
struction of hadoop cluster is completed. The cluster
will be developed under Linux system. This paper
selects Centos6.5 Server release version of Linux
operating system. The version of the web crawler
framework Scrapy is 2.5, and Python3.8 is chosen as
the development language. (Lin, 2016)
In this system, the front-end development tool of
JavaWeb application is boomstrap+jquery, and the
development language is JavaScript+HTML+CSS.
The back-end Java development tool is IDEA
2021.1.3 (Ultimate Edition), the development envi-
ronment is JDK 1.8, and the J2EE framework of
Tomcat+Spring MVC+Spring+MyBatis is is used in
the implementation of this system. The development
language is Java, and MySQL 8.0.28 is selected to
help manage data.
3 OVERALL DESIGN
According to the needs of enterprises, hadoop-based
Big Data Precision Marketing System establishes a
top-down one-stop data collection, analysis, pro-
cessing and visualization system. The main func-
tions of data collection, data storage, data cleaning,
data query and data analysis are supported by ha-
doop ecological cluster, and visualization is realized
by javaweb technology.
First of all, collect data from three sources. One
is the collection of local enterprise server data by
flume, two is the URL data collected from the prod-
uct details page by python web crawler technology,
and the last is the access to Taobao, Weibo and other
shared data through external JDBC interface. These
data will be preliminarily cached in HDFS distrib-
uted storage. And the data of the crawler set is stored
by redis. The data calculation module is implement-
ed by mapreduce, which analyzes the preliminary
data and manages the crawler results, and uses data
mining techniques such as association rule algorithm
to achieve the portrait of consumers. After pro-
cessing, the data will be saved in HDFS and hive.