the core technology of big data. Sqoop lets you
import data from a relational database directly into
Hadoop. For example, data from Mysql and Oracle
can be directly imported to HDFS, Hive, and Hbase
in the Hadoop architecture for data storage. This
process is reversible, greatly facilitating data
collection and capture. Sqoop also supports automatic
transmission of large amounts of structured or semi-
structured data, improving the efficiency of big data
systems.
2.2 Data Storage
The distributed file storage system (HDFS) is
designed as the storage engine of big data technology
under the distributed system infrastructure Hadoop.
Hbase is deployed on the HDFS, which is a
distributed real-time column storage database. HBase
is essentially a NoSQL database used to store data.
However, unlike common relational databases,
HBase is more suitable for storing unstructured data.
In addition, Hbase has Key and Value attributes to
facilitate the HDFS to read and write data randomly.
Similar to HDFS, Tachyon is a memory-centered
distributed file system with high performance and
fault tolerance. Tachyon provides fast file sharing
services for offline computing engines in MapReduc
and Spark cluster frameworks. In terms of the
hierarchy of big data technology stack, Tachyon is an
independent layer between existing big data
computing frameworks and big data storage systems.
In the process of big data analysis and mining, HDFS
performance slows down and cache data is easily lost.
2.3 Data Cleaning
Under Hadoop, MapReduce is used as a query engine
for parallel computation of large data sets. The data
cleaning process is mainly MapReduce program
editing and execution, and the whole process is
divided into Mapper, Reducer, Job three basic
processes. (Cao, 2015) The MapReduce program is
used to clean and process the original or irregular data
collected in HDFS and transform it into regular data,
that is, to complete the pre-processing of data
information and facilitate subsequent statistical
analysis. The MapReduce program is used for
statistical analysis. After the program runs, the
statistical analysis results are returned to the HDFS
for storage.
Compared with MapReduce, Spark is a universal
cluster platform that cleans and computes data faster.
Spark extends the MapReduce computing model,
supports more computing modes, and provides users
with richer data interfaces, such as Python, Scala,
Java, and Sql. Spark Uses the Spark Core component
to create and operate apis for each pair of elastic
distributed data sets (RDD) to clean and compute
data.
2.4 Data Query and Analysis
Hive is a data warehouse tool running under Hadoop.
It can read HDFS data for offline query. Hive maps
data to a database table and supports Hive SQL
(HQL) to query data. Hive provides three query
modes, including Bin/Hive (client), JDBC, and
WebGUI, which are suitable for batch processing of
big data. Hive converts SQL statements sent by users
into MapReduce Jobs and runs them on Hadoop to
query and store data in the HDFS. Hive solves the
bottleneck of big data processing in traditional
relational databases such as MySql and Oracle. (Yang
2016)
2.5 Data Visualization
Big data technology obtains data results through a
series of steps, such as data collection, data storage,
data cleaning and data query, and uses data
visualization to intuitively display data results,
helping users to deepen their understanding of data
and discover the laws or trends contained in data.
Data visualization is the last and most important step
in the life cycle of big data technology. Hive based
visual chemicals, including Dbeaver and TreeSoft,
facilitate users to query and view data using SQL
statements through simple database configuration and
connection. Zeppelin is a Spark based data
visualization solution that allows any job running on
Spark to run on this platform. It also supports
visualization of table data.
Big data technology to the results of the data are
applied to the platform of business intelligence (BI)
to help enterprise managers to make decisions and
strategy development, through to the enterprise
external environment data, enterprises within its own
production, sales and management of data collection,
management and analysis, the original scattered, low
value of the density, different types of data into useful
information, provide high-quality data services for
enterprises, promote the integration of "two" of
enterprises, and complete the upgrading and
adjustment of industrial structure.