detached and loaded into this schema. After the data
has been loaded into this central repository, the
following job was how to analyze and exhibit the data
in informational systems.
3.3 Derived Data Layer
The data in the derived data layer has been cleaned,
formatted, aggregated or even summarized for
presentation tools, decision-making applications, ad
hoc query tools etc. The derived data layer in this
system was formed by OLAP cubes in which end-
users’ requirements would be responded
instantaneously. An OLAP system obtains
information from an underlying data warehouse. It is
necessary to calculate results and put them into OLAP
beforehand. Hence, in this experiment, the daily PV
(page views) and UV (unique visitors) have been
counted and stored in HBase and RDBMS in advance
for different purposes.
When evaluating the PV and the UV, join
operations have been taken among the fact table and
dimensional tables by HiveQL commands, in this
process, identical IPs and pages were counted
respectively. For instance, the page views counted in
a certain day is demonstrated in table 1 below.
Table 1: The sample of page views results.
RID PathID Page Count Time
137 1 index.php 1185 18/11/2015
138 2 forum.php 923 18/11/2015
141 5 register.php 13 18/11/2015
The first column was a record ID to identify each
record. PathID was considered as the page
identification number and the attribute of Page was
the name of this page. The column of Count was
calculated by grouping PathID. The last column
stored the time of this operation happened. According
to this table, the data analysis operations could be
taken. For example, it can be seen that index.php has
been visited 1185 times at that day. Majority of
clients preferred to visit the forum.php page. What is
more, not more than 13 people registered at this day.
The results could be stored in the HBase via
HBaseIntegration. Hence, it was convenient to load
data into HBase only by using HiveQL commands.
The results were alternatively transferred to external
RDBMS via Sqoop. These jobs were invoked every
day under shell commands control. Therefore, when
weekly or monthly even annual PV and UV statistical
results were requested by end-users to estimate
performance of a website, it was possible to get the
results in a short period of time. Otherwise, the raw
data which was placed in HDFS would be calculated
by Hive.
4 RESULTS DESCRIPTION
All frameworks and tools were working
harmoniously in the operations of data transferring,
data cleaning, data manipulating etc. The outline of
processing occurred in each layer is described as
follows. The website log files acted as real-time data
which recorded clients’ actions. In real-time layer,
Flume gathered log files from data stage servers and
sent them to HDFS in parallel. The reconciled data
layer was the key manufacturing site to gather data,
clean data, integrate data and deal with data quality
issues via Mapper and Reducer functions in Hadoop.
Hive took a responsibility to establish data warehouse.
The main purpose of the derived data layer was
considered to display data or make preparation for
data analysis. The OLAP cubes were stored in HBase
and RDBMS for data presentation. Most of the
operations and OLAP cubes constructing processing
were invoked by shell commands automatically. Most
of the data flow processing tasks were fulfilled in
parallel, which meant serval nodes were running a
same job but different parts at the same time. In
addition, this system was a low-cost synthesized data
warehouse architecture but with highly throughput
and big data analysis capability. Most of platforms
and tools were open-source, which could be
conducive to reduce the start-up cost of deploying a
data warehouse.
5 CONCLUSIONS
The emerging software and platforms like Hadoop
and its subprojects have been exploited to achieve the
goals in this experiment. Decades passed,
technologies have been evolving all the time, the
three layered architecture is still very beneficial to
build a data warehouse within the big data context.
From the perspective of this experiment, this system
only has a simple warehousing schema and an OLAP
system. The Hadoop platform worked on a computer
in independent virtual machines which shared
hardware. Therefore there are some aspects can be
ameliorated in this system. In the following work, the
Hadoop platform would be established on physically
independent commodities. It is necessary to design a
stable and robust data warehousing schema in order
to fulfill more sophisticated requests from end-users.