SPLITTING FACTS USING WEIGHTS

Liga Grundmane, Laila Niedrite

University of Latvia, Department of Computer Science,

19 Raina blvd., Riga, Latvia

Keywords: Data Warehouse, Weights, Fact Granularity.

Abstract: A typical data warehouse report is the dynamic representation of some objects’ behaviour or changes of

objects’ properties.

If this behaviour is changing, it is difficult to make such reports in an easy way. It is

possible to use the fact splitting to make this task simpler and more comprehensible for users. In the

presented paper two solutions of splitting facts by using weights are described. One of the possible solutions

is to make the proportional weighting accordingly to splitted record set size. It is possible to take into

account the length of the fact validity time period and the validity time for each splitted fact record.

1 INTRODUCTION

Following the classical data warehouse theory

(Kimball, 1996), (Inmon, 1996) there are fact tables

and dimension tables in the data warehouse.

Measures about the objects of interest of business

analysts are stored in the fact table. This information

is usually represented as different numbers. At the

same time the information about properties of these

objects is stored in the dimensions.

A typical data warehouse report is a dynamic

representation of some objects’ behavior. If this

behavior is changing slowly, we can say the fact is

slowly evolving. It is difficult to make such reports

in an easy way. Therefore we have introduced fact

splitting using weights. This method gives a

possibility to solve the problem.

To describe slowly evolving fact we can say that

the fact has a value for a time period and following

this is a situation when the given fact corresponds to

many time units in the time dimension.

In some sources some situations are described

when one object has a set of properties or it is

connected with a set of other objects. This set can be

considered as one entirety. To specify the percentage

for each member of the set, weights are used. Such

situation usually is called “many to many”

relationship between the dimension table and the

fact table. To cope with these situations, the bridge

tables are introduced. Each member from the set

that is defined in the bridge table, gets its weight.

Summarized total of weights for all members from

one set have to be 100 percent. There exists an

opinion that bridge tables are not easily

understandable for users, however such a modeling

technique satisfies business requirements quite good.

If a data mart exists with only aggregated values,

it is not possible to get the exact initial values. It is

possible to get only some approximation. So if we

need to lower the fact table grain, we have to split

the aggregated fact value accordingly to the chosen

granularity.

In section 2 related work is presented. Section 3

gives data warehouse related definitions. Splitting

fact using weights is introduced in section 4. It is

followed by an example in section 5. Section 6

concludes the paper and points out some further

research directions.

2 RELATED WORK

The concept of “slowly evolving fact” was

introduced by Chen, (Cochinwala, and Yueh, 1999).

The authors argue that in some cases the classical

approach to keep measurable facts in the star schema

is not the best solution. If the fact value remains

unchanged during some time period, the redundant

values of measures, being at the same time the

snapshots of the fact in every particular time unit

during the period, are stored in the fact table. Instead

of that a “transaction-oriented” fact table could be

used, where the fact table records represent

transaction by structure with measured fact value

253

Grundmane L. and Niedrite L. (2006).

SPLITTING FACTS USING WEIGHTS.

In Proceedings of the Eighth International Conference on Enterprise Information Systems - DISI, pages 253-258

DOI: 10.5220/0002447402530258

 SciTePress

and transaction start and end time representing the

time period with unchanged fact value and very

likely with long duration.

The real transaction-level fact tables with facts

that represent transactions in data sources are

discussed in (Kimball, 1996). (Chen, Cochinwala,

and Yueh, 1999) use the “transaction-oriented” fact

table for answering the questions concerning the

artificial transaction itself, e.g. the average length of

the transaction duration, but in cases when lower

detail data is necessary, a virtual cube is created to

answer the question.

The dimension tables are usually associated with

the fact table using the relationship “one to many”.

The real data warehouse projects sometime request

the solutions for “many to many” situations keeping

unchanged the star schema structure and also the

simplicity of the model. (Song et al., 2001) describes

different solutions for “many to many” relationship.

Two of them are connected with lowering the grain

of the fact table. When fact values are not accessible

they are precomputed and divided according

weights. Authors consider the solutions for

transaction fact tables.

The most popular solution for “many to many”

relationship between the fact and dimension tables is

known from Kimball’s books (Kimball et.al., 1998),

(Kimball and Ross, 2002), where the solution with

the bridge tables is explained. The attribute groups

are defined in the bridge table from attributes of the

dimension table and the weighting factors are

assigned for each attribute within the group.

(Kimball and Ross, 2002) mention also the possible

change of the granularity of the fact table, where the

detailed fact value could be computed from given

fact value multiplied with the weighting factors for

each attribute within the corresponding attribute

group from the bridge table. The result is the

growing number of the fact table records after this

activity. The solution could turn into a problem as

well in the case when more than one dimension have

“many to many” relationships with the fact table,

because the explicit meaning of the newly computed

fact is not possible to define.

(Eder, Koncilia, and Kogler, 2002) suggest the

solution with temporal data warehouse to depict the

structural changes. The dimension attributes and the

hierarchical relationships between them should be

time stamped, that actually means a definition of a

new version. Some solutions in (Eder, Koncilia and

Kogler, 2002) are provided also concerning the fact

attributes. For example, when the granularity of the

time dimension changes and the fact values for the

new detail level are not accessible, the

transformation function is defined from one version

to another, e.g. fact values for the month level are

multiplied with 1/number_of_days_in_the_month.

In our paper we describe the case of slow

evolving fact as a many to many relationship

between the fact table and time dimension and store

the weighted facts in the same fact table with the

original facts to compute dynamics of fact attributes.

3 DATA WAREHOUSE RELATED

DEFINITIONS

In the data warehouse a multidimensional data

model is used. In many books (Kimball, 1996),

(Inmon, 1996), (Jarke, Lanzerini and Vassiliou,

2002) and research papers (Abelló, Samos and

Saltor, 2001), (Hüsemann, Lechtenbörger and

Vossen, 2000) the key components of the

multidimensional model are defined and their

features analyzed. The main components are: fact,

measures, dimensions and hierarchies.

A fact is a focus of interest for the analytical data

processing. Measures are fact attributes usually

quantitative description of the fact. The other

component of the fact is the qualifying context,

which is determined by the hierarchy levels of the

dimensions. The fact is characterized by the

granularity that also depends from the corresponding

hierarchy levels of the fact.

Dimensions are the classifying data used for

grouping the fact data in different detail level. The

dimension data are organized in hierarchies

consisting of hierarchy levels prescribed for the fact

aggregation at different detail level.

As the special type of the facts we will consider

the slowly evolving fact, the fact whose value

remains unchanged during a time period with the

period start time and the period end time connected

to the fact from time dimension.

ICEIS 2006 - DATABASES AND INFORMATION SYSTEMS INTEGRATION

254

Fact table

Initial from

Initial to

Split from Split to Fact Unity split

15.04.2003 30.12.2004 15.04.2003 31.12.2003 10 0,5

15.04.2003 30.12.2004 01.01.2004 30.12.2004 10 0,5

24.01.2005 18.08.2005 24.01.2005 18.08.2005 20 1

Fact table

Initial from

Initial to

Split from Split to Fact Weighted fact

15.04.2003 30.12.2004 15.04.2003 31.12.2003 10 5

15.04.2003 30.12.2004 01.01.2004 30.12.2004 10 5

24.01.2005 18.08.2005 24.01.2005 18.08.2005 20 20

4 WEIGHTED FACT

Speaking about splitting the facts and adding the

weights, we must remember that there are different

kinds of facts, additive, semi-additive and facts that

can not be aggregated. Some cases and methods that

describe how to split facts are described in the next

sections. It will be pointed out, when the usage of

weighted facts is appropriate solution.

By a term weighting facts a process, when a

single fact record is splitted in several records, will

be denoted. The union of all fact validity time

intervals in this new record group must match with

the validity time for primary fact. The validity time

intervals in the new fact record group must not

overlap. And by using some special function it has

to be possible to restore initial value of fact record,

that were splitted.

4.1 Proportional Splitting

With the term proportional splitting we denote the

case when each measure value in a new splitted fact

group is calculated dividing the initial value by the

size of a new fact record group. The validity period

of this fact is not taken into account. For example,

the validity time for the initial fact is from 20

December till next year’s 26

December. If this fact

is splitted according the end of year, two fact records

are made. Validity time for one fact is 11 days, but

the second fact in this two records group is valid

almost a year. Nevertheless the fact values for each

of them are equivalent and are exactly one half of

initial fact’s value as the new values depend on fact

group size and not on validity time interval length.

It is possible to describe a method how to

calculate new fact values that are already weighted,

and split validity time interval into smaller ones, in

an algorithmic way.

A period of time, when the initial fact value, for

example f, is valid, will be denoted by interval

[ft1::ft2]. As the data warehouse is a data storage

that has a temporal nature, the value ft2 can be

unknown. Usually this is a time moment in a future

and is referenced as ‘now’ (Abelló and Martin,

2003). Such a situation in splitting facts causes

several problems that will be discussed later in this

paper. For the first approximation we can assume,

that ft2 is defined time moment, and so it is true in

most cases.

The time period from which one value will be

taken for making dynamic of changes, will be

denoted with [dt1::dt2]. Quite popular is situation,

when this time interval is from 1

January till 31

December. But in many cases this time period might

be completely different. It is possible to compare

values in the same moment for each day, each month

or quarter.

As [dt1::dt2] can be set in different ways, it is

possible to speak about the length of this interval.

There exist a lot of time units, like seconds, days,

months, years. For indicating the length of this time

span –

dt, we propose to chose the smallest

granularity unit in time dimension. Typically it is a

day, but could be also other possibilities. It depends

on each particular situation and requirements.

To determine the size of the new splitted fact

records set and find out the boundaries of the new

valid time intervals an algorithmic approach can be

used.

Set_size:=0;

if ft1<dt2 then Set_size:=Set_size+1;

if dt1<ft2 then Set_size:=Set_size+1;

Set_size:=Set_size +

((ft1-dt2)-(dt1-

ft2))/Δdt;

(1)

The method for getting splitted fact set size is

described in a code (1).

Formula {1} can be used to get the boundaries of

valid time intervals, where k is natural number from

the set [1..((ft1-dt2)-(dt1-ft2))/

dt].

[(ft1-dt2)+k•Δdt :: (dt1-ft2)+ k•Δdt] {1}

According to the previous definitions and

calculations the measure value fs for each fact

record from new the fact records set can be

calculated dividing initial fact value by the set size.

It is represented in formula {2}.

Figure 1: Additive fact weighting. Figure 2: Semi-additive fact weighting.

SPLITTING FACTS USING WEIGHTS

255

fs=f / Set_size {2}

If all values for f are from set [0,1], we call it a

unity fact. And a process of dividing this record in

weighted records – the unity splitting. Such a

column with values 0 and 1 is usually used for

counting objects in a process.

If the fact that is splitted is additive, the

formula {2} can be used for getting weighted fact

value. The example of such fact table is given in

figure 1, where two initial facts are, one of which is

divided in two records group.

There is also another way, which describes work

with fact weights. This approach can be used also if

a fact is semi additive. The solution is to add to the

fact table a new column – the unity fact column.

And then we can weight only this column, from

which the unity splitting is made. After such

operations the table showed in the figure 2 can come

as a result.

Making reports and building queries is the next

challenging task in this approach. If the dimensions

towards which the fact is additive are included in the

report, both columns have to be multiplied.

Otherwise initial values are taken. This solution is

considered better as the first one, because it is easier

to understand for users. We can say that there are

just two different facts in fact table that in special

cases can be multiplied. And of course it is always

possible to make views for clearing such problems.

4.2 Splitting Depending on Validity

Time Interval Length

In some cases it may be not enough, if only the new

fact record set size is used for getting weighted fact

values. Some times it may be important also to take

the validity time length of the fact for each record

into account. Referring back to the previous example

where the fact record is valid from 20

of December

till next year’s 26

of December it wouldn’t be a

good idea to divide the fact “total income for

period” into two equal values. There are almost no

chances that income for six days is the same as

income for almost a year.

If it is needed to weight the fact accordingly to

the validity time length for the each record in new

splitted fact record groups, the formula {2} must be

modified. We need a new function that can be called

diff. This function finds the length of any time

interval in lowest time dimension granularity units.

So it can be the count of days, minutes, seconds

depending on particular situation. If such a function

exists, and usually it is already built in DBMS, the

weighted values for each record in new fact record

group can be found following formula {3}.

{3}

Here with [ts::tb] is denoted the time interval

when the splitted fact value is valid. The length of

this interval can be from one smallest time

granularity object till the length of interval

[dt1::dt2]. Also in this approach not the real initial

fact value f can be weighted, but added unity fact, as

it was described in the previous section.

As it is seen from all formulas {1}, {2} and {3},

it is important to know the validity time interval for

the initial fact. It is not always possible, but from our

point of view it is not a good situation if ft2 is an

unknown time moment in future. It is possible to

choose one of two solutions to weight such facts.

– During data warehouse load it is possible to

assign to ft2 a fixed time moment in the future. It

is not the best way, as we have to recalculate old

fact when ft2 gets a real value. Also such

splitting is not fully correct.

– We can recalculate fact splitting each night, by

assigning to ft2 a current moment of time. It

works quite well, if lowest granularity unit in the

fact table is a day or something less, for example

month or year.

5 MOTIVATING EXAMPLE

In this section an example showing the way, how

splitted facts can be used to get the change dynamic

of the object status, will be provided. In the figure 3

a very simple star schema is given. To make the

example more perceptible, the time dimension is

taken away and the date is stored directly in the fact

table.

In the fact table company’s employees’ contracts

of employment are stored as measures. The contract

for each employee can be valid from several days till

several years. A Typical situation is when the

contract is made for several years, so the fact record

is valid a lot of days that in this example can be

perceived as a smallest granularity unit in the time

dimension. Accordingly to the definition from the

section 3, this fact is slow evolving. Since we want

to group employees according to their working

place, the dimension ‘Department’ is used. In the

dimension ‘Person’ the personal information for

each employee is stored.

[]

]::[

2::1

tbtsdiff

ftftdiff

fs •=

ICEIS 2006 - DATABASES AND INFORMATION SYSTEMS INTEGRATION

256

As it is seen from figure 3, the company has two

different departments, and four employees are

working there. With these persons contracts of

employment are concluded. The count of employees

in each department during the time is changing.

The time period, where one value will be taken

from to compute the dynamic of changes, is one year

in our example, so the values are 1

of January and

of December. The Splitted facts are made and

the weighted values and the valid time intervals are

assigned according to section 4.1.

In the figure 3 the company has one contract of

employment with the person2 and person3. These

contracts are valid in some parts of all these three

years. For this reason the initial fact is splitted into

three records and weighted values are 1/3 from the

initial value that was 1 as it is the unity splitting.

Fact records for other persons are splitted also

accordingly to previously described ideas.

From the data that are given in the figure 3, it is

quite easy to get two of the most popular reports.

One of them answers on the following question:

“How many employees do we have in our company

on the specified day, grouped by departments.” In

the code (2) an example query is given. This query

returns the number of employees in the company on

of July 2004 grouped by departments.

Select name, round(sum(weighted_fact))

from Department d, Fact_table f

where f.department_id=d.department_id

and ‘01.07.2004’ between (initial_from

and initial_to) group by name

(2)

As the weighted fact values are almost not

possible or at least not easy to store in a format 1/3

or 1/7, the weighted fact values normally are finite

decimal numbers. When the weights are

summarized, it is not possible to get whole numbers.

Some additional solutions can be introduced for not

losing some data because of rounding,

– Weighted fact can be stored using big precision.

– The fact can be weighted in not exact values.

This means that instead of three equal values of

0.33 when splitting unity fact, it is possible to

substitute one of the facts from the fact group

with 0.34. In such a way the total for these

weights will be exactly the beginning value

which in this case was one.

– To use views with partly summarized values.

This approach also makes these solutions more

understandable for data warehouse users.

The results of query in the code (2) that are

referencing the data structures from figure 4 are

given in the table 1.

Table 1: Company’s employees’ count in 1

of July 2004.

Second typical data warehouse report that is

possible to get using splitted facts is the change

dynamics of the employees count across several

years. This means that we would like to get

employees count for each department in each year’s

specified date. Normally this date is given as a

parameter. In the code (3) a query is shown against

figure’s 3 data structures. This query returns the

distribution of the employees in each year

separately.

Select name, sum(fact),

year(splitted_from) from Department d,

Fact_table f where f.department_id=

d.department_id and ‘01.07’ between

(day_month(splitted_from) and

day_month(splitted_to)) group by name,

year(splitted_from)

(3)

If we are building such queries, it is not possible

to manage without different time functions. First of

Figure 3: Star schema filled with data.

SPLITTING FACTS USING WEIGHTS

257

all the year has to be separated from the full date. On

the other hand in real life situation in the fact table

the foreign keys to the time dimension are stored.

But in the time dimension usually is a column called

“Year”. This is a reason why we can avoid the

function year. In any case it is not the difficult one

as function that extracts year from full date usually

is built in DBMS. The mandatory function in this

query is the one that can extract day and month from

the full date. In this example we are calling it

day_month. In a lot of situations it can be avoided

using substrings, concatenations and data type

translation functions. A little bit similar is the

function between. This function in DBMS level is

defined for built in data types like date, integer, real.

This is why it should be overloaded in a way that

between can work with partial date as the year this

time is not important.

The results from query in code (3) are presented

in table 2.

Table 2: Company’s employees’ count change dynamic

calculated on each year 1

of July.

When the fact splitting is used, the record count

in fact table is growing. It is not possible to predict

the percentage of growth for all situations. For each

fact table it can be very different. It depends on the

nature of facts, that is, how slow evolving they are.

The second point is, how long is the period, from

which we want to take the value for making

dynamics. It can be day, hour, month or year. In our

project, where the contracts of employment are

stored, the record count grew about three times.

Other benefit, which can be got by combining two

different facts into one fact table, is less storage

space. This is not only because we do not need to

make almost the same fact tables with almost the

same primary keys. Also the indexes should not be

duplicated.

6 CONCLUSION

We introduced solutions for implementing splitted

facts using weights in this paper. There were also

given an example of how to use such structures in

real world situation. However, fact splitting can be

used in other situations as well. One of them is fact

splitting in bitemporal data warehouses. The facts

can be splitted accordingly to transaction time and

validity time overlapping intervals. In that way we

could analyze data looking at events history from

different perspectives.

Other situation when splitting facts could be

appropriate solution is having inconsistent data from

different data sources that has to be integrated. Such

situations should be researched more closely.

REFERENCES

Abelló, A., Samos, J., and Saltor, F. 2001. Understanding

facts in a multidimensional object-oriented model. In:

Proc. of the 4th ACM international Workshop on Data

Warehousing and OLAP. ACM Press, 32-39.

Abelló, A., Martin C. 2003. A Bitemporal Storage

Structure for a Corporate Data Warehouse. In: Proc.

of the 5th Int. Conf. of Enterprise Information Systems

(ICEIS 2003), 177-183.

Chen, C., Cochinwala, M., and Yueh, E. 1999. Dealing

with slow-evolving fact: a case study on inventory

data warehousing. In: Proc. of the 2nd ACM

international Workshop on Data Warehousing and

OLAP. ACM Press, 22-29.

Eder, J. , Koncilia, C. , and Kogler, H. 2002. Temporal

data warehousing: business cases and solutions. In:

Proc. of the International Conference on Enterprise

Information Systems (ICEIS'02), Spain, 81-88.

Hüsemann, B. , Lechtenbörger, J., and Vossen., G. 2000.

Conceptual Data Warehouse Design. In Proc. of the

International Workshop on Design and Management

of Data Warehouses (DMDW 2000), CEUR-WS

(www.ceur-ws.org).

Inmon, W.H.. 1996. Building the Data Warehouse, John

Wiley.

Jarke, M., Lanzerini, M., and Vassiliou, Y. 2002.

Fundamentals of Data Warehouses, Berlin:Springer.

Kimball, R. 1996. The Data Warehouse Toolkit: Practical

Techniques for building Dimensional Data

Warehouses. Jon Wiley & Sons.

Kimball, R., Reeves, L., Ross, M., and Tthornthwaite, W.,

1998. The Data Warehouse Lifecycle Toolkit: Expert

Methods for Designing, Developing and Deploying

Data Warehouses, New York: Jon Wiley & Sons.

Kimball, R. and Ross M., 2002. The Data Warehouse

Toolkit: The Complete Guide to Dimensional

Modeling, John Wiley.

Song, I.-Y. , Rowen, W. , Medsker, C. , and Ewen,E.

2001. An Analysis of Many-to-Many Relationships

Between Fact and Dimension Tables in Dimensional

Modeling. In Proc. of the Int. Workshop on Design

and Management of Data Warehouses (DMDW'2001).

CEUR-WS (www.ceur-ws.org).

ICEIS 2006 - DATABASES AND INFORMATION SYSTEMS INTEGRATION

258