Doctoral Thesis

A Visual Analysis System for Hierarchical Ensemble Data

Matthias Thurau

Computer and Cognitive Sciences (INKO), University of Duisburg-Essen, Duisburg, Germany

1 STAGE OF THE RESEARCH

My research is done during a project with the steel

making industry in cooperation with the university of

Duisburg-Essen. We cooperated with a big German

steel production facility to digitize samples of steel to

control and enhance steel quality. While the project

included various areas of research for four doctoral

students, my part is the visual representation of re-

sults. I am working on the visual analysis system

since two years and have planned another year to ﬁn-

ish the implementation, evaluation and doctoral thesis

writing.

2 OUTLINE OF OBJECTIVES

Steel making is a very complex process consisting of

various stages. At each stage, the production param-

eters can vary to fulﬁll the wishes of differing cus-

tomers. There are thousands of grades of steel, each

having specialized properties relating to corrosion,

heat resistance, deformability, welding quality, costs

and so forth. To fulﬁll these differing requirements,

variations occur in the production process. There may

be variations in the process ﬂow, whether intention-

ally such as varying the number of production steps

in order to reduce costs, which would normally affect

the purity of the steel or unintentionally through mal-

functions. Also, there are variations in the process

parameters, like different melting temperatures, dif-

ferent material ingredients and the different timings

and durations of each production step. Additionally,

the process is subject to various natural ﬂuctuations

that have an impact on the outcome. While the smelt-

ing furnace should be heated to a certain temperature,

it may ﬂuctuate by several degrees and may thus af-

fect the outcome.

The outcome is measured in the form of a multi-

dimensional data set for a sample of the ﬁnished steel

slab. The steel-making facility is digitizing scientiﬁc

volume data about defects found in the steel. These

defects can be impurities in the form of nonmetallic

inclusions, argon bubbles or cracks. The volume data

is analyzed in a preprocess to create shape descriptors,

which can be analyzed much faster than the original

volume data.

Summarizing said, there are hundreds of input pa-

rameters, whether desired or undesired through un-

certainty, that inﬂuence the outcome. That outcome

again is highly complex and huge. The outcome

is hierarchical, multidimensional, multivariate, mul-

timodal and hierarchical. This is highly compara-

ble to typical ensemble data sets which often come

from simulations, like climate research (Nocke et al.,

2007). However, ensemble data from simulations of-

ten include timevarying results, taken from different

time-steps of the simulation. The dimension of time

is not supported by my system, as the data set is not

computer-simulated and thus only analyzes a single

measured outcome.

Figure 1: The system architecture has a data tree in the cen-

ter, the main data model. Several data interactions are avail-

able for manipulating that tree.

Fig. 1 shows the data model. The hierarchical data

sets consists of three levels, named melting charges,

samples and defects in our use case scenario. There is

a level of traversal t, which is the hierarchical level of

the nodes to compare. As an example, we can com-

pare samples with each other, but not samples with

defects as that would not make much sense.

3 RESEARCH PROBLEM

The goal of the analysis of our ensemble data set is

to identify the signiﬁcance of various input parame-

ters on the output (sensitivity analysis), analyze those

changes in detail (trend analysis), ﬁnd relationships

between different output variables (dimensions), an-

Thurau M..

Doctoral Thesis - A Visual Analysis System for Hierarchical Ensemble Data.

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

alyze the range of outcomes (uncertainty) and ﬁnd

anomalies and outliers within the samples for qual-

ity control purposes. Some of the expected results

are the following. Defects ﬂoating in the liquid steel

may have an ascending force, like that of bubbles in

sparkling water. As a result, since the border of the

steel slab solidiﬁes ﬁrst when it meets a lower ambi-

ent temperature, defects should be trapped in the so

called inclusion band, which is located in the upper

part of the slab. The larger a defect is, the greater

its ascending force and therefore the higher its posi-

tion in the inclusion band. Other properties that may

inﬂuence the position of defects are sphericity (form

descriptor), type of defect, and properties inherited

from the melting charge, such as material ingredients

or the duration of the oxygen blowing process. With

certain types of defects, the defects size may corre-

late with its sphericity, again like bubbles in sparkling

water. The bigger the defect is, the more spherical it

may be. A more complex relation may be the melt-

ing temperature in combination with the defects po-

sition. Because initial temperature determines how

long steel remains molten, the degree of the inﬂuence

of the defects size on its position may vary. Several of

these effects and relationships can be analyzed with

my system.

4 STATE OF THE ART

4.1 Visualizing the Complex Data

Hierarchy

One possibility is to visualize a selected level of

traversal t with some kind of visualization. For in-

stance, a histogram is shown which summarizes a sin-

gle dimension of all the nodes on level t, e.g., the

cleanliness of all samples. This actually means, that

each node on level t is simpliﬁed so that it can be vi-

sualized. This of course can be very beneﬁcial, as

some of the ”unimportant” dimensions aren’t shown

and thus won’t distract the end user. I call this kind of

visualization a level overview visualization as it gives

a brief summary of the whole level t. Another exam-

ple is the visualization of a two dimensional graph to

reveal the inﬂuence of an input parameter to an out-

come dimension (trend analysis), e.g., the steel clean-

liness over the smelting temperature. While my sys-

tems allows the visualization of a level overview, it is

not the focus of my research.

Small multiples (Tufte, 1990), on the other hand,

are well known to visualize multiple nodes using the

data of lower hierarchical levels, e.g., each sample is

represented by a histogram of the volumes of the de-

fects found in it. This is actually a more detailed way

to visualize the complete level t. I applied that idea

and extended the small multiples to ”small multiples

of multiple views”.

Finally, there exist many visualization systems to

analyze such data sets. A single node is a com-

plex data structure, because it has context information

available and also multiple child-nodes consisting of

various dimensions and modalities. Hierarchical visu-

alizations, like treemaps, could reveal the hierarchical

structure but that would not be of interest here as there

is a ﬁxed hierarchy known by the end users. State of

the art are visualization systems, which allow the se-

lection of hierarchical levels and data dimensions in

combination to a visualization type (histogram, graph,

etc.) to get a visualization, or view, on a selected slice

of the data. Multiple of those visualizations and views

(Multiple Views) can be arranged side by side to sup-

port the data analysis in more detail and are today’s

state of the art. Further more, linking and brushing

abroad single views are used to help analyzing the

data further more.

4.2 First Contribution

I retain and support the state of the art of general

purpose multiple view systems to get an insight of

a complex node, like one sample. Users can build

their own MultiView-layouts using various visual-

ization types to visualize different dimensions and

hierarchical levels at once (Wang Baldonado et al.,

2000). My added contribution is, that the user-created

MultiView-layout is reused multiple times to visual-

ize multiple nodes side by side. Therefore, the user is

encouraged and supported to conﬁgure the layout in

such a way, that the speciﬁc data from a node of level

t is visualized, as shown in ﬁg. 2 and 4.

There is some text-based information from par-

ent nodes (melting charges), some statistical visual-

izations summarizing lower nodes (piechart of defect

types, etc.) and scientiﬁc volume visualizations from

especially dangerous defects found through simple

data mining techniques.

The expected outcome is a huge enhancement in

data selection, because the typical node selection of

state of the art systems uses some text-based explorers

or smaller simpliﬁed visualizations. My system visu-

alizes the nodes fully so that the users can search vi-

sually by scrolling through the nodes, comparable to

small multiples. It is able to seamlessly adapt to dif-

ferent roles, whether the analysis of a single ensemble

member (typically a large coordinated multiple view)

or the trend analysis of multiple ensemble members

DoctoralThesis-AVisualAnalysisSystemforHierarchicalEnsembleData

Figure 2: This screenshot shows the level of traversal set to samples. The samples of the current data tree are visualized. Each

sample is presented in multiple views consisting of various visualization types and data from different hierarchical levels

(context+detail). Level overview visualizations are positioned on the left side. By scrolling further to the right, more samples

can be seen.

Figure 3: A layout for small multiples consisting of only

one view per node.

(typically small multiples using a single small visu-

alization, ﬁg. 3). For that reason, my system allows

layouts to be saved for reuse at a later time. Changing

the layouts is done by clicking its name on the top of

the screen. Multiple predeﬁned layouts are common

in systems like Eclipse

or Microsoft Visual Studio

Also, I expect great results for trend analysis by

resorting and ﬁltering the data on different hierar-

chical levels. For that reason, I am thinking about

adding intelligent or augmented scrollbars, that show

additional information about the current nodes visible

(McCrickard and Catrambone, 1999). As the amount

of nodes may exceed a practical amount to study,

users can adjust the system to only show a desired

amount of percentiles of the nodes of level t. Also,

users can aggregate nodes to study groups of nodes as

described next.

http://www.eclipse.org/

http://www.microsoft.com/visualstudio/

4.3 Visualizing Ensemble Data

When dealing with ensemble data or multi-run data

(Wilson and Potter, 2009), one of the main tasks is

to compare. How does one ensemble member differ

from other ensemble members? Which input param-

eters led to the ”best” ensemble member? State of the

art is to visualize reference data into the visualizations

to see a single ensemble member in comparison to the

whole ensemble (Kehrer and Hauser, 2013). As refer-

ence data consists mostly of multiple nodes, the visu-

alizations are enhanced by techniques know from un-

certainty visualization (Pang et al., 1996), e.g., each

bin of a histogram is enhanced by a whisker-boxplot

showing the the reference ensemble members in com-

parison (Mcgill et al., 1978). These visualizations

do not only show a single value, but a range of val-

ues. That range can be shown with varying precisions.

While averaged values may be enough for some anal-

ysis, other analysis may demand for a ﬁve-number

summary (minimum, lower quartile, median, upper

quartile, maximum) or even the full reference data

by visualizing them altogether into a single visualiza-

tion (like done with families of curves (Konyha et al.,

2012)).

On the other hand, ensemble members of inter-

est can be aggregated to form a group. That way, a

group of ensemble members can be compared to an-

other group of ensemble members. This means in par-

ticular, two groups of ensemble members are to be vi-

sualized side by side using uncertainty visualization

techniques.

VISIGRAPP2014-DoctoralConsortium

Figure 4: This screenshot shows the level of traversal set to defects. The user created a layout to see the volume data and

some meta information about the defect.

4.4 Second Contribution

Through the way of dealing with multiple data nodes

by visualizing multiple MultiView-layouts side by

side, there came the idea to aggregate those layouts

to visualize groups of nodes. This way, user see vari-

ous groups side by side on the screen, each group vi-

sualized by a MultiView-layout. This is of particular

interest for groups of samples having the same pro-

cess input parameter. The translation of a single node

layout to a group layout can be done mostly auto-

matic. For Instance, when the layout of a sample used

a traditional histogram of defect-volumes, a layout

for a group of samples will show a histogram based

on whisker-boxplots. Basically, every visualization

type is enhanced to show uncertainties that originate

through the groups or aggregates of nodes. Those en-

hancements will be implemented for various visual-

ization forms, like pie charts, text areas, histograms,

graphs and scatter plots (Pang et al., 1996). For other

forms, like volume data visualization, I will refer to

literature on how to visualize aggregated volume data

(Rhodes et al., 2003).

Figure 5: Several samples presented by histograms. The

red bars visualize the distribution of the defects sphericity

in the sample, and the grey bars visualize the same data di-

mension from reference data (in this case, the aggregation

of all samples of the same steel grade as this particular sam-

ple). While the red bars show original data of a sample, the

grey bars represent averaged data from the reference group.

Figure 6: Several data interaction techniques are available

to change the data model tree.

The complete set of data interaction techniques is

shown in ﬁg. 6. Data aggregation can be done based

on categorical numbers or strings, like the steel type

(like SQL GROUP BY). Visualizing groups based on

the steel types makes much sense, as each steel type

uses a different set of input parameters and therefore

the outcomes per steel type should be more equally

and comparable. Data aggregation on dimensions of

cardinal numbers can be achieved by putting them

into groups of intervals, whether with equally divided

interval ranges (e.g. melting temperatures 1000-1100;

1100-1200; ...) or with equally divided number of

nodes in each interval (e.g. melting temperatures

1000-1059; 1059-1099; ...). Here again, by using the

data interaction technique hierarchical sorting, users

can sort those groups, for instance by the average de-

fect volume, and thus analyze trends.

The interactive reference selection is also a new

contribution in this context. The user can conﬁgure a

list of dimensions that have to be equal. For instance,

the user may ask for the steel-type and the melting-

temperature (+/- 100) to be equal to be matched into

the reference data of the corresponding node. That

kind of input-methodology is used, as each node or

group of nodes has a different steel-type and melting-

temperature and thus asks for a different set of ref-

erence data. By visualizing a single MultipleView-

layout, the user could just select reference data di-

rectly, for instance through selection with the mouse.

DoctoralThesis-AVisualAnalysisSystemforHierarchicalEnsembleData

However, the layout descriptions for my system us-

ing multiples of MultiView-layouts asks for a more

general input methodology so that the layouts can be

reused for different nodes.

5 METHODOLOGY

Most of the described ideas will be implemented so

that evaluations with user and experts can take place.

I plan to publish a systematic evaluation as part of

my doctoral thesis to identify the strengths and weak-

nesses of my approach in detail.

We work with very large data sets. This makes the

system unusable for fast and interactive exploration

and analysis when not dealt with. The large amount

of visualizations is one of the smaller problems. Each

visualization can be rendered to a texture, so that the

translation of the screen (scrolling) only demands a

redraw of that texture. The visualization is only fully

redrawn, when users interact in a more complex way

(like rotating). Interaction on views can have different

kind of complexities:

• not linked: only the interacted visualization will

change

• locally linked: multiple visualizations within the

same node (single MultiView-layout) will change

• globally linked: multiple visualizations from all

nodes will change

However, redraw-operations have to be done

screen-wide only. When a visualization is not visi-

ble, the new interactions can be applied as soon as it

comes into the visible area. As the MultiView-layouts

have a ﬁxed size, the scrollbar-position can be used to

calculate the corresponding node-id very fast.

One of the bigger problems is the visualization

of reference data and aggregated nodes. When data

groups use much of the available data sets at once,

data preparation need much time. Therefore, I plan to

implement the visualization of aggregated nodes and

reference data with a incremental visualization, where

nodes are added to the visualization data one at a time

(Fisher et al., 2012).

6 EXPECTED OUTCOME

As a part of this system is already implemented, I al-

ready collected feedback from the staff of the steel

production facility and also from information visu-

alization experts. Reactions were positive. The vi-

sual search capabilities proved to be very useful. The

staff had originally worked with reports that had to

be opened one at a time. With my system, they were

able to create a node layout that equals their previ-

ous full data report. They can now search much faster

for speciﬁc characteristics and outliers in the reports

through the side-by-side visualizations, which also in-

clude reference data. By sorting, they were able to

analyze trends and by ﬁltering, determine the require-

ments for certain trends and analysis results. Chang-

ing the node layout to visualize only a limited aspect

or dimension of the data set is beneﬁcial, in that it

speeds up checking for repeatability and trend analy-

sis. However, users had difﬁculty locating the exact

layout they had previously saved. They had the lay-

out appearance in mind but not necessarily its name

they used to save it. Regarding the more complex use

cases, we plan a user evaluation in future research.

When a multiple view layout of one node reveals a

relationship between data dimensions, how can users

perceive trends between multiple nodes within that re-

lationship? For instance, how does the inﬂuence of a

defects size on its position change as the temperature

of the melting charge increases? The inﬂuence may

be higher at higher melting temperatures.

REFERENCES

Fisher, D., Popov, I., Drucker, S., and schraefel, m. (2012).

Trust me, i’m partially right: incremental visualiza-

tion lets analysts explore large datasets faster. In Pro-

ceedings of the SIGCHI Conference on Human Fac-

tors in Computing Systems, CHI ’12, pages 1673–

1682, New York, NY, USA. ACM.

Kehrer, J. and Hauser, H. (2013). Visualization and visual

analysis of multifaceted scientiﬁc data: A survey. Vi-

sualization and Computer Graphics, IEEE Transac-

tions on, 19(3):495–513.

Konyha, Z., Le

z, A., Matkovi

c, K., Jelovi

c, M., and Hauser,

H. (2012). Interactive visual analysis of families

of curves using data aggregation and derivation. In

Proceedings of the 12th International Conference on

Knowledge Management and Knowledge Technolo-

gies, i-KNOW ’12, pages 24:1–24:8, New York, NY,

USA. ACM.

McCrickard, D. and Catrambone, R. (1999). Beyond the

scrollbar: an evolution and evaluation of alternative

navigation techniques. In Visual Languages, 1999.

Proceedings. 1999 IEEE Symposium on, pages 270–

277.

Mcgill, R., Tukey, J. W., and Larsen, W. A. (1978).

Variations of box plots. The American Statistician,

32(1):12–16.

Nocke, T., Flechsig, M., and Bohm, U. (2007). Visual ex-

ploration and evaluation of climate-related simulation

data. In Simulation Conference, 2007 Winter, pages

703–711.

VISIGRAPP2014-DoctoralConsortium

Pang, A. T., Wittenbrink, C. M., and Lodh, S. K. (1996).

Approaches to uncertainty visualization. The Visual

Computer, 13:370–390.

Rhodes, P. J., Laramee, R. S., Bergeron, R. D., and Sparr,

T. M. (2003). Uncertainty visualization methods in

isosurface volume rendering. In Eurographics 2003,

Short Papers, pages 83–88.

Tufte, E. (1990). Envisioning information. Graphics Press,

Cheshire, CT, USA.

Wang Baldonado, M. Q., Woodruff, A., and Kuchinsky, A.

(2000). Guidelines for using multiple views in infor-

mation visualization. In Proceedings of the working

conference on Advanced visual interfaces, AVI ’00,

pages 110–119, New York, NY, USA. ACM.

Wilson, A. T. and Potter, K. C. (2009). Toward visual anal-

ysis of ensemble data sets. In Proceedings of the 2009

Workshop on Ultrascale Visualization, UltraVis ’09,

pages 48–53, New York, NY, USA. ACM.

DoctoralThesis-AVisualAnalysisSystemforHierarchicalEnsembleData