Software Forest: A Visualization of Semantic Similarities

in Source Code using a Tree Metaphor

Daniel Atzberger

∗

, Tim Cech

†

, Merlin de La Haye

†

, Maximilian S

ochting

†

, Willy Scheibel

∗

Daniel Limberger

∗

and J

urgen D

ollner

∗

Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Germany

†

{tim.cech, merlin.haye, maximilian.soechting}@student.hpi.de

Keywords:

Topic Modeling, Software Visualization, Source Code Mining.

Abstract:

Software visualization techniques provide effective means for program comprehension tasks as they allow

developers to interactively explore large code bases. A frequently encountered task during software development

is the detection of source code ﬁles of similar semantic. To assist this task we present Software Forest, a

novel 2.5D software visualization that enables interactive exploration of semantic similarities within a software

system, illustrated as a forest. The underlying layout results from the analysis of the vocabulary of the software

documents using Latent Dirichlet Allocation and Multidimensional Scaling and therefore reﬂects the semantic

similarity between source code ﬁles. By mapping properties of a software entity, e.g., size metrics or trend data,

to visual variables encoded by various, ﬁgurative tree meshes, aspects of a software system can be displayed.

This concept is complemented with implementation details as well as a discussion on applications.

1 INTRODUCTION

Program comprehension is of fundamental importance

during the entire software development process, es-

pecially in the maintenance phase. Getting a quick

overview over a program’s structure and functionali-

ties is necessary for further development. A question

developers frequently encounter is whether there are

already source code units of certain semantics in terms

of features, concerns, and techniques. The identiﬁca-

tion of such code units can support decisions on feature

location, encapsulation, and module architecture. Fur-

thermore, this information helps to identify associated

developers and could be used to, e.g., determine and

assign reviewers. This is a particular challenge for

historically grown legacy systems, systems developed

from distributed teams, or systems depending on a

large technology stack. In addition, the folder struc-

ture of a software project may deviate from its concep-

tual content over time. Various software visualization

techniques support developers by providing expressive

images whose layouts are inspired by maps known

from daily life. They promote shared mental images

that enable users to communicate more precisely and

gain insight into abstract data.

We present Software Forest, a novel 2.5D software

visualization technique for displaying semantic simi-

Figure 1: A software forest depicting 11 291 source code

units (top). Each unit is positioned based on its semantic sim-

ilarity w.r.t. to all other units. The unit’s dominant topic is

represented by the type of the tree (right, e.g., palm tree, de-

ciduous tree, conifer) and additional attributes are be mapped

to, e.g., tree health and tree growth (left).

larity of source code units. The semantic content of

software ﬁles is described as distributions over latent

topics hidden in the source code by applying Latent

Dirichlet Allocation (

LDA

). Multidimensional Scal-

ing (

MDS

) is used to compute a layout, where the

distance of points reﬂect their semantic similarity. Us-

ing a tree metaphor, Software Forest introduces various

112

Atzberger, D., Cech, T., de La Haye, M., Söchting, M., Scheibel, W., Limberger, D. and Döllner, J.

Software Forest: A Visualization of Semantic Similarities in Source Code using a Tree Metaphor.

DOI: 10.5220/0010267601120122

In Proceedings of the 16th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2021) - Volume 3: IVAPP, pages

112-122

ISBN: 978-989-758-488-6

Figure 2: Visualization of the Microsoft Calculator open-source project in our framework. Tree types are chosen based on the

topics, i.e., source code ﬁles with the same dominant topics are represented through the same tree type. Clusters of identical

tree types, i.e., source code ﬁles with identical dominant topics, can be easily recognized in this example.

visual variables for topic emphasis and additional data

display, e.g., software metrics, and offers possibilities

for displaying static as well as evolutionary data (Fig-

ure 1). The following contributions are made:

We present a layout algorithm based on the ap-

plication of

LDA

and

MDS

on source code ﬁles,

thereby mirroring the semantic relatedness of doc-

uments such as source code units.

We enlist and review visual variables inherent to

the tree metaphor and discuss possibilities and ap-

plications for software visualization tasks.

We provide technical notes on the implementation

of our visualization, discuss its features and lim-

itations, and provide a web-based prototype for

interactive exploration and evaluation.

The remainder of this paper is organized as follows.

Section 2 discusses background and related work. In

Section 3, we detail our layout algorithm. Section 4

presents our visualization concept using visual vari-

ables of trees (ﬁguratively). Section 5 highlights im-

plementation aspects regarding the data processing

pipeline and the visualization framework. Section 6

discusses application scenarios and limitations of the

current approach. Section 7 concludes this paper and

outlines direction for future work.

2 RELATED WORK

Related work is mainly identiﬁed in two research areas:

(1) preprocessing of software system data by means

of topic extraction and dimension reduction and (2)

thematic maps, map-based visualizations, and visual-

ization metaphors in general.

Data Processing.

Since a large part of a developer’s

knowledge of the project is directly encoded in the

source code through comments and identiﬁer names,

a layout origin from the vocabulary could reﬂect the

project’s semantic structure. This approach was pur-

sued for the ﬁrst time by Kuhn et al. (Kuhn et al.,

2008). They examine the natural language found in

source code to generate software visualization. In a

ﬁrst stage source ﬁles are parsed and a term-document-

matrix is created, which stores the frequency of each

word in a document. The high-dimensional document-

word-vectors, i.e., the columns of the term-document-

matrix, are projected to the plane by applying Latent

Semantic Indexing (

LSI

) (Deerwester et al., 1990) fol-

lowed by

MDS

(Cox and Cox, 2008). By creating

a height ﬁeld which represents the LOC, island-like

landscapes are created which allow semantic afﬁlia-

tions to be deduced. Another visualization of natural

language texts, which utilizes a cartographic metaphor,

can be found in (Skupin, 2004). The layout, which

is created by using a self-organizing map (Kohonen,

1997) on the term-document-matrix, is enriched with

a height ﬁeld. In their investigations the authors limit

themselves to abstracts from publications of the geo-

graphic domain. For the visualization of source code

this can be extended by suitable preprocessing.

The main difference between our work and the

previously presented is that our layout is based on

a probabilistic topic model,

LDA

speciﬁcally. It is

known that the pure Bag-of-Words (

BOW

) model is

not able to recognize synonymy and polysemy. Al-

though

LSI

can recognize synonyms, polysemy is still

an issue. Probabilistic models, e.g.,

LDA

can help

to overcome both problems. Besides the layout, we

can also map properties of the topic distribution to

visual variables of the trees (Figure 2). Since

LDA

has been used for numerous software mining tasks, it

is not surprising that the obtained results were repre-

sented by dimension reductions. For instance, Linstead

et al. used LDA and the author-topic model (Rosen-

Zvi et al., 2004) to detect semantic similarity between

Software Forest: A Visualization of Semantic Similarities in Source Code using a Tree Metaphor

113

documents and developers (Linstead et al., 2009). The

resulting document-topic distributions are displayed

by applying

MDS

to reduce them to two dimensions.

Nevertheless, this neglects the fact that the similarity

between topics among each other might differ. We

circumvent this issue by ﬁrst applying a dimension re-

duction to the topic distributions and then aggregating

the document vectors as linear combinations.

Another problem which adresses the question of

positionining elements in a way that reﬂects their se-

mantic similarity was presented in (Barth et al., 2014).

Barth et al. presented three novel layout algorithms for

word clouds, where the distance between the words

should capture their semantic relatedness. Our consid-

erations neglect the similarity of words and address the

representation of semantic similarity of documents.

Visualization Metaphors.

A majority of the under-

lying layouts of software visualizations are based on

the hierarchical folder structure (Holten et al., 2005;

Cornelissen et al., 2008). Treemaps are a widely used

technique in hierarchy visualization (Scheibel et al.,

2020) and are prominently used for software visualiza-

tion as they provide a wide variety of visual variables

(Limberger et al., 2019).

Besides classic visual variables like block height,

texture or color a 2.5D treemap offers creative ways for

visualizing dynamic properties of a software system.

An example is presented by W

urfel et al., who showed

how natural phenomena such as ﬁre and rain can be

used to visualize the evolution of source code artifacts

urfel et al., 2015). In this context, CodeCity (Wettel

and Lanza, 2007) and the Software City (Steinbr

uckner

and Lewerentz, 2013) are well known metaphors. By

mapping source code metrics, e.g., Lines-of-Code

(

LOC

), to visual variables of cuboids, e.g., the height,

visualizations are created, which remind optically of

modern cities. The approach was further developed

in later work and integrated into IDEs (Balogh et al.,

2015). Next to rectangular layouts, map-like shapes

are used in hierarchy visualization as well (Auber et al.,

2013). With focus on software visualization, the map

metaphor is extended to derive layouts from depen-

dencies (Hawes et al., 2015). The map metaphor is

further used in conjunction with oceans and islands

that are derived from Voronoi diagrams (Schreiber and

Misiak, 2018). Another approach for visualizing soft-

ware metrics inspired by nature are Software Feathers,

a visualization approach for object-oriented entities

like classes and interfaces with feathers (Beck, 2014).

The structure of feathers is determined by numerous

parameters, e.g., their size, shape, or texture, which

provides many possible mappings of software met-

rics. Similar to our forest and landscape visualization,

anek proposes a visualization primarily composed

of trees, wood cabins and islands (

anek, 2020).

The layout of source code units is calculated based on

a weighted graph, with ﬁles that reference each other

having a stronger pull toward one another. Different

ﬁle types are represented with different metaphors,

e.g., classes are represented with trees while value

types (structs) are visualized using small cabins.

Kleiberg et al. proposed another visualization

which mainly bases on the tree metaphor (Kleiberg

et al., 2001). The authors generate a botanic tree whose

branches and fruits display hierarchical structures. Our

work completely neglects the underlying hierarchical

structure of a software system provided by its folder

structure. Additionaly a single tree is used to display a

single ﬁle rather than an entire project.

3 LAYOUT COMPUTATION

The underlying layout of a software forest is con-

structed to reﬂect the semantic similarity of the source

code units, in our case ﬁles contained in a source code

repository. Thus, we analyze the use of natural lan-

guage in the source code. We ﬁrst preprocess the

corpus

C = {d

, . . . , d

}

of software ﬁles

, . . . , d

to get rid of the words in the vocabulary that do not

carry a semantic meaning. We then compute a formal

description of the documents as distributions over la-

tent topics using

LDA

. The resulting high-dimensional

vectors describing topics are reduced by

MDS

to two

components, which ﬁnally aggregate to the positions

of document vectors.

3.1 Preprocessing Data

Although source code resembles natural language in its

vocabulary, various preprocessing steps are required

to get rid of words that do not carry relevant seman-

tic meaning. For this, we implement an algorithm,

processing all documents as follows:

1. Removal of Non-text Symbols.

All special char-

acters such as dots and semicolons are replaced

with white spaces to avoid accidental connection

of words not meant to be combined. This includes

the splitting of identiﬁer names, e.g., the word

foo.bar gets split into foo and bar.

2. Split of Words.

Identiﬁers are split according to

delimiters and the camel case convention, e.g., Foo-

Bar is split into foo and bar, and stripped from

redundant white space subsequently.

3. Removal of Stop Words.

Stop words based on

natural language and keywords of programming

IVAPP 2021 - 12th International Conference on Information Visualization Theory and Applications

114

Table 1: Three exemplary topics extracted from Bitcoin Core (github.com/bitcoin/bitcoin) source code with K = 50.

Topic #1 Topic #2 Topic #3

Term Probability Term Probability Term Probability

std 0.070 thread 0.132 address 0.115

transaction 0.031 time 0.070 model 0.108

fee 0.027 queue 0.064 table 0.065

tx 0.026 std 0.054 label 0.051

ban 0.024 callback 0.040 qt 0.033

str 0.023 run 0.037 index 0.030

handler 0.016 call 0.025 dialog 0.024

output 0.016 mutex 0.021 column 0.024

bitcoin 0.015 scheduler 0.020 ui 0.021

reason 0.015 wait 0.018 role 0.019

language are removed as they carry no semantic

content. Additionally, we ﬁlter the input based on

a hand-crafted list comprising domain speciﬁc stop

words, data types, type abstractions etc.

4. Lemmatization.

To avoid grammatical diversions,

all words are reduced to their basic form, e.g., said

and saying are reduced to say.

After preprocessing, we store all documents as

BOW

, i.e., the order in which the words occur is ne-

glected and only their frequency is stored. We are not

introducing new identiﬁers for this notation and, when

using the term documents, always refer to their repre-

sentation as

BOW

. The term-document-matrix stores

these BOW vectors as its columns.

3.2 Latent Dirichlet Allocation

Given a corpus

C = {d

, . . . , d

}

of documents that

share a vocabulary

V = {w

, . . . , w

}

, a probabilistic

topic model extracts semantic clusters in its vocab-

ulary, so-called topics, as multinomial distributions

over the vocabulary

. Additionally each document is

described as a mixture of the extracted topics, which

describe its document’s semantic structure. Given a

number

of topics speciﬁed as a hyperparameter to-

gether with Dirichlet priors

α = (α

, . . . , α

)

, where

> 0

for all

1 ≤ i ≤ K

and

β = (β

, . . . , β

)

, where

> 0

for all

1 ≤ j ≤ N

that capture the document-

topic-distributions and the topic-term-distribution re-

spectively,

LDA

deﬁnes a fully generative model for

the creation of a corpus (Blei et al., 2003). The gener-

ation process for a document d operates as follows:

1. choose a distribution over topics θ ∼ Dirichlet(α)

2. for each word w in d

(a) choose a topic z ∼ Multinomial(θ)

(b)

choose the word

according to the probability

p(w|z, β)

Table 1 exemplarily shows the top 10 words of 3

manually selected topics applied on the corpus derived

from the source code of the Bitcoin Core. These often

suggest the concept underlying the topic, e.g., the third

topic covers user interaction. The number of topics

was chosen such that the extracted topics were easily

interpretable by humans.

3.3 Dimension Reduction

Given the extracted topics

, . . . , φ

and the

document-topic distributions

, . . . , θ

, we ﬁrst

project the topic vectors to two dimensions. We adapt

the approach by Sievert et al. (Sievert and Shirley,

2014), which is given by applying

MDS

to the disimi-

larity matrix derived from the Jensen-Shannon diver-

gence.

MDS

yields projections

, . . . ,

, such that

Figure 3: Ldavis applied on Bitcoin with

K = 50

. The

centers of the circles indicate the position of the topic-word

distributions after being reduced with

MDS

, the radii display

the relative importance of the topic for the entire corpus.

Software Forest: A Visualization of Semantic Similarities in Source Code using a Tree Metaphor

115

the euclidean distance between projected points re-

ﬂects the dissimilarity between the high-dimensional

points. A document

1 ≤ i ≤ m

viewed as

-length

vector θ

= (θ

(1)

, . . . , θ

(K)

), is then represented by the

convex linear combination

d, precisely

d =

∑

j=1

( j)

. (1)

A direct application of

MDS

to the document-topic

vector as done in (Linstead et al., 2009) neglects the

similarity of topics to each other as shown in Figure 3.

4 VISUALIZATION APPROACH

Since software has no intrinsic gestalt, our visualiza-

tion must project the properties of the processed topic

model data onto graphical primitives. This can be done

by utilizing its visual variables and either providing an

explanation, how the mapping of information to the

variables is done, or by making use of metaphors that

are known to the users (or both).

Tree Metaphor.

Our visualization applies a tree

metaphor: each source code unit is represented by

an individual tree. All source code units together make

up the software project and are depicted as a forest.

The trees provide different visual variables that can

be utilized, such as their type and height, the size of

their trunk and crown, or the amount and color of their

leaves. The position and displayed proximity of two

trees allows drawing conclusions about topic similarity

easily. In addition, we can rely on easy-to-understand

concepts such as health, age, or season of trees in or-

der to convey common software metrics such as test

coverage, change frequency, or size. Anomalies in

the source code ﬁles, such as many occurring errors

or rapidly changing topics, can be displayed by tree

stumps, or fallen or burning trees.

In our prototype, we decided to allow users to

choose their own mappings by listing available visual

variables and properties of the used source code enti-

ties. As an example, users can visualize the dominant

topic with the tree type, as can be seen in Figure 2,

and thus recognize which ﬁles contain similar topics.

Users can explore the 3D forest by rotating, pan, and

zoom using the virtual camera by means of a world-in-

hand navigation. To reduce rendering cost and main-

tain visual clarity we opted for simpler, low-poly tree

geometries instead of realistic tree rendering—we in-

tegrated and experimented with a technique for large

scale forest rendering ﬁrst (Bruneton and Neyret, 2012)

but found it difﬁcult to increase the visual expressive-

ness of individual trees. The resulting visualization

Figure 4: Visualization of the notepad-plus-plus

project.

The dominant topic of each source code ﬁle is visualized

with the tree type. The cursor hovers over a light green

tree and identical trees, i.e., similar source code ﬁles, are

highlighted. For the selected tree, information on the cor-

responding source code ﬁle, such as ﬁle path, lines of code

and its dominant topics, is shown above. In this example, the

pre-computed topics have been labeled manually to provide

comprehensible topic descriptions.

can be categorized as

⊕R

, i.e., three-dimensional

graphical primitives on a two-dimensional reference

space (D

ubel et al., 2014).

The visualization is also intended to support soft-

ware systems of more than 10 000 source code units

such as TensorFlow. This raises the issue that the visu-

alization of so many ﬁles and thus so many trees can

get confusing. To alleviate this issue, our visualiza-

tion emphasizes dense tree clusters by making use of

the terrain’s height. Each tree is assigned to a terrain

height depending on how many trees can be found in

its surroundings and trees in denser clusters then can

be found on hills. To be able to differentiate various

clusters, a water surface with adjustable height can

be used to separate different hills from one another.

Furthermore, users can receive information about each

tree via hovering over it. Depending on the zoom

level of the visualization, a tooltip is shown for the

hovered tree, containing the name and details of the

represented software code ﬁle or project, such as soft-

ware metrics or information about its topic distribution

(Figure 4). Trees with the same identical dominant

topic are highlighted to emphasize topic clusters.

Tree Geometry.

We use handcrafted geometry from

Sketchfab, rather than designing our own models or

generating geometry dynamically (Figure 5). While

the latter approach could introduce numerous varia-

tions, handcrafted geometry keeps the visualization

framework easily customizable and reduces the overall

complexity. Generating 3D trees at runtime would

allow data to be displayed in an arbitrary exact fash-

IVAPP 2021 - 12th International Conference on Information Visualization Theory and Applications

116

Figure 5: Examples of two handcrafted 3D tree model sets: top row shows a pine tree, modeled and hand painted in various

stages, sizes, and health conditions. The bottom row shows low-poly meshes of various tree types in color variation. The

models were purchased on Sketchfab: “HandPainted Pine Pack” by ZugZug and “Low Poly Nature Pack” by NONE.

ion. However, rendering highly individual meshes

increases the complexity for and requirements on web-

based implementations. In contrast, handcrafted mod-

els tend to increase the visual quality more and can

easily be instanced and rendered.

The model ﬁles are expected to contain multiple

variations of trees using scene-graph semantics. Fur-

thermore, for every tree, multiple variation of the tree

must be associated to a speciﬁc combination and range

of visual variables and data respectively, e.g., tree

growth requires information on which variations rep-

resent correspondingly large or small grown trees.

Each model ﬁle is stored with a description ﬁle

that captures each tree variant and its attributes used

in the visualization. Although automatic creation of

this description ﬁle is imaginable, e.g., through use

of heuristics or rule-based approaches, we manually

created our description ﬁles. In detail, each descrip-

tion ﬁle contains the following information: Each ﬁle

contains a list of tree types and variants along with an

overview of meta-information about the corresponding

model. The overview stores a reference to the model

ﬁle it describes, a list of the identiﬁers used for all vi-

sual variables labeled and information on how to scale

the whole scene to create a consistent visualization. By

storing an explicit reference to the model ﬁle, multiple

descriptions can be used for a single asset ﬁle, e.g.,

when asset ﬁles contain multiple, different 3D models.

The list of distinct variants follows a two-level

structure: On the top level, models are grouped by

their tree type, as this property separates fundamentally

different trees. Within each tree type, a generic repre-

sentative is named and a list of tree variants is recorded,

e.g., trees of type “birch” with different growth states.

Every tree variant assigns concrete manifestations in

https://github.com/notepad-plus-plus/

notepad-plus-plus

form of numeric values to each of its visual variables.

Models in the description ﬁle are referenced by their

scene identiﬁer from the model ﬁle scene graph. With

the description and the model ﬁle, data points in the

framework can be visualized using its models and all

visual variables of the ﬁle get shown to the user for at-

tribute mapping. Users can freely assign data attributes

to tree variables, e.g., by assigning the age of a soft-

ware project to the growth state of the trees. As each

set of models provides only a ﬁnite set of tree variants,

not every value combination can be expressed directly,

especially if multiple visual variables are used. We

employed a model selection algorithm to choose the

most suited tree model for each data point. To facili-

tate this, the algorithm rates each tree model based on

the difference between encoded values of each of the

used variables and their attributes.

Data Input.

The data processing on source code

repositories can take anything between hours and days,

thus we decoupled data processing and visualization

by storing the results in a structured, intermediate data

format with the following constraints:

• Each data point will be represented by one tree.

•

Each data point has attributes that represent a 2D

position to place the tree in the terrain.

•

Each data point might have additional attributes,

e.g., dominant topics, ﬁle name or metrics.

This separation allows for the visualization, to be used

with data from topic models as well as any input that

complies to these constraints, e.g., data from other

domains such as geo-referenced data.

Software Forest: A Visualization of Semantic Similarities in Source Code using a Tree Metaphor

117

5 IMPLEMENTATION

The implementation is separated into two distinct com-

ponent: data processing and visualization. The in-

terface between both components are structured ﬁles

where one ﬁle represents one data set. Thereby, the

structured data set uses the common CSV ﬁle format.

The data processing pipeline and the visualization ap-

proach are implemented as follows.

5.1 Data Processing

We use the Python package environment and the

GitHub API for Python for gathering our corpus. For

preprocessing, we use the nltk

and spacy

library

for lemmatization and obtaining a stop word list for

the English language. One of the bottle necks of our

data pipeline is the main memory. Large projects like

TensorFlow may exhaust the main memory especially

during lemmatization. Therefore, we chose to pro-

cess projects in batches and process them sequentially,

optimizing the preprocessing steps for this kind of

distributed computing. The results of this step were

cached and saved in text ﬁles that can be loaded during

runtime to avoid redoing this step for a project.

After ﬁnishing preprocessing, we used the gensim

implementation for LDA. It provides a convenient op-

timization pipeline with random search as well as deal-

ing properly with relatively large amounts of data. For

optimization and the dimension reduction algorithms,

we used the scikit-learn library

5.2 Visualization

The visualization component is implemented as a web-

based visualization system that processes the CSV ﬁles.

Thereby, the component is implemented using Type-

Script and WebGL to be used in modern web browsers.

As the browser integration and rendering is based in

the open-source framework webgl-operate

. It pro-

vides some abstraction in accessing WebGL rendering

functions, including some very high-level functionality

such as a ready-to-use label rendering pass or glTF

scene loading functionality. On top of this, the terrain

and tree rendering functionality are implemented as

extension to the framework.

https://www.nltk.org/

https://spacy.io/

https://radimrehurek.com/gensim/

https://scikit-learn.org/stable/index.html

https://webgl-operate.org

https://www.khronos.org/gltf/

Tree Geometry.

The visualization component han-

dles 3D models, i.e., the trees, encoded using the glTF

ﬁle format. To describe the models contained in the

scene graph, a hand-managed description ﬁle in JSON

format is provided by the visualization designer, as-

sociating visual variables with a number 3D models

in the scene graph. Each tree model is represented by

a scene graph node within the while glTF ﬁle. Since

trees may inherit a translation in the scene graph, the

translation is normalized and adjusted to the tree posi-

tion in the terrain for this visualization. Each tree has

a corresponding data structure in memory and on the

GPU as vertex buffers that contains all its attributes,

given through the current attribute mapping. When the

mapping changes, i.e., the user selects a different CSV

column for a tree attribute, the attributes are updated

and the vertex buffers are updated accordingly. After a

change in its attributes, the tree model for a data point

is determined by a best-ﬁt approach among the rule set

of the description ﬁle.

Terrain Generation.

When loading a data set, the

terrain is calculated based on the tree density. For

this, the terrain is subdivided into a grid with a ﬁxed

resolution. For each tree, a Gaussian kernel is applied

at its position, increasing the terrain height there and in

the vicinity. In cases where many trees are next to few

trees, the height is ﬂattened with a non-linear fall-off,

resulting in a ﬂatter, more plain terrain. The goal is

that the densest cluster on the whole terrain will reach

maximum terrain height while a single tree will still

have a clearly visible terrain height of ca. 30-50% of

the maximum terrain height. In addition to the hills

and beaches of the terrain, a static water texture is

rendered to give a sense of cohesion to clustered tree

groups. The water level is conﬁgurable in the user

interface and may be used for height-ﬁltering of data

points (Limberger et al., 2018).

Tree Placement.

Stating no additional assumptions

on the characteristics of the 2D layout of the data pro-

cessing pipeline, data points in the 2D layout may

overlap or may be positioned very close to each other.

Since the tree models have an actual and visible vol-

ume, these positions are not directly suitable for a clear

visualization and may lead to trees standing within

another. In order to prevent this, a basic collision

reduction algorithm is applied. First, the terrain is

subdivided into a grid with a ﬁxed resolution. Then,

each tree is assigned the closest position on this grid.

If the closest position is already taken, alternative posi-

tions in the vicinity are checked until a free position is

found. The alternative positions are looked up based

on a ﬁxed circle kernel in a clockwise direction, incre-

IVAPP 2021 - 12th International Conference on Information Visualization Theory and Applications

118

Figure 6: Four different assets comprised of different tree geometries and styles used for data mapping. The mapping, the

number of visual variables, and the overall expressiveness of the forest is highly dependent on the assets and its description.

mentally increasing the radius until the search kernel

is exhausted. In order to not make the clusters seems

artiﬁcial because of the introduced circle-like struc-

tures, a random offset based on the original position is

introduced. This results in the natural looking clusters

in comparison to sole square or circle kernels.

Rendering.

The tree rendering itself is implemented

using WebGL 2 and its instanced rendering capabili-

ties, i.e., the

drawElementsInstanced()

API. This

makes it possible to efﬁciently render a model multiple

times in the scene and allows to depict a large number

of data points. However, tree types and their visual

variables might change dynamically, thus, it is neces-

sary to adjust the buffers sent to the GPU frequently

since each tree might change its visual representation

on every change in the attribute mapping.

6 RESULTS & DISCUSSION

We built a renderer that runs in modern web browsers

and can thus be used on a variety of desktop and mobile

devices (Figure 7). Moderately sized software projects

(

10 000 entities) can be explored interactively with

an integrated graphics chip. With a dedicated graphics

card, even bigger projects (

100 000 entities) can be

visualized with interactive frame rates. In the visu-

alization, different presets are available that store a

Figure 7: Gui of our visualization running in a browser.

conﬁguration for the major user interface options and

can be used to get a ﬁrst impression of the framework

or to get a similar view on different data sets quickly.

Apart from that, every option can be adjusted to cus-

tomize the visualization: The data sets can be chosen

from a list of uploaded data sets or own data sets can

be uploaded. Additionally, the currently used model

ﬁles can be changed during runtime. The available

visual variables of the model ﬁle are displayed in the

user interface and can be assigned to data attributes.

The set of example use cases shown in this work un-

derline the capabilities of the proposed framework but

are not hard-wired in the structure of the framework.

Instead it is possible to easily adjust the visualization

for other needs if the constraints on the data format are

met, custom data ﬁles can be used and visualized.

6.1 Examples

Apart from the currently provided assets and their de-

scriptions, other glTF scenes can be used to visualize

the provided data (Figure 6). This effectively means

swapping trees with other objects and provide appro-

priate visual variables, e.g., buildings, street furniture,

rocks, ﬂowers, etc. The listing 1 shows an excerpt of

the description ﬁle used for Figure 1. Figure 8 shows a

forest of the notepad-plus-plus project. The resulting

topics of the software ﬁles are mapped to tree types

and 2D position. A cohesive cluster in the center made

up of tall, light green trees can be recognized. When

inspecting the trees of this cluster using the mouse

cursor, tooltips reveal that they share a dominant topic.

Analyzing the words this topic comprises, the ﬁles

can be related to user interface code. The topic model

is correct in this judgment since the ﬁles displayed

in this cluster indeed deal with the user interface of

notepad++.

In Figure 9, 40 different open-source projects are

visualized. The projects have been processed and their

topic weights have been determined. The Software

Forest allows to recognize similar projects, e.g., deep

learning projects. The cluster of red maple trees is

Software Forest: A Visualization of Semantic Similarities in Source Code using a Tree Metaphor

119

Figure 8: Software forest of notepad-plus-plus project.

composed of deep learning projects such as CNTK,

caffe, Pytorch, and others.

6.2 Limitations

For now, we manually labeled the topics in order to

give more context to users beyond the numerical topic

index and word vectors. Since the topics returned

by LDA are latent, they have no inherit label mecha-

nism (Blei et al., 2003). This labeling is an essential

part for making our visualization interpretable. There-

fore, efﬁcient ways to label topics automatically and

consistently should be researched, also drawing from

previous work and literature about this known problem

of latent topic modeling (Magatti et al., 2009).

The proposed work could beneﬁt from the usage

of more data sources. Similar to Linstead et al. bug

1 {

2 "modelFile": "BiomeTrees.glb",

3 "attributes": [ "health", "growth" ],

4 "modelScale": 1.0,

5 "types": [ ...

6 {

7 "name": "Tree_Palm",

8 "baseModel": "Tree_Palm_02",

9 "variants": [

10 {

11 "name": "Tree_Palm_03",

12 "health": 0.16,

13 "growth": 0.16

14 }, {

15 "name": "Tree_Palm_02",

16 "health": 0.5,

17 "growth": 0.83

18 }, ... ]

19 }, ... {

20 "name": "Tree_Birch",

21 "baseModel": "Tree_Birch_03",

22 "variants": [

23 {

24 "name": "Tree_Birch_04",

25 "health": 0,

26 "growth": 0.83

27 }, ... ]

28 } ]

29 }

Listing 1: JSON conﬁguration of a mapping from attribute

values to 3D model.

Figure 9: Software forest of 40 open-source projects.

reports could be considered for additional input as well

(Linstead et al., 2007). We did some ﬁrst experiments

using so called joint embeddings (Yu et al., 2017),

which can be used for the integration of more diverse

data sources, but more work needs to be done to con-

clude whether or not this is a promising approach.

When exploring large projects, dense clusters of

trees occur frequently. To improve the visual clarity of

the system, it would be possible to recognize clusters

and aggregate similar trees in the visualization. Hover-

ing over these trees or zooming into their region, the

trees clusters could be expanded again, or tooltips with

additional details could be shown.

7 CONCLUSIONS

The approach to visualize software data, i.e., a promi-

nent example of data without intrinsic gestalt, using

metaphors is often-tried but only scarcely used outside

of research communities. We try to address application

issues by proposing a metaphor close to nature and a

dedicated visualization approach. As such, we propose

Software Forest, a novel 2.5D software visualization

to interactively explore a software system illustrated

as a forest. The visual metaphor allows for multiple

visual variables, including, but not limited to, tree type,

growth state, height, and color. The layout of the forest

is derived from a source code document analysis us-

ing Latent Dirichlet Allocation and Multidimensional

Scaling. This layout reﬂects similarity in semantics

of source code by proximity, and, thus, allows for an

alternative mapping and view than traditional software

visualization approaches such as treemaps. The layout

is used for deriving a terrain and locations for the trees

for further attribute mapping. The rendering approach

takes different parameterization, different visual vari-

ables, and scalability into account. As such, using

different 3D models is feasible and requires only the

composition of a glTF scene graph and a description

ﬁle with information on individual models and speciﬁc

IVAPP 2021 - 12th International Conference on Information Visualization Theory and Applications

120

attribute mapping. The approach is exempliﬁed using

mid-sized open-source software development projects.

An application to the data sets indicates both an ef-

fective layout base on semantic similarity of source

code and an explorative visualization approach of the

resulting maps.

Future work can be discussed separately for layout

generation and the visualization approach. For lay-

out improvements, software similarity (Al-msie’deen

et al., 2013) and suitable dimensionality reduction

methods (Cha, 2007; Vernier et al., 2020) are target

for further research. The visualization approach needs

to be evaluated regarding readability using user stud-

ies. Regarding variability of the 3D models, tooling

on the glTF models would simplify the integration

of other 3D models, even for applications outside of

forest metaphors. As the input for the prototypical

viewer are CSV ﬁles with explicit layout and addi-

tional attributes as columns, we already imagine easy

visualization prototyping and adaption in other visual-

ization domains as well. Using a broader perspective,

the proposed visualization approach allows for visual-

ization of structured data sets with scatter-plot layout

and dynamic model mapping for a wide-spread use of

thematic mapping – the topic maps.

ACKNOWLEDGEMENTS

We want to thank the anonymous reviewers for their

valuable comments and suggestions to improve this ar-

ticle. Further, this work is part of the “Software-DNA”

project, which is funded by the European Regional De-

velopment Fund (ERDF – or EFRE in German) and the

State of Brandenburg (ILB) as well as the “TASAM”

project, which is funded by the German Federal Min-

istry for Economic Affairs and Energy (BMWi, ZIM).

REFERENCES

Al-msie’deen, R., Seriai, A.-D., Huchard, M., Urtado, C.,

and Vauttier, S. (2013). Mining features from the

object-oriented source code of software variants by

combining lexical and structural similarity. In Proc.

14th International Conference on Information Reuse &

Integration, IRI ’13, pages 586–593. IEEE.

Auber, D., Huet, C., Lambert, A., Renoust, B., Sallaberry,

A., and Saulnier, A. (2013). Gospermap: Using a

gosper curve for laying out hierarchical data. IEEE

Transactions on Visualization and Computer Graphics,

19(11):1820–1832.

Balogh, G., Szabolics, A., and Besz

edes,

A. (2015).

Codemetropolis: Eclipse over the city of source code.

In Proc. 15th International Working Conference on

Source Code Analysis and Manipulation, SCAM ’15,

pages 271–276. IEEE.

Barth, L., Fabrikant, S. I., Kobourov, S. G., Lubiw, A.,

ollenburg, M., Okamoto, Y., Pupyrev, S., Squarcella,

C., Ueckerdt, T., and Wolff, A. (2014). Semantic word

cloud representations: Hardness and approximation

algorithms. In Latin American Symposium on Theoret-

ical Informatics, pages 514–525. Springer.

Beck, F. (2014). Software feathers - ﬁgurative visualiza-

tion of software metrics. In Proc. 5th International

Conference on Information Visualization Theory and

Applications - Volume 1: IVAPP, IVAPP ’14, pages

5–16. INSTICC, SciTePress.

Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent

dirichlet allocation. Journal of Machine Learning Re-

search, 3:993–1022.

Bruneton, E. and Neyret, F. (2012). Real-time realistic ren-

dering and lighting of forests. Computer Graphics

Forum, 31(2pt1):373–382.

Cha, S.-H. (2007). Comprehensive survey on distance/simi-

larity measures between probability density functions.

International Journal of Mathematical Models and

Methods in Applied Sciences, 1(4):300–307.

Cornelissen, B., Zaidman, A., Holten, D., Moonen, L.,

Deursen, A., and van Wijk, J. (2008). Execution trace

analysis through massive sequence and circular bundle

views. Journal of Systems and Software, 81:2252–

2268.

Cox, M. A. and Cox, T. F. (2008). Multidimensional scaling.

In Handbook of Data Visualization, pages 315–347.

Springer.

Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer,

T. K., and Harshman, R. (1990). Indexing by latent

semantic analysis. Journal of the American Society for

Information Science, 41(6):391–407.

ubel, S., R

ohlig, M., Schumann, H., and Trapp, M. (2014).

2d and 3d presentation of spatial data: A systematic

review. In Proc. VIS International Workshop on 3DVis,

3DVis ’14, pages 11–18. IEEE.

Hawes, N., Marshall, S., and Anslow, C. (2015). Codesur-

veyor: Mapping large-scale software to aid in code

comprehension. In Proc. 3rd Working Conference on

Software Visualization, VISSOFT ’15, pages 96–105.

IEEE.

Holten, D., Vliegen, R., and van Wijk, J. (2005). Visual real-

ism for the visualization of software metrics. In Proc.

3rd International Workshop on Visualizing Software

for Understanding and Analysis, VISSOFT ’05, pages

1–6. IEEE.

Kleiberg, E., van De Wetering, H., and Van Wijk, J. J. (2001).

Botanical visualization of huge hierarchies. In Informa-

tion visualization, IEEE symposium on, pages 87–87.

Kohonen, T. (1997). Exploration of very large databases

by self-organizing maps. In Proc. International Con-

ference on Neural Networks, ICNN ’97, pages 1–6.

IEEE.

Kuhn, A., Loretan, P., and Nierstrasz, O. (2008). Consistent

layout for thematic software maps. In Proc. 15th Work-

ing Conference on Reverse Engineering, WCRE ’08,

pages 209–218. IEEE.

Software Forest: A Visualization of Semantic Similarities in Source Code using a Tree Metaphor

121

Limberger, D., Scheibel, W., D

ollner, J., and Trapp, M.

(2019). Advanced visual metaphors and techniques for

software maps. In Proc. 12th International Symposium

on Visual Information Communication and Interaction,

VINCI ’19, pages 11:1–11:8. ACM.

Limberger, D., Trapp, M., and D

ollner, J. (2018). Interactive,

height-based ﬁltering in 2.5d treemaps. In Proc. 11th

International Symposium on Visual Information Com-

munication and Interaction, VINCI ’18, pages 49–55.

ACM.

Linstead, E., Bajracharya, S., Ngo, T., Rigor, P., Lopes, C.,

and Baldi, P. (2009). Sourcerer: mining and searching

internet-scale software repositories. Data Mining and

Knowledge Discovery, 18(2):300–336.

Linstead, E., Rigor, P., Bajracharya, S., Lopes, C., and Baldi,

P. (2007). Mining eclipse developer contributions via

author-topic models. In Proc. 4th International Work-

shop on Mining Software Repositories, MSR ’07, pages

30–34. IEEE.

Magatti, D., Calegari, S., Ciucci, D., and Stella, F. (2009).

Automatic labeling of topics. In Proc. 9th Interna-

tional Conference on Intelligent Systems Design and

Applications, ISDA ’09, pages 1227–1232. IEEE.

Rosen-Zvi, M., Grifﬁths, T., Steyvers, M., and Smyth, P.

(2004). The author-topic model for authors and doc-

uments. In Proceedings of the 20th Conference on

Uncertainty in Artiﬁcial Intelligence, UAI ’04, pages

487–494. AUAI Press.

Scheibel, W., Trapp, M., Limberger, D., and D

ollner, J.

(2020). A taxonomy of treemap visualization tech-

niques. In Proc. 15th International Joint Conference

on Computer Vision, Imaging and Computer Graphics

Theory and Applications – Volume 3: IVAPP, IVAPP

’20, pages 273–280. INSTICC, SciTePress.

Schreiber, A. and Misiak, M. (2018). Visualizing software ar-

chitectures in virtual reality with an island metaphor. In

Proc. International Conference on Virtual, Augmented

and Mixed Reality: Virtual, Augmented and Mixed

Reality: Interaction, Navigation, Visualization, Em-

bodiment, and Simulation, VAMR ’18, pages 168–182.

Springer.

Sievert, C. and Shirley, K. (2014). Ldavis: A method for

visualizing and interpreting topics. In Proc. Workshop

on Interactive Language Learning, Visualization, and

Interfaces, pages 63–70. ACL.

Skupin, A. (2004). The world of geography: Visualizing a

knowledge domain with cartographic means. Proceed-

ings of the National Academy of Sciences, 101(suppl

1):5274–5278.

Steinbr

uckner, F. and Lewerentz, C. (2013). Understanding

software evolution with software cities. Information

Visualization, 12(2):200–216.

anek, A. (2020). Procedurally generated landscape as

a visualization of C# code. Technical report, Masaryk

University, Faculty of Informatics. Bachelor’s thesis.

Vernier, E. F., Garcia, R., Silva, I. P. d., Comba, J. L. D.,

and Telea, A. C. (2020). Quantitative evaluation

of time-dependent multidimensional projection tech-

niques. Computer Graphics Forum, 39(3):241–252.

Wettel, R. and Lanza, M. (2007). Visualizing software sys-

tems as cities. In Proc. International Workshop on

Visualizing Software for Understanding and Analysis,

VISSOFT ’07, pages 92–99. IEEE.

urfel, H., Trapp, M., Limberger, D., and D

ollner, J. (2015).

Natural phenomena as metaphors for visualization of

trend data in interactive software maps. In Proc. Con-

ference on Computer Graphics and Visual Computing,

CGVC ’15, pages 69–76. EG.

Yu, J., Jian, X., Xin, H., and Song, Y. (2017). Joint embed-

dings of chinese words, characters, and ﬁne-grained

subcharacter components. In Proc. Conference on

Empirical Methods in Natural Language Processing,

pages 286–291. ACL.

IVAPP 2021 - 12th International Conference on Information Visualization Theory and Applications

122