Constrained Agglomerative Hierarchical Software Clustering with

Hard and Soft Constraints

Chun Yong Chong and Sai Peck Lee

Department of Software Engineering, Faculty of Computer Science and IT, University of Malaya, Kuala Lumpur, Malaysia

Keywords: Agglomerative Hierarchical Clustering, Constrained Clustering, Reverse Engineering.

Abstract: Although agglomerative hierarchical software clustering technique has been widely used in reverse

engineering to recover a high-level abstraction of the software in the case of limited resources, there is a

lack of work in this research context to integrate the concept of pair-wise constraints, such as must-link and

cannot-link constraints, to further improve the quality of clustering. Pair-wise constraints that are derived

from experts or software developers, provide a means to indicate whether a pair of software components

belongs to the same functional group. In this paper, a constrained agglomerative hierarchical clustering

algorithm is proposed to maximize the fulfilment of must-link and cannot-link constraints in a unique

manner. Two experiments using real-world software systems are performed to evaluate the effectiveness of

the proposed algorithm. The result of evaluation shows that the proposed algorithm is capable of handling

constraints to improve the quality of clustering, and ultimately provide a better understanding of the

analyzed software system.

1 INTRODUCTION

Software requires continuous change and

enhancement to satisfy new business rules and

technologies. This is a human intensive task that

requires deep understanding and comprehension of a

software before any decision to modify it. Therefore,

software maintainers must first gain a complete

understanding of the structure and behavior of the

software to be maintained before making any major

changes. However, most software that had undergone

instant changes does not have up-to-date

documentation. Thus, software maintainers may need

to reverse engineer the source code to gain a high-

level abstraction view of the software. Software

clustering is one of the techniques used to recover a

semantic representation of the software design and

documentation. It has received a substantial attention

in recent years because of its capability to help in

improving the modularity of poorly designed

software systems.

However, in certain scenarios, software

maintainers may have access to additional

information or domain knowledge about the software

to be maintained. For instance, the core business

rules and functionalities of a software remain

unchanged after undergoing several major updates,

or stakeholders have additional knowledge about the

software because they were involved in the early

stage of software development. Thus, even if the

software documentation is not up-to-date,

maintainers are able to salvage some useful

information about the structure of the software.

However, such information are worthless unless

there is a proper way to synthesis them.

An improvement to conventional clustering

techniques was proposed in (Basu et al., 2004) by

incorporating side information to further improve the

accuracy of clustering results. The side information is

commonly referred as “constraints” which reveal the

similarity between pairs of clustering entities, or user

preferences about how those entities should be

grouped during clustering.

It has been proven in several fields of research

that constrained clustering can significantly improve

the reliability and accuracy of clustering results

(Davidson and Ravi, 2009). However, there is still a

lack of studies on integrating constrained clustering

to effectively improve the modularity of poorly

designed software. In cases where end-users or

developers have side information regarding the

software to be maintained, the relevant knowledge

177

Chong C. and Lee S..

Constrained Agglomerative Hierarchical Software Clustering with Hard and Soft Constraints.

DOI: 10.5220/0005344001770188

In Proceedings of the 10th International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE-2015), pages 177-188

ISBN: 978-989-758-100-7

 2015 SCITEPRESS (Science and Technology Publications, Lda.)

can help improve the results of clustering by the

means of constraints.

In this study, we focus on fulfilling different

types of constraints using agglomerative hierarchical

clustering. Agglomerative hierarchical clustering is a

bottom-up approach that iteratively merge pair of

clusters until all clusters are merged into a big

cluster. Unlike the work by (Davidson and Ravi,

2009) which solely imposed absolute constraints

(must be fulfilled regardless of any situation), the

constrained clustering technique to be presented in

this paper involves several types of constraints which

differ according to their importance, i.e. must be

fulfilled, or good to have. As such, a constrained

clustering algorithm is introduced in this study to

help ease the software maintenance problem when

the system documentation is non-existent or

inconsistent with real implementation.

2 RELATED WORK

Software maintenance and verification is crucial to

discover and validate the relationship between

technology and business models. How well a certain

software fulfills stakeholders’ requirements often

depicts the effectiveness of the software.

A major fraction of software life cycle’s

expenditure is contributed by software maintenance

and support. The authors estimated that over 50% of

software development budget is spent on maintaining

and supporting the software itself. This shows that

maintenance indeed plays a very important role in the

software life cycle. Before performing any relevant

maintenance work, the person in charge must first

gain a complete understanding of the particular

software. This is to ensure that when there are

requests to update a particular functionality,

maintainers can recognise in advance, the interrelated

components. Eventually, the threats of introducing

faults and bugs during maintenance can be

minimized. Therefore, maintainers often need to

spend a significant amount of time in comprehending

the structure and design of the software.

Accomplishment of maintenance is highly dependent

on how much information can be extracted by

maintainers.

However, maintainers are typically not involved

in early stages of software design and development.

Furthermore, the documentation of the software is

usually not up-to-date, especially for projects that

follow agile software development where software

requirements and solutions evolve rapidly. Thus,

maintainers require additional options. One way to

alleviate these problems is through remodularization

of software, which is one of the reverse engineering

techniques used to help recover a semantic

representation of the design and documentation.

Software clustering is one of the techniques used

to perform remodularization of software. The goal of

clustering is to form multiple groups of clusters, such

that components within the same group are similar to

each other, and dissimilar from components in other

groups. The measurement of similarities between

components is based on inter-relationships between

components or common features shared by them

(Maqbool and Babri, 2007). Mutual exclusive groups

of components can be identified to provide more

insight toward the analyzed software. The results of

clustering can also help maintainers to understand the

behavior and dependencies of programs, identify

orphaned source code, and allow adding of new

software modules without interfering with the

general workflow of the software system (Antonellis

et al., 2009).

2.1 Software Clustering

Clustering can be based on either a supervised or

unsupervised approach to pick from a collection of

entities, then form multiple groups of entities such

that entities within the same group are similar to each

other, while dissimilar from entities in other groups.

In the context of software clustering, entities are

normally source code or classes. Similarity measures

are normally common global variables used by an

entity or function calls made by an entity. The

identification of similarity is often depending on

what kind of reliable information is available.

Generally, clustering can be categorized into

partitional and hierarchical clustering. Given a

collection of data, partitional clustering works by

directly decomposing it into a set of disjoint clusters.

On the other hand, hierarchical clustering iteratively

merge smaller clusters into larger ones or divide

large clusters into smaller ones, depending on either

it is a bottom-up or top-down approach. Merging or

dividing operations are usually depend on the

clustering algorithm used in the existing studies. The

result of partitional clustering are usually presented

in several disjoint set of clusters, with each cluster

contains at least one entity and each entity belongs to

only one cluster. Meanwhile, the final result of

hierarchical clustering is a tree diagram, called

dendrogram. A dendrogram shows taxonomic

relationships of clusters produced by hierarchical

clustering. Cutting the dendrogram at a certain height

produces a set of disjoint clusters.

ENASE2015-10thInternationalConferenceonEvaluationofNovelSoftwareApproachestoSoftwareEngineering

178

In the domain of software clustering, partitional

clustering is less viable because it is almost

impossible to know the initial number of clusters

before performing software clustering (Chong et al.,

2013). According to the work by (Wiggerts, 1997),

the working principle of agglomerative clustering

(bottom-up hierarchical clustering) is actually similar

to reverse engineering where the abstractions of

software designs are recovered in a bottom-up

manner. Thus, in this research, agglomerative

clustering will be used to recover high-level

abstraction of a poorly documented or poorly

designed software system.

Agglomerative hierarchical clustering starts by

forming all entities as initial clusters. At each step, a

pair of entities is merged and the algorithm ends with

one big cluster. The following steps show a standard

agglomerative hierarchical clustering algorithm.

Input: Set 



,



,⋯,



 of entities.

Output: Dendrogram

1. Each entity 



forms an initial cluster 



. The

total number of clusters K = n. For each pair of

clusters 



and



, , the distance between 



and



is denoted by 



,



.

2. Find a pair of clusters with minimum distance,





,



 :

Let







,









,





Merge 







∪



and

reduce the number of clusters K= K-1

3. If K = 1, stop the iteration; else update

distance 



,



 , for all other clusters 



(Follow step 2)

Although some clustering algorithms produce a

single clustering result for any given dataset, a

dataset may have more than one natural and optimum

clustering. For instance, source code can only tell

very limited information about the architectural

design of a software system since it is a very low-

level software artifact. The work by Deursen and

Kuipers (1999) adopted a greedy search method by

using mathematical concept analysis to analyze the

structure of cluster entities and identify the features

that are shared by them. The proposed approach finds

all of the possible combination of clusters and

evaluates the quality of each combination.

Agglomerative clustering is used in this work. The

authors discovered that it is hard to analyze all

possible combination and useful information might

be missing if no attention is given to analyze the

results of different dendrogram cutting points.

In contrast to the greedy search method proposed

by Deursen and Kuipers, the work by (Fokaefs et al.,

2009; Fokaefs et al., 2012) proposed an approach that

results in multiple solutions from which software

designers can select their best solution. Their goal is

to decompose large classes by identifying Extract

Class refactoring opportunities. The authors used

agglomerative clustering to generate the dendrogram

that demonstrated how the clusters are formed.

On other hand, our previous work (Chong et al.,

2013) proposed a technique to enhance existing

agglomerative clustering algorithms by minimizing

redundant effort and penalizing for the formation of

singleton clusters during software clustering. By

utilizing a least-squares polynomial regression

analysis, the algorithm finds the optimum result that

produces sets of clusters with high cohesion and low

coupling. The proposed technique is based on a

bottom-up approach, which starts by transforming

source code into a flat sequence of class diagrams,

and finally restructures them into a package diagram

to provide a high-level semantic view of the whole

system design. Figure 1 shows the overall workflow

of the work presented in (Chong et al., 2013). The

clustering process are summarized as follow.

1. Identification of entities and features - UML

classes are represented as cluster entities, while

relationships between classes are used to

calculate the distance between a pair of entities.

2. Calculation of similarity measure - Sorensen-

Dice coefficient (Sørensen, 1948) is used to

calculate the similarity between pairs of cluster

entities because it is more suitable to be applied

to asymmetric binary features, which is similar to

the behavior of functional dependencies and

method calling.

3. Application of clustering algorithm - Un-

weighted Pair-Group Method using Arithmetic

Average (UPGMA) is used to merge pairs of

entities and form the dendrogram.

Figure 1: Architecture recovery process proposed in

(Chong et al., 2013).

Our previous work does not differentiate between

aggregation, association, cardinality and

ConstrainedAgglomerativeHierarchicalSoftwareClusteringwithHardandSoftConstraints

179

generalization through different weightage. The

presence of any type of correlation will be

represented as a direct relationship between two

entities. The technique does not have the ability to

integrate domain knowledge or other sources of

information which can further improve the quality of

clustering result. Such way of incorporating domain

knowledge is also known as semi-supervised

software clustering.

If one has the ability to exploit high-level

information to guide and improve clustering, this will

further improve the quality of the results.

Constrained clustering, for instance, is one of the

semi-supervised clustering techniques that combine

external information in order to improve clustering

results.

2.2 Constrained Software Clustering

Recent works (Basu et al., 2004; Davidson and Ravi,

2009; Wagstaff and Cardie, 2000) have attempted to

discover the benefits of instance-level constraints in

both hierarchical and non-hierarchical clustering. The

must-link (ML) and cannot-link (CL) constraints

specify that two entities must both be part of or not

part of the same cluster respectively. These

constraints are useful when the information of cluster

entities is vague, allowing domain experts to guide

the clustering process. The information is normally

given as a set of pairwise constraints which involve

two entities and impose restriction such as

determining whether the involved entities should be

clustered into the same group or not. Constrained

clustering method is contrary to traditional

unsupervised clustering method where users have no

influence toward the clustering results.

Work by (Wagstaff and Cardie, 2000) has found

that side information such as constraints can improve

the quality of clustering when compared against

those without constraints. Meanwhile, the work by

(Davidson and Ravi, 2009) examined the complexity

of traditional clustering algorithms and investigated

methods to improve the efficiency of constrained

agglomerative hierarchical clustering. The authors

introduced new constraints apart from the traditional

ML and CL constraints to further improve the run-

time of agglomerative hierarchical clustering. They

discovered that small amounts of constraints not only

improve the accuracy of agglomerative hierarchical

clustering but also the overall run-time. However,

clustering under all types of constraints is NP-

complete, which means that creating a feasible

cluster hierarchy under all types of constraints is

intractable.

We found that there is a lack of research that

focuses on applying constrained clustering in the

field of software reverse engineering to remodularize

poorly designed software systems. The NP-complete

problem stated in the work by (Davidson and Ravi,

2009) can be minimized if each constraint is assigned

with a certain degree of importance, i.e. constraints

that must be fulfilled, or optional constraints that are

good to have. Therefore, different from the work by

(Davidson and Ravi, 2009), this study aims to

maximize the fulfillment of software constraints

according to the degree of importance derived from

stakeholders.

Generally, current works involving constrained

clustering methods can be divided into three

categories, namely distance based, constrained based,

or hybrid of both. In distance based constrained

clustering, a distance metric is trained to satisfy the

constraints before the clustering process. The

distance metric represents the dissimilarity strength

between pairs of entities. Merging or splitting of

clusters is based on the distance metric. Thus,

training the distance metric allows one to manipulate

the process of clustering to allow certain pairs of

entities to be clustered into the same group, or

separated if otherwise. Examples of methods to train

distance metrics include shortest path (Klein et al.,

2002), expectation maximization (Bilenko and

Mooney, 2003), and convex optimization (Shental

and Weinshall, 2003).

On the other hand, constrained based methods

work by modifying the cluster assignments, i.e.

manually assign entities to designated clusters. For

instance, must-link constraints can be used to

initialize the baseline of cluster hierarchy so that the

must-link constraints can be satisfied

indefinitely(Kestler et al., 2006). Constrained based

approaches ensure that all the constraints are fulfilled

because the clustering assignments are manipulated

by users based on the given constraints. However,

experiments performed by (Davidson and Ravi,

2009) discovered that manipulating with the

clustering assignments might lead to “dead-end”

situation where no pair of clusters can be merged to

obtain a feasible clustering result. Thus, a proper way

to ensure the fulfillment of constraints must be

formulated before enforcing any kind of clustering

constraints.

Fulfillment of constraints can be classified as

either hard or soft constraints associated with some

cost of violation if those constraints cannot be

fulfilled (Basu et al., 2004). Hard constraints are

constraints that cannot be violated during the

clustering process regardless of any condition. These

ENASE2015-10thInternationalConferenceonEvaluationofNovelSoftwareApproachestoSoftwareEngineering

180

sets of constraints are usually highly reliable

knowledge or information given by domain experts.

In general, the cost of violating hard constraints

supersedes the objective function of constrained

clustering. Constrained based clustering method is

one of the most reliable approaches to make sure that

all hard constraints are fulfilled as much as possible.

Meanwhile, soft constraints are usually associated

with uncertainties and ambiguous information (Basu

etal., 2004). The cost of violating soft constraints

varies depending on the level of confidence provided

by the stakeholders. Clustering results will still be

acceptable if some of the soft constraints are not

fulfilled, with a condition that it falls within an

acceptable threshold (Ares et al., 2012). Soft

constraints are more robust against “noisy” or

incorrect. As a general rule, most of the objective

functions attempt to maximize the fulfillment of hard

and soft constraints. However, it is to be noted that

constrained clustering can fall into a NP-Complete

problem if the must-link and cannot-link constraints

are contradicting with each other, for instance,

(Must-Link ∪ Cannot-Link) > 0. Thus, potential

conflicts among hard constraints must be identified

in advance.

All in all, we found that it is feasible to

incorporate the notion of constrained clustering with

remodularization of software system. For instance,

experienced software developers can provide

opinions with a certain degree of confidence to

suggest if two entities should be clustered into the

same group. Furthermore, relationships among

software entities such as inheritance and dependency

suggest that two entities have strong affiliation and

they must be grouped together. There are plenty of

methods to derive constraints from the software itself

or side information from stakeholders. However,

there are certain cases where the stakeholder is not

assertive enough to judge whether the given

constraints are absolute, especially in the domain of

software engineering. For instance, as mentioned in

Section 1, the stakeholder who was involved in the

early stage of software design might provide some

constraints about the software to be maintained.

However, such constraints might not be valid

anymore after several phases of software updates and

changes. Thus, the constraints given by the

aforementioned stakeholder might be ambiguous or

contains erroneous information. A proper method is

needed to distinguish between absolute constraints

and optional constraints, and subsequently fulfill

those constraints according to their level of

importance. Therefore, in this study, a constrained

software clustering algorithm is proposed to alleviate

the problem mentioned above for aiding in reverse

engineering.

3 PROPOSED APPROACH

Constraints can be derived easily from stakeholders,

who are not necessary experts in a particular domain,

by asking them to make judgment whether two items

are similar or not (Hong and Yiu-ming, 2012). The

stakeholders can evaluate their judgments based on

their level of confidence or based on background

knowledge to support their decisions. In this

proposed approach, if the stakeholder is highly

confident that the provided constraints are reliable,

they will be categorized as hard constraints. These

sets of constraints must be fulfilled under any kind of

conditions. On the other hand, if the stakeholder is

doubtful about the given constraints, it will be

categorized as soft constraints.

3.1 Constraints with High Level of

Confidence

If a stakeholder has a high degree of confidence that

a pair of entities must be grouped together or

separated, these sets of constraints will be

categorized as the Must-Link Hard (MLH) or

Cannot-Link Hard (CLH) constraints. MLH and

CLH constraints are relatively easier to fulfill using

k-mean clustering because clustering assignment can

be manipulated during the process of clustering.

However, it is more difficult to achieve the same

results for agglomerative hierarchical clustering

because all entities in the dataset are linked together

at some level of the cluster hierarchy (Bair, 2013).

MLH and CLH constraints must always be fulfilled

at all levels of the hierarchy. The work by

(Miyamoto, 2012) introduced a distance based

approach to impose MLH constraints by requiring

entities linked by MLH constraints to be clustered

together at the lowest level of cluster hierarchy. This

is done by reducing the dissimilarities between pairs

of MLH constraints to zero.

Given a 



,



,⋯,



 with entities





,



,⋯,



. For ( 



,



∈, the distance

between 



and 



is modified to



,



0.

This will eventually form a baseline model for the

clustering hierarchy. Since the MLH constraints are

unconditionally fulfilled at the lowest level of the

hierarchy, one can ensure that the same can be

achieved all the way through the top of hierarchy.

Thus in this study, we will adopt the same technique

ConstrainedAgglomerativeHierarchicalSoftwareClusteringwithHardandSoftConstraints

181

proposed by (Miyamoto, 2012) to ensure the

fulfillment of MLH constraints.

For CLH constraints, there are typically two ways

to enforce using either constrained based or distance

based method. Constrained based methods modify

the cluster assignments by inspecting the merger of

two entities. If the chosen entities belong to the CLH

pairs, one will need to look for the next pair of

entities with the second highest similarity score.

However, the work by (Davidson and Ravi, 2009)

found that the formation of dendrogram may stop

prematurely in a certain scenario. The authors called

this scenario as the “dead-end” situation where

unless CLH constraints are violated, there will be no

more merging possible to form the final dendrogram.

Thus, constrained based approach to fulfill CLH

constraints is a less viable option in our case.

Distance based approaches, on the other hand,

modify the dissimilarities between pairs of CLH

constraints to be a value high enough to prevent them

from merging.

Given a set   



,



,⋯,



 with entities





,



,⋯,



. For 



,



∈, 



,









,



 where  is a constant large

enough to prevent linkage in between entities 



,



By enforcing this rule, the pairs of CLH

constraints will not be chosen to merge unless there

is no more entities pair with distance more

than 



,



. Entities which belong to

CLH constraints will then be merged at the top of the

hierarchy to form the complete dendrogram. By

looking into another perspective, the CLH constraints

are violated at the top of the hierarchy since without

violating them, “dead-end” situation will occur.

However, we argue that violating CLH constraints at

the top of the hierarchy is negligible because it is

almost impossible to cut the dendrogram at that

location. In a typical scenario, cutting the

dendrogram at the top of hierarchy will yield very

small number of clusters because this decision is at

the trade-off of relaxing the constraint of cohesion in

the cluster membership. Clusters formed under this

cutting point are usually made up of entities with

very fragile cohesion strength.

However, changing the distance measure of MLH

and CLH pairs will most likely result in violating the

triangle inequality of resemblance matrix (Klein et

al., 2002). This means that for some entities





,



∉,



,



∉ which were distance









,





apart before imposing MLH or CLH

constraints may now be ′







,













,





along some path which skip through the MLH or

CLH pairs. As pointed out by (Klein et al., 2002),

this problem can be solved by finding a new distance

value with respect to the modified constraints pairs

using all-pairs-shortest-path algorithm.

For instance, Figure 2a shows a simple example

of 6 entities, Classes A,B,C,D,E, and F. The number

on the edges indicate the distance between two

entities. In Figure 2a, the shortest distance between

Class A and Class C is 0.9 with the following order:

A-D-E-F-C.

Figure 2: Example of problem when imposing MLH and

CLH constraints.

After several discussions, the original developers

discovered that Class A and Class B in fact are very

closely related and impose a MLH constraint onto the

cluster. Thus, the distance between A and B is now

0.0 to reflect the MLH constraint, as illustrated in

Figure 2b. Therefore, the shortest path between Class

A and Class C after the imposition of MLH

constraint is now 0.5, with the following order: A-B-

C. If we do not update the distance matrix

accordingly, the final clustering results might contain

erroneous outcome. The overall algorithm to fulfill

both MLH and CLH constraints is shown below.

Input: A set of entities S = {



,



,⋯,



, a set of

MLH (must-link hard constraints) and a set of CLH

(cannot-link hard constraints)

Output: A modified distance matrix

1. Calculate the distance between each pair of

entities and store it in a distance matrix D where



,



,

2. Initialize: Let D’ = D (create a clone distance

matrix to modify the original one)

ENASE2015-10thInternationalConferenceonEvaluationofNovelSoftwareApproachestoSoftwareEngineering

182

3. while ∀



,



∈∩



∀







,





∈





0

i. for (



,



∈, ′



,



0

run all-pair-shortest-path algorithm to prevent

violation of triangular inequality

ii. for ( 



,



∈, ′



,



′



,





 where  is a constant large enough

to prevent linkage in between entities 



,



run all-pair-shortest-path algorithm to prevent

violation of triangular inequality

3.2 Constraints with Low Level of

Confidence

In cases where users are not confident enough to

judge whether the given constraints are absolute,

these sets of constraints will be categorized as soft

constraints. Since soft constraints are not definite,

clustering results with partial fulfillment of soft

constraints are still acceptable in most cases.

However, soft constraints might be derived with a

different level of importance and ranking, subject to

the information provided by users. Fulfilling a

handful of higher importance soft constraints might

overshadow the fulfillment of several less important

ones. Soft constraints are typically assigned with a

penalty score. The penalty score is used to evaluate

the quality of clustering results where minimization

of the penalty score is preferred. Thus, a

prioritization and ranking mechanism of soft

constraints is introduced in this study.

The nature of prioritizing a given set of

constraints is a multi-criteria decision-making

(MCDM) problem. MCDM is a research of methods

and procedures by which it concerns about

evaluating multiple conflicting criteria and derive a

way to come to a compromise. This set of criteria

often differs in the degree of importance. Examples

of methods to handle MCDM problems are analytic

hierarchical process (AHP), fuzzy AHP, goal

programming, scoring methods, and multi-attribute

value functions.

In this study, ranking and prioritizing the

importance of soft constraints are achieved using the

fuzzy AHP technique. Fuzzy AHP is capable of

handling the fuzziness of users’ opinions with respect

to the importance of soft constraints (Chong et al.,

2014). The results gathered from fuzzy AHP will be

represented in a table which shows a list of candidate

criteria (soft constraints) associated with weightage

(importance toward the analyzed software), where a

higher weightage value represents higher priority.

The result will act as a baseline to evaluate the

penalty score of each soft constraint. MLS and CLS

will be evaluated separately because the notion of

ML and CL is opposing to each other. The objective

function of MLS and CLS constraints is shown

below:

Given a set S = { 



,



,⋯,



 with entities





,



,⋯,



, a set of MLS (must-link soft

constraints) and a set of CLS (cannot-link soft

constraints). The objective function is to maximize

the number of satisfied MLS and CLS constraints:

 

























Subject to 



0,1,⋯

0



  1,  1,⋯

Where 



is the total number of available soft

constraints (including MLS and CLS) and 



 is

the number of satisfied soft constraints involving

pairs of entities with



as one of the entities. The

left side of the equation is the ratio of fulfilled soft

constraints over the total number of soft constraints.

Meanwhile, 



 is the penalty score for violated

constraints involving pairs of entities with



as one

of the entities. The penalty score is based on its

importance toward the overall software system using

fuzzy AHP technique. The cumulative weightage

(penalty score) of either MLS or CLS constraints is

equal to 1. Thus, a scaling constant of 1/2 is used to

normalize the second part of the equation when

adding both the MLS and CLS constraints.

Maximization of function  is the goal of this

objective function. The evaluation of soft constraints

fulfillment is performed after the formation of

dendrogram. The dendrogram needs to be cut at a

certain height to produce a set of disjoint clusters.

Evaluation of soft constraints can then be done by

inspecting the set of disjoint clusters, to check

whether or not the soft constraints are violated. A

few cutting points can be executed to compare and

contrast the quality of each cut with respect to the

minimization of soft constraints penalty.

3.3 Constrained Agglomerative

Hierarchical Clustering Algorithm

All in all, the complete algorithm of the proposed

constrained agglomerative hierarchical software

clustering is shown below.

Given a set of entities S, the distance for each pair

of entities x and y in S is 1



,



0 and a set

of constraints   ,,,.

1. Construct the baseline clusters from MLH

ConstrainedAgglomerativeHierarchicalSoftwareClusteringwithHardandSoftConstraints

183

constraints resulting in n number of initial

clusters 



,



,⋯



2. If there is a pair of entities



,







,



,⋯



and CLH



,



∈, then this is a

NP-Complete problem with no solution.

3. Construct an initial clustering with 



clusters

consisting of the n clusters 



,



,⋯



and a

singleton cluster for each entity. 



is the

maximum number of clusters for the set of

entities S.

4. while 1

a. Find the pair of entities 



,



 with

minimum distance.

b. Merge 







∪



at the level of

dissimilarity.

c. Remove 



,



d. 1.

e. repeat step 4.

5. Generate a dendrogram tree based on the

clustering results.

6. Cut dendrogram at several points.

7. Evaluate the fulfillment of MLS and CLS with

respect to the penalty score.

The overall workflow of the proposed technique

work in the following manner:

Software maintainers provide the UML class

diagrams of the software to be analyzed. If class

diagrams are not available, source codes are

converted into class diagrams using an off-the-shelf

round-trip engineering tool. The formation of

clustering entities, identification of features,

construction of dissimilarity matrix, and formation of

dendrogram are executed based on our previous work

(Chong et al., 2013) as discussed in Section 2.1.

After the dendrogram is formed, software

maintainers and/or the original developers can then

provide domain knowledge to aid in the software

clustering process. Based on the confidence level of

the maintainers and/or developers, each input is

categorized into hard or soft constraints.

Dendrograms are cut based on the available

constraints. Each cutting point is evaluated using the

objective function proposed in Section 3.2. Cutting

points that can fulfill the most constraints are

prioritized.

4 EVALUATION

The work by Anquetil and Lethbridge (Anquetil and

Lethbridge, 1999) discussed that instead of

recovering a software system’s architecture,

clustering techniques actually create a new one based

on the parameters and settings used by the clustering

algorithm. Thus, a way to evaluate the effectiveness

of the produced result is needed. MoJoFM is a well-

established technique used to compare the similarity

between clustering result and gold standard. High

similarity between two partitions is more desirable as

it indicates that the produced result resemble the gold

standard.

However, Mitchell and Mancoridis (2001)

discussed that often time, gold standard does not

Figure 3: Overview of the original package diagram and the constrained software clustering results.

ENASE2015-10thInternationalConferenceonEvaluationofNovelSoftwareApproachestoSoftwareEngineering

184

exist. The author suggested another approach by

clustering the analyzed software using different

clustering algorithms. Then, the similarity between

the results of different algorithms are compared with

each other. This will allow one to identify not only

the quality of the clustering results, but also the

stability of the clustering algorithm.

Thus, in this paper, we perform the evaluation of

the proposed technique in the following manner.

1. Perform normal software clustering (without any

constraints) based on our previous work (Chong

et al., 2013).

2. Perform constrained software clustering using our

proposed technique by incorporating hard and

soft constraints.

3. Retrieve the original package diagram of the

analyzed software. The original package diagram

is by no means the gold standard since we could

not verify the quality of the decomposition.

However, it can be treated as a guideline to

evaluate and compare between the results

produced by the proposed technique and the

documented artifact.

4. Use MoJoFM to calculate the similarity between

all three results (normal clustering, constrained

clustering, and package diagram).

Two evaluations were carried to assess the feasibility

of the proposed method. First, we choose a university

research project, MathArc ("MathArc - Ensuring

Access to Mathematics Over Time," August 2009),

as the input for our experiment. This project is aimed

at creating a system that is capable of the long-term

preservation and dissemination of digital journals in

mathematics and statistics. This system is a joint

project by Cornell University Library and Göttingen

State University Library, which took two years to

develop. The system contains 33 classes with an

average of 8 attributes and 4 methods per class.

The system’s functional modules are presented in

Figure 3. Dotted black boxes represent the original

UML packages. There are a total of six subsystems in

this software. Since the software design of the

MathArc system is documented properly, we can test

the feasibility of our proposed algorithm in the

following manner:

1. Prior to the experiment, we assume that all the

entities are scattered around and not grouped in

their respective packages.

2. Based on the original UML package diagram, we

extract a few MLH, CLH, MLS, and CLS

constraints. For instance, based on Figure 3, we

understand that class “Monitor” and

“Preservation” must be grouped into the same

cluster because they are from the same

subsystem. Thus, a MLH constraint “Monitor-

Preservation” is generated in Table 1.

3. For MLS and CLS constraints, penalty score for

violating the soft constraints are generated

randomly. Besides that, we intentionally generate

an erroneous constraint, but assign a very low

penalty score to see how the proposed algorithm

handles the constraint. For instance, although the

original package diagram indicates that “Media”

and “Standards” classes belong to different

packages, we create a MLS constraint with

penalty score of 0.1. This MLS constraint

simulates the situation where stakeholders are not

very confident about the given constraint.

4. Apply the proposed constrained clustering

algorithm to restructure the class diagram, so that

similar classes are grouped into the same

package, while dissimilar ones are separated from

each other.

5. Use MoJoFM to compare the result of the

proposed constrained clustering technique with

the original packages to identify its effectiveness.

Table 1 shows some of the constraints generated for

this experiment. Note that the bracketed value in

MLS and CLS represents the cost of violating a

constraint. Davies-Bouldin index (Davies & Bouldin,

1979) is used in this experiment to evaluate the

quality of cluster cohesion and separation.

Table 1: Generated constraints for MathArc system.

Constraints

MLH CLH MLS(penalty)

CLS

(penalty)

Submission-

QualityAssu

AccessControl-

Submission

Report-SysD(0.3)

Monitor-

Negotiator

(0.5)

Monitor-

Preservation

Report-

Services

Standards-

AccessControl(0.3)

Submission-

Services(0.5)

ErrorCheck-

Media

Updates-

APGeneration(0.3)

ReplaceMedia-

Media

Media-

Standards(0.1)

Figure 3 shows the clustering results using the

proposed algorithm. The blue and red boxes

represent the experimental results, with each box

representing one subsystem. The blue boxes indicate

the clustering results that match the original package

diagram, while the red boxes indicate the mixture of

results that match and do not match the original

package diagram. The diagram was redrawn to

normalize all of the association, aggregation, and

generalization into the form of normal association

notation.

ConstrainedAgglomerativeHierarchicalSoftwareClusteringwithHardandSoftConstraints

185

Figure 4: Overview of the original package diagram and the clustering results without pairwise constraints.

Table 2: Generated Constraints for JSPWiki system.

Constraints

MLH CLH MLS(penalty) CLS(penalty)

GroupCommand-AbstractCommand Workflow-TemplateDirTag Tast-Outcome (0.3)

Command-

WikiEventUtil(0.2)

AbstractCommand-WikiCommand MailUtil-Entry WatchDog-RSSThread(0.3)

WikiPrinciple-

WikiPage(0.3)

UserCheckTag-WikiServletFilter

Workflow-

CommandResolver

PageManager-

EditorManager(0.2)

Step-ParseException(0.3)

AdminBeanManager-WikiEngineEvent PageRenamer-Entry Feed-RSS20Feed(0.1) UserBean-Editor(0.1)

UserDatabase-WikiSession

Workflow-

WikiRPCHandler

Editor-RSSGenerator(0.1) BlogUtil-FileUtil(0.1)

Entry-AclImpl MessageTag-Denounce

WikiSession-UserProfile MessageTag-Entry

FormClose-FormSelect FileUtil-RPCCallable

FormElement-FormSet Heading-MarkupParser

FormOutput-FormOpen

Heading-

ProviderException

FormInput-FormTestArea

SecurityVerifier-

WikiException

InsertPage-TableofContents FileUtil-ClassUtil

Entry-FileSystemProvider BasicPageFilter-CoreBean

InitializablePlugin-Plugin Util.PageSorter-Outcome

TemplateDirTag-WikiRPCHandler Outcome-Feed

Note that all the MLH and CLH constraints are

fulfilled in the result. However, the MLS constraint

of "Media-Standards was violated. This is because

based on Davies-Bouldin index, fulfilling the MLS

constraint of “Media-Standards” will result in low

cohesion strength among the associated clusters.

Since the cost of violation is relatively smaller,

selecting another cutting point that violates this MLS

constraint is a better option. The objective function of

soft constraints in this experiment is 









5/60.05  0.7833. The left side of the

equation signifies that 5 out of 6 soft constraints are

ENASE2015-10thInternationalConferenceonEvaluationofNovelSoftwareApproachestoSoftwareEngineering

186

fulfilled. The value of 0.05 is calculated based on the

penalty score of violating the constraint “Media-

Standards” and multiplying it with scaling constant

of 1/2.

By using the MoJoFM tool provided by (Zhihua

and Tzerpos, 2004), we manage to achieve MoJoFM

metric of 92.59%. This shows a very high

resemblance between the result of our proposed

constraint clustering technique and the original

package diagram. However, as mentioned earlier, the

original package diagram is by no mean the ‘gold

standard’ because we are unable to verify if it is the

best abstraction to represent the software design of

MathArc system. Thus, we perform another

evaluation by comparing the results without

imposing the pairwise constraints. The result is

shown in Figure 4.

In Figure 4, we can observe that the

‘Administrator’ package (lower left hand side)

contains classes from two other packages. The reason

is that these classes behave similarly to utility

classes, for which the association strengths within the

same package are relatively weak compared to the

other packages. When compared with the original

package diagram, the MoJoFM achieves value of

88.89%. Although there are slight improvement

when using the proposed constrained clustering

technique, it is not significant enough. Thus, we

decided to perform another experiment using a larger

software.

We chose another open-source project, the

JSPWiki which is a Wiki engine written in J2EE

component. Wiki engines are used to host and

manage Wiki web pages. JSPWiki contains 42560

lines of code and 425 classes with an average of 5.5

methods per class.

We extracted 15 MLH and CLH constrains, and 5

MLS and CLS constraints from the original package

diagram of JSPWiki. The constraints are listed in

Table 2. However, due to the number of classes exist

in the project, the size of the class diagram is too

large to be displayed. We decided to report the

MoJoFM metric instead.

MoJoFM Metric: Constrained clustering compared to

original package = 76.25%

MoJoFM Metric: Normal clustering (without

constraints) compared to original package = 62.45%

The improvement by imposing pairwise constraints,

observing from the perspective of MoJoFM metric, is

more significant in larger software systems. The

same observation was also found in the work by

(Davidson and Ravi, 2009), where the author claimed

that when performing on large datasets, a small

number of constraints can significantly improve the

results of agglomerative hierarchical clustering.

5 CONCLUSION AND FUTURE

WORK

This paper presents a technique to integrate the

concept of constrained clustering with agglomerative

hierarchical software clustering to remodularize

poorly designed and documented software systems.

The proposed algorithm is capable of handling four

types of constraints, namely MLH, CLH, MLS, and

CLS constraints. Hard constraints are fulfilled

throughout the whole clustering process while soft

constraints are optional constraints associated with

some validation of penalty if they are violated.

The proposed algorithm has been successfully

implemented on two projects, the MathArc and

JSPWiki system. Several MLH, MLS, CLH, and

CLS constraints were generated to test the proposed

technique. We managed to restructure the software

and present it in the form of package diagram. When

compared against clustering without any constraints,

our proposed approach managed to achieve better

results measured using MoJoFM metric.

Finally, we believe that there is potential research

that can further improve the effectiveness of the

proposed technique. For example, one can attempt to

adapt the technique to be applied on partitional

clustering algorithms such as k-mean clustering.

ACKNOWLEDGEMENTS

This work is carried out within the framework of a

research project supported by eScienceFund with

reference 01-01-03-SF0851, funded by Ministry of

Science, Technology and Innovation (MOSTI),

Malaysia.

REFERENCES

Anquetil, N., & Lethbridge, T. C. (1999). Recovering

software architecture from the names of source files.

Journal of Software Maintenance, 11(3), 201-221. doi:

10.1002/(sici)1096-908x(199905/06)11:3<201::aid-

smr192>3.0.co;2-1

Antonellis, P., Antoniou, D., Kanellopoulos, Y., Makris,

C., Theodoridis, E., Tjortjis, C., & Tsirakis, N. (2009).

Clustering for Monitoring Software Systems

Maintainability Evolution. Electron. Notes Theor.

ConstrainedAgglomerativeHierarchicalSoftwareClusteringwithHardandSoftConstraints

187

Comput. Sci., 233, 43-57. doi:

10.1016/j.entcs.2009.02.060

Ares, M. E., Parapar, J., & Barreiro, Á. (2012). An

experimental study of constrained clustering

effectiveness in presence of erroneous constraints.

Information Processing & Management, 48(3), 537-

551.

Bair, E. (2013). Semi-supervised clustering methods.

Wiley Interdisciplinary Reviews: Computational

Statistics, 5(5), 349-361. doi: 10.1002/wics.1270

Basu, S., Banerjee, A., & Mooney, R. (2004). Active

Semi-Supervision for Pairwise Constrained Clustering

Proceedings of the 2004 SIAM International

Conference on Data Mining (pp. 333-344).

Bilenko, M., & Mooney, R. J. (2003). Adaptive duplicate

detection using learnable string similarity measures.

Paper presented at the Proceedings of the ninth ACM

SIGKDD international conference on Knowledge

discovery and data mining, Washington, D.C.

Chong, C. Y., Lee, S. P., & Ling, T. C. (2013). Efficient

software clustering technique using an adaptive and

preventive dendrogram cutting approach. Information

and Software Technology, 55(11), 1994-2012.

Chong, C. Y., Lee, S. P., & Ling, T. C. (2014). Prioritizing

and Fulfilling Quality Attributes For Virtual Lab

Development Through Application of Fuzzy Analytic

Hierarchy Process and Software Development

Guidelines. Malaysian Journal of Computer Science,

27(1).

Davidson, I., & Ravi, S. S. (2009). Using instance-level

constraints in agglomerative hierarchical clustering:

theoretical and empirical results. Data Mining and

Knowledge Discovery, 18(2), 257-282. doi:

10.1007/s10618-008-0103-4

Davies, D. L., & Bouldin, D. W. (1979). A Cluster

Separation Measure. Pattern Analysis and Machine

Intelligence, IEEE Transactions on, PAMI-1(2), 224-

227. doi: 10.1109/TPAMI.1979.4766909

Deursen, A. v., & Kuipers, T. (1999). Identifying objects

using cluster and concept analysis. Paper presented at

the Proceedings of the 21st international conference on

Software engineering, Los Angeles, California, USA.

Fokaefs, M., Tsantalis, N., Chatzigeorgiou, A., & Sander,

J. (2009). Decomposing Object-Oriented Class

Modules Using an Agglomerative Clustering

Technique. IEEE International Conference on

Software Maintenance, 93-101.

Fokaefs, M., Tsantalis, N., Stroulia, E., & Chatzigeorgiou,

A. (2012). Identification and application of Extract

Class refactorings in object-oriented systems. Journal

of Systems and Software, 85(10), 2241-2260. doi:

10.1016/j.jss.2012.04.013

Hong, Z., & Yiu-ming, C. (2012). Semi-Supervised

Maximum Margin Clustering with Pairwise

Constraints. Knowledge and Data Engineering, IEEE

Transactions on, 24(5), 926-939. doi:

10.1109/TKDE.2011.68

Kestler, H., Kraus, J., Palm, G., & Schwenker, F. (2006).

On the Effects of Constraints in Semi-supervised

Hierarchical Clustering. In F. Schwenker & S. Marinai

(Eds.), Artificial Neural Networks in Pattern

Recognition (Vol. 4087, pp. 57-66): Springer Berlin

Heidelberg.

Klein, D., Kamvar, S. D., & Manning, C. D. (2002). From

Instance-level Constraints to Space-Level Constraints:

Making the Most of Prior Knowledge in Data

Clustering. Paper presented at the Proceedings of the

Nineteenth International Conference on Machine

Learning.

Maqbool, O., & Babri, H. A. (2007). Hierarchical

Clustering for Software Architecture Recovery.

Software Engineering, IEEE Transactions on, 33(11),

759-780. doi: 10.1109/TSE.2007.70732

MathArc - Ensuring Access to Mathematics Over Time.

(August 2009).

Mitchell, B. S., & Mancoridis, S. (2001, 2001).

Comparing the decompositions produced by software

clustering algorithms using similarity measurements.

Paper presented at the Software Maintenance, 2001.

Proceedings. IEEE International Conference on.

Miyamoto, S. (2012). An Overview of Hierarchical and

Non-hierarchical Algorithms of Clustering for Semi-

supervised Classification. In V. Torra, Y. Narukawa,

B. López, & M. Villaret (Eds.), Modeling Decisions

for Artificial Intelligence (Vol. 7647, pp. 1-10):

Springer Berlin Heidelberg.

Shental, N., & Weinshall, D. (2003). Learning Distance

Functions using Equivalence Relations. Paper

presented at the In Proceedings of the Twentieth

International Conference on Machine Learning.

Sørensen, T. (1948). A Method of Establishing Groups of

Equal Amplitude in Plant Sociology Based on

Similarity of Species Content and Its Application to

Analyses of the Vegetation on Danish Commons: I

kommission hos E. Munksgaard.

Wagstaff, K., & Cardie, C. (2000). Clustering with

Instance-level Constraints. Paper presented at the

Proceedings of the Seventeenth International

Conference on Machine Learning.

Wiggerts, T. A. (1997, 6-8 Oct 1997). Using clustering

algorithms in legacy systems remodularization. Paper

presented at the Reverse Engineering, 1997.

Proceedings of the Fourth Working Conference on.

Zhihua, W., & Tzerpos, V. (2004, 24-26 June 2004). An

effectiveness measure for software clustering

algorithms. Paper presented at the Program

Comprehension, 2004. Proceedings. 12th IEEE

International Workshop on.

ENASE2015-10thInternationalConferenceonEvaluationofNovelSoftwareApproachestoSoftwareEngineering

188