Detecting Feature Duplication in Natural Language Speciﬁcations when

Evolving Software Product Lines

Amal Khtira, Anissa Benlarabi and Bouchra El Asri

IMS Team, SIME Laboratory, ENSIAS, Mohammed V University, Rabat, Morocco

Keywords:

Natural Language Requirements, Software Product Line, Feature Duplication, Natural Language Processing.

Abstract:

Software product lines are dynamic systems that need to evolve continuously to meet new customer require-

ments. This evolution impacts both the core platform of the product line and its derived products. For several

reasons, the most common way to express requirements by customers is natural language. However, the ex-

perience has shown that this communication channel does not give the possibility to detect system defects

such as inconsistency and duplication. The objective of this paper is to propose a method to transform textual

requirements into the XML format used by some Feature-oriented software development tools, in order to

facilitate the detection of features duplication.

1 INTRODUCTION

Software Product Line Engineering (SPLE)

(Clements and Northrop, 2002) is an approach

that aims at reusing common assets to generate

customized applications according to different needs

of customers. The main beneﬁts of this approach are

the cost reduction, quality enhancement and time to

market reduction. The SPLE approach involves two

major processes, domain engineering and application

engineering (Pohl et al., 2005). Domain engineering

consists in deﬁning the variability and commonality

of the product line, and building all reusable assets

(i. e. requirements, architecture, components, tests).

Based on these assets, speciﬁc applications are

derived during the application engineering process.

Software product lines are a long-term investment

and have to evolve constantly in response to business

needs, changing markets, and advances in technol-

ogy. This evolution impacts both the common as-

sets of the product line and the speciﬁc assets of in-

dividual applications. Due to this change, several de-

fects can arise in the SPL models, such as the incon-

sistency and incompleteness of features (Reder and

Egyed, 2013)(Zowghi and Gervasi, 2003), and the

non-conformance of constraints (Mazo et al., 2011).

What makes the evolution process more complex

is that, in most current software projects, require-

ments documents are written in natural language be-

cause it is the simplest and the more ﬂexible way

for customers to express their expectations. How-

ever, specifying requirements in natural language has

shown many drawbacks such as imprecision, inaccu-

racy and ambiguity (Meyer, 1985)(Lami et al., 2004).

In order to preserve the consistency and correctness

of the SPL models, many approaches have been pro-

posed to manage the evolution process by verifying

the speciﬁcations of change requests considered as

the main source of model defects. A large number

of these approaches aimed at transforming the natu-

ral language requirements to a formal or semi-formal

representation (Lami et al., 2004)(Kamalrudin et al.,

2010)(Holtmann et al., 2011).

In this paper, we propose a method based on natu-

ral language processing to transform natural language

requirements into a tree-like document (i. e. XML

document). This document will serve as an input to an

algorithm whose purpose is to detect a speciﬁc defect

in the speciﬁcation, the feature duplication (Khtira

et al., 2014). We consider that two features are du-

plicated when they satisfy the same functionality in

the product line.

The remainder of the paper is organized as fol-

lows. Section 2 gives an insight of the knowledge

base of our work. In Section 3, we explain our method

of transforming natural language speciﬁcations into

XML documents, and we present the algorithm of de-

tection of feature duplication. Section 4 provides an

overview of different works of the literature related to

our approach and explains in what our contribution is

different. In section 5, we conclude and outline future

work.

257

Khtira A., Benlarabi A. and El Asri B..

Detecting Feature Duplication in Natural Language Speciﬁcations when Evolving Software Product Lines.

DOI: 10.5220/0005459802570262

In Proceedings of the 10th International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE-2015), pages 257-262

ISBN: 978-989-758-100-7

 2015 SCITEPRESS (Science and Technology Publications, Lda.)

2 BACKGROUND

In this section, we introduce the background of our

study. First, we give an overview of the feature dupli-

cation problem. Then, we discuss the transformation

of requirements from an informal to a formal repre-

sentation.

2.1 Feature Duplication

Software product lines are dynamic systems that need

to evolve constantly to address changes such as new

customer requirements, technology changes, or refac-

toring. The possible changes in a product line involve

changes that affect the entire product line or changes

driven by an individual product. These changes can

be the source of many defects in the product line

such as inconsistency (Reder and Egyed, 2013)(Blanc

et al., 2009), incompleteness (Zowghi and Gervasi,

2003) and duplication (Khtira et al., 2014). Duplica-

tion is the fact of adding features that have the same

role in the application, which means that they satisfy

the same functionality.

In (Thomas and Hunt, 1999), Hunt and Thomas

distinguish the four main causes of duplication,

namely imposed duplication, inadvertent duplication,

impatient duplication and inter-developer duplication.

In the case of software product lines, at least the two

last causes might occur. Indeed, due to tight dead-

lines, developers could possibly get impatient and add

duplicated features from the speciﬁcations. In addi-

tion, the large number of customers and developers

working on the system can easily cause feature du-

plication. Many studies have dealt with defects in

SPL such as inconsistency and incompleteness, but

to the best of our knowledge, few attempts have tack-

led the problem of duplication. In our study, we will

focus on this speciﬁc defect. Since manual veriﬁca-

tion has proved to be time-consuming and error prone,

we propose an automatic algorithm to detect duplica-

tion based on the formal presentation generated from

a textual requirements of an evolution.

2.2 Formalizing Natural Language

Requirements

In most development projects, textual requirements

are an important input of the requirements analy-

sis. Although formal methods are used in some

safety-critical systems, for the largest part of soft-

ware projects, the predominant mean of representing

requirements is Natural Language due to many rea-

sons. Indeed, specifying requirements in natural lan-

guage is more ﬂexible and simple for the customer.

Moreover, natural language can be understood by all

the stakeholders of the project, unlike formal methods

which can be very difﬁcult to customers. Hence, nat-

ural language speciﬁcations can be used as an agree-

ment between customers and suppliers and as an out-

put for the project management. However, expressing

requirements in a natural language frequently make

them prone to many defects. For instance, (Meyer,

1985) details seven problems with natural language

speciﬁcations: noise, silence, over-speciﬁcation, con-

tradictions, ambiguity, forward references and wish-

ful thinking. (Lami et al., 2004) dealt with other de-

fects, namely the ambiguity, the inconsistency and

the incompleteness. To overcome these problems,

many studies have proposed methods to transform

natural language speciﬁcations to formal or semi-

formal speciﬁcations (Holtmann et al., 2011)(Fat-

wanto, 2013)(Cabral and Sampaio, 2008)(Ilieva and

Ormandjieva, 2005).

In the context of software product lines, the do-

main model of the entire product line and the applica-

tion models of the derived products can be expressed

using feature models, while the speciﬁcations of a

new evolution concerning a derived product can be

expressed using natural language. To enable a veri-

ﬁcation of the speciﬁcations against the product line

models, we need ﬁrst to transform these requirements

into a more formal presentation and to assure the ab-

sence of all sorts of defects.

3 DUPLICATION DETECTION IN

NATURAL LANGUAGE

SPECIFICATIONS

The purpose of our work is to detect duplication in

textual speciﬁcations related to an evolution of a soft-

ware product line. To achieve this goal, we propose a

two-step process. The ﬁrst step consists of transform-

ing the natural language speciﬁcations into an XML

document. The second step involves the application

of an algorithm that detects duplicated features in the

generated XML. Figure 1 represents the overview of

the proposed process.

3.1 Initiation Phase

The objective of this phase is to initiate the model

that will be used by the machine learning. The model

is thus populated based on the domain model, which

contains all the existing features implemented by the

product line. As a case study of our approach, we con-

sider a CRM (Customer Relationship Management)

ENASE2015-10thInternationalConferenceonEvaluationofNovelSoftwareApproachestoSoftwareEngineering

258

Figure 1: An overview of the process.

Figure 2: The Domain Feature Model of the CRM.

product line with the feature model depicted in Fig-

ure 2.

It has to be noted that the domain model and the

application models of the SPL are created using the

FeatureIDE tool (Kastner et al., 2009), which is an

open source framework for software product line en-

gineering based on Feature-Oriented Software Devel-

opment (FOSD). This framework supports the entire

life-cycle of a product line, especially domain analy-

sis and feature modeling. It provides a graphical pre-

sentation of the feature model tree and generates auto-

matically the corresponding XML source. This struc-

ture of the XML distinguishes the concepts of varia-

tion points and variants using tags. The tags ”or” and

”alt” correspond to variation points, while the tags

”feature” correspond to variants.

Expressing the feature models using an XML for-

mat will allow us to anticipate the comparison be-

tween these models and the speciﬁcations of evolu-

tions in terms of duplication.

3.2 Formalizing Natural Language

Speciﬁcations

In this section, we explain the process of transforma-

tion of natural language requirements into an XML

document. To perform this operation, we will use the

OpenNLP library (OpenNLP, 2011), which is a ma-

chine learning based toolkit for the processing of nat-

ural language text. The remainder of this subsection

details the different steps and artifacts involved in this

process.

3.2.1 Textual Speciﬁcation

In this stage of the process, the main input is the spec-

iﬁcation of a new evolution related to a derived prod-

uct. This speciﬁcation contains the new requirements

that have to be implemented in this speciﬁc product.

The easiest and more ﬂexible way for the customer to

express his requirements is natural language. Hence,

the speciﬁcation can be written in a document with a

doc or txt format.

3.2.2 Sentence Detector

The ﬁrst step of this stage consists of detecting the

punctuation characters that indicate the end of sen-

tences. After the detection of all sentence boundaries,

each sentence is written in its own line. By processing

the input textual speciﬁcation, the output of this oper-

ation will be a text document that contains a sentence

per line.

3.2.3 Tokenizer

This step consists of segmenting the resulted sen-

tences of the previous step into tokens. A token can

be a word, a punctuation, a number, etc. As an out-

put of this action, all the tokens of the speciﬁcation

are separated using whitespace characters, such as a

space or line break, or by punctuation characters.

3.2.4 Parser

The parser is responsible for analyzing each sentence

of the speciﬁcation in order to mark tokens with their

corresponding types and roles in the sentence based

on the rules of the language grammar (e.g. noun, verb,

adjective, adverb). We note that the language used in

our input speciﬁcations is English. A parser marks

all the words of a sentence using a POS tagger (Part-

Of-Speach tagger) and converts the sentence into a

tree that represents the sentences syntactic structure.

Figure 3 depicts an example of parsing for a textual

deﬁnition of a requirement related to the case study

presented in Figure 2.

This operation enables us to have an exact under-

standing of the sentence. For example, it allows us

to conﬁrm whether the action of a verb is negative or

afﬁrmative, and whether a requirement is mandatory

or optional.

DetectingFeatureDuplicationinNaturalLanguageSpecificationswhenEvolvingSoftwareProductLines

259

Figure 3: An example of syntax parsing.

3.2.5 Entity Detector

The aim of this step is to detect semantic entities in

the speciﬁcation. In our study, we are interested espe-

cially in the parts of the sentences considered as vari-

ation points and variants. To carry out this task, we

need the model created in the initiation phase where

all the domain speciﬁcations are tagged.

In the example depicted in Figure 4, the entity de-

tector reads a tokenized sentence and outputs the sen-

tence with markup for the detected variation point and

variant.

Figure 4: An example of detecting entities.

In order to measure the precision of entity recog-

nition, we use the evaluation tool of OpenNLP that

provides information about the accuracy of the used

model.

3.2.6 Repository

The repository contains two main components:

• The model that contains the different features

of the domain model, classiﬁed in different

categories, especially <variation point> and

<variant>.

• The dictionary which contains the set of syn-

onyms and alternatives for all the concepts used

in the system.

The repository is initially populated based on the do-

main model of the product line. So that the repository

keeps up with the evolution of the product line and

its derived products, the new concepts detected in the

speciﬁcation are added to the initial repository, which

makes the latter more accurate.

3.2.7 XML Output

The output of our process is the speciﬁcation in the

form of an XML document. The XML should contain

the tags ”or” and ”alt” for variation points, and the

tags ”feature” for variants. This structure provides a

uniﬁed presentation for both the domain model and

the speciﬁcation. The XML document will serve as

an input of the algorithm for duplication detection.

3.3 The Algorithm to Detect Feature

Duplication

With the large number of features in a speciﬁcation,

detecting duplication manually becomes a tedious,

time-consuming and error-prone task. In order to

identify efﬁciently the duplicated features, we pro-

pose an automatic algorithm that uses as an input the

generated XML of the speciﬁcation. An example of

this XML is depicted in Figure 5.

Figure 5: An example of the algorithm input.

We denote by S the XML of the speciﬁcation. PA

is the set of nodes related to the tags ”or” and ”alt”

of S (i. e. variation points), and V is the set of nodes

related to the tags ”feature” of S (i. e. variants).

P = {p

, p

, . . . , p

}

∀p

∈ P ∃V

where V

= {v

i j

| j ∈ N}

Thus:

V =

[

i=1

The proposed algorithm contains the following

steps:

• Step 1. This step is based on the dictionary that

contains the synonyms of all the concepts of the

SPL, and for each set of synonyms, we deﬁne

a key synonym. For example : The synonyms

for ”on-line sales” could be ”e-sales”, ”Internet

sales”, or ”web sales”. The key synonym for these

alternatives is ”on-line sales”.

The aim of this step is to generate an equivalent

XML of the input, by replacing the name of every

ENASE2015-10thInternationalConferenceonEvaluationofNovelSoftwareApproachestoSoftwareEngineering

260

node (variation point or variant) with its associ-

ated key synonym in the dictionary.

• Step 2. This step consists of putting in alphabet-

ical order the variation points and the variants of

each variation point.

• Step 3. For each variation point, the duplicated

variants are detected and removed from the XML.

The sub-algorithm of this step is as follows:

Algorithm 1: Detecting duplicated variants of a variation

point.

// n

the number of variants of the variation point p

// DP the set of variation points with duplicated variants

// DV

the set of duplicated variants of the variation point

DP =

for each p

∈ P do

i ← 0

while i < n

if v

= v

ki+1

then

DP ← DP ∪ {p

}

← DV

∪ {v

}

← V

\ {v

}

end if

i ← i + 1

end while

end for

• Step 4. During this step, we carry out a com-

parison between the variants of all the variation

points, in order to detect duplication in the whole

XML.

The ﬁnal output of these steps is a log ﬁle that

contains the set of duplicated pairs (variation point,

variant) of the speciﬁcation. These features are sent

to the user in order to reverify his needs and change

them in case of error.

4 RELATED WORK

Several papers have dealt with detecting defects in

natural language speciﬁcations. In this section, we

provide an insight of the studies that use the transfor-

mation of natural language requirements into a formal

or semi-formal presentation.

Lami et al. (Lami et al., 2004) propose a method-

ology and a tool called QuARS (Quality Analyzer

for Requirement Speciﬁcations). This tool performs

an initial parsing of the speciﬁcations in order to de-

tect automatically speciﬁc linguistic defects, namely

inconsistency, incompleteness and ambiguity. The

analysis performed by QuARS is limited to syntax-

related issues of a natural language document, while

semantic-related problems are not directly addressed.

Although this study converges with our paper in the

fact that it is based on parsing of natural language re-

quirements to detect defects. However, our approach

is different because it transforms speciﬁcations into

XML trees and it focuses particularly on a speciﬁc

defect which is the feature duplication.

Kamalrudin et al. (Kamalrudin et al., 2010) pre-

sented the automated tracing tool Marama that en-

ables users to capture their requirements and automat-

ically generate the Essential Use Cases (EUC). This

tool supports the inconsistency checking between the

textual requirements, the abstract interactions and the

EUCs, but unlike our approach, this one is based on

use cases instead of XML documents.

In order to locate inconsistency in the domain fea-

ture model of a SPL, Yu (Yu et al., 2012) provides

a new method to construct traceability between re-

quirements and features. It consists of creating indi-

vidual application Feature Tree Models (AFTMs) and

establishing traceability between each AFTM and its

corresponding requirements. It ﬁnally merges all the

AFTMs to extract the Domain Feature Tree Model

(DFTM), which enables to ﬁgure out the traceabil-

ity between domain requirements and DFTM. Using

this method helps constructing automatically the do-

main feature model from requirements. It also helps

locate affected requirements while features change or

vice versa, which makes it easier to detect inconsis-

tencies. However, this approach is different from our

own one, because we suppose that domain and ap-

plication models exist, our objective is hence to con-

struct a more formal presentation of the requirements

related to a new version of a derived product.

At the aim of validating and correcting automat-

ically textual requirements, Holtmann et al. (Holt-

mann et al., 2011) proposed an approach that uses an

extended CNL (controlled natural language) that is al-

ready used in the automotive industry. The CNL re-

quirements are ﬁrst translated into an ASG (Abstract

Syntax Graph) typed by a requirements metamodel.

Then, structural patterns are speciﬁed based on this

metamodel. The use of patterns allows an automated

correction of some requirements errors and the vali-

dation of requirements due to change requests. While

this approach considers the correction of textual spec-

iﬁcations using a CNL, our approach aims at detect-

ing duplication in these speciﬁcations by transform-

ing them into tree-like documents.

Similarly, Cabral and Sampaio (Cabral and Sam-

paio, 2008) proposed a novel approach that uses tem-

plates to write use cases in CNL (Controlled Natu-

ral Language), then to translate use cases in CNL to

models in CSP process algebra that allows the de-

DetectingFeatureDuplicationinNaturalLanguageSpecificationswhenEvolvingSoftwareProductLines

261

scription of systems in terms of processes that operate

independently, and interact with each other through

message-passing communication. The use of a CNL

and use case templates enables to guarantee require-

ments consistency. The paper focuses on use case

transformation but does not detail the process of in-

consistency detection.

5 CONCLUSION

In this paper, we presented an approach to detect du-

plication in natural language speciﬁcations related to

a derived product of a software product line. This ap-

proach is based on a two-phase process. During the

ﬁrst phase, the speciﬁcations are transformed into a

more formal presentation, XML document, using nat-

ural language processing. In the second phase, we

propose an algorithm that searches the duplicated fea-

tures in the generated XML. Apart from the detection

of duplication within the speciﬁcation, the formaliza-

tion of the latter anticipates the veriﬁcation of dupli-

cation between the new requirements and the existing

domain and application models of the software prod-

uct line.

As a case study for our work, we considered a

CRM SPL. Work in progress consists of creating a

model based on the domain speciﬁcations of the CRM

tool and using the OpenNLP library to validate it.

We also prepare the speciﬁcation of a new evolution

which will serve as an input of the proposed process.

In a future work, we intend to develop a support tool

whose objective is to apply the syntax parsing to the

speciﬁcation, generate the XML and apply the algo-

rithm to detect automatically the duplicated features.

REFERENCES

Blanc, X., Mougenot, A., Mounier, I., and Mens, T. (2009).

Incremental detection of model inconsistencies based

on model operations. In Advanced information sys-

tems engineering, pages 32–46. Springer.

Cabral, G. and Sampaio, A. (2008). Formal speciﬁcation

generation from requirement documents. Electronic

Notes in Theoretical Computer Science, 195:171–188.

Clements, P. and Northrop, L. (2002). Software product

lines: practices and patterns, volume 59. Addison-

Wesley Reading.

Fatwanto, A. (2013). Software requirements speciﬁcation

analysis using natural language processing technique.

In QiR (Quality in Research), 2013 International Con-

ference on, pages 105–110. IEEE.

Holtmann, J., Meyer, J., and von Detten, M. (2011). Auto-

matic validation and correction of formalized, textual

requirements. In Software Testing, Veriﬁcation and

Validation Workshops (ICSTW), 2011 IEEE Fourth In-

ternational Conference on, pages 486–495. IEEE.

Ilieva, M. and Ormandjieva, O. (2005). Automatic transi-

tion of natural language software requirements speci-

ﬁcation into formal presentation. In Natural Language

Processing and Information Systems, pages 392–397.

Springer.

Kamalrudin, M., Grundy, J., and Hosking, J. (2010). Man-

aging consistency between textual requirements, ab-

stract interactions and essential use cases. In Com-

puter Software and Applications Conference (COMP-

SAC), 2010 IEEE 34th Annual, pages 327–336. IEEE.

Kastner, C., Thum, T., Saake, G., Feigenspan, J., Leich, T.,

Wielgorz, F., and Apel, S. (2009). Featureide: A tool

framework for feature-oriented software development.

In Software Engineering, 2009. ICSE 2009. IEEE 31st

International Conference on, pages 611–614. IEEE.

Khtira, A., Benlarabi, A., and El Asri, B. (2014). To-

wards duplication-free feature models when evolving

software product lines. In Software Engineering Ad-

vances (ICSEA), 2014 Ninth International Conference

on, pages 107–113.

Lami, G., Gnesi, S., Fabbrini, F., Fusani, M., and Trentanni,

G. (2004). An automatic tool for the analysis of nat-

ural language requirements. Informe t

ecnico, CNR

Information Science and Technology Institute, Pisa,

Italia, Setiembre.

Mazo, R., Lopez-Herrejon, R. E., Salinesi, C., Diaz, D.,

and Egyed, A. (2011). Conformance checking with

constraint logic programming: The case of feature

models. In Computer Software and Applications Con-

ference (COMPSAC), 2011 IEEE 35th Annual, pages

456–465. IEEE.

Meyer, B. (1985). On formalism in speciﬁcations. IEEE

software, 2(1):6–26.

OpenNLP (2011). Apache software foundation. URL

http://opennlp. apache. org.

Pohl, K., B

ockle, G., and Van Der Linden, F. (2005). Soft-

ware product line engineering. Springer, 10:3–540.

Reder, A. and Egyed, A. (2013). Determining the cause of

a design model inconsistency. Software Engineering,

IEEE Transactions on, 39(11):1531–1548.

Thomas, D. and Hunt, A. (1999). The pragmatic program-

mer: From journeyman to master.

Yu, D., Geng, P., and Wu, W. (2012). Constructing trace-

ability between features and requirements for soft-

ware product line engineering. In Software Engineer-

ing Conference (APSEC), 2012 19th Asia-Paciﬁc, vol-

ume 2, pages 27–34. IEEE.

Zowghi, D. and Gervasi, V. (2003). On the interplay be-

tween consistency, completeness, and correctness in

requirements evolution. Information and Software

Technology, 45(14):993–1009.

ENASE2015-10thInternationalConferenceonEvaluationofNovelSoftwareApproachestoSoftwareEngineering

262