Automatically Generating Tests from Natural Language

Descriptions of Software Behavior

Sunil Kamalakar, Stephen H. Edwards and Tung M. Dao

Department of Computer Science, Virginia Tech, 2202 Kraft Drive (0902), Blacksburg, Virginia, U.S.A.

Keywords: Behavior-Driven Development, Test-Driven Development, Agile Methods, Software Testing, Feature

Description, Natural Language Processing, Probabilistic Analysis, Automated Testing, Automated Code

Generation.

Abstract: Behavior-Driven Development (BDD) is an emerging agile development approach where all stakeholders

(including developers and customers) work together to write user stories in structured natural language to

capture a software application’s functionality in terms of required “behaviors.” Developers can then manu-

ally write “glue” code so that these scenarios can be translated into executable software tests. This glue code

represents individual steps within unit and acceptance test cases, and tools exist that automate the mapping

from scenario descriptions to manually written code steps (typically using regular expressions). This paper

takes the position that, instead of requiring programmers to write manual glue code, it is practical to convert

natural language scenario descriptions into executable software tests fully automatically. To show

feasibility, this paper presents preliminary results from a tool called Kirby that uses natural language

processing techniques to automatically generate executable software tests from structured English scenario

descriptions. Kirby relieves the developer from the laborious work of writing code for the individual steps

described in scenarios, so that both developers and customers can both focus on the scenarios as pure behav-

ior descriptions (understandable to all, not just programmers). Preliminary results from assessing the per-

formance and accuracy of this technique are presented.

1 INTRODUCTION

Behavior-Driven Development (BDD) is a relatively

new agile development technique that builds on the

established practice of test-driven development.

Test-Driven Development (TDD), (Beck, 2002;

Koskela, 2007) is an approach for developing soft-

ware by writing test cases incrementally in conjunc-

tion with the code being developed: “write a little

test, write a little code.” TDD provides a number of

benefits (Nagappan et al., 2008), including earlier

detection of errors, more refined and usable class

designs, and greater confidence when refactoring.

TDD facilitates software design by encouraging one

to express software behaviors in terms of executable

test cases.

Behavior-driven development combines the

general techniques and principles of TDD with ideas

from domain-driven design and object-oriented

analysis. It was originally conceived by Dan North

(2013) as a response to limitations observed with

TDD. In BDD we specify each “behavior” of the

system in a clearly written and easily understandable

scenario description written using natural language.

These natural language scenarios help all

stakeholders—not just programmers—understand,

refine, and specify required behaviors. Through the

clever use of “glue code” provided by programmers

once the scenarios are written, it is possible to

execute these natural language scenarios as

operational software tests.

BDD is focused on defining fine-grained

specifications of the behavior of the target system.

The main goal of BDD is to produce executable

specifications of the target system (Solis and Wang,

2011), while keeping the focus on human-readable

scenario descriptions that can be easily understood

by customers as easily as by developers.

However, one weak point of BDD is the “glue

code”—programmers are still required to produce

program “steps” that correspond to the basic actions

described in the natural language scenarios. A

number of tools have developed to make the process

of writing this glue code easier and more

238

Kamalakar S., H. Edwards S. and M. Dao T..

Automatically Generating Tests from Natural Language Descriptions of Software Behavior.

DOI: 10.5220/0004566002380245

In Proceedings of the 8th International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE-2013), pages 238-245

ISBN: 978-989-8565-62-4

 2013 SCITEPRESS (Science and Technology Publications, Lda.)

streamlined, and to automatically map phrases in the

scenarios into such steps (typically using regular

expressions). However, a manually-written bridge

between the scenarios and the programmatic actions

that correspond to them is still necessary.

This paper takes the following position:

Instead of requiring programmers to write

manual glue code, it is practical to convert

natural language scenario descriptions into

executable software tests fully automatically.

To defend this position, the remainder of this paper

describes work in progress on a tool called Kirby: a

BDD support tool that automatically translates

natural language scenario descriptions into

executable Java software tests. This paper describes

the approach used and presents preliminary results

that demonstrate feasibility.

The paper is organized as follows: Section 2

briefly reviews the related work including typical

BDD practices. Section 3 describes Kirby and its

architecture and implementation strategy. Section 4

illustrates how the approach works with examples,

and Section 5 summarizes our preliminary

evaluation. The paper concludes with a discussion

of future work in Section 6.

2 RELATED WORK

There are few published studies on BDD, most of

which take a relatively narrow view, treating it as a

specific technique of software development (Solis

and Wang, 2011). Keogh (2010) embraces a broader

view, arguing its significance to the whole lifecycle

of software development, especially on the business

side and the interaction between business and soft-

ware development. Lazar et al. (2010) highlight the

value of BDD in the business domain, claiming that

BDD enables developers and domain experts to

speak the same language, and that BDD encourages

collaboration between all project participants.

Many tools for BDD have been created for use in

different contexts, the best known of which is Cu-

cumber (Cucumber, 2013, Wynne and Hellesøy,

2012). Cucumber is a BDD tool written for the pro-

gramming language Ruby. Developers and custom-

ers write semi-structured natural language scenario

descriptions, and developers write the corresponding

test “steps” in Ruby, using regular expressions to

match natural language phrases used in the scenari-

os. Similar tools exist for other languages (JBehave,

2013, JDave, 2013, NBehave, 2013, PHPSpec,

2013). Traditionally, TDD has been used in writing

unit tests and BDD has evolved to specify

acceptance tests. Nevertheless, software developers

should be able to leverage the capabilities of BDD to

specify unit tests in a intuitive manner as well.

Cucumber uses the Gherkin language (Gherkin,

2013) for writing semi-structured scenario descrip-

tions. It is a “business readable, domain specific

language” (Cucumber, 2013) that lets you describe

software behaviors without detailing how those

behaviors are implemented. Gherkin simultaneously

serves two purposes: documentation and automated

test description. Gherkin structures descriptions this

way:

Scenario: [Name]

Given [Initial context] And [some more context]

When [Event] And [some other event]

Then [Outcome] And [some other outcome]

In Gherkin, each behavior is called a scenario,

and scenarios can further be grouped into stories or

features. For example, here is a short scenario

describing a stock trading behavior in Gherkin:

Scenario: check stock threshold

Given a stock with symbol GOOGLE and

a threshold of 15.0

When the stock is traded at a price

of 5.0

Then the alert status is OFF

When the stock is sold at a price of

16.0

Then the alert status is ON

The scenario name is a shorthand description of

what the scenario is supposed to do. Scenarios use a

declarative syntax containing “Given”, “When” and

“Then” clauses. “Given” clauses describe an initial

context for some behavior, “When” clauses describe

the occurrence of one or more events or actions, and

“Then” clauses describe the expected outcome(s).

The sentences (or phrases) after each keyword are

free-form, but each must match a specific step (or

glue method) written by the developer(s).

While existing BDD tools such as Cucumber and

JBehave rely on regular expressions to recognize

key phrases in scenarios in order to map them to

steps, we propose a strategy based on a more

comprehensive natural language processing (NLP)

approach. Other techniques for automatically gener-

ating code based on NLP have been described (Yu

and Fleming, 2010). Budinsky (1996) describes code

generation techniques that generate templates for

user-specified design patterns. Various software

tools for UML, XML processing, etc., are capable of

generating source code based on user input. These

systems require a well-defined input format, and

unless one follows a known grammar with unambig-

AutomaticallyGeneratingTestsfromNaturalLanguageDescriptionsofSoftwareBehavior

239

uous, clearly defined instructions, it is difficult to

auto-generate program code from natural language

on the fly.

In another closely related work, Soeken et al.

(2012) propose an assisted flow for BDD, where the

user enters into a dialog with the computer—the

computer suggests code fragments extracted from

the sentences in a scenario, and the user confirms or

corrects each step. This allows for a semi-automatic

transformation from BDD scenarios as acceptance

tests into source code stubs and thus provides a first

step towards automating BDD. Our approach bor-

rows certain ideas of NLP, but aims for complete

automation.

3 A NEW APPROACH

BDD is a highly iterative and incremental process

where one switches back and forth between working

on scenarios and developing application code that

meets the scenario’s requirements. We propose a

new tool called Kirby that completely automates the

generation of the individual steps in executable tests

directly from the natural language scenario descrip-

tions. Kirby is named after “Kirby cucumbers”, a

variety of cucumbers that are both short and bumpy

in appearance. Kirby shortens the process of execut-

ing BDD scenarios by eliminating the manual task

of writing test steps. However, while Kirby shows

that this approach is feasible, the road to a produc-

tion-quality solution still contains a few bumps.

The workflow we envision for BDD with Kirby

is illustrated in Figure 1. Developers alternate be-

tween creating or revising scenarios and writing (or

creating stubs for) implementations of features.

Since the implementation code is written with the

scenario in mind, the language that is used in the

code we believe naturally will reflect the language

of the scenario (with subtle variations). At any time,

the scenarios can be executed directly on the imple-

mentation code by using Kirby, which will generate

the needed step definitions for us automatically from

the language used in the scenarios themselves.

3.1 The Design of Kirby

Gherkin as a language is very expressive and intui-

tive for describing scenarios—any natural language

phrasing can be used in each clause of a scenario.

The general strategy used by Kirby is to translate

each scenario into a single test. The “Given”

clause(s) specify the object creation actions or other

setup actions needed at the beginning of the test.

Figure 1: BDD Workflow with Kirby.

The “When” clause(s) represent method calls on

objects involved in the test, while “Then” clauses

represent assertions that check expected outcomes.

Clauses can be interleaved as needed. Understand-

ing this basic interpretation of clauses may help

stakeholders write effective scenarios that can be

translated successfully.

The high level architecture for Kirby is shown in

Figure 2. Since the goal is to relieve the developer

from writing step definitions manually, we need to

develop a mechanism to map the natural language

scenarios onto the code implementation/skeleton that

is being written alongside the scenarios.

Figure 2: Kirby’s high-level architecture.

Kirby uses both the scenario descriptions and the

co-developed software (complete code or stubs) as

input when generating executable tests. It uses a

Probabilistic

Matcher

Code

Generator

Class Info

Extractor

NLP Aug-

mentation

Engine

Scenarios

in Natural

Language

Test Code

App

Code/

Stubs

<<uses>>

Kirby

<<uses>>

ENASE2013-8thInternationalConferenceonEvaluationofNovelSoftwareApproachestoSoftwareEngineering

240

NLP Augmentation Engine to process and augment

the information in each clause of the scenario to

understand its structure and semantics. At the same

time, Kirby also uses a reflection-based Class In-

formation Extractor to obtain details about the clas-

ses and methods that have been written in the im-

plementation. The Probabilistic Matcher uses a

variety of algorithms to determine the best matches

between noun phrases and verb phrases in the be-

havioral description, and objects and methods avail-

able in the application. Once suitable matches have

been found, the Code Generator synthesizes this

information to produce JUnit-style tests.

3.2 NLP Augmentation Engine

The natural language processing performed in Kirby

occurs in its NLP Augmentation Engine, which pre-

processes a scenario to extract structural information

about the organization of its clauses. The Stanford

NLP library (Stanford NLP Group, 2013) is used to

create a Phrase Structure Tree (PST) for each clause,

allowing the noun phrases and verb phrases to be

extracted. We also use its capability to augment

information about the types of dependencies that

exist between the words of each clause.

Preprocessing includes removal of stop-words

that do not add any value to the meaning of the sen-

tence, while making sure that the PST representation

for the clause still remains intact. The augmentation

also ensures that we keep the lemmatized version of

each word encountered in the clause to reduce ambi-

guity by consolidating different inflected forms of

the word into a single form for matching.

3.3 Code Information Extractor

The Code Information Extractor is responsible for

extracting information from the implementation

code. Kirby uses a streamlined Java reflection API

(Edwards et al., 2012) to keep track of all the classes

in a particular project, plus the classes that are ac-

cessible and utilized by those classes. These classes

represent our search space for the objects that need

to be created based on the natural language infor-

mation available in the scenario clauses. Since we

use Java, which follows a strict object oriented struc-

ture, we keep track of the public methods and also

public members that are part of each class. If the

Java bytecode for the application includes debug

information (which is typical during development),

the Code Information Extractor also keeps track of

the names of each method’s parameters.

3.4 Probabilistic Matcher

The Probabilistic Matcher is an interesting aspect of

the architecture, since it is responsible for computing

probability values of match between the clause in

the behavior and the code implementation. Using

natural language in scenarios provides a great deal

of flexibility in the way we specify software behav-

ior. But matching this natural language with pro-

gram features is quite challenging. There is no one-

stop solution or algorithm that works perfectly in

this situation. An edit-distance algorithm like Le-

venshtein (Soukoreff and MacKenzie, 2001) may

work favorably in a situation where the user speci-

fies the partial or exact wording used in the code, but

it will fail miserably when a synonym is used in

natural language. For multi-word or sentence match-

ing a vector space model like cosine similarity gives

better values (Tata and Patel, 2007).

Kirby takes a hybrid approach that combines

multiple algorithms that include confidence

measures, weighing them against each other to

choose the most likely match. Kirby’s cosine simi-

larity measure has been extended to include Word-

Net (Fellbaum, 1998) so that word synonyms can be

handled. For computing the semantic similarity

between words based on how they are used in writ-

ten language, Kirby uses a tool called DISCO (Kolb,

2008) that provides a second order similarity meas-

ure between two words or sentences based on actual

usage in large datasets like Wikipedia. The chal-

lenges faced in these computations vary depending

upon what specific kind of matching is needed in a

particular clause: class matching, parameter match-

ing, constructor matching, or method matching.

The Probabilistic Matcher uses weighted averag-

ing to adapt its matching model based on the indi-

vidual values obtained from the competing algo-

rithms. If one algorithm, such as DISCO or edit-

distance, does not provide any results in a given

situation, the matcher modifies the weights of the

probabilities (confidence levels) produced by the

other algorithms. The weights used were obtained

through experimentation as discussed in Section 5.

3.5 Code Generator

Once phrases in the scenario have been matched

with classes, methods, and values, the Code Genera-

tor interacts with the other components to produce

executable tests. Information from the Probabilistic

Matcher is combined with code features retrieved by

the Code Information Extractor to generate the test

code in Java using JUnit as our base unit-testing

AutomaticallyGeneratingTestsfromNaturalLanguageDescriptionsofSoftwareBehavior

241

framework. Each of the clauses expressed in a sce-

nario is treated differently. “Given” clauses map to

one or more constructor calls. “When” clauses refer

to a method call (or sequence of calls). “Then”

clauses refer to one or more assertions from a code

generation perspective.

The code is generated using a sophisticated on-

the-fly approach based on the CodeModel library

(CodeModel, 2013). We also handle ambiguity and

error handling at this layer. If the probability values

are very close to each other, or if no match is found,

or if the confidence measure is too low, the Code

Generator will generate a fail() method call in

the test specifying the reason for the ambiguity.

4 EXAMPLES IN ACTION

To see how this strategy works in practice, consider

the example scenario shown in Section 2, which

comes from the JBehave web site (JBehave, 2013):

Scenario: check stock threshold

Given a stock with symbol GOOGLE and

a threshold of 15.0

When the stock is traded at a price

of 5.0

Then the alert status is OFF

When the stock is sold at a price of

16.0

Then the alert status is ON

The NLP Augmentation Engine parses the claus-

es as shown in Figure 3.

To see how the proposed approach operates in

practice, we will examine the actual results produced

by the Kirby prototype on this scenario. In this

example, the parse of the “Given” clause identifies

three nouns, “stock”, “symbol”, and “threshold”.

Figure 3: Parse results for the “stock” scenario.

The Code Information Extractor found the class

Stock as part of the application under develop-

ment. This class does include a constructor that

happens to have parameters named “symbol” (a

String) and “threshold” (a double). As a result, the

Code Generator translates this “Given” clause into:

Stock stock =

new Stock("GOOGLE", 15.0);

Similarly, the first “When” clause includes two

nouns and a verb phrase. By looking at the labelling

of the parse, it is possible to determine that “stock”

is the subject of the verb—in this case, the receiver

of a method call that represents an action. The verb

phrase determines the action, while the second noun

is an object that represents a parameter value. Be-

cause the Code Information Extractor reports a

tradeAt() method with one parameter named

“price”, the Probabilistic Matcher produces a high-

confidence match.

The “Then” clause is handled similarly, where

the noun “status” is matched to an existing method

named getStatus() that returns a string, and its

presence in a “Then” clause triggers the use of an

assertion to compare this method’s return value with

an expected result. The complete test produced is:

@Test

public void testCheckStockThreshold()

{

Stock stock =

new Stock("GOOGLE", 15.0);

stock.tradeAt(5.0);

assertEquals(

"OFF", stock.getStatus());

stock.tradeAt(16.0);

assertEquals(

"ON", stock.getStatus());

}

Now consider another BDD scenario, this time

for a program implementation of Conway’s “Game

of Life” (also inspired by scenarios posted on the

JBehave web site):

Scenario: multiple toggle outcome

Given a game called gameOfLife, with

width of 5 and height of 6

When I toggle the cell at column = 2

and row = 4

And I switch the cell at "4", "2"

And I alternate the cell at row 4 and

column 2

Then the string representation of the

game should look like

"_ _ _ _ X

_ _ _ _ _

_ _ _ X _"

ENASE2013-8thInternationalConferenceonEvaluationofNovelSoftwareApproachestoSoftwareEngineering

242

The same parsing strategy results in the following

JUnit test:

@Test

public void testMultipleToggleOutcome()

{

Game gameOfLife = new Game(5, 6);

gameOfLife.toggleCellAt(2, 4);

gameOfLife.toggleCellAt(4, 2);

gameOfLife.toggleCellAt(2, 4);

assertEquals(

"_ _ _ _ X\n_ _ _ _ _\n_ _ _ X _",

gameOfLife.

getStringRepresentation());

}

The example above shows how Kirby provides

flexibility in the way that different word choices,

such as “toggle”, “alternate”, or “switch”, can all be

mapped to the method toggleCellAt() by using

for flexible similarity measures. This example also

shows different free-form choices for expressing

parameter values.

Finally, here is a third scenario for a class that is

used to make web requests and examine the result-

ing responses:

Scenario: request contains string

Given a URL with value

"http://google.com", called google

And a web requester with url equal to

google

When we set the timeout to be 100

And we send a request

Then the response contains "google"

The corresponding JUnit test generated by Kirby is:

@Test

public void testRequestContainsString()

{

URL google =

new URL("http://google.com");

WebRequester webRequester =

new WebRequester(google);

webRequester.setTimeout(100);

webRequester.sendRequest();

assertTrue(webRequester.

getResponse().contains("google"));

}

Note that in all of the generated test methods

above, the code corresponding to the “Given” claus-

es have been embedded directly in the test methods.

If a person were writing tests by hand, he or she

would most likely take advantage of a setUp() (or

@Before) method so that the starting conditions for

the test could be reused across multiple tests. In this

case, however, the “source code” of the tests is the

natural language scenarios from which the Java test

code is automatically generated. Since scenarios

may or may not contain overlapping “Given” claus-

es, and programmer updates and modifications to

tests are expected to be made in the scenario descrip-

tions themselves, the Code Generator does not gen-

erate separate setUp()-style methods.

5 EVALUATION

Although BDD is an emerging technique with a

growing user community, it is difficult to find large

numbers of publicly available scenarios written for

tools like Cucumber and JBehave. At the same time,

however, it is important to evaluate new techniques

against real-world situations. To this end, we com-

piled a small collection of 12 BDD scenario descrip-

tions written in Gherkin—primarily from tutorials

published for use by developers learning to use other

BDD tools. We then ran this collection through

Kirby to assess its accuracy and performance, with

the belief that these examples are representative of at

least some portion of real-world practice.

Table 1 shows the accuracy of each of the four

individual matching algorithms employed in Kirby

to match parsed nouns to classes or objects. In addi-

tion, the “weighted average” shows the accuracy of

the final result produced by the Probabilistic Match-

er when it chooses results from the four competing

algorithms based on confidence weights. Table 2

shows the same information for matching verbs and

verb phrases to methods (note that “Given” clauses

typically use constructors rather than method calls,

and so are not included in Table 2).

Table 1: Matching algorithm accuracy for nouns and noun

phrases (classes and objects).

Algorithm

Clause Type

Given When Then

Edit distance 40% 55% 50%

Cosine 69% 78% 80%

Cosine (WordNet) 56% 56% 49%

DISCO 84% 96% 88%

Weighted average 100% 100% 100%

Table 2: Matching algorithm accuracy for verbs and verb

phrases (methods).

Algorithm

Clause Type

Given When Then

Edit distance - 23% 30%

Cosine - 58% 61%

Cosine (WordNet) - 32% 28%

DISCO - 76% 80%

Weighted average - 100% 100%

AutomaticallyGeneratingTestsfromNaturalLanguageDescriptionsofSoftwareBehavior

243

From these tables, it is clear that using a single algo-

rithm will not produce acceptable accuracy. How-

ever, by running all algorithms and considering the

confidence scores associated with each, it is possible

to pick results from the algorithm that produces the

most likely match in any given clause, increasing

overall accuracy significantly. At the same time,

however, this simply shows the feasibility of im-

proving accuracy by combining algorithms using

confidence measures. The tiny set of scenarios used

cannot be taken as truly representative of real-world

practices.

In addition to accuracy, speed is also a concern.

Kirby’s development quickly showed that some

algorithms depend critically on a dictionary of

known words, and the smaller the dictionary, the

less capable the algorithm—but larger dictionaries

significantly increase processing time. DISCO in

particular, as well as the WordNet extension to the

cosine measure, is susceptible. As a result, we col-

lected timing data on the processing of individual

clauses and whole scenarios, both using the most

comprehensive dictionaries available, and also using

a smaller dictionary intended to reduce processing

time. Unfortunately, the smaller dictionary also

reduced accuracy—resulting in a 25% loss in accu-

racy for word and phrase matching. Table 3 summa-

rizes the run time performance.

Table 3: Average running time for phrase analysis/match-

ing.

Clause type

Comprehensive

Dictionary

Smaller

Dictionary

Given 4.57 s (s.d. 3.98) 3.47 s (s.d. 3.61)

When 0.59 s (s.d. 0.27) 0.45 s (s.d. 0.28)

Then 0.84 s (s.d. 0.45) 0.73 s (s.d. 0.46)

Complete

scenario

8.34 s (s.d. 3.78) 6.47 s (s.d. 3.73)

From Table 3, it is clear that NLP is time-

consuming in relation to simpler approaches like

regular expression matching. It is interesting to note

that the bulk of the time is associated with class and

constructor matching in given clauses, while meth-

od-based matching in later clauses is much faster. It

is also interesting that using a smaller dictionary

sacrificed a noticeable amount of accuracy, but did

not drastically improve speed.

6 CONCLUSIONS

This paper takes the position that fully automated

translation of natural language behavioural descrip-

tions directly into executable test code is practical.

By describing the design of a prototype tool for

achieving this goal, and presenting results from

applying the prototype to a small collection of real-

world examples, this paper also shows the feasibility

of one technique for accomplishing this task.

At the same time, however, this prototype repre-

sents work in progress and has not undergone a

significant evaluation in the context of authentic

BDD usage by real developers. As future work, it is

necessary to collect a much larger library of existing

BDD scenario descriptions—preferably from open-

source projects, since the corresponding applications

would also be needed—to serve as a baseline for

truly evaluating effectiveness. Further, additional

improvements in performance (and potentially accu-

racy) are also needed.

REFERENCES

Beck, K. 2002. Test Driven Development: By Example.

Addison-Wesley Longman Publishing Co., Inc., Bos-

ton, MA, USA.

Budinsky, F. J., Finnie, M. A., Vlissides, J. M., and Yu, P.

S. 1996. Automatic code generation from design pat-

terns. IBM Systems Journal, 35(2):151–171, May

1996.

CodeModel. 2013. http://codemodel.java.net [Accessed

May 15, 2013].

Cucumber. 2013. http://cukes.info/ [Accessed May 15,

2013].

Edwards, S. H., Shams, Z., Cogswell, M., and Senkbeil, R.

C. 2012. Running students’ software tests against each

others’ code: New life for an old “gimmick”. In Pro-

ceedings of the 43rd ACM technical symposium on

Computer Science Education, SIGCSE ’12, pp. 221–

226, ACM, New York, NY, USA.

Fellbaum, C. (ed.). 1998. WordNet: An Electronic Lexical

Database. MIT Press, Cambridge, MA, USA.

Gherkin. 2013. https://github.com/cucumber/cucumber/

wiki/Gherkin [Accessed May 15, 2013].

JBehave. 2013. http://jbehave.org/, Accessed May 15, 2013.

JDave. 2013. http://jdave.org/ [Accessed May 15, 2013].

Keogh, E. 2010. BDD: A lean toolkit. In Proceedings of

Lean Software Systems Conference, 2010.

Kolb, P. 2008. DISCO: A multilingual database of distri-

butionally similar words. In Proceedings of KON-

VENS-2008, Berlin.

Koskela, L. 2007. Test Driven: Practical TDD and Ac-

ceptance TDD for Java Developers. Manning Publica-

tions Co., Greenwich, CT, USA.

Lazr, I., Motogna, S., and Pírv, B. 2010. Behaviour-

driven development of foundational UML compo-

nents. Electronic Notes in Theoretical Computer Sci-

ence, 264(1): 91–105, Aug. 2010.

ENASE2013-8thInternationalConferenceonEvaluationofNovelSoftwareApproachestoSoftwareEngineering

244

Nagappan, N., Maximilien, E. M., Bhat, T., and Williams,

L. 2008. Realizing quality improvement through test

driven development: Results and experiences of four

industrial teams. Empirical Software Engineering,

13(3): 289-302, June 2008.

NBehave. 2013. http://nbehave.org/ [Accessed May 15,

2013].

North, D. 2013. Introduction to BDD. http://dannorth.net/

introducing-bdd/ [Accessed May 15, 2013].

PHPSpec. 2013. http://www.phpspec.net/ [Accessed May

15, 2013].

Soeken, M., Wille R., and Drechsler, R. 2012. Assisted

behavior driven development using natural language

processing. In Proceedings of the 50

International

Conference on Objects, Models, Components, Patterns

(TOOLS'12), Springer-Verlag, Berlin, Heidelberg, pp.

269-287.

Solis, C. and Wang, X. 2011. A study of the characteris-

tics of behavior driven development. In 37th EU-

ROMICRO Conference on Software Engineering and

Advanced Applications (SEAA), pp. 383–387.

Soukoreff, R. W., and MacKenzie, I. S. 2001. Measuring

errors in text entry tasks: An application of the Le-

venshtein string distance statistic. In CHI’01 Extended

Abstracts on Human Factors in Computing Systems.

ACM, New York, NY, USA, pp. 319-320.

Stanford NLP Group. 2013. http://nlp.stanford.edu/ [Ac-

cessed May 15, 2013].

Tata, S., and Patel, J.M. 2007. Estimating the selectivity of

tf-idf based cosine similarity predicates. ACM SIG-

MOD Record, 36(2): 7-12, June 2007.

Yu, J.J.-B., and Fleming, A.M. 2010. Automatic code

generation via natural language processing, U.S. Pa-

tent 7765097, July 27, 2010.

Wynne, M., and Hellesøy, A. 2012. The Cucumber Book:

Behaviour-Driven Development for Testers and De-

velopers. Pragmatic Programmers, LLC.

AutomaticallyGeneratingTestsfromNaturalLanguageDescriptionsofSoftwareBehavior

245