
From these tables, it is clear that using a single algo-
rithm will not produce acceptable accuracy. How-
ever, by running all algorithms and considering the
confidence scores associated with each, it is possible
to pick results from the algorithm that produces the
most likely match in any given clause, increasing
overall accuracy significantly. At the same time,
however, this simply shows the feasibility of im-
proving accuracy by combining algorithms using
confidence measures. The tiny set of scenarios used
cannot be taken as truly representative of real-world
practices.
In addition to accuracy, speed is also a concern.
Kirby’s development quickly showed that some
algorithms depend critically on a dictionary of
known words, and the smaller the dictionary, the
less capable the algorithm—but larger dictionaries
significantly increase processing time. DISCO in
particular, as well as the WordNet extension to the
cosine measure, is susceptible. As a result, we col-
lected timing data on the processing of individual
clauses and whole scenarios, both using the most
comprehensive dictionaries available, and also using
a smaller dictionary intended to reduce processing
time. Unfortunately, the smaller dictionary also
reduced accuracy—resulting in a 25% loss in accu-
racy for word and phrase matching. Table 3 summa-
rizes the run time performance.
Table 3: Average running time for phrase analysis/match-
ing.
Clause type
Comprehensive
Dictionary
Smaller
Dictionary
Given 4.57 s (s.d. 3.98) 3.47 s (s.d. 3.61)
When 0.59 s (s.d. 0.27) 0.45 s (s.d. 0.28)
Then 0.84 s (s.d. 0.45) 0.73 s (s.d. 0.46)
Complete
scenario
8.34 s (s.d. 3.78) 6.47 s (s.d. 3.73)
From Table 3, it is clear that NLP is time-
consuming in relation to simpler approaches like
regular expression matching. It is interesting to note
that the bulk of the time is associated with class and
constructor matching in given clauses, while meth-
od-based matching in later clauses is much faster. It
is also interesting that using a smaller dictionary
sacrificed a noticeable amount of accuracy, but did
not drastically improve speed.
6 CONCLUSIONS
This paper takes the position that fully automated
translation of natural language behavioural descrip-
tions directly into executable test code is practical.
By describing the design of a prototype tool for
achieving this goal, and presenting results from
applying the prototype to a small collection of real-
world examples, this paper also shows the feasibility
of one technique for accomplishing this task.
At the same time, however, this prototype repre-
sents work in progress and has not undergone a
significant evaluation in the context of authentic
BDD usage by real developers. As future work, it is
necessary to collect a much larger library of existing
BDD scenario descriptions—preferably from open-
source projects, since the corresponding applications
would also be needed—to serve as a baseline for
truly evaluating effectiveness. Further, additional
improvements in performance (and potentially accu-
racy) are also needed.
REFERENCES
Beck, K. 2002. Test Driven Development: By Example.
Addison-Wesley Longman Publishing Co., Inc., Bos-
ton, MA, USA.
Budinsky, F. J., Finnie, M. A., Vlissides, J. M., and Yu, P.
S. 1996. Automatic code generation from design pat-
terns. IBM Systems Journal, 35(2):151–171, May
1996.
CodeModel. 2013. http://codemodel.java.net [Accessed
May 15, 2013].
Cucumber. 2013. http://cukes.info/ [Accessed May 15,
2013].
Edwards, S. H., Shams, Z., Cogswell, M., and Senkbeil, R.
C. 2012. Running students’ software tests against each
others’ code: New life for an old “gimmick”. In Pro-
ceedings of the 43rd ACM technical symposium on
Computer Science Education, SIGCSE ’12, pp. 221–
226, ACM, New York, NY, USA.
Fellbaum, C. (ed.). 1998. WordNet: An Electronic Lexical
Database. MIT Press, Cambridge, MA, USA.
Gherkin. 2013. https://github.com/cucumber/cucumber/
wiki/Gherkin [Accessed May 15, 2013].
JBehave. 2013. http://jbehave.org/, Accessed May 15, 2013.
JDave. 2013. http://jdave.org/ [Accessed May 15, 2013].
Keogh, E. 2010. BDD: A lean toolkit. In Proceedings of
Lean Software Systems Conference, 2010.
Kolb, P. 2008. DISCO: A multilingual database of distri-
butionally similar words. In Proceedings of KON-
VENS-2008, Berlin.
Koskela, L. 2007. Test Driven: Practical TDD and Ac-
ceptance TDD for Java Developers. Manning Publica-
tions Co., Greenwich, CT, USA.
Lazr, I., Motogna, S., and Pírv, B. 2010. Behaviour-
driven development of foundational UML compo-
nents. Electronic Notes in Theoretical Computer Sci-
ence, 264(1): 91–105, Aug. 2010.
ENASE2013-8thInternationalConferenceonEvaluationofNovelSoftwareApproachestoSoftwareEngineering
244