Authors:
            
                    Takwa Kochbati
                    
                        
                    
                    ; 
                
                    Shuai Li
                    
                        
                    
                    ; 
                
                    Sébastien Gérard
                    
                        
                    
                     and
                
                    Chokri Mraidha
                    
                        
                    
                    
                
        
        
            Affiliation:
            
                    
                        
                    
                    Université Paris-Saclay, CEA, List, F-91120, Palaiseau, France
                
        
        
        
        
        
             Keyword(s):
            User Story, Machine Learning, Word Embedding, Clustering, Natural Language Processing, UML Use-case.
        
        
            
                
                
            
        
        
            
                Abstract: 
                In modern software development, manually deriving architecture models from software requirements expressed in natural language becomes a tedious and time-consuming task particularly for more complex systems. Moreover, the increase in size of the developed systems raises the need to decompose the software system into sub-systems at early stages since such decomposition aids to better design the system architecture. In this paper, we propose a machine learning based approach to automatically break-down the system into sub-systems and generate preliminary architecture models from natural language user stories in the Scrum process. Our approach consists of three pillars. Firstly, we compute word level similarity of requirements using word2vec as a prediction model. Secondly, we extend it to the requirement level similarity computation, using a scoring formula. Thirdly, we employ the Hierarchical Agglomerative Clustering algorithm to group the semantically similar requirements and provide
                 an early decomposition of the system. Finally, we implement a set of specific Natural Language Processing heuristics in order to extract relevant elements that are needed to build models from the identified clusters. Ultimately, we illustrate our approach by the generation of sub-systems expressed as UML use-case models and demonstrate its applicability using three case studies.
                (More)