Authors:
Peerachai Banyongrakkul
and
Suronapee Phoomvuthisarn
Affiliation:
Department of Statistics, Chulalongkorn University, Bangkok, Thailand
Keyword(s):
Pull-Based Development, Pull Request, GitHub, Deep Learning, Multi-Output Learning, Classification.
Abstract:
GitHub’s pull-based development model is widely used by software development teams to manage software complexity. Contributors create pull requests for merging changes into the main codebase, and integrators review these requests to maintain quality and stability. However, a high volume of pull requests can overwhelm integrators, causing feedback delays. Previous studies have built predictive models using traditional machine learning techniques with tabular data, but these may lose meaningful information. Additionally, relying solely on acceptance and latency predictions may not be sufficient for integrators. Reopened pull requests can add maintenance costs and burden already-busy developers. This paper proposes a novel multi-output deep learning-based approach that early predicts acceptance, latency, and reopening of pull requests, effectively handling various data sources, including tabular and textual data. Our approach also applies SMOTE and VAE techniques to address the highly i
mbalanced nature of the pull request reopening. We evaluate our approach on 143,886 pull requests from 54 open-source projects across four well-known programming languages. The experimental results show that our approach significantly outperforms the randomized baseline. Moreover, our approach improves accuracy by 8.68%, precision by 1.01%, recall by 11.49%, and F1-score by 6.77% in acceptance prediction, and MMAE by 6.07% in latency prediction, while improving balanced accuracy by 9.43%, AUC by 9.37%, and TPR by 30.07% in reopening prediction over the existing approach.
(More)