Detecting Duplicate Effort in GitHub Contributions
James Galbraith, Des Greer
2025
Abstract
The pull-based development model allows collaborators to develop and propose changes to a codebase. However, pull requests can often offer duplicate functionality and therefore duplicate effort. Users can also request changes via issues, the text of which could provide clues, useful in determining duplicate pull requests. This research investigates combining pull requests with issues with a view to better detecting duplicate pull requests. The paper reviews existing related work and then extends this by investigating the use of natural language processing (NLP) on combined issues and pull requests in order to detect duplicates. Using data taken from 15 popular GitHub repositories, an NLP model was trained to predict duplicates by comparing the title and description of issues and pull requests. An evaluation of this model shows that duplicates can be detected with an accuracy of 93.9% and recall rate of 90.5%, while an exploratory study shows that the volume of duplicates detected can be increased dramatically by combining issues and pull requests into a single dataset. These results show a significant improvement on previous studies and demonstrate the value in detecting duplicates from issues and pull requests combined.
DownloadPaper Citation
in Harvard Style
Galbraith J. and Greer D. (2025). Detecting Duplicate Effort in GitHub Contributions. In Proceedings of the 20th International Conference on Evaluation of Novel Approaches to Software Engineering - Volume 1: ENASE; ISBN 978-989-758-742-9, SciTePress, pages 520-529. DOI: 10.5220/0013289000003928
in Bibtex Style
@conference{enase25,
author={James Galbraith and Des Greer},
title={Detecting Duplicate Effort in GitHub Contributions},
booktitle={Proceedings of the 20th International Conference on Evaluation of Novel Approaches to Software Engineering - Volume 1: ENASE},
year={2025},
pages={520-529},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0013289000003928},
isbn={978-989-758-742-9},
}
in EndNote Style
TY - CONF
JO - Proceedings of the 20th International Conference on Evaluation of Novel Approaches to Software Engineering - Volume 1: ENASE
TI - Detecting Duplicate Effort in GitHub Contributions
SN - 978-989-758-742-9
AU - Galbraith J.
AU - Greer D.
PY - 2025
SP - 520
EP - 529
DO - 10.5220/0013289000003928
PB - SciTePress