Detecting Duplicate Effort in GitHub Contributions

James Galbraith and Des Greer

School of Electronics, Electrical Engineering and Computer Science, Queen’s University Belfast, Belfast, BT71NN, U.K.

Keywords:

GitHub, Duplicate Detection, Issues, Pull Requests.

Abstract:

The pull-based development model allows collaborators to develop and propose changes to a codebase. How-

ever, pull requests can often offer duplicate functionality and therefore duplicate effort. Users can also request

changes via issues, the text of which could provide clues, useful in determining duplicate pull requests. This re-

search investigates combining pull requests with issues with a view to better detecting duplicate pull requests.

The paper reviews existing related work and then extends this by investigating the use of natural language

processing (NLP) on combined issues and pull requests in order to detect duplicates. Using data taken from

15 popular GitHub repositories, an NLP model was trained to predict duplicates by comparing the title and

description of issues and pull requests. An evaluation of this model shows that duplicates can be detected with

an accuracy of 93.9% and recall rate of 90.5%, while an exploratory study shows that the volume of duplicates

detected can be increased dramatically by combining issues and pull requests into a single dataset. These

results show a signiﬁcant improvement on previous studies and demonstrate the value in detecting duplicates

from issues and pull requests combined.

1 INTRODUCTION

The introduction of the pull-based development

model (Gousios et al., 2014; Gousios and Zaid-

man, 2014) in GitHub has facilitated the develop-

ment of open-source projects through the contribu-

tions of community developers (Jiang et al., 2017) and

has fundamentally changed the way in which projects

evolve. The open-source model allows anyone to

copy and redistribute the software solution, while the

pull-based development model allows anyone to con-

tribute changes to a project. The community develop-

ers who provide these changes need not have any par-

ticular relationship to, or experience with, the project

in question. These developers contribute to projects

for a number of reasons (Jiang et al., 2017; McClean

et al., 2021). When someone would like to contribute

a new feature or bug ﬁx to a project, they do so by

forking the project and making the changes locally.

These changes are then submitted in the form of a pull

request to be reviewed and merged. This process of

review is carried out by a core team of reviewers who

decide to accept or reject the changes. Each pull re-

quest comes with a title and description detailing what

the code submitted in the pull request aims to achieve.

Users of an open-source project can also report

https://orcid.org/0000-0001-6367-9274

bugs and contribute ideas and suggestions by raising

an issue. Issues, like pull requests, contain a title

and description detailing the bug or feature in ques-

tion, but issues do not provide any solution in terms

of code. One way of thinking about this is that is-

sues may be raised by users of a project to detail bugs

that they have found, or new features that they would

like to see, and who do not have the technical capa-

bility, or will, to carry out such work themselves. Pull

requests, by contrast, may provide the solution to an

issue, but they may also provide a solution to a bug

or a feature request that has not already been raised as

an issue. Simply, pull requests contain code changes,

and issues do not.

Issues are used in open-source software hosting

platforms to track bug reports, and propose new fea-

tures. They also provide a medium through which

users and maintainers of a product can communicate,

provide feedback, and track the progress of work. Pull

requests, by contrast, are used to manage the merging

of new code into a codebase. Pull requests display the

differences between the source code and the new code

being proposed, and they allow reviewers to provide

feedback on code changes.

As the number of users using an open-source

project increases, so, too, the volume of issues and

pull requests being submitted often increase. One

520

Galbraith, J. and Greer, D.

Detecting Duplicate Effor t in GitHub Contributions.

DOI: 10.5220/0013289000003928

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 20th International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE 2025), pages 520-529

ISBN: 978-989-758-742-9; ISSN: 2184-4895

study from 2016 (Gousios et al., 2016) found that

135,000 repositories hosted in GitHub collectively

received more than 600,000 pull requests in a single

month. Many GitHub repositories can have thousands

of open issues and pull requests at any one time. At

the time of writing, Flutter has more than 12,000

open issues

, while Tensorﬂow has almost 2,000

open issues

. This makes it extremely difﬁcult for

users, contributors, and reviewers of a project to keep

track of all of the changes that are being made to the

project. Thus, due to the open, distributed, and unco-

ordinated nature of pull-based development (Gousios

et al., 2016; Gousios et al., 2015) when used at

scale, a challenge presents itself where duplicate

work can be carried out in terms of development for

pull requests, reporting issues, and also in reviewers

reviewing submitted pull requests and issues. This

can drain reviewers’ resources. Since they have no

method of immediately identifying duplicate issues

and pull requests, these duplicates must undergo

the same review process as all other issues and pull

requests. Often, the review process will require some

back and forth communication between the reviewer

and the reporter to gather more information, so the

reviewer is not just wasting time through the code

review, but also through their interactions with the

reporter. Some work has been done previously on

creating a dataset of duplicate pull requests (Yu et al.,

2018) and some on automatically detecting duplicate

pull requests using machine learning (Li et al., 2017),

while Runeson et al. explored the possibility of auto-

matically detecting duplicate defect reports (Runeson

et al., 2007) from a database of defect reports used

by Sony Ericsson. These papers do not, however,

consider the possibility of collating pull requests and

issues into a single dataset. The machine learning

model created by Li et al. (Li et al., 2017) uses natural

language processing to calculate the textual similarity

between the title and description of pull requests in

order to ﬁnd duplicates. Issues also contain a title

and description serving the same function as those

of pull requests, which leaves the question of why

these cannot be considered as a single entity when

training, implementing, and evaluating a machine

learning model. Thus, this paper will seek to answer

the following questions:

RQ1. What effect does combining pull requests and

issues have on the accuracy of a natural language

processing (NLP) model, used to predict duplicates,

when compared to pull requests only?

RQ2. When considering open and unresolved issues

and pull requests, can we ﬁnd a greater volume of

https://github.com/ﬂutter/ﬂutter/issues

https://github.com/tensorﬂow/tensorﬂow/issues

duplicates than when considering pull requests only?

When considering these research questions, we will

adopt the following null hypotheses, respectively:

1: & The accuracy of a given NLP model will

not increase when evaluated on the combined dataset

compared to pull requests only.

2: & The volume of duplicates found when consid-

ering issues and pull requests combined will not be

signiﬁcantly greater than when considering pull re-

quests only.

To test these hypotheses, we will adopt an ap-

proach similar to that of Li et al. (Li et al., 2017).

First, a console application will be developed to

gather a dataset of known duplicate issues and du-

plicate pull requests from a number of popular open-

source GitHub repositories, making use of GitHub’s

public API to do so. This application will use regular

expressions to search the comments of closed issues

and pull requests for strings identifying one issue or

pull request as a duplicate of another. This data will

then be used to train and evaluate a machine learning

model. The model will consider the title and descrip-

tion, and use natural language processing to calculate

the textual similarity between the items in our dataset,

and thus group these items where they are likely to be

duplicates of each other. This model will also then

be used to calculate the number of predicted dupli-

cates, from a snapshot of unresolved issues and pull

requests, relative to the size of the dataset.

This approach differs from that of Li et al. (Li

et al., 2017) in several ways. Our model will use a dif-

ferent regex pattern for ﬁnding comments identifying

duplicates, to allow for greater variety in the format of

such comments, following on from the ﬁndings from

Li et al. (Li et al., 2018). This program will also con-

sider a wider range of projects. However, critically we

will be combining pull requests and issues into a sin-

gle dataset when training our NLP model. Given the

nature of the data collected, it may be useful to ex-

plore the differences between pull requests only, and

then issues and pull requests combined, when detect-

ing duplicates in public projects and also in evaluat-

ing our machine learning model. We will also explore

the differences when the model is evaluated on each

project individually.

2 RELATED WORK

All the papers reviewed in the following section have

been taken from Google Scholar, IEEE Xplore, Sci-

ence Direct, Springer, and Scopus. The following

search strings composed of key words were used to

Detecting Duplicate Effort in GitHub Contributions

521

search these sites:

"GitHub" AND "Issues" AND ("Pull requests" OR

"Pull-requests")

"GitHub" AND ("Issues" OR "Pull requests" OR

"Pull-requests") AND "Duplicate"

This will exclude articles centred around other

source code hosting platforms. As an extension to

these search strings, articles that have not been peer

reviewed will be excluded, along with opinion pieces

and articles which have not been published in a known

and credible journal or conference paper. This will

provide a systematic approach to selecting papers for

review, and facilitate a study of the practices asso-

ciated with pull-based development and open-source

software ﬁrstly, followed by a review of the work car-

ried out thus far in ﬁnding and reducing duplicate con-

tributions.

2.1 Pull-Based Development

Through GitHub’s public REST API, given the name

and owner of a public repository, researchers can col-

lect vast amounts of data related to the issues, pull re-

quests, comments, contributors, and so on, connected

to said repository. This has, in part, enabled stud-

ies to be carried out on the processes and method-

ologies used to manage the continuous development

of software solutions, and how human resources, in

terms of contributors and integrators, can collabo-

rate to facilitate this development. An early study by

Kalliamvakou et al. (Kalliamvakou et al., 2014) in-

vestigating the perils of mining GitHub for data found

that almost 40% of all pull requests do not appear as

merged even though they have been. A more recent

study by Wessel et al. (Wessel et al., 2023) to investi-

gate the effects of GitHub Actions on the pull request

process found that the adoption of GitHub Actions

leads to more comments on accepted pull requests and

less comments in rejected pull requests. GitHub Ac-

tions is a continuous integration and continuous de-

livery platform that enables the automation of test-

ing and deployment workﬂows. From their sample

of 5,000 popular GitHub repositories, 1,489 of them

used GitHub Actions. These ﬁndings may pose the

possibility of a bias towards merged pull requests if

we analyse the comments on pull requests and issues.

2.2 Duplicate Contributions

Zhang et al. (Zhang et al., 2018) conducted a study

in 2018 on competing pull requests, wherein compet-

ing pull requests were deﬁned as those that aimed to

"change the same piece of code" and were open for

overlapping periods of time. In 45 out of the 60 repos-

itories explored, over 31% of pull requests belonged

to competing pull requests. In 20 of the reposito-

ries, there were more than 100 groups of competing

pull requests, each of which were submitted by more

than ﬁve developers. However, it should be noted that

competing pull requests are not strictly the same as

duplicate pull requests, although in some cases com-

peting pull requests may also be duplicate pull re-

quests.

Much of the research conducted around duplicate

contributions concerns two or more pull requests that

are decided to add the same functionality, rather than

those which simply change the same articles of code.

This often manifests itself as pull requests which ﬁx

the same bug, or add the same new piece of func-

tionality. A study by Li et al. (Li et al., 2020) found

that duplicate pull requests lead to the use of redun-

dant resources. Another study (Zhou et al., 2019)

found that as much as 51% of rejected pull requests

in some repositories were rejected due to redundant

development (i.e. they were duplicates of other pull

requests). Although the mean was only 3.4%, they

also found a high variance between the repositories

considered. This shows the potential value in explor-

ing our research questions on a project-to-project ba-

sis. Another study by Li et al. (Li et al., 2017) ex-

plored the possibility of using textual similarity be-

tween the titles and descriptions of pull requests to

detect duplicates. When this model was evaluated on

three GitHub projects, 55.3% - 71.0% of the dupli-

cates were found using a combination of title simi-

larity and description similarity. This evaluation has

only used data from three repositories however, which

does not allow for the potential discrepancies between

repositories in terms of reporting bugs, feature re-

quests, etc. Many open source projects provide con-

tribution guidelines (Li et al., 2022), while GitHub

also provides guidelines for contributions. Therefore,

the results of this study (Li et al., 2017) cannot ac-

count for these differences. Wang et al. proposed an

approach (Wang et al., 2019) in which they used nine

features of pull requests to train a machine learning

model to detect duplicates. Their model was trained

on data from 12 GitHub repositories and evaluated

on data from 14 different GitHub repositories and

demonstrated a recall rate of 88.7%.

A study (Yu et al., 2018) by Yu et al. has suggested

a method for collecting a dataset of duplicates con-

sisting of automatic identiﬁcation followed by man-

ual veriﬁcation. A program will use regex patterns to

search the comments for links between pull requests

and extract the links to said pull requests. Each pair of

pull requests were then manually veriﬁed to remove

possible false positives. This paper provides a good

ENASE 2025 - 20th International Conference on Evaluation of Novel Approaches to Software Engineering

522

starting point for collecting a dataset of duplicates;

however, this manual veriﬁcation step would appear

to be superﬂuous. The article does not detail the num-

ber of false positives found, but given the contents

of the data collected, combined with GitHub’s tem-

plate for reporting such data

, the prediction is that

the probability of false positives being detected is sig-

niﬁcantly small, provided that the program works as

expected. As shown, the comments left by contribu-

tors follow particular patterns depending on the rela-

tionship they are trying to convey (Li et al., 2018), and

hence manual veriﬁcation becomes redundant. This

extracted dataset is publicly available and contains

duplicate pull requests from 26 popular open-source

projects in GitHub.

2.3 Combining Issues and Pull Requests

Again, studies (Yu et al., 2018; Li et al., 2017; Zhang

et al., 2018; Li et al., 2020) only consider pull requests

that may be duplicates of other pull requests. They

do not consider the impact of combining datasets of

issues and pull requests, when performing empirical

studies and evaluating their models. Some work has

been done on matching pull requests to commits, but

there are no studies exploring the automatic detection

of duplicate issues, or the potential value in detecting

duplicates from a combined dataset of issues and pull

requests.

When pull requests are submitted to GitHub, they

are sometimes submitted as a solution to an already

existing issue. In this scenario, the pull request will

contain an explicit link to the issue that it is resolving.

In the event where two users identify the same bug

in a software product, one user raises it as an issue,

while the other user forks the project and submits a

pull request resolving the bug unaware that an issue

has been raised concerning the same bug, then there

will be no link between the issue and pull request.

By predicting duplicates from a combined dataset of

issues and pull requests, the program will be able to

identify pull requests and issues which deal with the

same bug but are not known to do so. This can be

used to prevent duplicate pull requests.

These ideas are predicated on the assumption that

the title and description of issues serve the same pur-

pose as the title and description of pull requests. This

seems like a reasonable assumption to make, and one

which we will seek to verify in our empirical study, as

part of RQ2. However, one potential problem might

be how the level of detail presented differs between

https://docs.github.com/en/issues/

tracking-your-work-with-issues/

marking-issues-or-pull-requests-as-a-duplicate

issues and pull requests. Any user who submits a

pull request must have some level of technical ability

along with some understanding of the codebase since

they are making changes to the code. A user who

submits an issue, though, need not necessarily have

any level of technical ability or understanding. They

only really need to be able to explain the bug they

have found with a system, possibly along with steps

to recreate the bug, or the feature which they would

like added. This poses the possibility that pull re-

quests may contain much more technical jargon, com-

pared to issues which may use simpler language to

provide a higher-level overview. If this is the case,

where the descriptions of pull requests provide much

lower-level detail compared to issues, then this could

reduce the effectiveness of any natural language pro-

cessing model when comparing the text between pull

requests and issues, and would advance the case for

null hypothesis H

3 DATA COLLECTION

3.1 Projects Studied

15 open-source projects, hosted in GitHub, have been

selected to be studied, and from which we will gather

our data. The projects selected are: angular, angu-

lar.js, ansible, bootstrap, elasticsearch, electron, ﬂut-

ter, freeCodeCamp, kubernetes, node, opencv, rails,

react, rust, and tensorﬂow. These projects have been

selected partly by design. As a starting point, projects

with less than 5000 closed issues or less than 5000

closed pull requests were not considered. This was

to ensure that there was sufﬁcient data available, and

to maximise the probability of ﬁnding duplicate is-

sues and pull requests. Following this, projects with

less than 1000 contributors were also not considered.

Working off the assumption that a large number of

issues and pull requests, coupled with a large num-

ber of contributors, makes it difﬁcult for users to keep

track of what work has been completed and hence in-

creasing the likelihood that duplicate efforts will be

submitted, the minimum contributor limit was hence

added. Projects where the spoken language was any

other than English were also not considered. Some ef-

fort was made to ensure that a range of programming

languages were represented in the projects selected to

remove the possibility that our machine learning algo-

rithm would place either too much or too little value

on language-speciﬁc keywords. To this end, 12 pro-

gramming languages have been represented in the 15

open-source projects chosen, as shown in Table 1.

Some of the projects selected therefore have been

Detecting Duplicate Effort in GitHub Contributions

523

Table 1: Project Characteristics.

Project Pulls? Issues? Language

Angular y y TS

Angular.js y n JS

Ansible y y Python

Bootstrap y y JS, CSS

Elasticsearch y y Java

Electron y y JS, Python,

C++, C

Flutter y y Dart, C, C++

FreeCodeCamp y y TS, React

Kubernetes y y Go

Node y y JS, Python,

C++, C

OpenCV y y Java, Python,

C++

Rails y y Ruby, JS

React y y JS

Rust y y Rust

TensorFlow y y Python, C++

taken from the research previously carried out on this

subject (Yu et al., 2018; Li et al., 2017). Beyond

that, the other projects were selected at random from

a list of the top 100 starred open-source projects in

GitHub

, provided they met the criteria in the preced-

ing paragraph. A breakdown of the numbers of is-

sues, pull requests, and contributors from the projects

selected can be seen in Table 2. Note that while some

effort was expended to try to maximise the numbers of

duplicates found in the selected projects, the data col-

lection method was designed to ensure no false pos-

itives were identiﬁed, and thus could not guarantee

that duplicates would be found in all of the projects.

Thus in the case of Angular.js no issues were found.

Table 2: Issue and PR Statistics from studied projects.

Statistic Mean Median Min Max

Open Issues 2571 1391 271 12165

Closed Issues 27156 21894 7901 79838

Open PRs 336 214 61 820

Closed PRs 36083 32138 7926 77660

Contributors 2889 1907 1249 5540

3.2 Method

In GitHub, there is no mechanism to check for dupli-

cate pieces of work. There is no mechanism, exter-

nal to an issue/pull request, by which duplicates can

be marked. Duplicates must be marked by leaving a

comment on the issue or pull request in question not-

https://github.com/EvanLi/Github-Ranking/blob/

master/Top100/Top-100-stars.md

ing the issue or pull request which it is a duplicate

of, and guidelines have been provided for this

. To

ﬁnd duplicates in the projects chosen, we will there-

fore need to dig through the comments for each issue

and pull request and check if the format of the com-

ment matches the format expected for marking a du-

plicate. GitHub has made available a public REST

API which can be used to retrieve data from GitHub

projects (Kalliamvakou et al., 2016; Mombach and

Valente, 2018). Simply by knowing the name and

owner of a repository, we can work through a series of

layers to retrieve all its issues and pull requests, and

subsequently retrieve all comments for each issue/pull

request. When an issue or pull request is marked as a

duplicate, it will also be closed, and so we will only

be looking at closed issues and pull requests when

searching for duplicates.

Since there is no requirement to follow the guide-

lines on marking duplicates, variations arise in the

format of the comments to identify duplicates (Li

et al., 2017; Gousios and Zaidman, 2014). As

shown in previous research, duplicates are sometimes

marked by leaving a comment of the format "Dupli-

cate of #11111". In this context, the number is a

unique identiﬁer which can be used to identify the is-

sue or pull request in question. On GitHub’s website,

this number also contains a hyperlink to said issue or

pull request. When this data is retrieved through the

REST API the number is lost, and only the hyper-

link remains (on the latest version of the API). There-

fore, we need not check for comments of the format

"Duplicate of #11111" as has been done previously.

GitHub limits the number of requests which can be

made to its REST API by a single user within a spec-

iﬁed period of time. As a result of this, we were

only able to make 5000 requests to the API in each

hour (one hour after the ﬁrst request is made, the limit

would be reset). Given that some of the projects to

be studied have many tens of thousands of closed is-

sues and pull requests, it would be impossible to re-

trieve and process all issues and pull requests for each

project in a reasonable time frame. Therefore only the

5000 most recently closed issues and 5000 most re-

cently closed pull requests were considered for each

project.

To collect the data needed, a console application

was created and hard-coded with the name and owner

of the projects selected for study. When the issues and

pull requests were retrieved for each project, and the

comments for each issue/pull request were retrieved,

regex was used to search the comments for messages

https://docs.github.com/en/issues/

tracking-your-work-with-issues/

marking-issues-or-pull-requests-as-a-duplicate

ENASE 2025 - 20th International Conference on Evaluation of Novel Approaches to Software Engineering

524

which match one of the patterns known to be used to

mark duplicates. As per previous research (Yu et al.,

2018; Li et al., 2017; Li et al., 2018), comments can

take several forms including "duplicate of...", "closed

by...", and "addressed in...". Taking this into con-

sideration, the following regex pattern has been con-

structed to ﬁnd such strings in pieces of text:

(w+ )?(by|of|in) (http)

A separate regex pattern, following the same for-

mat, was then used to extract the URLs of the du-

plicates from the text. From each URL, the GitHub

API was used once again to retrieve the full details

of the item, and thus, a pair of duplicates had been

identiﬁed. In the event where an issue or pull re-

quest is a duplicate of multiple other issues/pull re-

quests, the second regex pattern can allow for this

to extract multiple URLs, provided the URLs are

separated by commas. From a string such as "Du-

plicate of https://github.com/..., https://github.com/...,

https://github.com/...", each of the URLs would be ex-

tracted and processed and hence a tuple of duplicates

would be recorded. When the console application had

run its course and all processing was complete, the re-

sultant data was stored in a database.

To answer the research questions, a natural lan-

guage processing model must be trained such that

given two pieces of text, it will predict them to be

either duplicates or non-duplicates. To train such a

model, we would need data containing sets of both

duplicates and non-duplicates, but the data gathered

to this point consisted only of sets of duplicates. To

gather sets of non-duplicates, the existing sets of du-

plicates were simply shufﬂed to generate sets of non-

duplicates. Since there was a possibility, as allowed

for by our console application, that sets of duplicates

containing more than two items were collected, some

effort was required to ensure that the non-duplicates

were truly non-duplicate, and not simply an inversion

of an existing set of duplicates. The result of all of this

was a dataset containing 393 sets of duplicate issues,

312 sets of non-duplicate issues, 510 sets of duplicate

pull requests, and 509 sets of non-duplicate pull re-

quests. To ensure a more even balance in the datasets,

some pull requests were also taken from the dataset

produced by Yu et al. (Yu et al., 2018). These ﬁgures

are not intended to be reﬂective of the proportion of

duplicates which exist in the projects, either identiﬁed

(closed) or unidentiﬁed (open and unresolved).

4 EXPERIMENTS

In considering the research questions previously laid

out, we might summarise them to ask, is it possible

to predict duplicates from a combined dataset of pull

requests and issues, and if so, is there any value in

it. RQ1 is concerned only with the dataset collected,

used as a source of truth, from which we might train

and evaluate a machine learning model. RQ2, by con-

trast, concerns itself with how this machine learning

model might be deployed and used to explore the ex-

tent to which duplicates exist in open-source GitHub

projects, and how it’s deployment might be used to

identify these automatically.

4.1 Setup

Once sufﬁcient data had been collected, a natural lan-

guage processing model would be trained to identify

duplicates from two pieces of text. To do this, the

NLP model would convert each piece of text to a se-

quence of vectors, so that the two sequences could be

compared. Following on from previous research (Ma

et al., 2019; Muennighoff et al., 2022; Devlin et al.,

2018), it was decided that a MiniLM (Wang et al.,

2020; Reimers and Gurevych, 2019) based model

should be used. A pre-trained model would be se-

lected, and ﬁne-tuned using our own data. The Mas-

sive Text Embedding Benchmark (MTEB) leader-

board (Muennighoff et al., 2022) was, in part, used

to ﬁnd a suitable model. Many of the models listed in

this leaderboard however would only accept text in-

put up to a few hundred tokens (equivalent to a few

hundred words), which would be a problem since the

title and descriptions of some of the issues and pull re-

quests in our dataset contained (an estimated) tens of

thousands of tokens. Some of the models are able to

handle larger pieces of text, however this often came

at the cost of unreasonably large and complicated

models, which required substantially more comput-

ing resources to use. Another way around this issue

is to split the text up into chunks, generate a sequence

of embeddings for each chunk, and then take an aver-

age across all the embeddings to get a single sequence

which would be representative of the entire piece of

text. To do this, we split the text into chunks of 250

characters, while also ensuring words do not get split

across chunks. Thus, by applying these techniques, a

single sequence of embeddings could be generated to

represent the semantics of each piece of text.

Detecting Duplicate Effort in GitHub Contributions

525

4.2 Predicting Duplicates

To predict duplicates from the data, some measure

would be needed to compare each of the two se-

quences of embeddings and generate a score of simi-

larity between them. As per Li et al. (Li et al., 2017),

we will be using a cosine similarity measure (Man-

ning and Schutze, 1999) to do this. This will give a

single score between 0 and 1 representing how similar

the two pieces of text are, where 1 is identical and 0 is

completely unlike. To classify duplicates from this, a

threshold therefore needs to be applied whereby two

pieces of text whose similarity score meets the thresh-

old are duplicates. Varying duplicate threshold values

were tested and used as appropriate.

4.3 Model Evaluation

To explore RQ1, the sentence-transformers (Reimers

and Gurevych, 2019) model all-MiniLM-L6-v2

, as

in the MTEB leaderboard

, was selected as a starting

point and a base model. This model was ﬁne-tuned on

our data using an 80:20 train/test split. The model was

ﬁne-tuned on both issues and pull requests. The test

data consists then of a dataset containing only pull re-

quests, and a dataset containing both issues and pull

requests. In evaluating the model, we will explore

the accuracy of the model when varying the threshold

value by which duplicates are determined, and com-

pare and contrast the results when doing this for each

of the two datasets. To disprove the null hypothe-

sis, we would expect the accuracy to increase for the

combined dataset. This experiment will then be re-

peated on a per-project basis to investigate how the

results vary between projects and whether the tech-

niques used are equally applicable to all projects.

4.4 Empirical Study

To answer RQ2 we will take a snapshot of all the

issues and pull requests open and unresolved in the

projects studied, and use our ﬁne-tuned model to pre-

dict sets of duplicates from among these, and investi-

gate what proportion of pull requests are (predicted)

duplicates, in comparison with the proportion of (pre-

dicted) duplicates found with the issues and pull re-

quests combined. The alternative hypothesis is that

the duplicates, as a proportion of the whole, should

increase when the pull requests and issues are com-

bined. To make the investigation more manageable,

https://huggingface.co/sentence-transformers/

all-MiniLM-L6-v2

https://huggingface.co/spaces/mteb/leaderboard

only 525 issues and 525 pull requests have been con-

sidered as part of this experiment, with an even dis-

tribution from each project. In comparing the results

for a dataset of pull requests only to a dataset of is-

sues and pull requests combined, the latter dataset will

be larger and therefore would be expected to contain

more duplicates, which is why the volume of dupli-

cates found will be represented relative to the size of

the dataset, otherwise the results will lose their value.

This experiment will also be repeated on a per-project

basis to explore how the number of duplicates found

vary by project, and whether the impact of combining

issues and pull requests into a single dataset has an

equal impact across projects. Since an equal number

of issues and pull requests have been taken from each

project, it would be expected that the relative increase

in duplicates found would be similar across projects.

5 RESULTS

5.1 Model Evaluation

The accuracy measured across various duplicate

thresholds for each of pull requests only, and issues

and pull requests combined, can be seen in Fig. 1.

The results for the two datasets follow the same pat-

tern, with both peaking at the same point, and the

combined dataset returning better results across most

duplicate thresholds. The combined dataset achieved

a peak accuracy of 93.9%, which occurred when the

threshold was set at 0.38. The pull requests dataset

achieved a peak accuracy of 90.6%, which also oc-

curred when the threshold was set at 0.38. The ac-

curacy of the model was on average 4.25 percentage

points higher when using the combined dataset. A

recall rate of 90.5% was observed for the combined

dataset with the duplicate threshold set at 0.38, while

the dataset of pull requests only displayed a recall rate

of 83.2% for the same duplicate threshold. The re-

sults from both of our datasets are a signiﬁcant im-

provement on those from Li et al. (Li et al., 2017)

(who considered only pull requests) in that our model

can predict duplicates with a signiﬁcantly greater ac-

curacy and recall rate.

Fig. 2 shows the accuracy in predicting duplicates

from the combined dataset compared to pull requests

only, when evaluated for each repository individually.

In doing this, the threshold for duplicates was set to

0.38 which was found to produce the greatest overall

accuracy. In 11 out of the 15 projects, the accuracy

was higher when evaluated on the combined dataset.

The results are reasonably consistent between the two

datasets, and also between projects. The highest accu-

ENASE 2025 - 20th International Conference on Evaluation of Novel Approaches to Software Engineering

526

Figure 1: Accuracies for Varying Thresholds.

Figure 2: Accuracy evaluated for each repository.

racy is 100%, which occurs for multiple projects and

for both datasets. Angular returned an accuracy of

only 50% when considering pull requests only, which

is signiﬁcantly less than its accuracy for the combined

dataset, meanwhile opencv returned 100% accuracy

across both datasets.

5.2 Duplicate Prediction

Fig. 3 shows the volume of (predicted) sets of dupli-

cates found, relative to the size of the input data (num-

ber of issues and pull requests considered), across

several duplicate thresholds. The two datasets largely

follow the same trend, albeit with clear separation be-

tween the results for each. For a sensitivity thresh-

old of 0.38 (for which we achieved the greatest ac-

curacy in the previous section), there was a signif-

icant difference between the two datasets. Fig. 4

shows the volume of duplicates found at this thresh-

old for each on a per-project basis. The median of

the differences was 7.84 with a 95% conﬁdence inter-

val ranging from 8.81 to 6.43 (Wilcoxon signed-rank

Figure 3: Sets of Duplicates Predicted Relative to Input

Size.

test V=0, p≪0.001). This implies that the dataset

comprised of issues and pulls produces a consistently

higher volume of duplicates relative to the size of the

input over the pulls-only dataset. Indeed, in all of the

projects studied, the volume of duplicates found were

much greater for the combined dataset compared to

the dataset of pull requests only. The greatest ob-

served percentage difference between the two datasets

was for Angular.js where using the combined dataset

improved the detection by a factor of 2.8.

Figure 4: Relative duplicates found for each project.

5.3 Threats to Validity

One threat to the internal validity of this work lies in

the method of identifying duplicates from resolved is-

sues and pull requests. The regex pattern searches the

comments of issues and pull requests for messages

of the format ’duplicate of https://github.com/...’. If

a comment were to be found which asks, rather than

asserts, whether a particular contribution is a dupli-

cate of another, then this could be incorrectly identi-

ﬁed as a duplicate. Although the console application

Detecting Duplicate Effort in GitHub Contributions

527

searched for duplicates in the same number of issues

and pull requests from each project, the numbers of

duplicates found were not evenly distributed among

projects. This might therefore be a threat to the va-

lidity of the per-project results. Given the relatively

small sample size of 15 projects, a post-hoc power

analysis (Cohen, 2013) was conducted to evaluate sta-

tistical power. The observed effect size between com-

bined dataset versus the pulls-only dataset was 7.84,

and the resulting power of the test was 1, indicating

that the study had an very high probability of detect-

ing a true effect if one exists. The splitting of the

text into chunks of 250 characters when training and

testing the NLP model may also pose a threat to the

validity of this work since sections of meaning may

be split between chunks and hence lost. This might

reduce the effectiveness of a NLP model trained on

these chunks.

Affecting the external validity, while efforts have

been made to ensure that a range of programming lan-

guages, technologies, and application domains were

represented in the projects studied, there remains

some potential that these results are not entirely re-

ﬂective of the wider GitHub community. Further-

more, only open source GitHub projects were stud-

ied, so we cannot claim validity outside of this. For

example in a smaller co-located agile team there may

be a better inherent understanding of the relationship

between issues and pulls (Taylor et al., 2006).

6 CONCLUSION

The aim of this paper is to assess the feasibility

of automatically detecting duplicates in open-source

GitHub projects using natural language processing

techniques, and the potential beneﬁts of applying

these techniques on issues and pull requests alike.

A Sentence Transformers model was evaluated on a

dataset of known duplicate issues and pull requests

and found it to be 93.9% accurate in detecting dupli-

cates. The accuracy was also found to be 4.25 per-

centage points higher when considering issues and

pull requests together, compared to pull requests only,

thus supporting the alternative hypothesis that com-

bining issues and pull requests into a single dataset

will show an increase in the accuracy of our NLP

model.This veriﬁes the assumption that the title and

description of issues serve the same purpose as the

title and description of pull requests, and that the

level of technical detail is similar between both. This

model was then used to predict duplicates from a

snapshot of unresolved issues and pull requests, and

found that the volume of duplicates found when con-

sidering issues and pull requests together was consis-

tently higher compared to when considering pull re-

quests only, which supports the alternative hypothe-

sis that by combining issues and pull requests into a

single dataset the volume of duplicates found will in-

crease signiﬁcantly.

These results demonstrate the practical value of

automatically detecting duplicates using semantic

similarity techniques in that not only can we detect

duplicates with great accuracy, but by considering is-

sues and pull requests together we can detect a sig-

niﬁcantly greater volume of duplicates than we would

if we were only considering pull requests. Our re-

call rate of 90.5% was 20.0 percentage points higher

than that demonstrated by Li et al. (Li et al., 2017),

while our model was evaluated on a larger number of

GitHub repositories. Our recall rate is also 1.8 per-

centage points higher than that displayed by Wang et

al. (Wang et al., 2019), with their model being eval-

uated on 14 GitHub repositories. Lazar et al. (Lazar

et al., 2014) displayed a recall rate of 100% in de-

tecting duplicate bug reports from three open-source

systems, although it is unclear the degree to which

the choice of open-source systems studied affects the

results. Our model was evaluated on GitHub bug re-

ports where theirs was evaluated on bug reports from

other systems.

Extending our study to include other sources of

bug reports would strengthen the work and consti-

tutes future work. Further, as opposed to detecting

duplicates from issues and pull requests which have

already been submitted, one avenue of potential fur-

ther study lies in detecting duplicate pieces of work at

their inception. In the case of pull requests in partic-

ular, there may be some value in incorporating a ma-

chine learning model into the GitHub forking process

through which contributors can check if their ideas or

bug ﬁxes have already been added as part of another

pull request, thus detecting potential duplicates before

development work has begun. In thinking about is-

sues, a machine learning model might be incorporated

into the submission process. Such work along with

the ﬁndings in this paper could inform and then be

used to alleviate the current waste of resources caused

by duplicate pull requests and the resultant burned on

those reviewing open source code.

REFERENCES

Cohen, J. (2013). Statistical power analysis for the behav-

ioral sciences. Routledge.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.

(2018). Bert: Pre-training of deep bidirectional trans-

ENASE 2025 - 20th International Conference on Evaluation of Novel Approaches to Software Engineering

528

formers for language understanding. arXiv preprint

arXiv:1810.04805.

Gousios, G., Pinzger, M., and Deursen, A. v. (2014). An

exploratory study of the pull-based software develop-

ment model. In Proceedings of the 36th international

conference on software engineering, pages 345–355.

Gousios, G., Storey, M.-A., and Bacchelli, A. (2016). Work

practices and challenges in pull-based development:

The contributor’s perspective. In Proceedings of the

38th International Conference on Software Engineer-

ing, pages 285–296.

Gousios, G. and Zaidman, A. (2014). A dataset for pull-

based development research. In Proceedings of the

11th Working Conference on Mining Software Repos-

itories, pages 368–371.

Gousios, G., Zaidman, A., Storey, M.-A., and Van Deursen,

A. (2015). Work practices and challenges in pull-

based development: The integrator’s perspective. In

2015 IEEE/ACM 37th IEEE International Conference

on Software Engineering, volume 1, pages 358–368.

IEEE.

Jiang, J., Lo, D., He, J., Xia, X., Kochhar, P. S., and Zhang,

L. (2017). Why and how developers fork what from

whom in github. Empirical Software Engineering,

22:547–578.

Kalliamvakou, E., Gousios, G., Blincoe, K., Singer, L., Ger-

man, D., and Damian, D. (2016). An in-depth study

of the promises and perils of mining github. Empirical

Software Engineering, 21(4):2035–2071.

Kalliamvakou, E., Gousios, G., Blincoe, K., Singer, L., Ger-

man, D. M., and Damian, D. (2014). The promises and

perils of mining github. In Proceedings of the 11th

working conference on mining software repositories,

pages 92–101.

Lazar, A., Ritchey, S., and Sharif, B. (2014). Improving

the accuracy of duplicate bug report detection using

textual similarity measures. In Proceedings of the

11th Working Conference on Mining Software Repos-

itories, pages 308–311.

Li, L., Ren, Z., Li, X., Zou, W., and Jiang, H. (2018). How

are issue units linked? empirical study on the linking

behavior in github. In 2018 25th Asia-Paciﬁc Soft-

ware Engineering Conference (APSEC), pages 386–

395. IEEE.

Li, Z., Yin, G., Yu, Y., Wang, T., and Wang, H. (2017).

Detecting duplicate pull-requests in github. In Pro-

ceedings of the 9th Asia-Paciﬁc symposium on inter-

netware, pages 1–6.

Li, Z., Yu, Y., Wang, T., Lei, Y., Wang, Y., and Wang, H.

(2022). To follow or not to follow: Understanding

issue/pull-request templates on github. IEEE Trans-

actions on Software Engineering, 49(4):2530–2544.

Li, Z., Yu, Y., Zhou, M., Wang, T., Yin, G., Lan, L., and

Wang, H. (2020). Redundancy, context, and prefer-

ence: An empirical study of duplicate pull requests

in oss projects. IEEE Transactions on Software Engi-

neering, 48(4):1309–1335.

Ma, X., Wang, Z., Ng, P., Nallapati, R., and Xiang, B.

(2019). Universal text representation from bert: An

empirical study. arXiv preprint arXiv:1910.07973.

Manning, C. and Schutze, H. (1999). Foundations of statis-

tical natural language processing. MIT press.

McClean, K., Greer, D., and Jurek-Loughrey, A. (2021).

Social network analysis of open source software: A

review and categorisation. Information and Software

Technology, 130:106442.

Mombach, T. and Valente, M. T. (2018). Github rest api vs

ghtorrent vs github archive: A comparative study.

Muennighoff, N., Tazi, N., Magne, L., and Reimers, N.

(2022). Mteb: Massive text embedding benchmark.

arXiv preprint arXiv:2210.07316.

Reimers, N. and Gurevych, I. (2019). Sentence-bert: Sen-

tence embeddings using siamese bert-networks. arXiv

preprint arXiv:1908.10084.

Runeson, P., Alexandersson, M., and Nyholm, O. (2007).

Detection of duplicate defect reports using natural lan-

guage processing. In 29th International Conference

on Software Engineering (ICSE’07), pages 499–510.

IEEE.

Taylor, P. S., Greer, D., Sage, P., Coleman, G., McDaid,

K., Lawthers, I., and Corr, R. (2006). Applying

an agility/discipline assessment for a small software

organisation. In Product-Focused Software Process

Improvement: 7th International Conference, PRO-

FES 2006, Amsterdam, The Netherlands, June 12-14,

2006. Proceedings 7, pages 290–304. Springer.

Wang, Q., Xu, B., Xia, X., Wang, T., and Li, S. (2019).

Duplicate pull request detection: When time matters.

In Proceedings of the 11th Asia-Paciﬁc symposium on

internetware, pages 1–10.

Wang, W., Wei, F., Dong, L., Bao, H., Yang, N., and Zhou,

M. (2020). Minilm: Deep self-attention distillation for

task-agnostic compression of pre-trained transform-

ers.

Wessel, M., Vargovich, J., Gerosa, M. A., and Treude, C.

(2023). Github actions: the impact on the pull request

process. Empirical Software Engineering, 28(6):1–35.

Yu, Y., Li, Z., Yin, G., Wang, T., and Wang, H. (2018). A

dataset of duplicate pull-requests in github. In Pro-

ceedings of the 15th international conference on min-

ing software repositories, pages 22–25.

Zhang, X., Chen, Y., Gu, Y., Zou, W., Xie, X., Jia, X., and

Xuan, J. (2018). How do multiple pull requests change

the same code: A study of competing pull requests

in github. In 2018 IEEE International Conference on

Software Maintenance and Evolution (ICSME), pages

228–239. IEEE.

Zhou, S., Vasilescu, B., and Kästner, C. (2019). What the

fork: A study of inefﬁcient and efﬁcient forking prac-

tices in social coding. In Proceedings of the 2019 27th

ACM joint meeting on european software engineering

conference and symposium on the foundations of soft-

ware engineering, pages 350–361.

Detecting Duplicate Effort in GitHub Contributions

529