Yet Another Miner Utility Unveiling a Dataset: CodeGrain

Dániel Horváth, László Vidács, László Vidács

2024

Abstract

Automated program repair (APR) gained more and more attention over the years, both from an academic, and an industrial point of view. The overall goal of APR is to reduce the cost of development and maintenance, by automagically finding and fixing common bugs, typos, or errors in code. A successful, and highly researched approach is to use deep-learning (DL) techniques to accomplish this task. DL methods are known to be very data-hungry, but despite this, data that is readily available online is hard to find, which poses a challenge to the development of such solutions. In this paper, we address this issue by providing a new dataset consisting of 371,483 code examples on bug-fixing, while also introducing a method that other researchers could use as a feature in their mining software. We extracted code from 5,273 different repositories and 250,090 different commits. Our work contributes to related research by providing a publicly accessible dataset, which DL models could be trained, or fine-tuned on, and a method that easily integrates with almost any code mining tool, as a language-independent feature that gives more granular choices when extracting code parts from a specific bugfix commit. The dataset also includes the summary, and message of the commits in the training data which consists of multiple programming languages, including C, C++, Java, JavaScript, and Python.

Download


Paper Citation


in Harvard Style

Horváth D. and Vidács L. (2024). Yet Another Miner Utility Unveiling a Dataset: CodeGrain. In Proceedings of the 13th International Conference on Data Science, Technology and Applications - Volume 1: DATA; ISBN 978-989-758-707-8, SciTePress, pages 338-345. DOI: 10.5220/0012760100003756


in Bibtex Style

@conference{data24,
author={Dániel Horváth and László Vidács},
title={Yet Another Miner Utility Unveiling a Dataset: CodeGrain},
booktitle={Proceedings of the 13th International Conference on Data Science, Technology and Applications - Volume 1: DATA},
year={2024},
pages={338-345},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0012760100003756},
isbn={978-989-758-707-8},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 13th International Conference on Data Science, Technology and Applications - Volume 1: DATA
TI - Yet Another Miner Utility Unveiling a Dataset: CodeGrain
SN - 978-989-758-707-8
AU - Horváth D.
AU - Vidács L.
PY - 2024
SP - 338
EP - 345
DO - 10.5220/0012760100003756
PB - SciTePress