Is It Professional or Exploratory? Classifying Repositories Through README Analysis
Maximilian Auch, Maximilian Balluff, Peter Mandl, Christian Wolff
2025
Abstract
This study introduces a new approach to determine whether GitHub repositories are professional or exploratory by analyzing README.md files. We crawled and manually labeled a dataset that contains over 200 repositories to evaluate various classification methods. We compared state-of-the-art Large Language Models (LLM) against traditional Natural Language Processing (NLP) techniques, including term frequency similarity and word embedding-based nearest-neighbors, using RoBERTa. The results demonstrate the advantages of LLMs on the given classification task. When applying a zero-shot classification without multi-step reasoning, GPT-4o had the overall highest accuracy. The implementation of a few-shot learning showed a mixed result in different models. Llama 3 (70b) achieved 89.5% accuracy when using multi-step reasoning, though such improvements were not consistent across all models. Also, our experiments with word probability threshold filtering showed mixed results. Our findings highlight important considerations regarding the balance between accuracy, processing speed, and operational costs. For time-critical applications, we found that direct prompts without multi-step reasoning provide the most efficient approach, while the model size made a smaller contribution. Overall, README.md content proved sufficient for accurate classification in approximately 70% of cases.
DownloadPaper Citation
in Harvard Style
Auch M., Balluff M., Mandl P. and Wolff C. (2025). Is It Professional or Exploratory? Classifying Repositories Through README Analysis. In Proceedings of the 20th International Conference on Evaluation of Novel Approaches to Software Engineering - Volume 1: ENASE; ISBN 978-989-758-742-9, SciTePress, pages 457-467. DOI: 10.5220/0013272500003928
in Bibtex Style
@conference{enase25,
author={Maximilian Auch and Maximilian Balluff and Peter Mandl and Christian Wolff},
title={Is It Professional or Exploratory? Classifying Repositories Through README Analysis},
booktitle={Proceedings of the 20th International Conference on Evaluation of Novel Approaches to Software Engineering - Volume 1: ENASE},
year={2025},
pages={457-467},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0013272500003928},
isbn={978-989-758-742-9},
}
in EndNote Style
TY - CONF
JO - Proceedings of the 20th International Conference on Evaluation of Novel Approaches to Software Engineering - Volume 1: ENASE
TI - Is It Professional or Exploratory? Classifying Repositories Through README Analysis
SN - 978-989-758-742-9
AU - Auch M.
AU - Balluff M.
AU - Mandl P.
AU - Wolff C.
PY - 2025
SP - 457
EP - 467
DO - 10.5220/0013272500003928
PB - SciTePress