Is It Professional or Exploratory? Classifying Repositories Through README Analysis

Maximilian Auch, Maximilian Balluff, Peter Mandl, Christian Wolff

2025

Abstract

This study introduces a new approach to determine whether GitHub repositories are professional or exploratory by analyzing README.md files. We crawled and manually labeled a dataset that contains over 200 repositories to evaluate various classification methods. We compared state-of-the-art Large Language Models (LLM) against traditional Natural Language Processing (NLP) techniques, including term frequency similarity and word embedding-based nearest-neighbors, using RoBERTa. The results demonstrate the advantages of LLMs on the given classification task. When applying a zero-shot classification without multi-step reasoning, GPT-4o had the overall highest accuracy. The implementation of a few-shot learning showed a mixed result in different models. Llama 3 (70b) achieved 89.5% accuracy when using multi-step reasoning, though such improvements were not consistent across all models. Also, our experiments with word probability threshold filtering showed mixed results. Our findings highlight important considerations regarding the balance between accuracy, processing speed, and operational costs. For time-critical applications, we found that direct prompts without multi-step reasoning provide the most efficient approach, while the model size made a smaller contribution. Overall, README.md content proved sufficient for accurate classification in approximately 70% of cases.

Download


Paper Citation


in Harvard Style

Auch M., Balluff M., Mandl P. and Wolff C. (2025). Is It Professional or Exploratory? Classifying Repositories Through README Analysis. In Proceedings of the 20th International Conference on Evaluation of Novel Approaches to Software Engineering - Volume 1: ENASE; ISBN 978-989-758-742-9, SciTePress, pages 457-467. DOI: 10.5220/0013272500003928


in Bibtex Style

@conference{enase25,
author={Maximilian Auch and Maximilian Balluff and Peter Mandl and Christian Wolff},
title={Is It Professional or Exploratory? Classifying Repositories Through README Analysis},
booktitle={Proceedings of the 20th International Conference on Evaluation of Novel Approaches to Software Engineering - Volume 1: ENASE},
year={2025},
pages={457-467},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0013272500003928},
isbn={978-989-758-742-9},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 20th International Conference on Evaluation of Novel Approaches to Software Engineering - Volume 1: ENASE
TI - Is It Professional or Exploratory? Classifying Repositories Through README Analysis
SN - 978-989-758-742-9
AU - Auch M.
AU - Balluff M.
AU - Mandl P.
AU - Wolff C.
PY - 2025
SP - 457
EP - 467
DO - 10.5220/0013272500003928
PB - SciTePress