Reﬁning English Writing Proﬁciency Assessment and Placement in

Developmental Education Using NLP Tools and Machine Learning

Miguel Da Corte

1,2 a

and Jorge Baptista

1,2 b

University of Algarve, Faro, Portugal

INESC-ID Lisboa, Lisbon, Portugal

Keywords:

Developmental Education (DevEd), Automatic Writing Assessment Systems, Natural Language Processing

(NLP), Machine-Learning Models.

Abstract:

This study investigates the enhancement of English writing proﬁciency assessment and placement for Devel-

opmental Education (DevEd) within U.S. colleges using Natural Language Processing (NLP) and Machine

Learning (ML). Existing automated placement tools, such as ACCUPLACER, often lack transparency and

struggle to identify nuanced linguistic features necessary for accurate skill-level classiﬁcation. By integrat-

ing human-annotated linguistic features, this study aims to contribute to equitable and transparent placement

systems that better address students’ academic needs, reducing misplacements and their associated costs.

For this study, a 300-essay corpus was compiled and manually annotated with a reﬁned set of 11 DevEd-

speciﬁc (DES) features, alongside 328 linguistic features automatically extracted from CTAP and 106 via

COH-METRIX. Supervised ML algorithms were used to compare ACCUPLACER-generated classiﬁcations

with human ratings, assessing classiﬁcation accuracy and identifying predictive features. This analysis re-

vealed gaps in ACCUPLACER’s classiﬁcation capabilities. Experimental results showed that models incorpo-

rating DES features improved classiﬁcation accuracy, with Na

ıve Bayes (NB) and Support Vector Machine

(SVM) achieving scores up to 80%. The reﬁned features presented and methodology offer actionable insights

for faculty and institutions, potentially contributing to more effective DevEd course placements and targeted

instructional interventions.

1 INTRODUCTION AND

OBJECTIVES

Developmental Education (DevEd) courses play a

crucial role in equipping students who are not yet

prepared for college-level work by developing their

English writing skills and ensuring they are academi-

cally prepared to enter a college program. Placement

into DevEd or college-level courses is typically de-

termined by standardized test scores, which inﬂuence

not only students’ educational trajectories but also

carry economic consequences, as students with lower

scores are required to complete one or two semesters

of remedial coursework (Bickerstaff et al., 2022).

This study aims to improve English writing

proﬁciency assessments and placement within U.S.

community colleges, where DevEd support is most

needed. Approximately 62% of individuals, ages 16

https://orcid.org/0000-0001-8782-8377

https://orcid.org/0000-0003-4603-4364

to 24, enrolled in colleges or universities in 2023, ac-

cording to the United States Bureau of Labor Statis-

tics

. At Tulsa Community College

, where this

research was conducted, around 30% of incoming

students require developmental support in multiple

areas, including English (reading and writing) and

Math. The primary focus of this investigation is on

English as an L1, though some observations may ap-

ply to the description of other languages.

Current automated placement tools, such as AC-

CUPLACER, align with academic standards but are not

speciﬁcally calibrated to detect patterns unique to stu-

dents requiring DevEd support. Moreover, ACCU-

PLACER’s “black box” approach limits transparency

and interpretability, posing challenges in educational

contexts. This study seeks to address this issue by

reﬁning the linguistic descriptors used in placement

assessments through the identiﬁcation of key features

that better capture writing proﬁciency in DevEd stu-

https://www.bls.gov

https://www.tulsacc.edu

288

Da Corte, M. and Baptista, J.

Reﬁning English Writing Proﬁciency Assessment and Placement in Developmental Education Using NLP Tools and Machine Learning.

DOI: 10.5220/0013351500003932

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 17th International Conference on Computer Supported Education (CSEDU 2025) - Volume 2, pages 288-303

ISBN: 978-989-758-746-7; ISSN: 2184-5026

dents. By leveraging both human annotations and

NLP-derived features, this research aims at enhancing

placement accuracy and providing actionable insights

for instructional support.

For automated placement systems to become ef-

fective tools for both faculty and students, the use

of high-quality annotated corpora is essential, despite

the time-intensive nature of annotation tasks. These

annotations, crucial for training and testing automated

scoring systems, enhance the reliability and applica-

bility of placement decisions, ultimately supporting

equitable access to education. This study addresses

the need for accuracy in DevEd placements by re-

ﬁning and expanding an annotated dataset, integrat-

ing Natural Language Processing (NLP) tools for au-

tomatic feature extraction, and supplementing these

with novel, manually encoded DevEd features.

The methodology followed employs a 300-essay

corpus annotated with 11 DevEd-speciﬁc features,

complemented by linguistic features automatically

extracted through NLP tools, namely COH-METRIX

(McNamara et al., 2006) and the Common Text Anal-

ysis Platform (CTAP)

(Chen and Meurers, 2016).

This corpus will be used to train and test various ML

classiﬁers against both ACCUPLACER’s automated

classiﬁcations and human ratings, assessing classiﬁ-

cation accuracy to identify the most predictive fea-

tures and the most effective algorithms for DevEd

placements. A suite of well-known ML algorithms

from the ORANGE

data-mining tool (Dem

sar et al.,

2013) is used as the experiments with the models it

makes available can be easily replicated, helping pro-

vide a framework for future comparisons.

Based on these motivations and the existing limi-

tations mentioned, the purpose of this paper is to: (i)

analyze linguistic features in an expanded corpus us-

ing COH-METRIX, CTAP, and a manually annotated

set of DevEd-speciﬁc (DES) features to enhance text

classiﬁcation accuracy; (ii) assess the impact of these

combined features on classiﬁcation accuracy for stu-

dent placement in DevEd courses through supervised

ML experiments; (iii) compare classiﬁcation accuracy

rates achieved with the larger corpus and reﬁned fea-

ture set to a previously established baseline; and (iv)

offer actionable insights into students’ linguistic abil-

ities to support accurate placement and targeted in-

structional support in DevEd courses.

https://141.225.61.35/CohMetrix2017/

https://sifnos.sfs.uni-tuebingen.de/ctap/

https://orangedatamining.com/

2 RELATED WORK

Concerns about the accuracy of automated classiﬁca-

tion systems like ACCUPLACER stem from the limited

detail in the linguistic features considered during the

classiﬁcation process, as well as uncertainty around

how these features are deﬁned and ranked (Roscoe

et al., 2014; Johnson et al., 2022). This lack of trans-

parency makes it difﬁcult to clearly explain to stu-

dents which linguistic features were evaluated to de-

termine their placement, what deﬁnes college-level

writing, and which patterns they should address to

improve their skills. Without such clarity, students

may lack the necessary guidance to effectively en-

hance their writing abilities (Ma et al., 2023).

Studies suggest that up to one-third of placements

may be inaccurate (Ganga and Mazzariello, 2019).

Inaccurate placement in the context of DevEd often

leads to students being assigned to courses that either

underestimate or overestimate their writing abilities,

both of which carry signiﬁcant implications, not only

economically, but for student outcomes and institu-

tional effectiveness (Hughes and Li, 2019; Link and

Koltovskaia, 2023; Edgecombe and Weiss, 2024).

Assessing students’ linguistic skills solely through

test scores may overlook critical aspects of their lan-

guage abilities at the onset of their academic jour-

ney, highlighting the need for a more nuanced eval-

uation approach. As institutions seek to develop and

support students’ literacy-related skills, especially the

descriptive quality of their written work (Kosiewicz

et al., 2023), it becomes essential to explore and ar-

ticulate the developmental aspects of writing for this

population. Among the features worth exploring, the

literature emphasizes the lexical and syntactic proper-

ties of a text, followed by the text’s length, portraying

them as key indicators of writing development (Sta-

vans and Zadunaisky-Ehrlich, 2024).

Text structure, a key measure of text quality,

constitutes another property worth exploring (Kyle

et al., 2021; Feller et al., 2024). Texts that rely on

single lines or simple, non-multilayered paragraphs

are identiﬁed as needing remediation, while those

with well-structured paragraphs—including a clear

introduction, supporting statements, and a conclu-

sion—are better aligned with academic writing stan-

dards (Da Corte and Baptista, 2025). Additionally,

texts that demonstrate a varied grammatical reper-

toire, grammatical complexity, and purposeful gram-

matical choices are more in line with these standards,

showcasing the writer’s ability to use grammar effec-

tively (Nyg

ard and Hundal, 2024).

Studies on DevEd placement for English (L1)

speakers have focused on reﬁning texts ranking tasks

Reﬁning English Writing Proﬁciency Assessment and Placement in Developmental Education Using NLP Tools and Machine Learning

289

and improving classiﬁcation accuracy (CA) through

corpora annotation (G

otz and Granger, 2024) and lex-

ical and syntactic analysis supported by text mining

(Lee and Lee, 2023). Key NLP tools, such as COH-

METRIX (McNamara et al., 2006) and CTAP (Chen

and Meurers, 2016)), have proven essential for assess-

ing linguistic complexity across languages. While

this study focuses on English, the identiﬁcation and

development of linguistic features for enhanced ac-

curacy in automatic language proﬁciency classiﬁca-

tion is applicable beyond this language. For instance,

(Leal et al., 2023) adapted COH-METRIX for Brazil-

ian Portuguese and (Okinina et al., 2020) extended

CTAP to Italian. (Akef et al., 2023) applied CTAP

to Portuguese proﬁciency assessment, achieving 76%

CA and emphasizing the role of feature selection.

Similarly, (Wilkens et al., 2022) investigated lexical

diversity and syntactic dependency relations for eval-

uating French language development.

Exploring further the above-mentioned contribu-

tions, lexical features such as word n-grams, part-

of-speech (POS) n-grams, POS-tag ratios, and type-

token ratio (TTR), which reﬂect the diversity of word

types in a text, play a crucial role in these analyses,

in addition to lexical variation features (e.g., noun,

adjective, adverb, and verb variations), as well as

metrics like lexical density, to capture the breadth of

students’ linguistic proﬁciency (Vajjala and Meurers,

2012). This conﬁrms the importance of incorporat-

ing rich linguistic features into automated classiﬁca-

tion systems to enhance the accuracy of writing as-

sessments across languages (Vajjala and Lu

c, 2018;

Vajjala, 2022).

By focusing on descriptive features that more ac-

curately reﬂect native English proﬁciency, automatic

classiﬁcation systems have shown measurable im-

provements in placement accuracy (Da Corte and

Baptista, 2024c; Da Corte and Baptista, 2024b;

Da Corte and Baptista, 2025), providing valuable in-

sights into students’ readiness for college and their

linguistic progression over time. Studies such as

(Pal and Pal, 2013) have demonstrated this poten-

tial by using the WEKA machine-learning platform

with models like Na

ıve Bayes, Multilayer Perceptron,

and Decision Trees, achieving a classiﬁcation accu-

racy (CA) close to 90% in course placement. Sim-

ilarly, (Filighera et al., 2019) utilized Neural Net-

works and embeddings to categorize texts by read-

ing level with approximately 80% accuracy, while

(Hirokawa, 2018) and (Duch et al., 2024) further

demonstrated the effectiveness of classical ML al-

gorithms—including Na

ıve Bayes, Neural Networks,

and Random Forest—in feature selection and out-

come prediction, with CA rates above 90%.

In alignment with the motivation and objectives of

this study, the ﬁndings presented in the state-of-the-

art review conﬁrm the potential of machine learning

models to enhance DevEd placement by emphasizing

relevant linguistic features, helping institutions better

support students both academically and linguistically

(Bickerstaff et al., 2022; Giordano et al., 2024).

3 METHODS

The methodology for this study involves a systematic

and detailed setup that prioritizes ethical participant

selection and focuses on linguistic descriptor relia-

bility for DevEd placement, all done in accordance

with the Institution’s Review Board (IRB) protocols

(ID #22-05

). The methodology was designed to ad-

here to guidelines sensitive to the unique academic

challenges experienced by this group.

3.1 Corpus

This study extends a previous classiﬁcation task con-

ducted with a small corpus of 100 text samples

(Da Corte and Baptista, 2024a; Da Corte and Bap-

tista, 2025) by adding 200 more essays, resulting in a

systematically selected corpus of 300 essays totaling

97,339 tokens

. The corpus focuses on non-college-

level classiﬁcations to enhance the precision of place-

ment for students requiring DevEd support. The texts

were written by students seeking college admission

during the 2021-2022 and 2023-2024 academic years

and were drawn from a larger pool of 1,000 essays

within the institution’s standardized entrance exam

database. The selected essays cover 11 distinct writ-

ing topics, including Is History Valuable, Necessary

to Make Mistakes, Acquisition of Money, Differences

Among People, Happiness Not an Accident, and Inde-

pendent Ideas, to mention a few. These essays were

written in a controlled, proctored environment at the

institution’s testing center, ensuring no Internet access

or editing tools were available.

The selection criteria aimed at compiling a cor-

pus focused on non-college-level classiﬁcations, ac-

curately representing students in DevEd and their

placement levels as determined by ACCUPLACER into

Levels 1 or 2. The corpus was balanced by level,

with essays averaging 324 tokens each, making it

well-suited for examining classiﬁcation by level. This

dataset provides a valuable foundation for assessing

https://www.tulsacc.edu/irb

Corpus dataset to be made available after paper publi-

cation.

CSEDU 2025 - 17th International Conference on Computer Supported Education

290

language proﬁciency among community college stu-

dents. Demographic information—including gender,

ﬁrst language, years of English studied in high school,

and race—was available for 67% of participants but

was omitted from this stage of analysis. This data will

be analyzed in detail in a future study.

3.2 Corpus Annotation

As an initial step in this study, two trained raters anno-

tated the corpus with a reﬁned set of 11 DES features

devised by (Da Corte and Baptista, 2024a)

. Selected

through an open call for volunteers, the annotators

were equally represented in terms of gender distribu-

tion (1 male and 1 female), were both native English

speakers with advanced English skills, held at least a

Bachelor’s degree, and had experience in higher edu-

cation.

The reﬁned set of features derived from an ini-

tial list of 21 and was narrowed down to 11 using

ORANGE’s feature selection tool and the Informa-

tion Gain ranking method. These features, identi-

ﬁed as the most discriminative in previous experi-

ments (Da Corte and Baptista, 2022; Da Corte and

Baptista, 2024a), are summarized in Table 1. They

are grouped into 4 (pattern) clusters: Orthographic

(ORT), Grammatical (GRAMM), Lexical and Seman-

tic (LEXSEM), and Discursive (DISC), each reﬂect-

ing key aspects of foundational language proﬁciency.

While most features are classiﬁed as negative (-), rep-

resenting deviations from proﬁciency and identifying

errors, others are positive (+), serving as indicators of

proﬁciency and advanced language use.

ORT patterns capture basic language structure

through grapheme alterations, punctuation, and the

use of contractions (generally avoided in academic

writing unless otherwise speciﬁed). GRAMM pat-

terns measure text quality through verb agreement,

referential pronoun usage, and the omission or addi-

tion of a part of speech, which often interferes with

the meaning of a statement. LEXSEM patterns in-

clude the use of multiword expressions (MWE) and

lexical accuracy. Lastly, DISC patterns highlight

writers’ ability to extend discourse with reasoned ar-

guments and examples. These features were inte-

grated with the previously mentioned NLP-derived

features, automatically extracted from COH-METRIX

(106) and CTAP platforms (328)

, and used in the

supervised ML experiments detailed in Section 4.

https://gitlab.hlt.inesc-id.pt/u000803/deved/

The deﬁnitions of COH-METRIX and CTAP are well-

documented in the literature and adopted as cited.

For the annotation of these features on the texts,

Labelbox

was selected as the web annotation tool

to use. Labelbox has been tested before and identiﬁed

as a tool that simpliﬁes the data labeling process in

a way that can be used to train Artiﬁcial Intelligence

(AI) models, and enables the creation of high-quality

annotated datasets (Colucci Cante et al., 2024).

3.3 Annotation Task Assessment

After manually annotating the corpus, all tags were

meticulously tracked and processed through Label-

box. Krippendorff’s Alpha (K-alpha) inter-rater re-

liability coefﬁcient was computed to evaluate anno-

tation reliability. Given the size of the dataset, these

calculations were performed via Excel formulas. For

the interpretation of K-alpha scores, the following

agreement thresholds and interpretations, set forth

by (Fleiss and Cohen, 1973), were followed:

below 0.20 - slight (Sl);

between 0.21 and 0.39 - fair (F);

between 0.40 and 0.59 - moderate (M);

between 0.60 and 0.79 - substantial (Sb);

above 0.80 - almost perfect (P);

The two annotators agreed on the assessment and

tagging of 79,805 tokens and disagreed on 17,534 to-

kens. The observed proportion of agreement (P

) was

0.819, while the expected agreement by chance (p

)

was 0.5 (or 50%), given two possible outcomes with

equal probability. Using the formula:

K − al pha =

− 0.5

1 − 0.5

the inter-rater reliability score was calculated as k =

0.640, indicating “substantial agreement”. Consider-

ing the complexity of the task, this level was deemed

adequate. The reasons for disagreeing on the 17,534

tokens will be analyzed in a future study.

A third annotator, with a similar background to

the other two, was engaged in the task assessment to

facilitate consensus and establish a gold standard (or

reference) annotation. This process resulted in a ﬁnal

set of 17,342 tags applied and used for analysis. Ta-

ble 2 provides a detailed count of tagged features per

level, maintaining the order of features as presented in

Table 1. The table also highlights the uneven distribu-

tion of tags, with 38% found in texts classiﬁed by AC-

CUPLACER as Level 1 and 62% as Level 2. Notably,

this feature distribution offers foundational insights

into the linguistic attributes that typically distinguish

Levels 1 and 2 texts, serving as an initial prototype for

deﬁning the unique characteristics of each level.

https://docs.labelbox.com/

Reﬁning English Writing Proﬁciency Assessment and Placement in Developmental Education Using NLP Tools and Machine Learning

291

Table 1: DevEd-speciﬁc (DES) features summary.

Patterns Description Features Clustered

Orthographic (ORT)

Patterns representing the foundational language skills

needed to represent words and phrases.

(-) Grapheme (addition, omission, transposition,

and capitalization) (ORT)

(-) Word split (WORDSPLIT)

(-) Punctuation used & Contractions (PUNCT)

Grammatical (GRAMM) Patterns evidencing the quality of text production.

(-) Word omitted (WORDOMIT)

(-) Word repetition (WORDREPT)

(-) Verb agreement (VAGREE)

(-) Pronoun-alternation referential (ALTERN)

Lexical & Semantic (LEXSEM)

Patterns contributing to the structuring

of a writer’s discourse.

(-) Word precision (PRECISION)

(+) Multiword expressions (MWE)

Discursive (DISC) Patterns exhibiting the writer’s ability to produce— extended discourse.

(+) Argumentation with reason (REASON)

(+) Argumentation with example (EXAMPLE)

Table 2: Distribution of DevEd-speciﬁc (DES) features across the corpus.

Polarity Feature Level 1 % Level 2 % Dif (L1-L2) Result

- ORT 2,528 38.18% 2,504 23.36% -14.83% Improved

- WORDSPLIT 95 1.43% 72 0.67% -0.76% Almost no improvement

- PUNCT 1,390 20.99% 2,082 19.42% -1.57% Slightly improved

- WORDOMIT 435 6.57% 462 4.31% -2.26% Slightly improved

- WORDREPT 71 1.07% 98 0.91% -0.16% Almost no improvement

- VAGREE 178 2.69% 106 0.99% -1.70% Slightly improved

- ALTERN 37 0.56% 46 0.43% -0.13% Almost no improvement

- PRECISION 599 9.05% 745 6.95% -2.10% Slightly improved

+ MWE 1,038 15.68% 4,101 38.25% 22.57% Improved

+ REASON 200 3.02% 356 3.32% 0.30% Almost no improvement

+ EXAMPLE 50 0.76% 149 1.39% 0.63% Almost no improvement

Total - 6,621 - 10,721 - - -

Overall, Level 1 texts exhibited more frequent in-

stances of foundational issues (PUNCT and ORT) and

demonstrated less cohesion and complexity. In con-

trast, Level 2 texts displayed fewer lower-order errors

and greater exempliﬁcation of higher-order features,

such as MWE. As writing develops, texts tend to

incorporate more sophisticated features with correc-

tions focusing on ﬁne-tuning elements that enhance

clarity and reﬁne expression (Nyg

ard and Hundal,

2024). As an illustration, two sample texts represent-

ing Levels 1 and 2 are provided in the Appendix.

The primary improvements noted between texts in

Levels 1 and 2 were a sharp reduction in ORT er-

rors (-14.83%) and a notable increase in the use of

multiword expressions (MWE) (+22.57%), corrobo-

rating previous research ﬁndings on the connection

between MWE use and higher proﬁciency levels (La-

porte, 2018; Kochmar et al., 2020; Pasquer et al.,

2020). Additionally, a slight improvement was ob-

served in the reduced incidence of several negative

features, with proﬁciency levels improving by ap-

proximately 1.6% to 2.3% for these features. The

distribution of the remaining features shows no sig-

niﬁcant differences, with changes below 1%.

Building on this tagging framework, the ranking

of DES features was compared with the original sub-

corpus of 100 text samples, 200 texts, and the full cor-

pus, 300 text samples, as shown in Table 3. The rank-

ing was determined using the Information Gain scor-

ing method available in ORANGE, selected due to its

superior classiﬁcation accuracy (CA) demonstrated in

a prior study (Da Corte and Baptista, 2024c). Conse-

quently, this method was applied to all classiﬁcation

tasks described in Section 4.

The ranking of DES features for the 100- and the

200-text samples yielded a “very strong” Spearman

correlation score

(Schober et al., 2018) of ρ = 0.927

with a two-tailed p-value of 0, indicating consistency

in feature ranking across studies despite observable

differences. This correlation, however, does not pre-

clude some shifts in the relative importance of certain

features, likely inﬂuenced by differences in corpus

size. Note that the content of each sample does not

overlap. Observable changes include higher rankings

for MWE, EXAMPLE (argumentation of key con-

cepts; support of one’s position with examples.), and

VAGREE (lack of agreement between the subject of

the sentence and the conjugated form of the verb) in

the 200 text samples, while the PRECISION (impre-

cise use of words attending to its meaning in the sen-

tence) feature dropped signiﬁcantly from 4th to 11th

place, in this sub-corpus.

The ﬁnal ranking with all 300 text samples com-

bined portrays VAGREE, EXAMPLE, and ORT (de-

noting orthographic errors) as the top 3 features with

comparable Information Gain scores. Comparison of

the Spearman ranking coefﬁcient is also “very high”,

with ρ = 0.948 for the 100 samples vs. 300 samples

corpus (< overlap), and ρ = 0.977 for the 200 vs. 300

samples (> overlap). This indicates a consistent rank-

ing of the features across the subcorpora.

https://www.socscistatistics.com/tests/spearman/

CSEDU 2025 - 17th International Conference on Computer Supported Education

292

Table 3: Comparison of DES Feature Rankings Using Information Gain Scores for 100, 200, and 300 Texts.

100 Texts 200 Texts 300 Texts

Rank Feature Info. Gain Rank Features Info. Gain Rank Features Info. Gain

1 ORT 0.084 1 MWE 0.260 1 VAGREE 0.131

2 EXAMPLE 0.080 2 EXAMPLE 0.171 2 EXAMPLE 0.124

3 WORDOMIT 0.080 3 ORT 0.140 3 ORT 0.112

4 PRECISION 0.072 4 WORDOMIT 0.126 4 MWE 0.088

5 WORDREPT 0.062 5 VAGREE 0.126 5 WORDOMIT 0.075

6 VAGREE 0.056 6 ALTERN 0.081 6 REASON 0.045

7 MWE 0.050 7 WORDREPT 0.078 7 WORDREPT 0.039

8 REASON 0.039 8 REASON 0.069 8 ALTERN 0.034

9 ALTERN 0.038 9 WORDSPLIT 0.060 9 PRECISION 0.014

10 PUNCT 0.037 10 PUNCT 0.034 10 WORDSPLIT 0.012

11 WORDSPLIT 0.036 11 PRECISION 0.004 11 PUNCT 0.011

3.4 Skill-Level Classiﬁcation and

Assessment

In addition to assessing the DevEd-speciﬁc features

marked in the corpus, all 300 full-text samples were

classiﬁed by the same two annotators according to

skill level, using DevEd-level deﬁnitions adapted

from the Institution’s course descriptions:

DevEd Level 1: if essays demonstrated a need for

improvement in general English usage, including

grammar, spelling, punctuation, and the structure

of sentences and paragraphs;

DevEd Level 2: if essays required targeted support

in speciﬁc aspects of English, such as sentence

structure, punctuation, editing, and revising; or

College Level: if essays were written accurately

and displayed appropriate English usage at the col-

lege academic level.

The two evaluators agreed on the skill level as-

signed to 260 text samples, while they differed on

40. Among the agreed-upon samples, 122 were clas-

siﬁed as Level 1, 134 as Level 2, and 4 as College

level. To measure the level of agreement, the Krip-

pendorff’s Alpha (K-alpha) inter-rater reliability co-

efﬁcient was calculated using the ReCal-OIR

tool

(Freelon, 2013) for ordinal data.

According to (Fleiss and Cohen, 1973), a “mod-

erate” level of agreement 0.473 was obtained. While

this score indicates some consistency in the ratings

between the annotators, it also highlights the com-

plexity involved in assessing proﬁciency skills. This

suggests that incorporating additional descriptive lin-

guistic criteria, such as numerical subscales to quan-

tify misspelled words, beyond the outlined DevEd

level descriptions, may enhance automatic classiﬁca-

tion accuracy.

Given the moderate agreement score, further anal-

ysis was necessary to resolve discrepancies in classi-

ﬁcations. For the 40 cases where differences between

https://dfreelon.org/utils/recalfront/recal-oir/

human classiﬁcations occurred, the average of the lev-

els assigned for each text was calculated, and a ﬁnal

classiﬁcation was attributed. This process resulted in

14 cases being classiﬁed as Level 1, 22 as Level 2,

and 4 as College level.

With the discrepancies ﬁnally resolved, the DevEd

corpus now comprised 136 essays classiﬁed as Level

1, 156 as Level 2, and 8 as College-level. To maintain

balance between the levels, random resampling was

conducted for Level 2 to reduce the number of essays

to 136, equal to Level 1. The College-level essays

were discarded at this stage, as this study focuses on

a two-level DevEd classiﬁcation task.

This human-rated assessment adds a level of depth

to the analysis, as annotators, unlike automated sys-

tems like ACCUPLACER, can grasp the subtleties of

language use and the complexities involved in student

writing assessment.

3.4.1 Comparing ACCUPLACER vs. Human

Assessment

The classiﬁcation performance of the ACCUPLACER

system with human annotators is compared in this

section, providing insights into alignment and dis-

crepancies across subcorpora of 100, 200, and all 300

texts, using Pearson coefﬁcient (Cohen, 1988).

The correlation between ACCUPLACER and hu-

man classiﬁcations is summarized in Table 4.

Table 4: Pearson Correlation Comparison: Human Raters

vs. Accuplacer Across Subcorpora.

Subcorpus Pearson

100 0.242

200 0.366

300 0.305

The Pearson coefﬁcients consistently indicated

“weak” correlation scores for 100, 200, and for the

full corpus. Similarly, the K-alpha coefﬁcient, com-

puted for all 300 texts, yielded a “fair” score of

k = 0.312 conﬁrming ACCUPLACER’S limitations in

aligning with human raters, particularly in identifying

Reﬁning English Writing Proﬁciency Assessment and Placement in Developmental Education Using NLP Tools and Machine Learning

293

nuanced linguistic features and transitions between

proﬁciency levels.

To further illustrate the discrepancies between

ACCUPLACER and human classiﬁcations, confusion

matrices (Tables 5,6, 7) provide a detailed breakdown

of speciﬁc instances where ACCUPLACER encounters

challenges in accurately predicting proﬁciency levels.

Table 5 shows that ACCUPLACER aligned with hu-

man classiﬁcations for 79 texts (accuracy, 79%). All

21 errors occurred when Level 1 texts were misclas-

siﬁed as Level 2. The results suggest that this auto-

matic classiﬁcation system has some limitations when

distinguishing Level 1 from Level 2 and the need for

improved recognition of Level 1 linguistic features.

Table 5: Accuplacer vs. Human Assessment: 100 Texts.

Accuplacer L1 Accuplacer L2 Sum

Human L1 50 21 71

Human L2 0 29 29

Sum 50 50 100

With the subcorpus of 200 text samples, as per Ta-

ble 6, ACCUPLACER aligned with the human clas-

siﬁcation for 157 texts (accuracy, 78.5%), a simi-

lar ratio as with the subcorpora of 100 texts. How-

ever, the misclassiﬁcation of 35 Level 2 texts as Level

1 signiﬁcantly impacts overall accuracy. Addition-

ally, 8 College-level texts were not recognized by Ac-

cuplacer as proﬁcient texts, despite humans identi-

fying linguistic features corresponding to advanced,

college-level writing. These 8 texts were excluded

from both this matrix and the subsequent one.

Table 6: Accuplacer vs. Human Assessment: 200 Texts.

Accuplacer L1 Accuplacer L2 Sum

Human L1 65 0 65

Human L2 35 92 127

Sum 100 92 192

For the 300 text samples, ACCUPLACER aligned

with the human classiﬁcation for 236 texts (accuracy,

79%). In Table 7, a misclassiﬁcation of 56 texts (35

Level 2 texts misclassiﬁed as Level 1; 21 Level 1 texts

misclassiﬁed as Level 2) was noted, indicating AC-

CUPLACER’S difﬁculty in distinguishing these levels.

Furthermore, ACCUPLACER’S missed all 8 College-

Level texts classiﬁed by humans, suggesting that im-

provements are needed to better identify linguistic

features that signal higher proﬁciency and transitions

between Levels 1 and 2.

Table 7: Accuplacer vs. Human Assessment: 300 Texts.

Accuplacer L1 Accuplacer L2 Sum

Human L1 115 21 136

Human L2 35 121 156

Sum 150 142 292

These misclassiﬁcations have broader implica-

tions beyond accuracy scores. Misplacing approxi-

mately 1 in 5 students (approximately 20%) at a level

above or below their true skill carries signiﬁcant ped-

agogical consequences. These include not only extra

costs in time and money for both students and institu-

tions but also potential impacts on learning outcomes

and student success.

4 MACHINE LEARNING FOR

DevEd PLACEMENT

As mentioned before, this study builds on earlier work

(Da Corte and Baptista, 2024c) by integrating 445

features from three sources to classify text samples

by DevEd level: 106 from COH-METRIX, 328 from

CTAP, and the reﬁned set of 11 DES features man-

ually annotated on the corpus. Supervised ML meth-

ods now use this data in a set of experiments aimed at

(i) identifying the most relevant linguistic features for

classiﬁcation and (ii) determining the ML algorithm

achieving the highest classiﬁcation accuracy (CA).

While this represents a classical ML approach to

the classiﬁcation problem at hand, future research will

expand on these ﬁndings by incorporating pre-trained

Large Language Models (LLMs), such as Generative

Pre-trained Transformers (GPT), to assess their abil-

ity to align textual features with reﬁned descriptors

and generate interpretative explanations for classiﬁ-

cation outcomes.

These experiments were set in two scenarios: (i)

using the full corpus with classiﬁcation levels auto-

matically assigned by ACCUPLACER; and (ii) using

the same corpus with classiﬁcation levels assigned

by human annotators. In each scenario, experiments

were ﬁrst conducted with all 445 features, followed

by additional experiments where features were added

incrementally in bundles of 10 until classiﬁcation re-

sults reached asymptotic results.

The data-mining tool ORANGE was selected for

analysis and modeling for its usability and the diver-

sity of ML tools and algorithms it makes available.

A total of 10 well-known ML algorithms were se-

lected from the set available in ORANGE (listed al-

phabetically): (i) Adaptive Boosting (AdaBoost); (ii)

CN2 Rule Induction (CN2); (iii) Decision Tree (DT)

(iv) Gradient Boosting (GB); (v) k-Nearest Neigh-

bors (kNN); (vi) Logistic Regression (LR); (vii) Na

ıve

Bayes (NB); (viii) Neural Network (NN); (ix) Ran-

dom Forest (RF); and (x) Support Vector Machine

(SVM). The default conﬁguration of these learners

was selected.

CSEDU 2025 - 17th International Conference on Computer Supported Education

294

Figure 1 shows the basic workﬂow adopted for

this study.

To train and evaluate the models, the Orange

TEST&SCORE widget was employed. Classiﬁcation

Accuracy (CA) was the primary evaluation metric,

aligned with this study’s goals. In cases where CA

values were identical, Precision (Prec) was used as a

secondary criterion to rank the models. Given the cor-

pus size, a 3-fold cross-validation was implemented

using the DATA SAMPLER widget, allocating 2/3 of

the corpus for training and 1/3 for testing. Addi-

tionally, the RANK widget facilitated an evaluation

of each feature’s discriminative power for the task,

while a Confusion Matrix provided insights into de-

tailed breakdown of the classiﬁcation results.

Previously, a baseline has been established, set-

ting a CA of 0.727 using the Random Forest (RF)

algorithm (Da Corte and Baptista, 2024c). This rel-

atively high baseline reﬂects RF’s recognized effec-

tiveness in machine learning applications to writing

assessment (Huang, 2023). Although the classiﬁca-

tion accuracy is high, the baseline represents a con-

cerning value. A CA of 0.727 translates to approx-

imately 3 out of 10 students being misclassiﬁed in

DevEd following the ACCUPLACER writing assess-

ment. This misclassiﬁcation rate raises concerns

about the methodology of current automatic place-

ment systems, with potentially far-reaching implica-

tions for student success and the allocation of institu-

tional resources. Addressing these challenges aligns

with the objectives of this study, which seeks to en-

hance placement accuracy.

4.1 Scenario 1: Classiﬁcation Using

ACCUPLACER-Assigned Skill Levels

For this classiﬁcation task, text sample units were

classiﬁed using all 445 features with the skill-

level classiﬁcations automatically assigned by ACCU-

PLACER as the target variable (150 texts Level 1; 150

texts Level 2). The results of this experiment are pre-

sented in Table 8.

Table 8: CA scores, F1, Precision, and Recall for all models

with 445 features and ACCUPLACER’s classiﬁcations.

Model CA F1 Prec Recall

CN2 0.655 0.653 0.655 0.655

AdaBoost 0.675 0.675 0.675 0.675

DT 0.730 0.730 0.731 0.730

RF 0.750 0.748 0.753 0.750

kNN 0.755 0.751 0.764 0.755

NN 0.770 0.768 0.773 0.770

LR 0.780 0.780 0.780 0.780

NB 0.785 0.782 0.794 0.785

SVM 0.795 0.794 0.796 0.795

GB 0.800 0.798 0.806 0.800

When all features were combined, 7 out of the 10

models tested achieved higher CA scores, surpassing

the initial baseline of 72.7% - from 75% with RF up

to 80% with GB (highlighted in bold). This indicates

that the combined features hold signiﬁcant potential

for improving student placement and ensuring they

receive the necessary support as they begin college.

With this enhanced feature set, at least 8 out of 10

students would now be properly placed, compared to

the previous ratio of 7 out of 10.

To address the question of which features con-

tribute most to improved placement accuracy, the

most discriminative features from Coh-Metrix, CTAP,

and DES were identiﬁed using the Information Gain

ranking method. Table 9 provides the topmost 30.

The top-ranked features, overall, were mostly

from CTAP (99%), followed by 2 COH-METRIX fea-

tures, Word count, Number of words ranking 35th (In-

formation Gain: 0.242) and Sentence count, number

of sentences ranking 88th (Information Gain: 0.162)

and one DES feature, VAGREE ranking 117th (Infor-

mation Gain: 0.131). The next DES feature ranked

was in 121st place, EXAMPLE (Information Gain:

0.124). Notably, VAGREE had previously ranked 7th

when the experiments were conducted with a smaller

text sample unit size of 100 (Da Corte and Baptista,

2024c).

All features were then grouped into sets of 10,

and bundles were used in several ML experiments to

identify optimal CA scores performance before re-

sults reached an asymptote. The CA scores for the

top 120 selected features for all 10 ML algorithms are

detailed in Table 10. CA scores in bold indicate the

highest score achieved with different feature combi-

nations, while scores exceeding the previously men-

tioned baseline of 0.727 (in Section 4) are italicized

for each model and bundle. Figure 2 depicts the out-

comes of Experiment 4.1 for a more in-depth evalua-

tion of the results obtained.

Most models consistently outperformed the previ-

ous baseline of 0.727 CA, except for the AdaBoost

and CN2 algorithms, which fell below this value in

most instances. The NN model quickly peaked in the

ﬁrst run, achieving a CA score of 0.800 with only 10

combined features. While its performance slightly de-

clined in subsequent experiments, it consistently re-

mained above 0.727. Similarly, the NB model demon-

strated strong performance with consistent CA scores

of 0.770 or higher, reaching asymptote results after a

total of 30 combined features. This consistent behav-

ior is likely due to NB’s known great computing efﬁ-

ciency and adaptability to text classiﬁcation tasks (Pa-

jila et al., 2023).

Reﬁning English Writing Proﬁciency Assessment and Placement in Developmental Education Using NLP Tools and Machine Learning

295

Figure 1: ORANGE workﬂow setup. The NB, NN, and SVM algorithms are displayed merely as representatives of the chosen

learners.

Table 9: Scenario 1: Combined top-ranked 30 features from COH-METRIX, CTAP, and DES by Information Gain scores.

Rank Source Feature Description Info. gain

1 CTAP Lexical Sophistication: Easy word types (NGSL) 0.341

2 CTAP Number of Tokens with More Than 2 Syllables 0.333

3 CTAP Number of Word Types (including Punctuation and Numbers) 0.331

4 CTAP Number of Word Types (excluding Punctuation and numbers) 0.327

5 CTAP Number of Word Types 0.327

6 CTAP Number of POS Feature: Lexical word Types 0.326

7 CTAP Number Of Letters 0.311

8 CTAP Number of Word Types with More Than 2 Syllables 0.303

9 CTAP Number of POS Feature: Verb Types (including Modals) 0.296

10 CTAP Number of POS Feature: Adjective and Adverb Types 0.288

11 CTAP Lexical Sophistication: Easy verb types (NGSL) 0.275

12 CTAP Number of POS Feature: Preposition Types 0.273

13 CTAP Number of syllables 0.272

14 CTAP Lexical Sophistication: Easy lexical types (NGSL) 0.271

15 CTAP Number of POS Feature: Verb Types (without Modals) 0.271

16 CTAP Lexical Sophistication Feature: SUBTLEX Word Frequency per Million (AW Type) 0.270

17 CTAP Lexical Sophistication: Easy lexical tokens (NGSL) 0.269

18 CTAP Number of POS Feature: Lexical word Lemma Types 0.269

19 CTAP Number of POS Feature: Noun Types 0.268

20 CTAP Number of POS Feature: Noun Lemma Types 0.266

21 CTAP Number of POS Feature: Verb Lemma Types 0.265

22 CTAP Number of POS Feature: Noun Tokens 0.263

23 CTAP Number of Syntactic Constituents: Complex Noun Phrase 0.263

24 CTAP Number of POS Feature: Lexical word Tokens 0.262

25 CTAP Number of POS Feature: Adverb in base form Types 0.262

26 CTAP Number of POS Feature: Adverb Types 0.260

27 CTAP Number of POS Feature: Verb Tokens (without Modals) 0.258

28 CTAP Number of Syntactic Constituents: Declarative Clauses 0.253

29 CTAP Lexical Sophistication: Easy verb tokens (NGSL) 0.251

30 CTAP Number of Unique Words 0.250

Table 10: Scenario 1: CA scores for top 120 features in bundles of 10 across 10 ML models.

Classiﬁcation Accuracy (CA) Scores

Model 10ft 20ft 30ft 40ft 50ft 60ft 70ft 80ft 90ft 100ft 110ft 120ft

DT 0.700 0.745 0.665 0.655 0.660 0.660 0.735 0.735 0.770 0.760 0.755 0.725

AdaBoost 0.715 0.695 0.640 0.665 0.720 0.730 0.715 0.715 0.765 0.695 0.710 0.705

CN2 0.715 0.680 0.660 0.715 0.695 0.715 0.695 0.695 0.700 0.695 0.695 0.695

GB 0.755 0.745 0.735 0.705 0.710 0.745 0.765 0.760 0.765 0.760 0.780 0.775

NB 0.770 0.790 0.790 0.785 0.785 0.780 0.780 0.785 0.785 0.780 0.780 0.780

RF 0.770 0.745 0.720 0.740 0.760 0.750 0.730 0.775 0.765 0.740 0.745 0.755

LR 0.770 0.780 0.780 0.705 0.755 0.700 0.710 0.700 0.710 0.700 0.715 0.755

kNN 0.775 0.740 0.735 0.765 0.770 0.760 0.745 0.735 0.775 0.770 0.770 0.770

SVM 0.790 0.795 0.780 0.780 0.765 0.770 0.770 0.775 0.770 0.770 0.770 0.775

NN 0.800 0.790 0.760 0.790 0.760 0.760 0.755 0.765 0.750 0.760 0.760 0.740

CSEDU 2025 - 17th International Conference on Computer Supported Education

296

Figure 2: Scenario 1: Machine-learning algorithms performance based on feature aggregation with Accuplacer classiﬁcation.

Assessing the Impact of DES Features

To better gauge the impact of DES features on

the overall classiﬁcation task, the 30 CTAP fea-

tures—where NB reached its asymptote—were com-

bined with all 11 DES features. The resulting CA

scores are shown in Table 11, with scores exceed-

ing NB’s CA of 0.790 highlighted in bold. The ﬁnal

column compares the CA differences with 30 CTAP

features alone to their combination with DES fea-

tures, where positive values indicate improved accu-

racy, and negative values reveal a hindrance to the

models’ performance by this addition.

Table 11: CA scores: top-ranked 30 plus DES features.

CA Model Performance Comparison

Model w/30ft w/30ft+11 DES Difference

DT 0.665 0.750 +8.5%

AdaBoost 0.640 0.705 +6.5%

CN2 0.660 0.640 -2.0%

GB 0.735 0.760 +2.5%

NB 0.790 0.795 +0.5%

RF 0.720 0.755 +3.5%

LR 0.780 0.785 +0.5%

kNN 0.735 0.735 No difference

SVM 0.780 0.810 +3.0%

NN 0.760 0.805 +4.5%

Most models improved their performance with the

addition of DES features. DT and AdaBoost showed

the most signiﬁcant gains (+8.5% and +6.5%), while

CN2 experienced a slight decline in accuracy (-2%).

The NB remained stable with a modest 0.5% gain.

SVM and NN achieved CA scores of 0.810 and 0.805,

improving by 3% and 4.5%, respectively. These ﬁnd-

ings highlight the signiﬁcance of DES features as key

indicators of students’ writing abilities in DevEd.

4.2 Scenario 2: Classiﬁcation Using

Human-Assigned Skill Levels

Lastly, the resampled dataset of 272 text units (136

for Level 1 and 136 for Level 2), balanced by level fol-

lowing human assessment as described in Section 3.4,

was classiﬁed using all 445 features, with the human-

assigned skill levels as the target variable. The results

from this experiment are presented in Table 12.

Table 12: CA scores, F1, Precision, and Recall for all mod-

els with 445 features with Human classiﬁcations.

Model CA F1 Prec Recall

CN2 0.552 0.553 0.554 0.552

LR 0.558 0.555 0.564 0.558

AdaBoost 0.608 0.608 0.609 0.608

kNN 0.613 0.613 0.615 0.613

NB 0.635 0.631 0.649 0.635

GB 0.635 0.635 0.639 0.635

NN 0.657 0.657 0.662 0.657

DT 0.674 0.674 0.674 0.674

RF 0.674 0.674 0.674 0.674

SVM 0.713 0.711 0.715 0.713

Overall, the ML algorithms did not surpass

the previously established 0.727 baseline CA score.

However, the SVM model was the highest-performing

model, with a score of 0.713. To further understand

the algorithms’ performance, the most discrimina-

tive attributes across COH-METRIX, CTAP, and DES

were identiﬁed using the Information Gain ranking

method. The top 30 features are included in Table 13.

When compared to the ranking of the features pre-

sented in Section 4.1, some shifts in feature rank-

ings were noted, reﬂecting differences in how hu-

man raters and automated systems interpret linguis-

Reﬁning English Writing Proﬁciency Assessment and Placement in Developmental Education Using NLP Tools and Machine Learning

297

Table 13: Scenario 2: Combined top-ranked 30 features from COH-METRIX, CTAP, and DES by Information Gain scores.

Rank Source Feature Description Info. gain

1 COH-Metrix Paragraph length, number of sentences in a paragraph, mean 0.144

2 COH-Metrix Lexical diversity, VOCD, all words 0.142

3 CTAP Lexical Sophistication: Easy noun tokens (NGSL) 0.124

4 COH-Metrix LSA given/new, sentences, mean 0.124

5 CTAP Lexical Richness: Type Token Ratio (STTR NGSLeasy Nouns) 0.118

6 COH-Metrix Lexical diversity, MTLD, all words 0.115

7 CTAP Number of POS Feature: Plural noun Types 0.109

8 COH-Metrix LSA given/new, sentences, standard deviation 0.109

9 DES EXAMPLE 0.103

10 CTAP POS Density Feature: Existential There 0.102

11 COH-Metrix Positive connectives incidence 0.100

12 COH-Metrix LSA overlap, adjacent sentences, mean 0.100

13 CTAP Number of POS Feature: Existential there Types 0.099

14 CTAP Number of POS Feature: Preposition Types 0.099

15 COH-Metrix WordNet verb overlap 0.098

16 CTAP Number of Syntactic Constituents: Postnominal Noun Modiﬁer 0.097

17 CTAP Number of Word Types (including Punctuation and Numbers) 0.097

18 CTAP Lexical Sophistication Feature: SUBTLEX Logarithmic Word Frequency (AW Type) 0.096

19 CTAP Number of POS Feature: Existential there Tokens 0.096

20 CTAP Lexical Sophistication: Easy noun types (NGSL) 0.094

21 CTAP Number of POS Feature: Verbs in past participle form Types 0.092

22 COH-Metrix LSA verb overlap 0.092

23 CTAP Number of Unique Words 0.092

24 CTAP Number of Tokens with More Than 2 Syllables 0.090

25 CTAP Number of Word Types with More Than 2 Syllables 0.086

26 CTAP Lexical Richness: HDD (excluding punctuation and numbers) 0.086

27 CTAP POS Density Feature: Possessive Ending 0.086

28 CTAP Number of Syntactic Constituents: Complex Noun Phrase 0.084

29 CTAP Number of POS Feature: Plural noun Tokens 0.083

30 CTAP Referential Cohesion: Global Lemma Overlap 0.083

tic patterns. Most of the highest-ranked features de-

rived from CTAP (67.5%), with a much smaller pro-

portion from COH-METRIX (11%). Two DES fea-

tures (5%)—EXAMPLE(Information Gain: 0.103)

and ORT (Information Gain: 0.082)—were ranked

9th and 31st, respectively.

The presence of these two DES features (EXAM-

PLE and ORT) within the top 31 may highlight the

importance of human-derived features in the task,

though a good approximation of the other features

might have been obtained by mechanical methods as

well. These features also seem to enable a more de-

tailed assessment of students’ writing abilities before

entering an academic program, providing valuable in-

sights into their communication effectiveness (Kim

et al., 2017).

To prepare for the next step in this classiﬁcation

task, all features were also tested by grouping them

in bundles of 10, using their Information Gain scores,

and incrementally adding them to observe asymptotic

trends in the results. Table 14 summarizes the impact

of these feature bundles (10 bundles, 120 features to-

tal) on the classiﬁcation accuracy (CA) of the ML al-

gorithms used. Bolded CA scores represent the high-

est score achieved by the ML algorithms with differ-

ent feature bundles, while scores exceeding the base-

line of 0.727 are italicized. Figure 3 depicts the out-

comes of Experiment 4.2.

Out of all the algorithms tested, four (CN2, DT,

AdaBoost, and kNN) underperformed relative to the

0.727 baseline, with CN2 and kNN yielding the low-

est results and showing no signiﬁcant CA improve-

ments as features were added. Conversely, SVM,

RF, LR, and NB performed better, often surpassing

the 0.727 baseline. Notably, the NB model achieved

a CA of 0.779 with just 10 features, though scores

declined slightly as more features were added, re-

maining above 0.727 with the ﬁrst 30 features. LR

peaked at 0.785 with 40 features, an improvement

of nearly 6% from the baseline, before its accuracy

decreased, on average, by about 12% in subsequent

iterations. SVM maintained consistent performance

above 0.727, in most instances, reaching an asymp-

totic line at 70 features (CA = 0.740).

The enhanced CA score of almost 80% achieved

by LR indicates some improvements by signaling that

instead of misclassifying 3 students, only 2 face incor-

rect placement, with one more student now receiving

the essential support needed in college.

Assessing the Impact of DES Features

After achieving a peak CA of 0.785 with LR with 40

features, a combination of these top features and the

remaining 9 DES features (2 were already included in

the top 40) was tested to evaluate whether the addition

of more DES features improved accuracy. The result-

ing CA scores are shown in Table 15, with scores of at

least 0.785 in bold. The ﬁnal column contains the dif-

ference, with positive values indicating an improve-

ment and negative values revealing a decrease in the

models’ accuracy.

In comparison to the ﬁrst scenario (Section 4.1),

four ML algorithms (DT, GB, LR, and NN) exhibited

CSEDU 2025 - 17th International Conference on Computer Supported Education

298

Table 14: Scenario 2: CA scores for top 120 features in bundles of 10 across 10 ML models.

Classiﬁcation Accuracy (CA) Scores

Model 10ft 20ft 30ft 40ft 50ft 60ft 70ft 80ft 90ft 100ft 110ft 120ft

CN2 0.602 0.591 0.569 0.575 0.586 0.558 0.575 0.608 0.602 0.591 0.669 0.669

DT 0.630 0.652 0.663 0.652 0.707 0.691 0.663 0.652 0.641 0.619 0.635 0.635

AdaBoost 0.669 0.613 0.641 0.641 0.608 0.624 0.646 0.635 0.657 0.646 0.624 0.652

GB 0.724 0.691 0.685 0.691 0.685 0.663 0.685 0.674 0.702 0.713 0.702 0.735

NN 0.702 0.729 0.740 0.751 0.713 0.718 0.702 0.729 0.729 0.696 0.707 0.718

kNN 0.702 0.691 0.669 0.619 0.624 0.613 0.613 0.602 0.597 0.613 0.608 0.608

SVM 0.702 0.702 0.718 0.740 0.724 0.735 0.740 0.740 0.746 0.740 0.729 0.740

RF 0.735 0.707 0.702 0.713 0.762 0.691 0.718 0.724 0.691 0.724 0.713 0.735

LR 0.713 0.724 0.713 0.785 0.696 0.680 0.680 0.685 0.702 0.641 0.630 0.630

NB 0.779 0.773 0.751 0.718 0.735 0.718 0.674 0.674 0.657 0.669 0.669 0.663

Figure 3: Scenario 2: Machine-learning algorithms performance based on feature aggregation with human rater classiﬁcation.

Table 15: CA scores: top-ranked 40 plus remaining 9 DES

features.

CA Model Performance Comparison

Model w/40ft w/40ft+9 DES Difference

DT 0.652 0.635 -1.7%

AdaBoost 0.641 0.663 +2.2%

CN2 0.575 0.591 +1.6%

GB 0.691 0.635 -5.6%

NB 0.718 0.785 +6.7%

RF 0.713 0.751 +3.8%

LR 0.785 0.713 -7.2%

kNN 0.619 0.669 +5%

SVM 0.740 0.746 +0.6%

NN 0.751 0.735 -1.6%

declines in CA scores, with LR experiencing the most

signiﬁcant drop of 7.2%. Conversely, models like Ad-

aBoost, RF, kNN, SVM, and NN showed moderate

improvements, averaging 2.64%, though their scores

remained below the 0.785 threshold.

Notably, the NB model matched the highest CA

score of 0.785 achieved by LR with the combined 40

features, consistent with its strong performance in the

ﬁrst scenario. While some models faced declines, in-

cluding DES features continues to yield valuable in-

sights into writing performance. This is further vali-

dated by the consistent performance of NB, a model

recognized in the literature for its effectiveness in fea-

ture selection and pattern detection (Hirokawa, 2018).

5 CONCLUSIONS

This study aimed at reﬁning English writing proﬁ-

ciency assessment and placement processes by lever-

aging a 300-essay corpus analyzed using a total of 445

linguistic features. These features were drawn pre-

dominantly from well-established platforms such as

COH-METRIX and CTAP, but also included a set of

11 DES-speciﬁc features humanly devised and pre-

viously tested for this task, further emphasizing the

importance of lexical and syntactic complexity anal-

ysis in placement accuracy (Stavans and Zadunaisky-

Ehrlich, 2024). The 300 full-text samples were classi-

ﬁed automatically by ACCUPLACER and, also, inde-

pendently by two trained annotators into the follow-

ing skill levels: Levels 1, 2, and College-level. This

dual classiﬁcation approach provided a comprehen-

sive basis for comparing automated and human eval-

uations.

Reﬁning English Writing Proﬁciency Assessment and Placement in Developmental Education Using NLP Tools and Machine Learning

299

The study contributes to ongoing efforts in reﬁn-

ing feature selection for text classiﬁcation, support-

ing the argument that high-quality annotated corpora

are essential for reliable automated assessments (Lee

and Lee, 2023; G

otz and Granger, 2024). Using

ORANGE as a data mining tool, top-performing fea-

tures were identiﬁed through Information Gain rank-

ings and evaluated for their impact on classiﬁcation

accuracy (CA) across various machine learning (ML)

models. This testing attempted to replicate the auto-

matic classiﬁcation process of ACCUPLACER, about

which the literature currently provides limited infor-

mation regarding its methodology and feature selec-

tion process.

DES features such as EXAMPLE and ORT ranked

among the top 40 most discriminative features and of-

fered granular insights into students’ writing abilities

in areas like argumentation and orthographic preci-

sion. They also improved classiﬁcation accuracy in

several ML models reinforcing prior work on the role

of argumentation, lexical variation, and orthographic

precision in proﬁciency assessment (Kosiewicz et al.,

2023; Nyg

ard and Hundal, 2024). Additionally, re-

sults align with research demonstrating the efﬁcacy

of feature selection in enhancing classiﬁcation out-

comes, as seen in studies utilizing COH-METRIX and

CTAP for multilingual language assessment (Okinina

et al., 2020; Leal et al., 2023; Akef et al., 2023).

The inclusion of human classiﬁcations in this

study, validated through inter-rater reliability mea-

sures (Pearson and K-alpha inter-rater reliability co-

efﬁcient), provided a critical benchmark for evaluat-

ing the efﬁcacy of both the automated ACCUPLACER

system and the machine learning models showcasing

an alternative approach to reducing misclassiﬁcation

rates (Hughes and Li, 2019; Link and Koltovskaia,

2023). Unlike ACCUPLACER, which operates as a

“black-box” system with limited transparency, human

raters have the ability to capture nuanced linguistic

features and transitions between proﬁciency levels.

By combining these approaches, this study demon-

strated improvements in CA, particularly with models

like NB, which consistently matched or outperformed

the baseline accuracy of 0.727 in both classiﬁcation

scenarios. These ﬁndings align with prior research

on the effectiveness of ML in educational applications

(Filighera et al., 2019).

The reﬁned feature set and methodology presented

in this study is a critical step in the broader process of

advancing equity in educational outcomes by (i) im-

proving student placement, (ii) reducing misclassiﬁ-

cation, and (iii) supporting targeted instructional in-

terventions. Future efforts will continue to reﬁne and

validate these approaches, ensuring they align with

the complex needs of DevEd in higher education.

ACKNOWLEDGMENTS

This work was supported by Portuguese national

funds through FCT (Reference: UIDB/50021/2020,

DOI: 10.54499/UIDB/50021/2020) and by the

European Commission (Project: iRead4Skills,

Grant number: 1010094837, Topic: HORIZON-

CL2-2022-TRANSFORMATIONS-01-07, DOI:

10.3030/101094837).

We also extend our profound gratitude to the ded-

icated annotators who participated in this task and the

IT team whose expertise made the systematic analysis

of the linguistic features presented in this paper possi-

ble. Their meticulous work and innovative approach

have been instrumental in advancing our research.

REFERENCES

Akef, S., Mendes, A., Meurers, D., and Rebuschat, P.

(2023). Linguistic complexity features for auto-

matic Portuguese readability assessment. In XXXIX

Encontro Nacional da Associac¸

ao Portuguesa de

Lingu

ıstica, Covilh

a, Portugal, October 26–28, 2023,

Proceedings 14, pages 103–109. Associac¸

ao Por-

tuguesa de Lingu

ıstica.

Bickerstaff, S., Beal, K., Raufman, J., Lewy, E. B., and

Slaughter, A. (2022). Five principles for reforming

Developmental Education: A review of the evidence.

Center for the Analysis of Postsecondary Readiness,

pages 1–8.

Chen, X. and Meurers, D. (2016). CTAP: A Web-Based

Tool Supporting Automatic Complexity Analysis. In

Proceedings of the Workshop on Computational Lin-

guistics for Linguistic Complexity (CL4LC), pages

113–119, Osaka, Japan. The COLING 2016 Organiz-

ing Committee.

Cohen, J. (1988). Statistical power analysis. Hillsdale, NJ:

Erlbaum.

Colucci Cante, L., D’Angelo, S., Di Martino, B., and

Graziano, M. (2024). Text annotation tools: A com-

prehensive review and comparative analysis. In In-

ternational Conference on Complex, Intelligent, and

Software Intensive Systems, pages 353–362. Springer.

Da Corte, M. and Baptista, J. (2022). A phraseology ap-

proach in developmental education placement. In Pro-

ceedings of Computational and Corpus-based Phrase-

ology, EUROPHRAS 2022, Malaga, Spain, pages 79–

86.

Da Corte, M. and Baptista, J. (2024a). Charting the linguis-

tic landscape of developing writers: An annotation

scheme for enhancing native language proﬁciency. In

Calzolari, N., Kan, M.-Y., Hoste, V., Lenci, A., Sakti,

CSEDU 2025 - 17th International Conference on Computer Supported Education

300

S., and Xue, N., editors, Proceedings of the 2024

Joint International Conference on Computational Lin-

guistics, Language Resources and Evaluation (LREC-

COLING 2024), pages 3046–3056, Torino, Italia.

ELRA and ICCL.

Da Corte, M. and Baptista, J. (2024b). Enhancing writ-

ing proﬁciency classiﬁcation in developmental educa-

tion: The quest for accuracy. In Calzolari, N., Kan,

M.-Y., Hoste, V., Lenci, A., Sakti, S., and Xue, N.,

editors, Proceedings of the 2024 Joint International

Conference on Computational Linguistics, Language

Resources and Evaluation (LREC-COLING 2024),

pages 6134–6143, Torino, Italia. ELRA and ICCL.

Da Corte, M. and Baptista, J. (2024c). Leveraging NLP

and machine learning for English (l1) writing assess-

ment in developmental education. In Proceedings of

the 16th International Conference on Computer Sup-

ported Education (CSEDU 2024), 2-4 May, 2024,

Angers, France, volume 2, pages 128–140.

Da Corte, M. and Baptista, J. (2025). Toward consistency

in writing proﬁciency assessment: Mitigating classiﬁ-

cation variability in developmental education. In Pro-

ceedings of CSEDU 2025, Porto, Portugal. (to ap-

pear).

Dem

sar, J., Curk, T., Erjavec, A.,

Crt Gorup, Ho

cevar, T.,

Milutinovi

c, M., Mo

zina, M., Polajnar, M., Toplak,

M., Stari

c, A.,

Stajdohar, M., Umek, L.,

Zagar, L.,

Zbontar, J.,

Zitnik, M., and Zupan, B. (2013). Orange:

Data Mining Toolbox in Python. Journal of Machine

Learning Research, 14:2349–2353.

Duch, D., May, M., and George, S. (2024). Empower-

ing students: A reﬂective learning analytics approach

to enhance academic performance. In 16th Inter-

national Conference on Computer Supported Educa-

tion (CSEDU 2024), pages 385–396. SCITEPRESS-

Science and Technology Publications.

Edgecombe, N. and Weiss, M. (2024). Promoting equity

in Developmental Education reform: A conversation

with Nikki Edgecombe and Michael Weiss. Center

for the Analysis of Postsecondary Readiness, page 1.

Feller, D. P., Sabatini, J., and Magliano, J. P. (2024). Dif-

ferentiating less-prepared from more-prepared college

readers. Discourse Processes, pages 1–23.

Filighera, A., Steuer, T., and Rensing, C. (2019). Automatic

text difﬁculty estimation using embeddings and neu-

ral networks. In Transforming Learning with Mean-

ingful Technologies: 14th European Conference on

Technology Enhanced Learning, EC-TEL 2019, Delft,

The Netherlands, September 16–19, 2019, Proceed-

ings 14, pages 335–348. Springer.

Fleiss, J. L. and Cohen, J. (1973). The equivalence of

weighted Kappa and the intraclass correlation coefﬁ-

cient as measures of reliability. Educational and psy-

chological measurement, 33(3):613–619.

Freelon, D. (2013). Recal OIR: ordinal, interval, and ratio

intercoder reliability as a web service. International

Journal of Internet Science, 8(1):10–16.

Ganga, E. and Mazzariello, A. (2019). Modernzing college

course placement by using multiple measures. Educa-

tion Commission of the States, pages 1–9.

Giordano, J. B., Hassel, H., Heinert, J., and Phillips, C.

(2024). Reaching All Writers: A Pedagogical Guide

for Evolving College Writing Classrooms, chapter

Chapter 2, pages 24–62. University Press of Colorado.

otz, S. and Granger, S. (2024). Learner corpus research

for pedagogical purposes: An overview and some re-

search perspectives. International Journal of Learner

Corpus Research, 10(1):1–38.

Hirokawa, S. (2018). Key attribute for predicting student

academic performance. In Proceedings of the 10th In-

ternational Conference on Education Technology and

Computers, pages 308–313.

Huang, Z. (2023). An intelligent scoring system for En-

glish writing based on artiﬁcial intelligence and ma-

chine learning. International Journal of System As-

surance Engineering and Management, pages 1–8.

Hughes, S. and Li, R. (2019). Affordances and limitations

of the ACCUPLACER automated writing placement

tool. Assessing Writing, 41:72–75.

Johnson, M. S., Liu, X., and McCaffrey, D. F. (2022). Psy-

chometric methods to evaluate measurement and al-

gorithmic bias in automated scoring. Journal of Edu-

cational Measurement, 59(3):338–361.

Kim, Y.-S. G., Schatschneider, C., Wanzek, J., Gatlin, B.,

and Al Otaiba, S. (2017). Writing evaluation: Rater

and task effects on the reliability of writing scores

for children in grades 3 and 4. Reading and writing,

30:1287–1310.

Kochmar, E., Gooding, S., and Shardlow, M. (2020). De-

tecting multiword expression type helps lexical com-

plexity assessment. arXiv preprint arXiv:2005.05692.

Kosiewicz, H., Morales, C., and Cortes, K. E. (2023). The

“missing English learner”’ in higher education: How

identiﬁcation, assessment, and placement shape the

educational outcomes of English learners in commu-

nity colleges. In Higher Education: Handbook of The-

ory and Research: Volume 39, pages 1–55. Springer.

Kyle, K., Crossley, S. A., and Verspoor, M. (2021). Mea-

suring longitudinal writing development using indices

of syntactic complexity and sophistication. Studies in

Second Language Acquisition, 43(4):781–812.

Laporte, E. (2018). Choosing features for classifying mul-

tiword expressions. In Sailer, M. and Markantonatou,

S., editors, Multiword expressions: In-sights from a

multi-lingual perspective, pages 143–186. Language

Science Press, Berlin.

Leal, S. E., Duran, M. S., Scarton, C. E., Hartmann, N. S.,

and Alu

ısio, S. M. (2023). NILC-Metrix: assess-

ing the complexity of written and spoken language in

Brazilian Portuguese. Language Resources and Eval-

uation, pages 1–38.

Lee, B. W. and Lee, J. H.-J. (2023). Prompt-based learn-

ing for text readability assessment. In In Proceed-

ings of the 18th Workshop on Innovative Use of NLP

for Building Educational Applications (BEA 2023),

pages 1–19. Toronto, Canada. Association forCompu-

tational Linguistics.

Link, S. and Koltovskaia, S. (2023). Automated Scoring of

Writing, pages 333–345. Springer International Pub-

lishing, Cham.

Reﬁning English Writing Proﬁciency Assessment and Placement in Developmental Education Using NLP Tools and Machine Learning

301

Ma, H., Wang, J., and He, L. (2023). Linguistic features

distinguishing students’ writing ability aligned with

CEFR levels. Applied Linguistics, 45(4):637–657.

McNamara, D. S., Ozuru, Y., Graesser, A. C., and Louw-

erse, M. (2006). Validating CoH-Metrix. In Proceed-

ings of the 28th annual Conference of the Cognitive

Science Society, pages 573–578.

Nyg

ard, M. and Hundal, A. K. (2024). Features of gram-

matical writing competence among early writers in a

Norwegian school context. Languages, 9(1):29.

Okinina, N., Frey, J.-C., and Weiss, Z. (2020). CTAP for

Italian: Integrating components for the analysis of

Italian into a multilingual linguistic complexity anal-

ysis tool. In Proceedings of the 12th Conference

on Language Resources and Evaluation (LREC 2020,

pages 7123–7131.

Pajila, P., Sheena, B., Gayathri, A., Aswini, J., Nalini, M.,

and R, S. S. (2023). A comprehensive survey on

ıve Bayes algorithm: Advantages, limitations and

applications. In 2023 4th International Conference

on Smart Electronics and Communication (ICOSEC),

pages 1228–1234.

Pal, A. K. and Pal, S. (2013). Classiﬁcation model of

prediction for placement of students. International

Journal of Modern Education and Computer Science,

5(11):49.

Pasquer, C., Savary, A., Ramisch, C., and Antoine, J.-Y.

(2020). Verbal multiword expression identiﬁcation:

Do we need a sledgehammer to crack a nut? In

Proceedings of the 28th International Conference on

Computational Linguistics, pages 3333–3345.

Roscoe, R. D., Crossley, S. A., Snow, E. L., Varner, L. K.,

and McNamara, D. S. (2014). Writing quality, knowl-

edge, and comprehension correlates of human and au-

tomated essay scoring. The Twenty-Seventh Interna-

tional Florida Artiﬁcial Intelligence Research Society

Conference, pages 393–398.

Schober, P., Boer, C., and Schwarte, L. A. (2018). Corre-

lation coefﬁcients: appropriate use and interpretation.

Anesthesia & Analgesia, 126(5):1763–1768.

Stavans, A. and Zadunaisky-Ehrlich, S. (2024). Text struc-

ture as an indicator of the writing development of de-

scriptive text quality. Journal of Writing Research,

15(3):463–496.

Vajjala, S. (2022). Trends, limitations and open challenges

in automatic readability assessment research. In Cal-

zolari, N., B

echet, F., Blache, P., Choukri, K., Cieri,

C., Declerck, T., Goggi, S., Isahara, H., Maegaard,

B., Mariani, J., Mazo, H., Odijk, J., and Piperidis,

S., editors, Proceedings of the Thirteenth Language

Resources and Evaluation Conference, pages 5366–

5377, Marseille, France. European Language Re-

sources Association.

Vajjala, S. and Lu

c, I. (2018). OneStopEnglish corpus:

A new corpus for automatic readability assessment

and text simpliﬁcation. In Tetreault, J., Burstein, J.,

Kochmar, E., Leacock, C., and Yannakoudakis, H.,

editors, Proceedings of the Thirteenth Workshop on

Innovative Use of NLP for Building Educational Ap-

plications, pages 297–304, New Orleans, Louisiana.

Association for Computational Linguistics.

Vajjala, S. and Meurers, D. (2012). On improving the ac-

curacy of readability classiﬁcation using insights from

second language acquisition. In Proceedings of the 7th

Workshop on Building Educational Applications using

NLP, pages 163–173, Montr

eal, Canada. Association

for Computational Linguistics.

Wilkens, R., Alfter, D., Wang, X., Pintard, A., Tack, A.,

Yancey, K. P., and Franc¸ois, T. (2022). Fabra: French

aggregator-based readability assessment toolkit. In

Proceedings of the Thirteenth Language Resources

and Evaluation Conference, pages 1217–1233.

CSEDU 2025 - 17th International Conference on Computer Supported Education

302

APPENDIX

Text ID: 66, Classiﬁed as Level 1 by both raters.

Text ID: 57, Classiﬁed as Level 2 by both raters.

Reﬁning English Writing Proﬁciency Assessment and Placement in Developmental Education Using NLP Tools and Machine Learning

303