Analysing the Impact of Images and Text for Predicting Human

Creativity Through Encoders

Amaia Pikatza-Huerga

1 a

, Pablo Matanzas de Luis

1 b

, Miguel Fernandez-De-retana Uribe

1 c

Javier Pe

na Lasa

2 d

, Unai Zulaika

1 e

and Aitor Almeida

1 f

Faculty of Engineering, University of Deusto, Unibertsitate Etorb., 24, Bilbao, Spain

Faculty of Health Science, University of Deusto, Unibertsitate Etorb., 24, Bilbao, Spain

{a.pikatza, javier.pena, unai.zulaika, aitor.almeida}@deusto.es

Keywords:

Machine Learning, Creativity Assessment, Originality Evaluation, Artistic Expression, Text and Image

Analysis, EEG.

Abstract:

This study explores the application of multimodal machine learning techniques to evaluate the originality and

complexity of drawings. Traditional approaches in creativity assessment have primarily focused on visual

analysis, often neglecting the potential insights derived from accompanying textual descriptions. The research

assesses four target features: drawings’ originality, ﬂexibility and elaboration level, and titles’ creativity, all

labelled by expert psychologists. The research compares different image encoding and text embeddings to

examine the effectiveness and impact of individual and combined modalities. The results indicate that incor-

porating textual information enhances the predictive accuracy for all features, suggesting that text provides

valuable contextual insights that images alone may overlook. This work demonstrates the importance of a

multimodal approach in creativity assessment, paving the way for more comprehensive and nuanced evalua-

tions of artistic expression.

1 INTRODUCTION

The assessment of creativity is a dynamic ﬁeld where

artiﬁcial intelligence (AI) opens possibilities to en-

hance the objectivity, scalability, and depth of eval-

uations across various tasks. Traditionally, creativity

assessments, such as the Alternate Uses Task (AUT)

in verbal creativity and drawing-based tasks in visual

domains, have relied heavily on human judgment, fac-

ing challenges in consistency and efﬁciency. AI, how-

ever, introduces data-driven methods to quantify cre-

ativity aspects like originality and ﬂexibility with ob-

jective precision. For instance, platforms like SemDis

(Beaty and Johnson, 2020) use natural language pro-

cessing to measure semantic distance, automating the

scoring of verbal creativity and reducing the subjec-

tivity and labor intensity of manual evaluations (Allen

et al., 2015; Shaban-Nejad et al., 2022). Such ad-

https://orcid.org/0009-0003-9080-6242

https://orcid.org/0009-0009-8897-5796

https://orcid.org/0009-0002-0883-1303

https://orcid.org/0000-0002-0041-7020

https://orcid.org/0000-0002-7366-9579

https://orcid.org/0000-0002-1585-4717

vances lay a foundation for reliable, large-scale cre-

ativity assessments, making comprehensive analysis

feasible (Stojnic et al., 2022).

In visual creativity assessment, a specialized

area focuses on drawing completion tests commonly

used in psychology to explore aspects of person-

ality, emotional state, and cognitive style. These

tests, like the Thematic Apperception Test (TAT)

and Draw-A-Person Test (DAP), ask participants to

complete drawings, revealing deeper psychological

traits. While valuable, these assessments have tra-

ditionally depended on subjective evaluations, limit-

ing consistency and scalability. AI enhances these

assessments by analysing graphic features—such as

line quality, shape complexity, and spatial arrange-

ment—objectively, improving reliability and identi-

fying subtle patterns that might be missed by human

evaluators (Liu et al., 2020; Wang et al., 2023; Tan

et al., 2023; Gado et al., 2021).

AI-based approaches in ﬁgural creativity have

shown particular promise. Convolutional neural net-

works (CNNs) (O’Shea and Nash, 2015) have been

applied to measure originality in drawing tasks, align-

ing closely with expert ratings and reducing both

time and costs while ensuring consistency (Cropley

Pikatza-Huerga, A., Matanzas de Luis, P., Uribe, M. F.-D.-R., Lasa, J. P., Zulaika, U. and Almeida, A.

Analysing the Impact of Images and Text for Predicting Human Creativity Through Encoders.

DOI: 10.5220/0013203600003938

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 11th International Conference on Information and Communication Technologies for Ageing Well and e-Health (ICT4AWE 2025), pages 15-24

ISBN: 978-989-758-743-6; ISSN: 2184-4984

and Marrone, 2022; Kvam et al., 2023). Similarly,

platforms like AuDrA (Patterson et al., 2023) utilize

modiﬁed ResNet (He et al., 2015) architectures to

score features like elaboration and divergent thinking,

achieving correlations with human evaluations and

highlighting AI’s potential to standardize creativity

assessments (Easton et al., 2019; Davis et al., 2022).

This trend reﬂects the broader integration of AI in ed-

ucation, where it increasingly supports learning and

assessment practices (Pezzulo et al., 2023).

Building on these advances, recent work has ex-

plored supervised learning techniques, such as Vision

Transformers and Random Forest classiﬁers, to auto-

mate scoring in tasks like the Torrance Tests of Cre-

ative Thinking-Figural (TTCT-F) (Acar et al., 2024).

This approach extends this by integrating textual data,

including titles or descriptions accompanying draw-

ings, to enable a multimodal assessment of creativity.

This combined analysis of visual and textual data al-

lows AI to capture nuanced aspects—particularly in

originality and ﬂexibility—that may be overlooked by

image-only models (Weidinger et al., 2022). As hu-

man creativity often involves both visual and verbal

expression, this multimodal approach is essential for

more comprehensive evaluation (Bahcecik, 2023).

AI’s applications in assessing emotional content

within drawings also show promise. Using senti-

ment analysis, AI can detect emotional cues in hu-

man ﬁgure drawings, traditionally used to evaluate

emotional well-being and intelligence. This ability

enables faster, more precise emotional assessments,

aligning with the increasing recognition of emotional

intelligence in psychological evaluation (Imuta et al.,

2013; Røed et al., 2023; Devedzic, 2020).

Beyond assessment, AI holds potential for thera-

peutic applications. By tracking changes in patient

drawings over time, AI helps clinicians monitor emo-

tional and cognitive progress throughout therapy, fa-

cilitating personalized interventions that leverage the

creative process as a therapeutic tool (Zhang et al.,

2024; Lee et al., 2015). AI’s role in these settings not

only enhances therapeutic effectiveness but also un-

derscores the healing potential of creativity (Searle,

2018).

AI’s capacity to compare drawings against norma-

tive data further enhances its diagnostic capabilities,

helping detect psychological conditions early by iden-

tifying deviations from typical proﬁles (Sheng et al.,

2019; Ferrara and Qunbar, 2022). Additionally, by

assessing cognitive styles in drawing tasks, AI of-

fers insights into thought processes and personality

traits, advancing diagnostic accuracy and deepening

our understanding of individual differences in creativ-

ity (Gigi, 2015; Creely and Blannin, 2023; Cetinic

and She, 2021).

This research advances these developments by in-

troducing a novel tool that combines visual and tex-

tual data for a more thorough creativity assessment

in drawing tasks. The inﬂuence of visual and textual

data has been studied separately, and by integrating

titles or descriptions often accompanying drawings,

this approach captures added layers of creativity, es-

pecially in originality and ﬂexibility, where text can

enrich insights beyond what image-based models of-

fer. The model, trained on expert evaluations, aligns

closely with human expertise while minimizing sub-

jective inconsistencies, allowing for scalable, precise

assessments across diverse psychological tests (Harr

and El-Tariﬁ, 2023).

2 METHODS

2.1 Data

The drawings in this dataset were collected as part of

a study aimed at investigating the effects of intracra-

nial stimulation on creativity. In this original study,

53 participants were asked to create drawings in two

phases: one before receiving intracranial stimulation

(pre-stimulation phase) and another after the stimula-

tion (post-stimulation phase). For each drawing, par-

ticipants were also asked to provide a title or a brief

description that reﬂected their interpretation or con-

cept of the image.

The primary goal of that study was to explore

whether intracranial stimulation could inﬂuence the

originality of the participants’ drawings. The dataset

used in this research consists of these drawings along

with their corresponding titles, both from the pre-

stimulation and post-stimulation phases. These im-

ages were collected and made available for further

analysis, and they serve as the foundation for the cur-

rent study.

The dataset used consists of 486 samples, includ-

ing numerical and textual data related to scanned im-

ages of drawings. Each image is assigned labels

corresponding to various features, including the ti-

tle given by the participant to their drawing, which

is used for classiﬁcation with a deep learning model.

The study targets four numerical variables: O, FLE,

E, and T, which represent different aspects of creativ-

ity that the model seeks to predict. Each entry in the

dataset contains the following ﬁelds:

• IMAGE: A scanned image of one of the drawings

completed by the participant.

• TEXT: Title assigned to the drawing by the par-

ICT4AWE 2025 - 11th International Conference on Information and Communication Technologies for Ageing Well and e-Health

ticipant (in Spanish).

• O: Label given by an expert psychologist indicat-

ing whether the drawing demonstrates creativity,

expressed as 0 (not original) or 1 (original).

• FLE: Label provided by an expert psychologist

that assesses the participant’s ﬂexibility in draw-

ing. Each drawing is assigned a numerical cat-

egory related to its theme (people, landscapes,

etc.), and ﬂexibility is calculated based on the

number of different categories represented.

• E: Label assigned by an expert psychologist that

measures the level of elaboration of the drawing.

• T: Label provided by an expert psychologist in-

dicating whether the title given to the drawing is

creative or not, also expressed as O (0 or 1).

2.2 Participants

In total, 53 participants contributed to the creation of

the dataset. Demographic and personal characteristics

of the participants include age, gender, educational

level, mother tongue and certain habits, such as stim-

ulant and tobacco use, as well as number of hours of

sleep.

The main features of the participant dataset are:

• Gender: Gender of the participant, recorded as

‘M’ (male) or ‘F’ (female). The distribution was

balanced, with 49.1% men and 50.9% women.

• Age: Age in years of participants, ranging from

10 to 60 years.

• Mother tongue: The majority of participants have

Spanish as their mother tongue.

• Education: Educational level ranges from com-

pulsory secondary education to postgraduate stud-

ies.

• Sleeping hours: Sleeping hours were recorded the

night before the experimental session.

• Stimulants and Tobacco: Participants reported on

the consumption of stimulants (e.g. coffee) and

tobacco before the sessions.

• Observations: Additional notes on participants,

such as medical or behavioural observations dur-

ing the study.

2.3 Model Building

In this study, we aim to predict four creativity-related

variables — O (originality), E (elaboration), FLE

(ﬂexibility), and T (title originality) — using mul-

timodal data that combines visual information (im-

ages of the drawings) with textual information (ti-

tles assigned to the drawings by the participants). To

achieve this, deep learning models were employed to

analyze both the visual features of the drawings and

the semantic features of the titles. The objective is to

evaluate the performance of these models in predict-

ing the mentioned variables and to explore to what

extent each data modality (image or text) contributes

to the model’s accuracy.

To obtain a more detailed understanding, all possi-

ble combinations of text and image models, which are

described in the following subsections, were tested.

Additionally, experiments were conducted using only

images and only text to predict each of the four cre-

ativity variables separately, allowing us to assess how

much information each modality contributes indepen-

dently.

The visual features of the drawings were pro-

cessed using convolutional neural networks (CNN) as

encoders to create an image embedding, while the ti-

tles assigned to the drawings by the participants were

processed using different text embedding models. For

models that utilized both text and image data, the ﬁnal

layer before the output of each model was concate-

nated with the other modality, allowing both sources

of information to be combined effectively.

2.3.1 Image-Based Models

• ResNet50: A deep network with 50 layers widely

used in image classiﬁcation tasks due to its ability

to handle degradation problems in deep networks.

(He et al., 2015)

• InceptionV3: A modular network design model

that efﬁciently uses computational resources, im-

proving image analysis accuracy. (Szegedy et al.,

2015)

• EfﬁcientNetB0: This model optimizes both net-

work size and accuracy, offering a balance be-

tween computational performance and feature ex-

traction capacity. (Tan and Le, 2020)

• Xception: Based on depthwise separable convo-

lutions, this model excels in image classiﬁcation

tasks, improving accuracy without signiﬁcantly

increasing computational cost. (Chollet, 2016)

The ResNet50, InceptionV3, EfﬁcientNetB0 and

Xception architectures were chosen due to their rele-

vance and diversity in CNN design strategies. These

architectures have consistently demonstrated high

performance in image classiﬁcation tasks, such as

those in ImageNet benchmarks, and represent key ap-

Analysing the Impact of Images and Text for Predicting Human Creativity Through Encoders

proaches in the evolution of CNNs. ResNet50 in-

corporates residual connections that enable the train-

ing of deep networks; InceptionV3 optimizes com-

putational efﬁciency with convolutions of varying

sizes; EfﬁcientNetB0 introduces compound scaling to

balance accuracy and efﬁciency; and Xception uti-

lizes depthwise separable convolutions, achieving im-

proved accuracy with reduced computational cost.

This selection ensures a representative and diverse

analysis of visual encoding capabilities in the context

of the problem studied.

2.3.2 Text Embedding Models

• BETO (Ca

nete et al., 2020): A model based on

the Transformer architecture that provides contex-

tualized representation of words in the titles, cap-

turing both local and global meaning of the text.

• FastText (Joulin et al., 2016): This model gener-

ates word embeddings that include morphological

information, which is particularly useful for short

titles or unknown words.

• Keras Embedding layer: A simpler model that

enables efﬁcient text representation using dense

layers, suitable for fast and efﬁcient classiﬁcation

tasks.

The BETO, FastText, and Keras Embedding layer

models were selected to encompass diverse strate-

gies in semantic text representation. BETO is a

Transformer-based model pre-trained speciﬁcally in

Spanish, making it ideal for capturing linguistic nu-

ances in the analyzed titles. FastText generates em-

beddings based on subword information, allowing it

to handle out-of-vocabulary words and morphological

features, which are particularly useful for short titles.

Finally, the Keras Embedding layer offers an efﬁcient

and ﬂexible approach for dense text representation in

classiﬁcation tasks. Combining these approaches pro-

vides a comprehensive and complementary analysis

of the semantic characteristics of the titles within the

context of creativity.

2.4 Model Evaluation

To assess the models’ performance in predicting

creativity-related variables, we employ distinct met-

rics tailored to classiﬁcation (binary and multiclass)

and regression tasks.

2.4.1 Classiﬁcation Metrics

For both binary and multiclass tasks, the following

metrics provide a comprehensive view of classiﬁca-

tion performance:

• ROC AUC: Evaluates the model’s ability to dis-

tinguish between classes, using:

– Binary AUC: Direct comparison between two

classes.

– One-vs-Rest AUC (multiclass): Calculates

AUC for each class, revealing overall discrimi-

nation ability.

• Accuracy: Measures the proportion of correctly

predicted labels across all classes.

• Recall (Sensitivity): Proportion of actual posi-

tive instances correctly identiﬁed, highlighting the

model’s capability in capturing positive cases.

• Precision: Accuracy of predicted instances per

class, showing how well each class is identiﬁed.

• Speciﬁcity: Proportion of true negatives correctly

identiﬁed, useful for understanding false positive

avoidance.

• F1 Score: Harmonic mean of Precision and Re-

call, balancing performance in cases of class im-

balance.

2.4.2 Regression Metrics

For predicting continuous variables (e.g., elaboration

scores), we apply:

• Loss (Mean Squared Error (MSE)): Measures av-

erage squared error, penalizing larger deviations

between predicted and actual values.

• Mean Absolute Error (MAE): Represents the av-

erage absolute difference between predicted and

true values, offering intuitive error measurement.

• Root Mean Squared Error (RMSE): Square root

of MSE, emphasizing larger errors and enhancing

interpretability.

• R² Score: Proportion of variance explained by the

model, indicating overall predictive strength for

continuous outcomes.

3 RESULTS

The following section presents the results of the

model evaluation for predicting each of the four target

features. Metrics are presented for all combinations of

models, including those using image data only, text

data only and both combined. The performance of

each model is evaluated using a comprehensive set

of metrics that reﬂect the quality of predictions in

both classiﬁcation and regression tasks. These results

provide an in-depth view of the performance of each

combination on different prediction targets and allow

ICT4AWE 2025 - 11th International Conference on Information and Communication Technologies for Ageing Well and e-Health

Table 1: Results Predicting Originality (O).

Embedding CNN ROC AUC Accuracy Recall Precision Speciﬁcity F1 score

- ResNet50 0,76 0,77 0,71 0,40 0,84 0,51

- InceptionV3 0,81 0,78 0,66 0,42 0,88 0,51

- EfﬁcientNetB0 0,72 0,68 0,55 0,30 0,81 0,39

- Xception 0,82 0,81 0,76 0,44 0,88 0,56

BETO - 0,78 0,73 0,71 0,37 0,81 0,49

BETO ResNet50 0,78 0,69 0,92 0,36 0,67 0,52

BETO InceptionV3 0.84 0,83 0,76 0,46 0,90 0,57

BETO EfﬁcientNetB0 0,68 0,67 0,87 0,30 0,53 0,45

BETO Xception 0,80 0,79 0,82 0,43 0,82 0,56

FastText - 0,80 0,78 0,66 0,40 0,89 0,50

FastText ResNet50 0,74 0,66 0,97 0,34 0,59 0,50

FastText InceptionV3 0,85 0,80 0,71 0,42 0,89 0,53

FastText EfﬁcientNetB0 0,77 0,74 0,55 0,33 0,86 0,54

FastText Xception 0,80 0,73 0,82 0,38 0,75 0,52

Keras - 0,81 0,76 0,87 0,38 0,76 0,53

Keras ResNet50 0,74 0,76 0,66 0,38 0,86 0,48

Keras InceptionV3 0,80 0,71 0,87 0,37 0,72 0,52

Keras EfﬁcientNetB0 0,74 0,61 0,74 0,37 0,78 0,49

Keras Xception 0,75 0,71 0,82 0,37 0,63 0,51

a better understanding of the contribution of text, im-

age and their joint use in the prediction process.

3.1 Predicting Originality (O)

The table 1 presents the performance metrics for var-

ious model combinations used to predict the binary

variable O. The models are evaluated across ﬁve met-

rics: ROC AUC, accuracy, recall (sensitivity), speci-

ﬁcity, and F1 score.

The accuracy values range from 0.61 to 0.83, with

the highest accuracy observed in the model using the

combination of BETO and InceptionV3. The ROC

AUC values vary between 0.68 and 0.85, with the best

performance in this regard achieved by the combina-

tion of FastText and InceptionV3.

Recall (sensitivity) scores span from 0.55 to 0.97.

The model with the highest recall is the one that

uses FastText and ResNet50, whereas EfﬁcientNetB0

without using text results in the lowest recall. Speci-

ﬁcity values range between 0.53 and 0.90, where

the combination of BETO and InceptionV3 achieves

the highest speciﬁcity, while the FastText and Incep-

tionV3 combination shows the lowest.

Finally, F1 scores in the table range from 0.50 to

0.64, with the highest score obtained by the FastText

and InceptionV3 combination. The performance of

each model varies across different metrics, indicating

that no single combination of models consistently out-

performs the others in all areas.

3.2 Predicting Elaboration (E)

Table 2 presents the performance metrics for predict-

ing the continuous variable E. The evaluation metrics

provided include loss, Mean Absolute Error (MAE),

Root Mean Square Error (RMSE), and R2 score.

The lowest loss values are observed for the mod-

els without embeddings that use either InceptionV3

(2.82) or Xception (2.81), with these models also

showing the best overall performance in other met-

rics. Speciﬁcally, the InceptionV3 model achieves the

lowest MAE (1.07) and RMSE (1.30), along with the

highest R2 score (0.48). The Xception model follows

closely, with an MAE of 1.14 and an RMSE of 1.37,

and an R2 score of 0.44.

In contrast, models incorporating embeddings

show substantially higher loss values. For instance,

the BETO embedding combined with InceptionV3 re-

sults in a loss of 9.02, an MAE of 2.22, and an RMSE

of 2.82, with a negative R2 score of -1.91. Sim-

ilar trends are observed for combinations involving

other embeddings, where the R2 scores are consis-

tently negative, indicating poor model performance

for predicting E.

Overall, the results suggest that models using only

image data outperform those incorporating embed-

dings when predicting E, as evidenced by lower error

metrics and higher R2 values.

Analysing the Impact of Images and Text for Predicting Human Creativity Through Encoders

Table 2: Results Predicting Elaboration (E).

Embedding CNN loss MAE RMSE R2 score

- ResNet50 3,95 1,36 1,74 0,05

- InceptionV3 2,82 1,07 1,30 0,48

- EfﬁcientNetB0 7,98 1,98 2,76 -2,76

- Xception 2,81 1,14 1,37 0,44

BETO - 8,44 2,14 2,70 -1,61

BETO ResNet50 8,77 2,18 2,77 -1,81

BETO InceptionV3 9,02 2,22 2,82 -1,91

BETO EfﬁcientNetB0 7,65 2,02 2,56 -1,32

BETO Xception 8,58 2,13 2,77 -1,86

FastText - 8,75 2,21 2,77 -1,82

FastText ResNet50 8,88 2,18 2,79 -1,84

FastText InceptionV3 8,99 2,24 2,81 -1,90

FastText EfﬁcientNetB0 8,60 2,20 2,71 -1,58

FastText Xception 9,17 2,22 2,84 -1,93

Keras - 8,49 2,14 2,74 -1,78

Keras ResNet50 8,97 2,18 2,81 -1,89

Keras InceptionV3 8,80 2,15 2,79 -1,87

Keras EfﬁcientNetB0 7,72 2,04 2,64 -1,66

Keras Xception 8,67 2,17 2,77 -1,84

3.3 Predicting Flexibility (FLE)

Table 3 provides the results for predicting the cate-

gorical variable FLE. Metrics such as ROC AUC, ac-

curacy, recall, precision, accuracy, speciﬁcity, and F1

score are reported for each model conﬁguration.

Based on ROC AUC, the best performing model

is FastText without a convolutional neural network

(CNN), with an AUC value of 0.91. This model

also achieves the best accuracy (0.56) and F1 score

(0.66), maintaining a reasonable balance between re-

call (0.37) and precision (0.80). In contrast, the high-

est recall (0.53) is observed in the Keras-InceptionV3

model, which achieves an F1 score of 0.65 but shows

lower performance in terms of ROC AUC (0.79).

Regarding precision, BETO without a CNN yields

the highest score (1.00), though it has low recall

(0.10), suggesting that it correctly identiﬁes posi-

tive instances when detected but misses many posi-

tive cases overall. On the other hand, combinations

involving ResNet50 and EfﬁcientNetB0 show lower

recall and precision, indicating underperformance in

comparison to other model combinations.

Models involving embeddings, particularly BETO

and FastText, tend to exhibit more consistent perfor-

mance across several metrics, although with varying

degrees of success depending on the metric of focus.

3.4 Predicting Title Originality (T)

Table 4 presents the evaluation results for predicting

the binary variable T. The metrics shown include ROC

AUC, accuracy, recall, speciﬁcity, and F1 score.

The BETO embedding without a CNN stands out

as the best-performing model across most metrics. It

achieves the highest ROC AUC (0.91), the highest ac-

curacy (0.90), and the highest F1 score (0.90). The

recall of this model is also relatively high (0.80), with

a balanced speciﬁcity of 0.74.

Any of the embeddings combined with CNNs per-

form worse when compared to the performance, in

terms of ROC AUC or accuracy, achieved without

using images. Despite this, the recall for many of

these models remains relatively high, with some mod-

els, such as Keras with ResNet50 and EfﬁcientNetB0,

achieving perfect recall (1.00). However, these same

models exhibit very low speciﬁcity, indicating a ten-

dency to over-predict positive instances.

Results generally indicate that the combination of

text-based embeddings such as BETO and FastText,

especially without an image model, performs well in

T prediction, showing a balanced trade-off between

sensitivity and speciﬁcity.

ICT4AWE 2025 - 11th International Conference on Information and Communication Technologies for Ageing Well and e-Health

Table 3: Results Predicting Flexibility (FLE).

Embedding CNN ROC AUC Accuracy Recall Precision Speciﬁcity F1 score

- ResNet50 0,86 0,54 0,43 0,70 0,00 0,65

- InceptionV3 0,89 0,54 0,50 0,64 0,00 0,65

- EfﬁcientNetB0 0,81 0,26 0,09 0,60 0,00 0,21

- Xception 0,83 0,54 0,50 0,65 0,00 0,65

BETO - 0,86 0,40 0,10 1,00 0,00 0,42

BETO ResNet50 0,84 0,50 0,39 0,79 0,00 0,62

BETO InceptionV3 0,83 0,52 0,50 0,63 0,00 0,63

BETO EfﬁcientNetB0 0,78 0,35 0,28 0,51 0,00 0,50

BETO Xception 0,89 0,52 0,48 0,81 0,00 0,63

FastText - 0,91 0.56 0,37 0,80 0,00 0,66

FastText ResNet50 0,82 0,48 0,42 0,62 0,00 0,60

FastText InceptionV3 0,80 0,51 0,46 0,61 0,00 0,62

FastText EfﬁcientNetB0 0,80 0,36 0,11 0,92 0,00 0,44

FastText Xception 0,82 0,49 0,46 0,71 0,00 0,61

Keras - 0,80 0,53 0,38 0,86 0,00 0,64

Keras ResNet50 0,82 0.50 0,48 0,60 0,00 0,62

Keras InceptionV3 0,79 0,54 0,53 0,64 0,00 0,65

Keras EfﬁcientNetB0 0,86 0,30 0,07 1,00 0,00 0,20

Keras Xception 0,81 0.51 0,50 0,61 0,00 0,62

4 DISCUSSION

The primary goal of this research was to explore

whether using textual descriptions of drawings, in

addition to images, can improve the prediction of

creativity-related characteristics such as originality

(O), thematic ﬂexibility (FLE), elaboration (E), and

title creativity (T). Unlike previous studies that have

focused exclusively on image analysis to assess these

aspects, our research introduces text as an additional

(or even primary) source of information. The results

obtained allow us to reﬂect on whether text alone is

sufﬁcient and whether combining it with images pro-

vides signiﬁcant added value.

A key ﬁnding is that models based solely on text

were surprisingly competitive in predicting the origi-

nality of the drawing (O). For instance, the FastText

without CNN model achieved a recall of 0.97, indi-

cating that textual descriptions have great potential in

capturing whether a drawing is original or not. This

is a signiﬁcant result, given that previous studies have

relied solely on images, which may have limited the

detection of more abstract aspects of originality.

However, when observing other metrics such as

precision and speciﬁcity, models that combine both

text and images (such as FastText + InceptionV3)

showed improvements by reducing false positives.

This suggests that while text alone provides valuable

information, combining both data types allows for a

more balanced and accurate prediction.

The prediction of thematic ﬂexibility (FLE)

showed that text alone is not only sufﬁcient but, in

many cases, the most effective data source. Text-only

models, such as FastText without CNN, achieved the

highest scores in ROC AUC and F1 score, outper-

forming models based solely on images. This indi-

cates that textual descriptions of drawings capture the

variety of themes represented well, an aspect that ap-

pears to be more abstract and conceptual and may es-

cape purely visual evaluation.

The fact that images do not signiﬁcantly improve

FLE prediction suggests that thematic categories are

more easily expressed and understood through lan-

guage than by observing the visual details of the

drawing.

In the prediction of elaboration (E), images proved

to be clearly superior to text. Models relying exclu-

sively on visual data (such as InceptionV3) achieved

better results in terms of MAE, RMSE, and R2 score,

indicating that the visual details of the drawing are

essential for assessing its level of elaboration. Text-

based models, or combinations of text and images,

were unable to effectively capture the visual nuances

related to the complexity of the drawing.

This ﬁnding suggests that for characteristics like

elaboration, which depend on the direct perception of

visual details, the text does not provide sufﬁcient in-

formation and may introduce noise into the analysis.

Consequently, image analysis becomes crucial to ac-

Analysing the Impact of Images and Text for Predicting Human Creativity Through Encoders

Table 4: Results Predicting Title Originality (T).

Embedding CNN ROC AUC Accuracy Recall Precision Speciﬁcity F1 score

- ResNet50 0,57 0,45 1,00 0,54 0,00 0,70

- InceptionV3 0,50 0,41 0,86 0,54 0,02 0,66

- EfﬁcientNetB0 0,54 0,50 0,95 0,57 0,10 0,71

- Xception 0,62 0,61 0,59 0,78 0,47 0,67

BETO - 0,91 0,90 0,80 1,00 0,74 0,90

BETO ResNet50 0,83 0,83 0,73 0,94 0,68 0,82

BETO InceptionV3 0,78 0,83 0,93 0,63 0,26 0,75

BETO EfﬁcientNetB0 0,79 0,64 0,93 0,64 0,31 0,76

BETO Xception 0,78 0,76 0,61 0,82 0,64 0,70

FastText - 0,87 0,80 0,73 0,73 0,65 0,73

FastText ResNet50 0,55 0,50 0,98 0,56 0,08 0,71

FastText InceptionV3 0,69 0,65 0,57 0,68 0,53 0,62

FastText EfﬁcientNetB0 0,51 0,49 0,98 0,56 0,08 0,71

FastText Xception 0,73 0,74 0,66 0,92 0,61 0,77

Keras - 0,87 0,78 0,82 0,80 0,57 0,81

Keras ResNet50 0,48 0,45 1,00 0,54 0,00 0,70

Keras InceptionV3 0,65 0,60 0,66 0,70 0,42 0,68

Keras EfﬁcientNetB0 0,52 0,47 1,00 0,55 0,02 0,71

Keras Xception 0,59 0,57 0,66 0,66 0,37 0,66

curately assess these characteristics.

The analysis of title creativity (T) showed that

models based on text are the most effective tool

for this task. Since title creativity is expressed ex-

clusively through language, text-based models like

BETO without CNN performed exceptionally well,

achieving a ROC AUC of 0.91 and an F1 score of

0.90. In contrast, models based solely on images were

ineffective, highlighting the irrelevance of visual data

for this prediction.

5 CONCLUSION

A key contribution of this research is the demonstra-

tion that text not only provides relevant information

but can, in some cases, be more informative than im-

ages in predicting certain aspects of creativity. In pre-

vious research, the focus has mainly been on images,

overlooking the informative potential of textual de-

scriptions. Our results reveal that:

• For features such as originality (O) and thematic

ﬂexibility (FLE), text alone is a very valuable

source, and it may be even more suitable than pic-

tures for capturing abstract concepts.

• For title creativity (T), text is the only relevant

data source, as title creativity cannot be evaluated

through images.

• For elaboration (E), images remain the best op-

tion, as this aspect depends more on the direct per-

ception of visual details.

The combination of text and images only proved

advantageous in some cases, particularly for reduc-

ing false positives in the prediction of O. However,

in most cases, text alone was sufﬁcient or even more

effective than images.

This study demonstrates that incorporating text as

a data source in the evaluation of creativity in draw-

ings provides signiﬁcant value, especially in predict-

ing abstract characteristics such as originality and the-

matic ﬂexibility. While images remain crucial for vi-

sual traits like elaboration, researchers should seri-

ously consider using text in future studies, as it offers

a complementary, and in some cases, more powerful

perspective for capturing creativity.

ACKNOWLEDGEMENTS

We would like to thank the Deustek5 group of the

University of Deusto who have made this research

possible.

REFERENCES

Acar, S., Organisciak, P., and Dumas, D. (2024). Automated

scoring of ﬁgural tests of creativity with computer vi-

sion. Journal of Creative Behavior. Cited by: 2.

Allen, T. E., Chen, M., Goldsmith, J., Mattei, N., Popova,

A., Regenwetter, M., Rossi, F., and Zwilling, C.

ICT4AWE 2025 - 11th International Conference on Information and Communication Technologies for Ageing Well and e-Health

(2015). Beyond Theory and Data in Preference Mod-

eling: Bringing Humans into the Loop, page 3–18.

Springer International Publishing.

Bahcecik, S. O. (2023). I trends security politics and arti-

ﬁcial intelligence: Key trends and debates. Interna-

tional Political Science Abstracts, 73(3):329–338.

Beaty, R. E. and Johnson, D. R. (2020). Automating cre-

ativity assessment with semdis: An open platform

for computing semantic distance. Behavior Research

Methods, 53(2):757–780.

nete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H.,

and P

erez, J. (2020). Spanish pre-trained bert model

and evaluation data. In PML4DC at ICLR 2020.

Cetinic, E. and She, J. (2021). Understanding and creating

art with ai: Review and outlook.

Chollet, F. (2016). Xception: Deep learning with depth-

wise separable convolutions. 2017 IEEE Conference

on Computer Vision and Pattern Recognition (CVPR),

pages 1800–1807.

Creely, E. and Blannin, J. (2023). The implications of gen-

erative ai for creative composition in higher education

and initial teacher education. ASCILITE Publications,

page 357–361.

Cropley, D. H. and Marrone, R. L. (2022). Automated scor-

ing of ﬁgural creativity using a convolutional neural

network. Psychology of Aesthetics, Creativity, and the

Arts.

Davis, J. L., Shank, D. B., Love, T. P., Stefanik, C., and

Wilson, A. (2022). Gender Dynamics in Human-AI

Role-Taking, page 1–22. Emerald Publishing Limited.

Devedzic, V. (2020). Is this artiﬁcial intelligence? Facta

universitatis - series: Electronics and Energetics,

33(4):499–529.

Easton, K., Potter, S., Bec, R., Bennion, M., Christensen,

H., Grindell, C., Mirheidari, B., Weich, S., de Witte,

L., Wolstenholme, D., and Hawley, M. S. (2019). A

virtual agent to support individuals living with physi-

cal and mental comorbidities: Co-design and accept-

ability testing. Journal of Medical Internet Research,

21(5):e12996.

Ferrara, S. and Qunbar, S. (2022). Validity arguments

for ai-based automated scores: Essay scoring as an

illustration. Journal of Educational Measurement,

59(3):288–313.

Gado, S., Kempen, R., Lingelbach, K., and Bipp, T. (2021).

Artiﬁcial intelligence in psychology: How can we

enable psychology students to accept and use artiﬁ-

cial intelligence? Psychology Learning & Teaching,

21(1):37–56.

Gigi, A. (2015). Human ﬁgure drawing (hfd) test is affected

by cognitive style. Clinical and Experimental Psy-

chology, 02.

Harr

e, M. S. and El-Tariﬁ, H. (2023). Testing game the-

ory of mind models for artiﬁcial intelligence. Games,

15(1):1.

He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep resid-

ual learning for image recognition.

Imuta, K., Scarf, D., Pharo, H., and Hayne, H. (2013).

Drawing a close to the use of human ﬁgure drawings

as a projective measure of intelligence. PLoS ONE, 8.

Joulin, A., Grave, E., Bojanowski, P., Douze, M., J

egou,

H., and Mikolov, T. (2016). Fasttext.zip: Com-

pressing text classiﬁcation models. arXiv preprint

arXiv:1612.03651.

Kvam, P. D., Sokratous, K., Fitch, A., and Hintze, A.

(2023). Using artiﬁcial intelligence to ﬁt, compare,

evaluate, and discover computational models of deci-

sion behavior.

Lee, S. W., Kwak, D. S., Jung, I. S., Kwak, J. H., Park,

J. H., Hong, S. M., Lee, C. B., Park, Y. S., Kim, D. S.,

Choi, W. H., and Ahn, Y. H. (2015). Partial androgen

insensitivity syndrome presenting with gynecomastia.

Endocrinology and Metabolism, 30(2):226.

Liu, J., Xue, Z., Vann, K. R., Shi, X., and Kutateladze, T. G.

(2020). Protocol for biochemical analysis and struc-

ture determination of the zz domain of the e3 ubiquitin

ligase herc2. STAR Protocols, 1.

O’Shea, K. and Nash, R. (2015). An introduction to convo-

lutional neural networks.

Patterson, J. D., Barbot, B., Lloyd-Cox, J., and Beaty, R. E.

(2023). Audra: An automated drawing assessment

platform for evaluating creativity. Behavior Research

Methods, 56(4):3619–3636.

Pezzulo, G., Parr, T., Cisek, P., Clark, A., and Friston, K.

(2023). Generating meaning: Active inference and

the scope and limits of passive ai.

Røed, R. K., Baugerud, G. A., Hassan, S. Z., Sabet, S. S.,

Salehi, P., Powell, M. B., Riegler, M. A., Halvorsen,

P., and Johnson, M. S. (2023). Enhancing questioning

skills through child avatar chatbot training with feed-

back. Frontiers in Psychology, 14.

Searle, J. R. (2018). Minds, Brains and Programs, page

18–40. Routledge.

Shaban-Nejad, A., Michalowski, M., Bianco, S., Brown-

stein, J. S., Buckeridge, D. L., and Davis, R. L.

(2022). Applied artiﬁcial intelligence in health-

care: Listening to the winds of change in a post-

covid-19 world. Experimental Biology and Medicine,

247(22):1969–1971.

Sheng, L., Yang, G., Pan, Q., Xia, C., and Zhao, L.

(2019). Synthetic house-tree-person drawing test: A

new method for screening anxiety in cancer patients.

Journal of Oncology, 2019.

Stojnic, G., Gandhi, K., Yasuda, S., Lake, B. M., and Dil-

lon, M. R. (2022). Commonsense psychology in hu-

man infants and machines.

Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna,

Z. (2015). Rethinking the inception architecture for

computer vision.

Tan, M. and Le, Q. V. (2020). Efﬁcientnet: Rethinking

model scaling for convolutional neural networks.

Tan, T., Rodriguez-Ruiz, A., Zhang, T., Xu, L., Beets-Tan,

R. G. H., Shen, Y., Karssemeijer, N., Xu, J., Mann,

R. M., and Bao, L. (2023). Multi-modal artiﬁcial intel-

ligence for the combination of automated 3d breast ul-

trasound and mammograms in a population of women

with predominantly dense breasts. Insights into Imag-

ing, 14(1).

Wang, W., Koﬂer, L., Lindgren, C., Lobel, M., Murphy, A.,

Tong, Q., and Pickering, K. (2023). Ai for psycho-

Analysing the Impact of Images and Text for Predicting Human Creativity Through Encoders

metrics: Validating machine learning models in mea-

suring emotional intelligence with eye-tracking tech-

niques. Journal of Intelligence, 11(9):170.

Weidinger, L., Reinecke, M. G., and Haas, J. (2022). Arti-

ﬁcial moral cognition: Learning from developmental

psychology.

Zhang, R., Zeng, B., Yi, W., and Fan, Z. (2024). Artiﬁcial

Intelligence Painting: A New Efﬁcient Tool and Skill

for Art Therapy.

ICT4AWE 2025 - 11th International Conference on Information and Communication Technologies for Ageing Well and e-Health