A Neural Network for Automatic Handwriting Extraction and

Recognition in Psychodiagnostic Questionnaires

Giulia Rosemary Avis

1 a

, Fabio D’Adda

1,2 b

, David Chieregato

1 c

, Elia Guarnieri

1,2 d

Maria Meliante

, Andrea Primo Pierotti

1,3 e

and Marco Cremaschi

1 f

Aton Informatica S.r.l, Centro Direzionale Baricentro, Casamassima, Bari, Italy

University of Milan, Bicocca, Department of Informatics, Systems and Communication, Viale Sarca 336, Milan, Italy

ASST Grande Ospedale Metropolitano Niguarda, Piazza Ospedale Maggiore 3, Milan, Italy

Keywords:

Image Classiﬁcation, Model Alignment, Information Extraction, Questionnaire Digitalisation,

Computer Vision.

Abstract:

This paper presents PANTHER, a neural network model for automatic handwriting extraction and recognition

in psychodiagnostic questionnaires. Psychodiagnostic tools are essential for assessing and monitoring mental

health conditions, but they often rely on pen-and-paper administration, which poses several challenges for data

collection and analysis. PANTHER aims to address this problem by using a convolutional neural network to

classify scanned questionnaires into their respective types and extract the patient’s responses from the hand-

written annotations. The model is trained and evaluated on a dataset of ﬁve questionnaires commonly used in

psychological and psychiatric settings, achieving high accuracy and similarity scores. The paper also describes

the creation of an open-source library based on PANTHER, which can be integrated into a digital platform for

delivering psychological services. This paper contributes to the ﬁeld of computer vision and psychological

assessment by providing a novel and effective solution for digitising pen-and-paper questionnaires.

1 INTRODUCTION

Psychodiagnostic tools are the key to unlocking the

mysteries of the human mind, as they are crucial for

professionals in the ﬁeld of psychology and psychi-

atry to measure and observe a patient’s behaviour to

arrive at a diagnosis and guide treatment. They serve

various purposes, including assessing symptoms re-

lated to mental health conditions, evaluating symp-

tom severity, and measuring the impact of psychiatric

disorders on daily functioning. Additionally, they ef-

fectively assess intellectual functioning, personality,

memory, and interests (Watkins et al., 1995). Their

beneﬁts range from reducing in-patient time and as-

sociated psychiatric care costs to providing frequent

feedback on symptoms (Holl

andare et al., 2010).

https://orcid.org/0000-0001-7840-6228

https://orcid.org/0009-0006-0995-0661

https://orcid.org/0000-0002-6507-3708

https://orcid.org/0000-0003-3461-5372

https://orcid.org/0000-0002-7124-5668

https://orcid.org/0000-0001-7840-6228

The origins of psychodiagnostic tools date back

to the 20th century, with one of the earliest assess-

ments being the Binet-Simon Intelligence Scale cre-

ated by Alfred Binet and Th

eodore Simon in 1905.

Initially designed to assess cognitive abilities in chil-

dren, this scale has since transformed into the stan-

dardised Stanford-Binet test. Alongside this devel-

opment, other psychodiagnostic tools emerged in the

early 20th century (Lanyon, 2011).

Traditionally, questionnaires were administered

using pen-and-paper, a method with robust psycho-

metric properties (Seward et al., 2018). This process

involves distributing printed questionnaires to indi-

viduals who manually complete them using a pen or

pencil. While this approach remains a common way

to administer questionnaires, it may not necessarily

be the most optimal or effective, given the critical

issues it presents. Notably, the logistical burden of

collecting and manually scoring completed question-

naires by therapists can evolve into a time-consuming

process, encompassing costs and time associated with

printing, distributing, collecting, and manually enter-

ing data for analysis (Touvier et al., 2010). A sur-

140

Avis, G., D’Adda, F., Chieregato, D., Guarnieri, E., Meliante, M., Pierotti, A. and Cremaschi, M.

A Neural Network for Automatic Handwriting Extraction and Recognition in Psychodiagnostic Questionnaires.

DOI: 10.5220/0012587600003699

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 10th International Conference on Information and Communication Technologies for Ageing Well and e-Health (ICT4AWE 2024), pages 140-151

ISBN: 978-989-758-700-9; ISSN: 2184-4984

vey conducted among UK psychiatrists revealed that

a common deterrent for not using self-report question-

naires is the lack of infrastructure to support the ad-

ministration, scoring, and storage of data (Holl

andare

et al., 2010). Manual scoring of questionnaires intro-

duces the potential for human error, thereby impact-

ing result accuracy (Engan et al., 2016). Although

continuously relying on this method can lead to inef-

ﬁcient data collection, storage and analysis, the pen-

and-paper method has been the standard for many

years and the basis for developing and validating var-

ious psychodiagnostic tools (Tolley et al., 2015).

In this regard, over the past three decades, there

has been a notable increase in the adoption of

computer-assisted assessment. By 1999, 40% of psy-

chologists reported using some form of computer-

assisted testing (McMinn et al., 1999). This method

has several advantages, including saving valuable

professional time, enhancing test-retest reliability,

mitigating potential tester bias, and decreasing con-

sumer costs by improving efﬁciency (Butcher et al.,

2000; Groth-Marnat, 1999). Moreover, the advan-

tages of this transition extend to the administration

phase, where it enables skipping irrelevant items

based on previous answers (van Ballegooijen et al.,

2016) and the subsequent treatment and usage of data.

Digitisation will indeed facilitate the integration of

questionnaire results into electronic health records

and comprehensive digital diagnostic technologies.

As a result, it will streamline the real-time monitoring

of a patient’s mental health status (Martin-Key et al.,

2022).

The emergence of electronic administration meth-

ods presents an opportunity to improve the process of

questionnaire execution while maintaining its valid-

ity. It is important to note that the mode of submis-

sion, whether pen-and-paper or electronic, has been

found to have a relatively small effect on the mean

responses given to the questionnaires (Coons et al.,

2009).

However many beneﬁts the transition to digital

administration can have, it will have to be gradual

in contexts where pen-and-paper tools have been the

standard for years. There are also speciﬁc cases,

such as when working with an elderly population,

where traditional tools may be more suitable (Paulsen

et al., 2012). Moreover, the creation of digital health

records will require, for the patients already in care,

an integration of data from paper-based question-

naires completed in the past.

Hence, apart from creating instruments for digital

dispensing, it is essential to address the need for a so-

lution to automate the data acquisition process from

paper-based questionnaires. In this regard, using Ar-

tiﬁcial Intelligence (AI) can prove highly beneﬁcial.

An AI tool can ensure seamless integration of paper

questionnaires with electronic resources, allowing for

better coordination of care and improved monitoring

of patients’ mental health over time.

This paper presents a model named PANTHER

(Psychiatric Administration Neural neTwork for

Handwritten Extraction and Recognition), a model

designed for the automated recognition of digitised

questionnaires (e.g., through a scanner). By employ-

ing a Convolutional Neural Network (CNN), PAN-

THER extracts a vectorial representation of question-

naire images. Upon obtaining this representation, the

model classiﬁes the images into various types of com-

piled questionnaires. Subsequently, PANTHER iden-

tiﬁes the speciﬁc subsections, items, and page seg-

ments containing patient responses within the classi-

ﬁed questionnaires.

We make the following contributions:

• an analysis of the literature related to the digitisa-

tion of questionnaires, with a focus on those per-

taining to the psychological/psychiatric domain,

and an assessment of the level of digitisation in

the psychological ﬁeld;

• a reproducible pipeline that can be generalised for

the digitisation of any type of questionnaire;

• PANTHER

, an open-source library for question-

naire classiﬁcation and result extraction.

This paper is organised as follows: Section 2 con-

tains an overview of the State of the Art (SOTA);

Section 3 describes the dataset used to validate the

proposed approach; in Section 4 the approach is pre-

sented with detailed description of the preprocessing

phase, feature extraction, classiﬁcation techniques,

alignment and extraction of ﬁlled questions; in Sec-

tion 5 accuracy and similarity measures are used to

validate the presented approach; in Section 6 the con-

tributions of this paper are presented, consisting of an

open source library which will be used inside a larger

digital platform for digitalisation of services for psy-

chological personnel.

2 RELATED WORKS

In the ﬁeld of psychology and psychiatry, psychodiag-

nostic tools, such as questionnaires, play a crucial role

in assessing well-being, coping behaviour, personal-

ity traits, and psychological ﬂexibility. The standard-

ised use of questionnaires in psychological research

bitbucket.org/disco unimib/panther

A Neural Network for Automatic Handwriting Extraction and Recognition in Psychodiagnostic Questionnaires

141

and practice, emphasising the extraction of subscales,

has become commonplace (Franke, 1997).

Completing psychodiagnostic questionnaires re-

quires both the patient’s and the professional’s ef-

fort. The administration of these questionnaires to

patients involves providing them with the necessary

instructions and a suitable environment to complete

the forms. Patients may be given the questionnaires

during their clinical visit or may be provided with

electronic versions to complete at home. The average

completion time for these questionnaires varies de-

pending on the speciﬁc instrument and the individual

patient. For example, the PQ-16 italian version (iPQ-

16 (Lorenzo et al., 2018)) is a 16-item questionnaire,

and the average completion time may range from 5 to

15 minutes, depending on the patient’s reading and

comprehension abilities. The CBA-VE (Michielin

et al., 2009) is an 80-item questionnaire; compilation

times can take up to 30 minutes.

Once the questionnaires are completed, psychol-

ogists process the responses to create metrics. This

involves scoring the questionnaires based on prede-

termined criteria and guidelines, which may signif-

icantly vary for each questionnaire. The question-

naires are printed and manually completed using a

pen or pencil. The psychologist must then manually

read and annotate each response to obtain evaluation

metrics. In the best situations, some organisations

have automated the calculation of metrics through

spreadsheets (e.g., Excel), which still require manual

data entry. The poor automation of this activity often

leads to the introduction of human errors, which can

enormously inﬂuence the patient’s assessment and the

treatment planning that the patient will have to fol-

low (Simons et al., 2002).

Although there have been some initiatives aimed

at creating digital platforms for the administration of

psychodiagnostic tools, as indicated in the previous

section, their maturity and dispersion remain to be im-

proved. There is, therefore, a need for systems capa-

ble of digitising pen-and-paper questionnaires.

In recent scientiﬁc literature, there are only a few

works on automating the processing and scoring of

already compiled questionnaires. Notably, Optical

Mark Recognition (OMR) and Intelligent Character

Recognition (ICR) technologies have emerged for this

purpose. OMR enables the detection of marks within

checkbox responses, whereas ICR recognises hand-

written characters. However, traditional OMR sys-

tems (similar to the example in Figure 1) lack ﬂexi-

bility as they require the use of speciﬁc paper types

for printing the questionnaire and a dedicated device.

Despite requiring extensive manual preparation work,

these approaches are valid alternatives to manual pro-

cessing (Paulsen et al., 2012).

Figure 1: An example of a traditional OMR questionnaire.

In this regard, some proposals have developed an

OMR system without the previously mentioned con-

straints (Sanguansat, 2015). An example is shown

in Figure 2. Nonetheless, annotating them with po-

sition detection patterns is essential when processing

questionnaire sheets. To this extent, the LightQuest

approach has been designed, which allows for rapid

questionnaire model creation but requires document

annotation with alignment targets and is unsuitable

for retrospectively completed questionnaires (Chabert

et al., 2021). The alignment mainly aims to match the

acquired documents with a model. The information

in the questionnaires is still extracted measuring the

content of black in a predeﬁned rectangle.

Figure 2: The questionnaire proposed in (Sanguansat,

2015).

In the proposed setting this kind of techniques

is not applicable as the questionnaires are not pre-

pared for digital acquisition. The most recent tech-

ICT4AWE 2024 - 10th International Conference on Information and Communication Technologies for Ageing Well and e-Health

142

nique for image alignment in the SOTA use com-

puter vision techniques (Patel et al., 2015) such as

Random Sample Consensus (RANSAC) to align doc-

uments without requiring speciﬁc alignment targets,

whose precision is close to 100% (Maniar et al.,

2021), leveraging the capabilities of OpenCV

. An-

other attempt has been done by (Zaryab and Ng,

2023), which highlights the use of Artiﬁcial Neural

Network (ANN) for Region of Interest (ROI) detec-

tion and OCR (Text Recognition) in handwritten med-

ical forms, with a focus on German handwritten text

recognition. Also (Norbert et al., 2023) proposes an

RPA-based software robot to assist healthcare pro-

fessionals in digitising handwritten medical forms,

while (Cao and Govindaraju, 2007) presents an algo-

rithm which utilises a vector model based on multiple

word recognition choices, incorporating segmentation

probabilities and a Gaussian function for the posterior

probability of word recognition, with the purpose of

automating data collection from Prehospital Care Re-

ports (PCR forms) forms. Other authors have con-

tributed a system utilising Machine Learning (ML)

for text recognition, automatically extracting answers

with a precision ranging from 85% to 90% (Yasmin

et al., 2017). Unfortunately, the unavailability of the

code for these solutions precludes the exact reproduc-

tion of their approach.

3 DATASET

This project originates from collaboration with the

CPS “Giovani di Niguarda”, the Psychosocial Cen-

ter of “ASST Grande Opsedale Metropolitano Ni-

guarda”

, one of the most signiﬁcant public hospi-

tal facilities in Milan, assisting approximately 500

patients each year with varying degrees of sever-

ity. Given the ongoing collaboration with the CPS,

the dataset creation process has involved the se-

lection of the ﬁve most extensively utilised ques-

tionnaires by institutions in their Italian version:

Cognitive Behavioural Assessment Outcome Evalu-

ation (CBA-VE), 16-item Prodromal Questionnaire

(PQ-16), Early Recognition Inventory for the retro-

spective assessment of the Onset of schizophrenia

Checklist (ERIraos-CL), Global Assessment Func-

tioning (GAF), and Social and Occupational Func-

tioning Assessment Scale (SOFAS). Below is a brief

description of these questionnaires:

opencv.org. Retrieved February 20, 2024

www.ospedaleniguarda.it. Retrieved February 20,

2024

• CBA-VE is an Italian tool designed to assess the

effectiveness of psychological treatments. It con-

sists of 80 items with a 5-point scale focusing on

the psychological state of the past 15 days, cov-

ering areas such as anxiety, well-being, positive

change perception, depression, and psychologi-

cal distress. The dataset analysis aims to evalu-

ate treatment outcomes based on individual and

treatment-related factors. (Michielin et al., 2009).

• PQ-16 is a screening tool designed to assess the

presence of prodromal symptoms associated with

psychosis or schizophrenia. It consists of 16 self-

report items that individuals respond to based on

their experiences. The questionnaire aims to iden-

tify subtle or early signs of psychosis that may

precede the onset of a full-blown psychotic dis-

order. Questions typically cover various aspects,

such as perceptual abnormalities, cognitive dis-

turbances, and social functioning. The PQ-16 is

commonly used in research and clinical settings

as part of a broader assessment to identify indi-

viduals at risk of developing psychosis (Lorenzo

et al., 2018).

• ERIraos-CL is a structured interview designed to

detect the presence of symptoms associated with

psychosis, as well as perceptual and dissociative

phenomena. It serves as a concise support for

the initial assessment of the diverse population of

young help-seekers experiencing symptoms con-

sistent with a prodromal state of psychosis. This

screening tool is the primary choice within early

intervention services to ascertain whether there

are grounds for proceeding with a more compre-

hensive assessment (Meneghelli et al., 2013).

• GAF/SOFAS are two scales used to assess the

severity of a person’s psychological and psychi-

atric symptoms. GAF assesses an individual’s

general functioning on a scale of 1 to 100, where

100 represents the highest functioning and 1 the

lowest. SOFAS is similar to GAF but focuses

on a person’s social and occupational function-

ing (American Psychiatric Association, 2000).

For each questionnaire, 10 anonymous response

sheets were collected. Each administration was then

digitised using a scanner. Considering the number

of pages for each questionnaire, the resulting dataset

consists of 80 images (1 653×2 338 pixels at 72 DPI),

distributed as shown in Table 1.

A Neural Network for Automatic Handwriting Extraction and Recognition in Psychodiagnostic Questionnaires

143

Table 1: Images per questionnaire.

Questionnaire # images

CBA-VE 30

PQ-16 10

ERIraos-CL 20

GAF 10

SOFAS 10

4 APPROACH

In this Section, we present PANTHER, a tech-

nique designed for classifying psychodiagnostic ques-

tionnaires and extracting corresponding patient re-

sponses. The tool is speciﬁcally developed to cate-

gorise a given set of PDF-format questionnaires (q ∈

Q) across a range of predeﬁned categories (t ∈ T ),

each representing a distinct questionnaire type. The

primary objective is establishing an association be-

tween each questionnaire (q ∈ Q) and its correspond-

ing type (t ∈ T ). Once the type is identiﬁed, the re-

sponses are extracted and digitally archived for fur-

ther analysis.

The pipeline of the process is represented in Fig-

ure 3, and it is composed of 3 sequential phases:

(i) Image Pre-processing, (ii) Feature Extraction,

(iii) Classiﬁcation, and (iv) Image Alignment and

Extraction.

4.1 Image Pre-Processing

The initial phase involves applying preliminary tech-

niques to raw data before it is fed into a ML algo-

rithm for training or testing. Pre-processing aims to

convert the raw data into a format suited for analysis,

thereby enhancing the performance and reliability of

ML models. Speciﬁcally, in this scenario where the

dataset consists of PDFs, the following steps are un-

dertaken to carry out the pre-processing:

1. Image Loading. The image dataset is loaded,

and each image is transformed using the “RGB”

colour model. This implies that each image is en-

coded with three channels: red, green, and blue,

so each pixel occupies 3 bytes of storage, one for

each colour channel.

2. JPG Conversion. The initial step entails convert-

ing each image, initially in PDF format, into a for-

mat compatible with the Image

library in Python.

pillow.readthedocs.io/en/stable/reference/Image.html.

Retrieved February 20, 2024

To accomplish this, the pdf2image library

is em-

ployed in the computational process.

3. Resizing. This step consists of adjusting images

to a consistent size to ensure uniformity in the

dataset. This is important because most ML mod-

els expect input data with consistent dimensions.

Once the image has been loaded, it is resized by

using the “resize” method in PIL library. The

method is used to resize an image to a speciﬁed

size, maintaining the original image’s aspect ra-

tio. ML models employed in the next sections are

designed to accept input images of size 224×224

pixels, which is a common input size for many im-

age classiﬁcation models. The PIL library tends to

preserve the quality and aspect ratio of the image.

Figure 4 shows an example of a resized question-

naire.

4. Normalisation. The ﬁnal stage involves normal-

ising the image vector using mean and standard

deviation. The image is normalised using the fol-

lowing formula:

image = (image −mean)/std

Speciﬁcally, the images are normalised us-

ing mean = [0.485,0.456,0.406] and std =

[0.229,0.224, 0.225]. This normalisation ap-

proach is employed because the models discussed

and elaborated upon in the subsequent subsection

anticipate input images to be normalised in this

manner

. An example of CBA-VE normalised

questionnaire is shown in Figure 5.

4.2 Feature Extraction

Given the remarkable ability of ML models in image

recognition tasks, we adopted a ML based solution to

extract relevant features from questionnaire images.

Due to the limited size of the dataset, the Trans-

fer Learning (TL) technique has been chosen to ex-

tract feature vectors from images within our dataset.

This technique is particularly useful when the target

task has limited labelled data, so instead of training a

model from scratch, TL allows leveraging knowledge

acquired from a more extensive dataset related to a

different task.

ML models are utilised for image classiﬁcation

due to their ability to discern patterns and features

in large datasets, making them highly effective in

pdf2image.readthedocs.io/en/latest/. Retrieved Febru-

ary 20, 2024

pytorch.org/vision/0.8/models.html. Retrieved Febru-

ary 20, 2024

ICT4AWE 2024 - 10th International Conference on Information and Communication Technologies for Ageing Well and e-Health

144

Figure 3: PANTHER pipeline.

Figure 4: Example of resized questionnaire image.

tasks that involve visual recognition. Traditional rule-

based systems often struggle to capture the complex-

ity and variability in images, especially when com-

pared with diverse and dynamic visual information.

ML models, particularly deep learning models, ex-

cel at learning hierarchical representations of features

from raw data. In the context of image classiﬁca-

tion, these models can automatically extract relevant

features and patterns from images, allowing them to

recognise and categorise objects, scenes, or patterns

with remarkable accuracy. The process involves train-

ing the model on a labelled dataset, where it learns to

associate speciﬁc features with corresponding labels.

Once trained, the model can generalise its knowl-

edge to accurately classify new, unseen images. Their

capacity to handle complex visual information and

adapt to diverse scenarios has positioned ML mod-

els as a cornerstone in advancing image classiﬁcation

technologies. The types of model used to address im-

age tasks are CNN, which are a class of Deep Neu-

ral Networks designed for processing structured grid

data. They are particularly effective in tasks related

to computer vision, image recognition, and other vi-

sual data analysis applications. CNNs have proven

to be highly successful in these domains due to their

ability to learn hierarchical features from input data

automatically (O’Shea and Nash, 2015). An example

Figure 5: Example of normalised questionnaire.

of an architecture of CNN is represented in Figure 6.

There is a vast amount of models available to

extract feature vectors; Among these CNN models,

some networks are very classic due to their excellent

generality and accuracy (Du et al., 2023):

1. VGG16 is a speciﬁc CNN architecture that was

introduced by the Visual Graphics Group (VGG)

at the University of Oxford. The “16” refers to

the total number of weight layers in the network,

including convolutional layers, Fully Connected

(FC) layers, and softmax layers for classiﬁcation.

VGG16 gained popularity for its simplicity and ef-

fectiveness in image classiﬁcation tasks. The to-

tal number of parameters in VGG16 is 138M (Si-

monyan and Zisserman, 2014);

2. GoogleNet is a CNN architecture developed by

researchers at Google, which has been designed

to address challenges such as computational ef-

ﬁciency, as well as the vanishing/exploding gra-

dient problems associated with very deep neu-

ral networks. The total number of parameters in

GoogleNet is 5M (Szegedy et al., 2015);

3. Resnet18 is a speciﬁc CNN architecture that is

part of the ResNet (Residual Networks) family.

Resnet18 was introduced to address the challenge

of training very deep neural networks. The main

A Neural Network for Automatic Handwriting Extraction and Recognition in Psychodiagnostic Questionnaires

145

Max-Pool Convolution Max-Pool Dense

8@128x128

8@64x64

24@48x48

24@16x16

1x256

1x128

Figure 6: The CNN model architecture.

innovation of Resnet18 is the use of residual or

skip connections, which help mitigate the van-

ishing gradient problem during training. Mul-

tiple ResNet architectures are available, such as

Resnet34, ResNet50, ResNet101 and ResNet152,

but the choice of using Resnet18 is related to its

fewer number of parameters, which makes it com-

putationally lighter and faster. It also requires less

memory and processing power compared to other

ResNet architectures. The total number of param-

eters in Resnet18 is 11M (He et al., 2016).

Leveraging each of the three previously mentioned

models, a feature vector was derived for each image

in the dataset. This process can be executed through

various methods involving the removal of one or more

layers from the model. The decision to perform this

adjustment aims to acquire vectors that depict images

at a distinct level of granularity. Typically, initial lay-

ers capture low-level features and patterns from the

input data, while the ﬁnal layers map high-level fea-

tures extracted starting from convolutional and pool-

ing layers to the output classes speciﬁc to the ﬁnal

task (Alzubaidi et al., 2021). Therefore, if the task

closely aligns with the model’s training objective, it

is feasible to truncate the model towards the end of

the CNN, particularly within the FC layers. VGG16,

GoogleNet, and Resnet18 underwent training using

supervised learning techniques on extensive labelled

datasets. All three models were trained on the Ima-

geNet dataset (Deng et al., 2009), a substantial collec-

tion that includes millions of labelled images across

numerous categories, serving as a benchmark for im-

age classiﬁcation tasks. Given the similarity between

our task and the ImageNet challenge, which focuses

on image recognition, we explored two distinct cuts

for each network.

The feature vectors obtained in this phase serve as

the input for the subsequent classiﬁcation step.

4.3 Classiﬁcation

This step of the pipeline is the classiﬁcation, wherein

predeﬁned labels or categories are assigned to input

data based on its distinctive characteristics or features.

Classiﬁcation, a form of supervised learning, involves

training the algorithm on a labelled dataset, where in-

put data is paired with corresponding output labels.

In this phase, the input comprises feature vectors ex-

tracted from pre-trained models, serving as the basis

for classifying the represented images. The execu-

tion of this task involves the utilisation of two distinct

techniques:

1. Supervised Classiﬁcation. It adopts the classi-

cal ML paradigm that focuses on training mod-

els to make accurate predictions, even if very lim-

ited examples for each class are available. In tra-

ditional ML approaches, models often require a

large amount of labelled training data to gener-

alise well to new, unseen examples. However, in

many real-world scenarios, obtaining a large la-

belled dataset can be challenging and expensive.

2. Similarity Classiﬁcation. This method refers to a

type of classiﬁcation where the similarity between

instances or data points is a key factor in determin-

ing their class or category. In traditional classiﬁ-

cation, a model is trained on a speciﬁc set of la-

belled classes, and it can only predict labels from

that predeﬁned set. However, similarity classiﬁ-

cation is capable of classifying images never seen

before into categories never seen before, making

this technique zero-shot.

4.4 Supervised Classiﬁcation

Supervised classiﬁcation uses pre-trained vectors and

trains a simple classiﬁer to recognise the question-

naire category (Figure 7). In our case, two types of

ML classiﬁers have been tested:

ICT4AWE 2024 - 10th International Conference on Information and Communication Technologies for Ageing Well and e-Health

146

1. Support Vector Machine (SVM). SVM are a

class of supervised ML algorithms used for classi-

ﬁcation and regression tasks. This ML technique

is particularly well-suited for scenarios where

the data can be represented as points in a high-

dimensional space, like feature vectors in PAN-

THER (Noble, 2006).

2. Random Forest (RF). RF is an ensemble learn-

ing method that can be used for classiﬁcation

tasks. It is robust, handles non-linear relation-

ships well, and can work with dense feature vec-

tors (Biau and Scornet, 2016).

3. FC Layer: is a type of layer, also known as dense

layer, in a neural network where each neuron or

node is connected to every neuron in the previous

layer and every neuron in the next layer. In the

context of a classiﬁer, a FC layer is often used as

the ﬁnal layer of a neural network to make predic-

tions based on the features extracted by the pre-

ceding layers. In this case, the FC is replaced at

the end of each base model (VGG16, GoogleNet

and Resnet18) to perform our task-speciﬁc pre-

diction. Often, the FC layer is also used as a clas-

siﬁer for dense feature vectors because it allows

for complex relationships to be learned between

the input features (Basha et al., 2020).

SVM, RF and FC models have been trained in a

supervised way on questionnaires by using the pro-

vided dataset. These models take as input vectors of

high-level features learned by pre-trained models and

then try to classify vector images in the correct cate-

gory. Results are shown in Table 2.

Figure 7: Supervised Classiﬁcation algorithm.

4.5 Similarity Classiﬁcation

A similarity classiﬁcation technique has been applied

for classifying questionnaires using a similarity met-

ric. In this case, the algorithm does not apply any

training phase to any model for building the recog-

nition system. This means that only feature vectors

obtained in the previous step are used to perform the

task. The algorithm’s core lies in comparing images

of patient questionnaires ﬁlled out with ground truth

vectors obtained by high-quality images of original

PDF questionnaires (Figure 8). The underlying as-

sumption of this algorithm is that the vectors of the

pages ﬁlled out by patients will have high similarity to

the original yet-to-be-ﬁlled questionnaires. This tech-

nique allows for comparing the individual image em-

bedding with all vectors of the original images. If ad-

ditional questionnaires are added in the future, there

would be no need for any training; instead, one would

simply compute the new vectors by using the previ-

ously mentioned models.

The metric used to compute the similarity between

vectors is the Cosine Similarity

, which provides a

measure of similarity that is invariant to the magni-

tude of the vectors. When dealing with embeddings

or feature vectors, the magnitude of the vector can be

inﬂuenced by factors such as scaling, and cosine sim-

ilarity helps address this issue.

The cosine similarity

cos(Θ) =

q·t

∥

∥∥

∥

∑

i=1

√

∑

i=1

√

∑

i=1

where q is the vector representation

,..., q

] of the questionnaire compiled

by the patient and t is the vector representation

,..., c

] of the yet-to-be ﬁlled questionnaire

(the clear one).

Figure 8: Similarity Prediction algorithm.

4.6 Image Alignment and Extraction

The last phase of the pipeline is to align and extract

information from the image questionnaires. Once

these are correctly classiﬁed, the information related

to the type of questionnaire submitted is obtained. We

achieved this result by using OMRChecker

, an open-

source library available in Python. This library is a

versatile tool designed for the precise reading and as-

sessment of OMR sheets, using either a scanner or an

image taken with a mobile phone. It excels in process-

ing customised OMRs, delivering nearly ﬂawless ac-

curacy, especially when applied to high-quality doc-

pytorch.org/docs/stable/generated/torch.nn.CosineSim

ilarity.html. Retrieved February 20, 2024

github.com/Udayraj123/OMRChecker. Retrieved

February 20, 2024

A Neural Network for Automatic Handwriting Extraction and Recognition in Psychodiagnostic Questionnaires

147

ument scans. The library for extracting information

from the grid ﬁlled out by the user requires alignment

with the ﬁle template to capture the provided answers.

To specify this template, a conﬁguration ﬁle named

“template.json” is created, which allows us to deﬁne

all the parameters for creating the alignment. Once

alignment is achieved, the distribution of the grey

colour scale in the image can be used as a response

classiﬁcation criterion. This, coupled with a speci-

ﬁed cutting threshold, facilitates the identiﬁcation of

marked questionnaire items. The OMR extracts the

response given by the patient (or the therapist) for

each question. The output of this ﬁnal phase is re-

turned in JSON format, representing the digitisation

of the submitted and completed questionnaire. An

example of how responses are extracted from ques-

tionnaires is represented in Figure 9. Specifying the

template for OMRChecker to extract responses ne-

cessitates the creation of a distinct template for each

type of questionnaire. For this reason, the recognition

step (Section4.3), before effectively extracting the re-

sponses, is essential.

5 EVALUATION

This Section presents the assessment of the experi-

ments that were conducted, which is divided into two

parts. The ﬁrst part shows the outcomes of the exper-

imentation for the supervised model in questionnaire

classiﬁcation, while the second part validates the sim-

ilarity classiﬁcation algorithm.

The ﬁrst model, which is Supervised Classiﬁca-

tion, uses the previously described dataset of images

and trains every model: SVM, RF and FC layer.

Our dataset was pretty limited in terms of size, for

this reason a k-fold-cross validation (Browne, 2000)

has been used to train SVM, RF and FC with k = 10

(8 instances per fold). This value represents a com-

mon choice and strikes a balance between bias and

variance. The choice of Cross-validation is a statis-

tical technique used in ML and statistical modelling

to assess the performance and generalisability of a

predictive model. Its primary purpose is to provide a

more accurate and reliable estimate of a model’s per-

formance by using different subsets of the data for

training and testing. The basic idea behind cross-

validation is to divide the dataset into multiple sub-

sets or “folds”. In this case, the model is trained k −1

folds and then evaluated on the remaining fold. This

process is repeated multiple times, each time using a

different set of folds for training and testing.

The second series of experiments conducted per-

tains to Similarity classiﬁcation. In this context, im-

Table 2: Training results of SVM, RF and FC layer by using

feature vectors obtained from the penultimate layer.

Pretrained

Model

Vector

Size

Classiﬁcation

Model

Accuracy

VGG16 4096

SVM 0.987

RF 0.825

FC Layer 0.625

SVM 0.375

RF 0.412

GoogleNet 1024

FC Layer 0.375

ResNet18 512

SVM 0.900

RF 0.700

FC Layer 0.375

age vectors derived from pre-trained models using

compiled questionnaires are assessed against vectors

generated from cleaned questionnaires, which serve

as the benchmark for evaluating the similarity be-

tween the compiled vector images and their cleaned

counterparts.

Numerous experiments are feasible due to the

ﬂexibility in cutting CNN models at various points.

In our study, we made two cuts for each model, aim-

ing to extract vectors from the penultimate (Table 3)

and ante penultimate (Table 5) layers, excluding the

ﬁnal layer, which serves a classiﬁcation role.

Table 3: Similarities using vectors obtained by cutting the

penultimate model layer.

Model Vector Size Accuracy

VGG16 4096 0.813

GoogleNet 1024 0.113

ResNet 512 0.376

The VGG16 network attains the highest perfor-

mance, boasting an accuracy of approximately 0.81.

An illustrative comparison is presented in the ta-

ble below, highlighting instances where the similar-

ity between ground truth vectors and an image from

the dataset results in missclassiﬁcation. Results are

shown in Table 4.

The table illustrates how the similarity approach

erroneously categorises the questionnaire as SOFAS

instead of the correct category, which is PQ 16. The

second conducted experiment using Similarity Classi-

ﬁcation, utilises vectors extracted from the models by

cutting the CNN at the ante-penultimate layer. This

approach results in vectors with larger dimensions to

retain more information within the embedding. The

expectation is that computational time will increase

ICT4AWE 2024 - 10th International Conference on Information and Communication Technologies for Ageing Well and e-Health

148

Figure 9: Response extraction from questionnaires.

Table 4: Similarity example using compiled image of PQ

16 questionnaire (VGG vector of size 4096).

Questionnaire Similarity

CBA VE p1 0.845

CBA VE p2 0.775

CBA VE p3 0.819

ERIRAOS p1 0.811

ERIRAOS p2 0.773

PQ 16 0.846

GAF 0.841

SOFAS 0.856

Table 5: Similarities using vectors obtained by cutting the

last 2 layers.

Model Vector Size Accuracy

VGG16 25088 1.0

GoogleNet 50176 0.90

ResNet 25088 1.0

but with higher accuracy. Results are shown in Ta-

ble 5.

Generally, VGG16 and Resnet18 achieve better

performances than GoogleNet. The reason could be

that VGG16 and Resnet18 have relatively simpler ar-

chitectures compared to GoogleNet. When limited

data are available, simpler networks tend to generalise

better. They are less prone to overﬁtting, which oc-

curs when a model learns the training data too well

but fails to generalise to new data. GoogleNet, with

its more complex architecture, might require a larger

dataset or more sophisticated optimisation to achieve

the best performance. Both classiﬁcation techniques

achieved peak accuracy levels, although, in the Sim-

ilarity Classiﬁcation, the algorithm requires a larger

embedding representation of the image as input. Ad-

ditionally, Similarity Classiﬁcation has the advantage

of being able to add additional types of question-

naires to classify without having to perform any kind

of training. Anyway, in order to perform the classiﬁ-

cation task, the two algorithms presented (Similarity

Classiﬁcation and Supervised Classiﬁcation) can be

seen as independent modules that can be invoked for

the questionnaire classiﬁcation step.

Following the successful classiﬁcation of ques-

tionnaire pages into their respective models, an es-

sential step in the extraction of compiled informa-

tion involves achieving alignment between scanned

images and the designated models. As delineated in

the SOTA section, employing the RANSAC algorithm

with OpenCV stands out as the current optimal ap-

proach for achieving an exceptionally high-precision

alignment of images with models. After achieving

alignment, the distribution of the grey colour scale

within the image serves as a criterion for response

classiﬁcation. This, combined with a deﬁned cutting

threshold, streamlines the identiﬁcation of marked

questionnaire items.

6 CONCLUSIONS

In conclusion, while many tools exist that, when

used collectively, can yield outcomes comparable to

the digitisation of hand-ﬁlled questionnaires, this pa-

per introduces an innovative neural network model

speciﬁcally developed for the automatic extraction

and recognition of psychodiagnostic questionnaires.

Additionally, the presented tool facilitates the digiti-

sation of a substantial volume of questionnaires with

minimal human intervention.

The dataset for this project comprised anonymous

responses to ﬁve extensively used questionnaires in

A Neural Network for Automatic Handwriting Extraction and Recognition in Psychodiagnostic Questionnaires

149

their Italian versions: CBA-VE, PQ-16, ERIraos-CL,

GAF, and SOFAS. These cover various psychological

and psychiatric aspects and serve as essential tools in

assessing and treating patients.

An issue encountered is certainly the lack of a

higher quantity of data to perform a consistent ﬁne-

tuning of the convolutional networks used. For this

reason, during the process, the choice of using a

Transfer Learning technique, without ﬁne-tuning

the CNNs (GoogleNet, VGG16 and Resnet18), has

been taken. Additionally, due to the highly sensi-

tive nature of the information contained in this type of

questionnaire, a labelled dataset serving as a ground

truth for accurate assessment was not found. There-

fore, the validation of the extraction phase lacked a

measurable metric.

Currently, the tool operates only on structured

questionnaires, exclusively recognising manually

ﬁlled check boxes. Nonetheless, we intend to include

a handwriting recognition component in the future,

allowing for the digitisation of less structured ques-

tionnaires containing handwritten sections. Thanks to

the collaboration with CPS “Giovani di Niguarda” we

will test the effectiveness of this solution on a wider

number of administrations, further expanding its ap-

plicability and assessing its performance.

This solution will be integrated into a proprietary

digital platform designed to oversee the comprehen-

sive management of psychiatric and psychological pa-

tient treatment courses. Although the digital platform

won’t be open-source, the PANTHER tool is publicly

available for further research and to be freely inte-

grated into other solutions.

REFERENCES

Alzubaidi, L., Zhang, J., Humaidi, A. J., Al-Dujaili, A.,

Duan, Y., Al-Shamma, O., Santamar

ıa, J., Fadhel,

M. A., Al-Amidie, M., and Farhan, L. (2021). Re-

view of deep learning: Concepts, cnn architectures,

challenges, applications, future directions. Journal of

big Data, 8:1–74.

American Psychiatric Association (2000). Manuale diag-

nostico e statistico dei disturbi mentali, chapter Valu-

tazione Multiassiale. Masson, 4th edition. text rev.

Basha, S. S., Dubey, S. R., Pulabaigari, V., and Mukherjee,

S. (2020). Impact of fully connected layers on per-

formance of convolutional neural networks for image

classiﬁcation. Neurocomputing, 378:112–119.

Biau, G. and Scornet, E. (2016). A random forest guided

tour. Test, 25:197–227.

Browne, M. W. (2000). Cross-validation methods. Journal

of mathematical psychology, 44(1):108–132.

Butcher, J. N., Perry, J. N., and Atlis, M. M. (2000). Valid-

ity and utility of computer-based test interpretation.

Psychological assessment, 12(1):6.

Cao, H. and Govindaraju, V. (2007). Vector model based

indexing and retrieval of handwritten medical forms.

In Ninth International Conference on Document Anal-

ysis and Recognition (ICDAR 2007), volume 1, pages

88–92. IEEE.

Chabert, C., Collado, A., Cheval, B., and Hue, O. (2021).

and-pencil questionnaires analysis: a new automated

technique to reduce analysis time and errors. bioRxiv,

pages 2021–03.

Coons, S. J., Gwaltney, C. J., Hays, R. D., Lundy, J. J.,

Sloan, J. A., Revicki, D. A., Lenderking, W. R.,

Cella, D., and Basch, E. (2009). Recommendations

on evidence needed to support measurement equiv-

alence between electronic and paper-based patient-

reported outcome (pro) measures: Ispor epro good re-

search practices task force report. Value in Health,

12(4):419–429.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-

Fei, L. (2009). Imagenet: A large-scale hierarchical

image database. In 2009 IEEE conference on com-

puter vision and pattern recognition, pages 248–255.

Ieee.

Du, X., Sun, Y., Song, Y., Sun, H., and Yang, L. (2023). A

comparative study of different cnn models and transfer

learning effect for underwater object classiﬁcation in

side-scan sonar images. Remote Sensing, 15(3):593.

Engan, H., Hilmarsen, C., Sittlinger, S., Sandmæl, J. A.,

Skanke, F., and Oldervoll, L. M. (2016). Are web-

based questionnaires accepted in patients attending re-

habilitation? Disability and Rehabilitation, 38:2406–

2412.

Franke, G. H. (1997). ”the whole is more than the sum of its

parts”: the effects of grouping and randomizing items

on the reliability and validity of questionnaires. Euro-

pean Journal of Psychological Assessment, 13:67–74.

Groth-Marnat, G. (1999). Current status and future direc-

tions of psychological assessment: Introduction. Jour-

nal of Clinical Psychology, 55(7):781–785.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-

ual learning for image recognition. In Proceedings of

the IEEE conference on computer vision and pattern

recognition, pages 770–778.

Holl

andare, F., Andersson, G., Engstr

om, I., et al. (2010). A

comparison of psychometric properties between inter-

net and paper versions of two depression instruments

(bdi-ii and madrs-s) administered to clinic patients.

Journal of medical Internet research, 12(5):e1392.

Lanyon, R. I. (2011). Handbook of psychological assess-

ment, gary groth-marnat, wiley, new york (2009), 752

pp.

Lorenzo, P., Silvia, A., Federica, P., Sara, G., Ilaria, S.,

Pupo, S., and Raballo, A. (2018). The italian ver-

sion of the 16-item prodromal questionnaire (ipq-16):

Field-test and psychometric features. Schizophrenia

Research, 199:353–360.

Maniar, S., Parmani, J., Bodke, M., and Saxena, K. (2021).

Generation and grading of arduous mcqs using nlp

ICT4AWE 2024 - 10th International Conference on Information and Communication Technologies for Ageing Well and e-Health

150

and omr detection using opencv. In 2021 12th Inter-

national Conference on Computing Communication

and Networking Technologies (ICCCNT), pages 1–7.

IEEE.

Martin-Key, N. A., Spadaro, B., Funnell, E., Barker, E. J.,

Schei, T. S., Tomasik, J., and Bahn, S. (2022). The

current state and validity of digital assessment tools

for psychiatry: systematic review. JMIR Mental

Health, 9(3):e32824.

McMinn, M. R., Buchanan, T., Ellens, B. M., and Ryan,

M. K. (1999). Technology, professional practice, and

ethics: Survey ﬁndings and implications. Professional

Psychology: Research and Practice, 30(2):165.

Meneghelli, A., Alpi, A., Cascio, M., H

afner, H., Maurer,

K., Preti, A., Raballo, A., and Cocchi, A. (2013). Ver-

sione italiana dell’early recognition inventory for the

retrospective assessment of the onset of schizophrenia

checklist: afﬁdabilit

a, validit

a e istruzioni per l’uso.

Journal of Psychopathology, 19:1–2.

Michielin, P., Bertolotti, G., Sanavio, E., Vidotto, G., and

Zotti, A. M. (2009). Manuale d’uso del cba-ve test per

valutare l’efﬁcacia nella pratica dei trattamenti psico-

logici e psicoterapeutici. Professione Psicologi, 3:8–

11.

Noble, W. S. (2006). What is a support vector machine?

Nature biotechnology, 24(12):1565–1567.

Norbert, G., Stoicu-Tivadar, V., and Emanuela, G. (2023).

Robotic process automation based data extraction

from handwritten medical forms. In Telehealth

Ecosystems in Practice: Proceedings of the EFMI

Special Topic Conference 2023, volume 309, page 68.

IOS Press.

O’Shea, K. and Nash, R. (2015). An introduction

to convolutional neural networks. arXiv preprint

arXiv:1511.08458.

Patel, N. V., Prajapati, G. I., and SVMIT, B. (2015). Var-

ious techniques for assessment of omr sheets through

ordinary 2d scanner: a survey. International Journal

of Engineering Research & Technology (IJERT) Vol,

4:2278–0181.

Paulsen, A., Overgaard, S., and Lauritsen, J. M. (2012).

Quality of data entry using single entry, double entry

and automated forms processing–an example based

on a study of patient-reported outcomes. PloS one,

7(4):e35087.

Sanguansat, P. (2015). Robust and low-cost optical

mark recognition for automated data entry. In 2015

12th International Conference on Electrical Engi-

neering/Electronics, Computer, Telecommunications

and Information Technology (ECTI-CON), pages 1–5.

IEEE.

Seward, R. J., Bayliss, D. M., Stallman, H. M., and Ohan,

J. L. (2018). Psychometric properties and norms for

the strengths and difﬁculties questionnaire adminis-

tered online in an australian sample. Australian Psy-

chologist, 53:116–124.

Simons, R., Goddard, R., and Patton, W. (2002). Hand-

scoring error rates in psychological testing. Assess-

ment, 9(3):292–300.

Simonyan, K. and Zisserman, A. (2014). Very deep con-

volutional networks for large-scale image recognition.

arXiv preprint arXiv:1409.1556.

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.,

Anguelov, D., Erhan, D., Vanhoucke, V., and Rabi-

novich, A. (2015). Going deeper with convolutions.

In Proceedings of the IEEE conference on computer

vision and pattern recognition, pages 1–9.

Tolley, C., Rofail, D., Gater, A., and Lalonde, J. (2015). The

feasibility of using electronic clinical outcome assess-

ments in people with schizophrenia and their infor-

mal caregivers. Patient Related Outcome Measures,

page 91.

Touvier, M., M

ejean, C., Kesse-Guyot, E., Pollet, C.,

Malon, A., Castetbon, K., and Herc¸berg, S. (2010).

Comparison between web-based and paper versions

of a self-administered anthropometric questionnaire.

European Journal of Epidemiology, 25:287–296.

van Ballegooijen, W., Riper, H., Cuijpers, P., van Oppen, P.,

and Smit, J. H. (2016). Validation of online psycho-

metric instruments for common mental health disor-

ders: a systematic review. BMC psychiatry, 16:1–12.

Watkins, C. E., Campbell, V. L., Nieberding, R., and Hall-

mark, R. (1995). Contemporary practice of psycho-

logical assessment by clinical psychologists. Profes-

sional psychology: Research and practice, 26(1):54.

Yasmin, F., Hossain, S. M. M., and Areﬁn, M. S. (2017).

Apsd: a framework for automated processing of sur-

vey documents. In 2017 International Conference on

Electrical, Computer and Communication Engineer-

ing (ECCE), pages 411–416. IEEE.

Zaryab, M. A. and Ng, C. R. (2023). Optical character

recognition for medical records digitization with deep

learning. In 2023 IEEE International Conference on

Image Processing (ICIP), pages 3260–3263. IEEE.

A Neural Network for Automatic Handwriting Extraction and Recognition in Psychodiagnostic Questionnaires

151