RLHR: A Framework for Driving Dynamically Adaptable

Questionnaires and Proﬁling People Using Reinforcement Learning

Ciprian Paduraru

Catalina Camelia Patilea

and Alin Stefanescu

1,2

Department of Computer Science, University of Bucharest, Romania

Institute for Logic and Data Science, Romania

Keywords:

Reinforcement Learning, Bias Removal, Time Series, Classiﬁcation, Behaviors, Proﬁling.

Abstract:

In today’s corporate landscape, the creation of questionnaires, surveys or evaluation forms for employees is a

widespread practice. These tools are regularly used to check various aspects such as motivation, opportunities

for improvement, satisfaction levels and even potential cybersecurity risks. A common limitation lies in their

generic nature: they often lack personalization and rely on predetermined questions. Our research focuses on

improving this process by introducing AI agents based on reinforcement learning. These agents dynamically

adapt the content of surveys to each person based on their unique personality traits. Our framework is open

source and can be seamlessly integrated into various use cases in different industries or academic research.

To evaluate the effectiveness of the approach, we tackle a real-life scenario: the detection of potentially in-

appropriate behavior in the workplace. In this context, the reinforcement learning-based AI agents function

like human recruiters and create personalized surveys. The results are encouraging, as they show that our

decision algorithms for content selection are very similar to those of recruiters. The open-source framework

also includes tools for detailed post-analysis for further decision making and explanation of the results.

1 INTRODUCTION

The goal of this work is to create a framework that

contains the tools needed to conduct large-scale adap-

tive surveys and to thoroughly analyze the results after

the survey. The main component is a software-based

virtual HR agent that can behave like a real human

during a survey and adapt the sequence of questions

asked to the individuals being assessed and their pre-

vious responses. We refer to this adaptive way of ask-

ing questions as dynamic survey. With the proposed

software agents, each person in an organization could

be individually assessed at minimal cost. A limited

and targeted number of questions must be asked to

maintain respondent engagement. In this context, the

agent strategically determines the sequence of ques-

tions. By optimizing this sequence, the goal is to bet-

ter assess individuals based on the survey objectives,

with the number of questions comparable to that of a

typical ﬁxed survey.

We summarize our contribution below:

1. The ﬁrst deep reinforcement learning (DRL)

method mimics HR professionals to drive ques-

tionnaires in real-time.

2. Improved methods for detecting and subtracting

bias in responses (due to user over- or under-

response to questions over time) using time series

techniques. (Xia et al., 2015).

3. A method of augmenting existing datasets (which

are usually small) to create large synthetic

datasets that mimic the original datasets.

4. We make our work available to industry

and academia as an open-source framework

called RLHR (Reinforcement Learning Hu-

man Resources) at https://github.com/unibuc-cs/

AIForProﬁlingHumans.

2 RELATED WORK

The work that comes closest to ours is (Paduraru.

et al., 2024), which has similar goals but different

methods. We compare their methods with ours using

a new, larger anonymized dataset. The methods pro-

posed in our current work have several technical fea-

tures to improve the state of the art. First, we found

that the Pathﬁnding AI method in (Paduraru. et al.,

2024) suffers from biased selection (Wang and Singh,

2021) as it always selects the closest possible clus-

ter (with limited random explorations). To improve

the results, in this work we use a deep reinforcement

learning method (Mnih et al., 2016) by adding bet-

Paduraru, C., Patilea, C. and Stefanescu, A.

RLHR: A Framework for Driving Dynamically Adaptable Questionnaires and Proﬁling People Using Reinforcement Learning.

DOI: 10.5220/0012752800003753

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 19th International Conference on Software Technologies (ICSOFT 2024), pages 359-366

ISBN: 978-989-758-706-1; ISSN: 2184-2833

359

ter explorations, latent encapsulation of individuals

by deep neural networks, and temporal information

understanding with gated recurrent unit (GRU) neural

networks (Cho et al., 2014). We also improve the con-

stant type of bias removal from the original method

with time series (Benvenuto et al., 2020), which leads

to better results. The available tools for post-survey

analysis have also been improved. We are also ad-

dressing the issue of creating synthetic data that ap-

proximates real human proﬁles so that clients can

evaluate and customize their survey deﬁnitions before

sending them to people.

Proﬁling people for content recommendations,

such as news recommendations, is a long-standing

practice (Mannens et al., 2013). Automatic detec-

tion of fraudulent proﬁles on social media platforms

such as Instagram and Twitter is another common

application for the creation of people proﬁles using

data mining and clustering techniques (Khaled et al.,

2018). In (Ni et al., 2017), social media data ex-

tracted from WeChat

is used to create individual

proﬁles and group them based on their occupational

ﬁeld, using similar NLP techniques to those previ-

ously mentioned. The research in (Schermer, 2011)

discusses the use of data mining in automated pro-

ﬁling processes, with a focus on ethics and potential

discrimination. Use cases include security services

or internal organizations that create proﬁles to assess

various characteristics of their employees. Proﬁling

and grouping individuals using data mining and NLP

techniques to extract information from text data is a

common topic in the literature. In (Wibawa et al.,

2022), the authors use AI methods such as traditional

NLP to process application documents for job open-

ings, which enables automatic ﬁltering, evaluation

and prioritization of candidates.

3 SURVEY SETUP

3.1 Survey Formalization

Our aim is to present a survey in as generalized a form

as possible. In doing so, we rely on our experience

with the clients of vorteXplore and on the experience

we have gained with later versions of the framework.

The proposed high-level presentation method consists

of a limited number of questions (conﬁgurable on the

client side, on average between 15-25) that are either

general in nature or related to an asset shown in the

form of an image, video or extracted text (e.g. arti-

cles, SMS messages, emails, etc.). The formal spec-

WeChat.com

iﬁcations and components of a survey are explained

below.

Groups. Every asset and question that is asked is part

of a group. Examples of groups from the use case:

Awareness, Prevalence, Sanction, Inspiration, Fac-

tual, Sensitivity. In our experience, this has proven to

be very useful for characterizing people from multi-

ple perspectives and organizing assets and questions.

It also has implications for reusability and makes it

easier to maintain the dataset.

Assets. A collection of assets representing video

ﬁles, media posts, SMS, etc. Asset indices also have

an optional dependency speciﬁcation, i.e. the client

can specify that an asset should depend on a pre-

viously displayed set of other assets: Deps(A

) =

}

j∈1..|Assets|

. For example, a video or image as-

set could only make sense as a sequence of previous

assets.

Question. The set of textual questions is denoted

by Q. Each element Q

∈ Q has two categories of

properties:

1. Structural properties.

• The set of assets that are compatible with this

question: Compat(Q

) = {A

∈ Assets}

. The

idea of compatibility is that some of the ques-

tions make sense for each type of asset shown.

Others do not, e.g. video-based assets with a

concrete action demonstration.

• Dependencies on previous questions. Internally,

the dependencies between questions take the

form of a directed acyclic graph, where each

node Q

has a set of dependencies Deps(Q

) =

}

. This set represents a restriction that Q

can only be asked as a follow-up question to a

previous question Q

∈ Deps(Q

b) Scoring properties.

• Attributes. For the use case of IB recognition,

some examples: Team interaction, Offensive

language, Rumors, Personal boundaries, Lead-

ership Style (the full list can be found in a ta-

ble in our repository). These are customizable

in the framework, are usually set by the organi-

zations prior to the surveys and are not visible

to the respondents. Generally, the client orga-

nization strategically uses these inherent charac-

teristics to gain the insights they are looking for

in the post-survey analysis. The Attr set repre-

sents the collection of attributes used by an ap-

plication. For each question Q

, a vector of all

attributes ordered by indices is given, represent-

ing the relative importance of each attribute to

the question. The value range is [0 − 1], where

ICSOFT 2024 - 19th International Conference on Software Technologies

360

...........

........... ........... ...........

step 0

steps 1..N+1

step N + 2

steps (N+3)..(N+3+M)

subsequent question on each asset. Usually 2-5 for each asset, total: 15-25 questions

steps (N+3+M+2)..(N+3+M+2+T)

step N+3+M+1

Figure 1: Example of the RLHR agent ﬁrst selecting an asset and then asking a series of related questions, taking into account

constraint dependencies until the end of the interview. For each asset, the typical number of questions asked by clients ranged

from 1-5, and the total number of questions was 15-25.

0 means that the respective attribute is insignif-

icant for the asset, while a value of 1 repre-

sents a strong correlation between the attribute

and the asset. Formally, each question is given

a speciﬁcation vector (by the client): At(Q

) =

{At

,At

,...,At

NAt

}, where NAt = |Attr|. A get-

ter function, Imp(Q

,At

), is used later in the pa-

per to determine the degree of importance for

each attribute in relation to the question.

• Importance of the questions. The function

Imp(Q

) is used to determine the relevance of the

question for the survey from the client’s point of

view. The values are numerical ﬂoating point

numbers in the range [0−1], where 0 has no rel-

evance, while 1 represents a high interest in the

user responses to the question.

• Baseline and tolerance values. Base(Q

) rep-

resents the standard expected response of re-

spondents to this question by the organization.

Tol(Q

) is used to denote the accepted deviations

in the responses.

• Amb(Q

): each question has an ambiguity factor.

This is not set from the beginning (no one would

intentionally create ambiguous questions); it is

regressed from the results of the post-survey or

participant feedback. Rather than changing the

entire survey structure, the client can increase

this factor to remove potential ambiguity and

mitigate the value of divergent responses. The

value range is between [0 − 1], where a value of

1 means no ambiguity, while 1 is the maximum.

Internally, the answers to the questions were

mapped to ﬂoating point numbers in the range [1 − 7],

either in binary format, as point values within a range,

etc. The same range of values is used for baselines

and tolerances.

Proﬁles Speciﬁcation. The autonomous survey aims

to categorize a person into a speciﬁc proﬁle that cor-

responds as closely as possible to an HR professional

who would interview the person face-to-face. Intu-

itively, people’s responses to a survey’s assets and

questions according to the factors mentioned above

(i.e. baselines and tolerance from the client’s perspec-

tive, importance of questions and ambiguities) con-

tribute to the aggregate score for each attribute. These

scores are used to calculate the match with one or the

other proﬁle.

A proﬁle is speciﬁed as a multivariate Gaussian

distribution (Guti

errez et al., 2023), with a total of

NAt (number of attributes) dimensions, i.e. one for

each attribute. The set of all proﬁles is denoted by

Pro f iles. The reasons for using this type of distribu-

tion are explained below, while the technical details

can be found in Section 4:

• It enables the natural modeling of an individual by

the properties hidden in the question. Intuitively,

the HR deﬁnes the mean, the µ vector, as the ex-

pected values of the deviations for the observed

inherent attributes for each of the proﬁles.

• The covariance matrix, Σ, can be used to indicate

both the tolerance (variance) of these attributes for

each of the proﬁles and the correlations between

the attributes. Of course, some attributes are cor-

related with each other and cannot be treated sep-

arately. At the beginning of a project with new at-

tributes and no previous data set, the client has no

information about correlations, so it uses a diag-

onal matrix Σ. However, after data has been col-

lected, as in our use case, the RLHR framework

has tools to calculate the correlation between the

attributes based on the Pearson correlation (Ben-

esty et al., 2008).

RLHR: A Framework for Driving Dynamically Adaptable Questionnaires and Proﬁling People Using Reinforcement Learning

361

4 METHODS FOR EVALUATING

SURVEYS

This section introduces the common evaluation func-

tions and the internal accounting of the statistics to

prepare the inputs for the RLHR agent discussed in

Section 5.

4.1 Deviations

The root of the scoring process begins with the calcu-

lation of the deviations for each question. After each

question Q

in the survey, the user responds with a nu-

merical sliding value in the range [1−7] (Section 3.1),

denoted by R(Q

). As shown in Eq. (1), this value is

compared with the base value and the tolerance val-

ues. Then, the importance of the question (or sever-

ity) in the range [0 −1] is added to mitigate deviations

from questions that are irrelevant to the ﬁnal classi-

ﬁcation from the client’s perspective. Finally, the re-

ported and agreed ambiguities in the range [0 − 1] are

added to the equation to mitigate questions that have

been found to be ambiguous, and depending on the

degree, the deviations become inversely proportion-

ally less important.

D(Q

) =



R(Q

) − Base(Q

)

Tol(Q

)





1 + Amb(Q

)

−1



× Imp(Q

) (1)

4.2 Removing Anchor and over- or

under-scoring the Questions

Numerous biases can manifest themselves in a survey

(Yan et al., 2018). The most common are anchors (in-

ﬂuences or connections to a previously asked ques-

tion) and the consistent over- or under-rating of an-

swers. The identiﬁcation of these are needed to ob-

tain accurate statistics at the team and organizational

level. Otherwise, the RLHR agent might misunder-

stand the situation. Figure 2 illustrates this behavior.

The method is to ﬁnd patterns in the deviation either

in the entire survey or in short, consecutive sequences

(Dee, 2006).

While the RLHR agent is conducting a survey, it

has access to the answers to the questions asked in

steps [1...K − 1] in each step K. To detect possible

bias or anchor, our method looks for an initial position

S in the range of steps so that a model can ﬁt a predic-

tor of biases for the range [S..K − 1]. The model is an

auto-regressive integrated moving average (ARIMA)

(Benvenuto et al., 2020).

.......

Deviations

.......

Figure 2: An example of user responses during a survey

and deviation values calculated with Eq. (1). At the begin-

ning, for the ﬁrst three questions, there is no trend in the

deviations. From step t onwards, however, it can be seen

that the deviations gradually equalize, which means that the

user could be over- or under-rating responses provided for

a number of steps.

4.3 Scores Feature Vector

An important feature that the RLHR agent uses to cat-

egorize a user U into one of the deﬁned proﬁles is the

aggregated score of each inherent attribute in the set

Attr in relation to the questions asked and their an-

swers. Assume that a survey is in progress and there

are already t pairs of questions, answers and both

types of deviations that can be computed using Eq.

(1).

The set of inherent attributes and their scores rep-

resent the features of U used by the RLHR agent and

for proﬁle classiﬁcation in each step. The calculation

method of these scores is shown in Eq. (2), where

the ﬁnal result Sc

(U,At

) represents the score vector

for the feature (attribute) At

∈ Attr of the user U dur-

ing the survey after t questions have been asked. We

further denote by Sc

(Ui, At

) the same score func-

tion without bias, D

) instead of D(Q

). The idea

behind the calculations is that for each attribute At

it iterates over all questions Q

asked so far and ag-

gregate their contribution to At

(as an average) by

using the deviations and the importance of questions,

Imp(Q

,At

) (Section 3.1). For simplicity, we use the

vectorized notation of the scores of U at time t by

(U) ∈ R

NAt

(U,At

) =

∑

i=1

Imp(Q

,At

) × D(Q

)

∑

i=1

(Imp(Q

,At

) > 0)

(2)

As mentioned in Section 3.1, the proﬁles are deﬁned

using multivariate Gaussian distributions (Guti

errez

et al., 2023) around the set of inherent attributes by

the expected mean and covariance (tolerance of each

attribute and predicted correlations between them)

Eq. (3). We denote the number of proﬁles with

NumPr f = |Pro f iles|.

ICSOFT 2024 - 19th International Conference on Software Technologies

362

Pr f De f

= N (µ

,Σ

), µ ∈ R

NAt

, Σ ∈ R

NAt×NAt

∀Pr f De f

∈ Pro f iles

(3)

To determine the probability that U is part of each

proﬁle at time t given the current score vector Sc

(U),

the deviation scores calculated above are passed to the

standard probability density function, as shown in Eq.

(4).

(k) = P

(Pr f De f

|Sc

(U)) = p(Sc

(U); µ

,Σ

) =

(2π)

NAt/2

|Σ

1/2

exp



−

(x − µ

)

−1

(x − µ

)



∀Pr f De f

∈ Pro f iles

(4)

The predicted proﬁle index at time step t for the user

U results from the selection of the maximum from

these results, Eq. (5).

Pr f

pred

(U) = argmax[P

(Pr f De f

|Sc

(U))]

k∈[1...NumPr f ]

(5)

5 THE RLHR AGENT

The goal of the RLHR agent is to autonomously con-

trol the survey process, adapt to the content requested

by the respondent, and provide a distribution of scores

across proﬁles that matches the ground truth proﬁle as

closely as possible. The general ideas for applying the

RL methodology and components to our objectives

are detailed in this section and outlined in Figure 3

5.1 Synthetic Environments and Dataset

The environment represents the world in which the

RLHR agent performs actions and receives feedback

through partial observations and rewards. We have

used the OpenAI Gym (Towers et al., 2023) interfaces

and principles (more speciﬁcally, the updated Gym-

nasium library) so that our framework can be further

used for experiments in the community.

Set up Virtual Users. With deﬁned proﬁles, even

without collecting real data, synthetic data can be cre-

ated based on sampling methods. Speciﬁcally, N ex-

amples of virtual users, VUsers, can be created, with

each U ∈ VUsers following a two-step process:

1. Select a ground truth proﬁle for U by drawing a

uniform sample from the available set of proﬁles.

Note that this is hidden from the observation of

the RLHR agent and is only used for background

evaluation mechanisms when interacting with the

environment.

Pr f

(U) = Uni f orm[1,NumPr f ] (6)

2. Sample a vector of inherent (ground truth) at-

tributes, knowing the ground truth proﬁle and its

base distribution parameters from Eq. (3).

(U) ∼ N (µ

,Σ

) (7)

If accurate data is available from HR experts, anno-

tated data for points 1. and 2. can be added to the

database.

Simulation of Responses from Virtual Users When

the RLHR agent asks the environment for an answer

from the surveyed user U to a question Q

, the value

of the answer must be correlated with: (a) the inher-

ent personality attributes, At

(U), and (b) with the

importance of the attributes in the questions, At(Q

This correlation can be solved by a dot product be-

tween the two, Eq.s (8), which gives the normalized

deviation value for Q

in the range [0 − 1]. It must

then be converted to the client range (in our use case,

for example, the range [1 − 7] is used, Section 3.1.

D(Q

) = remap(At

(U) ∗ At(Q

)) (8)

Finally, we substitute D(Q

) into Eq. (1) to determine

the response valueR(Q

). This results in the form

shown in Eq. 9

R(Q

) = Base(Q

Tol(Q

) ×



D(Q

) × Imp

−1

)

1 + Amb

−1

)



1/2

(9)

5.2 Episodes, Actions and Observations

An in-progress survey of a user U is represented as a

trajectory, τ, using the reinforcement learning policy-

based algorithms (Sutton and Barto, 2018). In our

case, a episode is the same as a trajectory from the

beginning of a survey to its end.

At any time t in a survey, the state includes all as-

sets displayed and the t questions asked. As shown in

Figure 1, at each step (or action in RL terminology),

the agent must either select a new asset to show or a

follow-up question based on the currently presented

asset. Suppose that t questions have been asked us-

ing multiple NG − 1 completed groups of assets and

associated questions, and the RLHR agent is decid-

ing which asset or question to show for group NG.

We denote the asset shown in group K by A

, the

i-th follow-up question by Q

, and the total number

of questions asked in group K by N(G

). Eq. (10)

shows closed groups (indexed by k). Similarly, Eq.

(11) deﬁnes an ongoing group that must select the

next question i + 1, while an empty group means that

the next action of the RLHR agent should be to se-

lect an asset ﬁrst, Eq. (12). Finally, Eq.(13) shows

RLHR: A Framework for Driving Dynamically Adaptable Questionnaires and Proﬁling People Using Reinforcement Learning

363

Environment

Inherent (hidden)

attributes

........

Virtual user

Survey history

States

User partial

observation

Feedback:

rewards, observations

Response

Figure 3: Relationship between the RLHR agent (left), the environment (center) and the virtual user being interviewed. The

agent sends actions to the environment and asks to display a new asset or ask a new question about the current state. In return,

the environment simulates a response that correlates with the user’s ground truth proﬁle. Each response updates the inherent

attributes. The environment sends feedback as a reward for the last action performed and the new partial observation of the

user, which models the agent’s belief in the user’s inherent attributes. The dashed lines represent the updates made internally.

the formalized relationship between the step (actions)

and the parameter t.



...Q

N(G

)



(10)

i+1



..., Q

i+1



(11)





(12)

t =

NG−1

∑

j=k

N(G

) (13)

The trajectory for a running survey is displayed in Eq.

(14). It is parameterized by three parameters: (a) t-

the total number of questions asked so far, (b) NG-the

index of the current group and (c) k-the number of

questions asked so far in the group NG, which can be

0 if no question has been asked yet, i.e. if an asset

is expected. In order not to overcomplicate the equa-

tions, we omit the typical pair (state, action, reward)

at each step and keep only the state and action to be

performed next (with an exclamation mark). The ac-

tions are formally discussed in Section 5.2, while the

rewards are taken after each action and deﬁned in Sec-

tion 5.3.

(t,NG,k)

,... , G

NG−1

= G

k+1

orG

(14)

At each step during surveying a user U at time t, the

observation of the RLHR agent returned by the envi-

ronment, O

, is composed of two components:

(a) the trajectory τ, which consists of the history of

pairs of groups, assets and questions asked.

(b) the score of the user’s attributes after each action,

(U), which is calculated as in Eq. (2).

The state of the agent is given by Eq. (15)). It in-

cludes the observation, the set of valid questions V Q

and the assets VA

at time t due to the course of the

survey and the contextual dependencies.



(t,..)

, Sc

,{V Q

,VA

}



(15)

Actions and environment constraints . There are also

hard constraints that must be fulﬁlled along the trajec-

tory (or episode) in relation to the actions:

1. In the ﬁrst step, an asset must be shown.

2. If at any time t the RLHR agent decides to ask a

new question Q

new

, it must comply with two main

rules. First, it must satisfy the dependencies on

the previously asked question (or no dependencies

at all), i.e. Deps(Q

new

) =

0, or Q

∈ Deps(Q

new

In addition, the new question must be compatible

with the current asset, i.e. A(t) ∈ Compat(Q

new

3. A maximum number of follow-up questions can

be asked about a currently presented asset, rep-

resented by the parameter MaxQPerAsset (in our

example MaxQPerAsset=5). Once this threshold

is reached, a hard constraint to show a new asset

is added to the RLHR agent’s observation. Note

that the agent can switch to a new asset even if this

threshold is not reached.

4. When a new asset is shown, it must satisfy the de-

pendency on a previous asset, similar to questions.

5. The episode ends when: (a) the number of steps

reaches a threshold MaxSteps (in our example

MaxSteps=30 - intuitively set for a maximum of

25 questions and ﬁve or more assets), or (b) when

there is no remaining question or asset that can be

shown to satisfy the dependencies and structural

requirements (e.g., an asset must be shown, but

there is no longer one that satisﬁes the dependen-

cies). Note in this context that the number of ques-

tions may vary between surveys depending on the

user’s choices and answers. We think this is natu-

ral human behavior.

6. To handle the case of general questions where no

asset needs to be shown, we consider a special

NULL asset that does not visually display any-

thing other than the following general questions.

7. The minimum number of questions in a group is 1.

Eq. (16) formalizes the action that the RLHR agent

can take if a group NG is in progress in the current

trajectory and k (possibly

0) questions have already

ICSOFT 2024 - 19th International Conference on Software Technologies

364

been asked (Eq. (14)). The possible actions are: (a)

displaying a new asset when the agent decides or is

forced to start the next group NG + 1, and (b) ask the

new question K + 1 in the current group.

Act

∈ {A

new

{NG or NG+1}

, Q

new

} (16)

5.3 Rewards

The aim of the RLHR agent is to drive the survey us-

ing the actions deﬁned in Eq. (16) so that the user U is

classiﬁed as close as possible to their known ground

truth proﬁle Pr f

(U) (Eq. (6)) at the end.

As in Eq. 4, at any time t during a survey, the

probability that a user U belongs to a proﬁle k is given

by the values of the inherent attributes, Sc

(U). In

this representation, the main idea is to display assets

and ask corresponding questions to ﬁnd the attribute

scores that lead to the correct classiﬁcation.

With this in mind, the system models the reward

function at time t, i.e. with two main components:

(a) OverallScore. The agent is penalized for hav-

ing attribute scores that do not yet approach the

ones deﬁned by the ground truth, Eq. (17). In-

tuitively, the maximum of this component is 0 if

the inherent attributes have scores that are close

to the predeﬁned mean value of the ground truth

, and taking into account the associated co-

variances Σ

OverallSc = P

(gt) − 1.0 (17)

(b) The agent is penalized for not performing an ac-

tion that moves the classiﬁcation in the right di-

rection. As shown in Eq. (18), the idea is to cal-

culate the velocity of the last action in relation to

the classiﬁcation probability of the ground truth.

VelSc =

(

(gt) − P

t−1

(gt), t > 1

0, otherwise

(18)

Eq. (19) shows the ﬁnal reward function after t

questions asked, with the same correlations to the cur-

rent group NG and the number of questions asked k

NG as in Eq. (13) is shown. The two components de-

ﬁned above are averaged with conﬁgurable weights.

In our use case, we set the total reward component as

= 0.8 and the velocity component as W

vel

= 0.2.

Reward

= OverallSc ∗W

+VelSc ∗W

vel

(19)

After some evaluation, we decided to use the

asynchronous actor-critic method, more precisely

A2C(Mnih et al., 2016) from the class of policy-based

methods.

Table 1: Comparative results between HR professionals,

PathﬁndingAI (Paduraru. et al., 2024), and our proposed

method RLHR. Accuracy 1

indicates how many predic-

tions of the person’s proﬁles match the HR, which is consid-

ered the ground truth. The Accuracy 2

for the two meth-

ods indicates how many of the incorrect predictions were

placed at the 2

position in the probability distributions of

the output. The last column shows the average error be-

tween the probability assigned to the ground truth proﬁle

and the probability of the predicted proﬁle.

Evaluation

method

Accuracy

Avg. Error

to 2

100%

(69)

0 0

PathﬁndingAI

62.3%

(43)

6.9%

(10)

∼ 0.221

RLHR

74%

(51)

21.7%

(15)

∼ 0.127

6 EVALUATION

The framework is evaluated from several perspec-

tives. First, quantitative and qualitative assessments

are presented to understand the ease of use from the

user’s perspective and the credibility of the methods.

Then, the computational effort required to conduct

scale surveys and retrain the RLHR agent is presented

to understand the practical usability. Finally, this sec-

tion presents post-survey analysis tools and lessons

learned from prototype development and previous ef-

forts.

Setup. Quantitative Evaluation. First, we try to

evaluate the correctness of the methods proposed in

this work by comparing them with an evaluation per-

formed in parallel by HR experts and the algorithm

PathﬁndingAI in (Paduraru. et al., 2024).

A sample of 69 people was selected by HR pro-

fessionals and interviewed in a similar way to that

described in the study, but face-to-face. After six

months, with no major post-survey interventions or

actions, we assessed the same individuals using the

proposed RLHR agent. Note that the dataset of as-

sets and questions used by the HR and RLHR agents

matched, but the questions and assets that were orig-

inally asked were replaced to avoid any bias. There

were a total of 1498 responses to the questions. The

results of the observed comparison follow:

Table 1 shows the results obtained by comparing

the supposed ground truth assessment of HR profes-

sionals in the client organization with the Pathﬁndin-

gAI and RLHR agents. The key observation is that the

RLHR agent implemented in our proposed framework

performs better than the state-of-the-art Pathﬁndin-

gAI method. Moreover, in many cases, the RLHR

RLHR: A Framework for Driving Dynamically Adaptable Questionnaires and Proﬁling People Using Reinforcement Learning

365

agent successfully classiﬁed the missing cases of the

ground truth proﬁle at the 2

position in the out-

put probability distribution. It left only three out of

69 classiﬁed individuals at the 3

and 4

positions,

compared to the PathﬁndingAi, which left 16 individ-

uals. Furthermore, the error of the RLHR is signiﬁ-

cantly lower for the misclassiﬁed examples, i.e. the

entropy between the ground truth proﬁle and the pre-

dicted proﬁle is high.

The method of removing bias based on time se-

ries improved the ﬁnal results, as shown in Table 1.

More speciﬁcally, compared to the previous method

for identifying constant bias in (Paduraru. et al.,

2024), the new method improved the Accuracy 1

from 48 to 51 correctly predicted individuals, while

the Accuracy 2

increased from 12 to 15.

7 CONCLUSIONS

The purpose of the RLHR framework is not to re-

place the experts in the HR departments of compa-

nies. Its main purpose is to create another layer be-

tween individuals and HR departments. The inter-

mediate layer we propose would improve the HR de-

partment’s survey processes and interventions and fo-

cus on the available resources where they are most

needed.

ACKNOWLEDGEMENTS

This research was supported by European Union’s

Horizon Europe research and innovation programme

under grant agreement no. 101070455, project DYN-

ABIC.

REFERENCES

Benesty, J., Chen, J., and Huang, Y. (2008). On the impor-

tance of the pearson correlation coefﬁcient in noise

reduction. IEEE Transactions on Audio, Speech, and

Language Processing, 16(4):757–765.

Benvenuto, D. et al. (2020). Application of the arima model

on the covid-2019 epidemic dataset. Data in Brief,

29:105–340.

Cho, K. et al. (2014). Learning phrase representations using

RNN encoder–decoder for statistical machine trans-

lation. In Proceedings of the 2014 Conference on

Empirical Methods in Natural Language Processing

(EMNLP), pages 1724–1734, Doha, Qatar. Associa-

tion for Computational Linguistics.

Dee, D. (2006). Bias and data assimilation. In Proceed-

ings of the ECMWF Workshop on Bias estimation and

correction in data assimilation., pages 1–20.

Guti

errez, F. et al. (2023). Differentiating abnormal, nor-

mal, and ideal personality proﬁles in multidimen-

sional spaces. Journal of Individual Differences.

Khaled, S. et al. (2018). Detecting fake accounts on social

media. In IEEE International Conference on Big Data

(Big Data), pages 3672–3681.

Mannens, E. et al. (2013). Automatic news recommenda-

tions via aggregated proﬁling. Multimedia Tools and

Applications - MTA, 63.

Mnih, V. et al. (2016). Asynchronous methods for deep

reinforcement learning.

Ni, X. et al. (2017). Behavioral proﬁling for employees

using social media: A case study based on wechat.

In Chinese Automation Congress (CAC), pages 7725–

7730.

Paduraru., C., Cristea., R., and Stefanescu., A. (2024).

Adaptive questionnaire design using ai agents for peo-

ple proﬁling. In Proceedings of the 16th International

Conference on Agents and Artiﬁcial Intelligence - Vol-

ume 3: ICAART, pages 633–640.

Schermer, B. W. (2011). The limits of privacy in automated

proﬁling and data mining. Computer Law and Secu-

rity Review, 27(1):45–52.

Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learn-

ing: An Introduction. A Bradford Book, Cambridge,

MA, USA.

Towers, M. et al. (2023). Gymnasium: An API standard

for single-agent reinforcement learning environments,

with popular reference environments and related utili-

ties (formerly gym) https://gymnasium.farama.org/.

Wang, Y. and Singh, L. (2021). Analyzing the impact of

missing values and selection bias on fairness. In-

ternational Journal of Data Science and Analytics,

12(2):101–119.

Wibawa, A. D. et al. (2022). Text mining for employee can-

didates automatic proﬁling based on application docu-

ments. EMITTER International Journal of Engineer-

ing Technology, 10:47–62.

Xia, P., Zhang, L., and Li, F. (2015). Learning similar-

ity with cosine similarity ensemble. Information sci-

ences, 307:39–52.

Yan, T., Keusch, F., and He, L. (2018). The impact of

question and scale characteristics on scale direction

effects. Survey Practice.

ICSOFT 2024 - 19th International Conference on Software Technologies

366