According to (Bernstein et al., 2010), testers ex-
pect to see certain elements of real-life communica-
tion to be represented in the test. The interview test
meets this expectation to the greatest degree among
oral proficiency tests. The interview test is always
conducted dynamically, in that the tester does not al-
ways pose questions with the same difficulty.
In this paper, we propose a method of deciding
the proper difficulty for questions in an interview test
in response to the second language speaking skill of
individual test takers. This method is built upon the
large-scale pre-trained language model BERT (Bidi-
rectional Encoder Representation from Transformers)
(Devlin et al., 2019). BERT was proposed to perform
natural language processing and has proven effective
in sentiment analysis, question answering, document
summarization, and other tasks. This motivates us to
imply such a model to difficulty decision tasks.
The conversational context and additional appro-
priateness information for responses of test takers are
used in our method. We conducted our experiments
using a simulated language oral speaking interview
test dataset to validate our method. This method out-
performed the baseline models, confirming the valid-
ity of our proposed method.
The remainder of this paper is organized as fol-
lows: Section 2 reviews recent research on the au-
tomation of the oral proficiency assessment. Section 3
describes the difficulty decision task and provides in-
sight into the structural design of the proposed model.
Section 4 presents a full account of the experimental
setting. Finally, we provide a brief summary of our
work and discuss potential objections to our plan for
future work.
2 RELATED WORK
Many studies put effort into the automation of oral
proficiency assessment with static question difficulty.
The samples that were used to judge speaking skills
of a test taker were collected as monologues. In
(Yoon and Lee, 2019), the authors collected around
one minute of spontaneous speech samples from test
takers, including readings and/or answers after lis-
tening to a passage. They used a Siamese convo-
lutional neural network (Mueller and Thyagarajan,
2016) to model the semantic relationship between the
key points generated by experts and test responses.
The neural network was also used to score the speak-
ing skill of each test taker. In (Zechner et al., 2014),
the authors collected restricted and semi-restricted
speech from test takers. The restricted speeches in-
volved reading and repeating a passage. In the semi-
restricted speech, the test taker is required to provide
sufficient remaining content to formulate a complete
response, corresponding to an image or chart. The
authors proposed a method that combines diverse as-
pects of the features of speaking proficiency using a
linear regression model to predict response scores.
The studies mentioned above use a strategy that
collects samples manually and then analyzes them us-
ing algorithms. Some previous studies also use ma-
chines to deliver tests and collect samples. In a Pear-
son Test of English (Longman, 2012), test takers are
requested to repeat sentences, answer short questions,
perform sentence builds, and retell passages to a ma-
chine. Their responses are analyzed by algorithms
(Bernstein et al., 2010). In (de Wet et al., 2009), the
authors designed a spoken dialogue system for test
takers to guide them and capture their answers. The
system involves reading tasks and repetition tasks.
The authors used the automatic speech recognition to
evaluate speaking skills of test takers, focusing on flu-
ency, pronunciation, and repeat accuracy.
Oral proficiency tests delivered using the mono-
logue test have been used to evaluate the speaking
skill of test takers and have shown a high degree of
correlation with the interview test (Bernstein et al.,
2010). However, automation of the interview assess-
ment method with dynamic question difficulty has not
been developed to the extent that automation of the
monologue one has. This paper seeks to fill this gap.
3 METHOD
3.1 Problem Setting
In an interview test, the tester follows the strategy be-
low:
• First, the appropriateness of the responses of the
test taker is estimated.
• Second, based on this appropriateness measure,
the difficulty of the question to be given next is
decided.
As noted in (Kasper, 2006), if it is difficult for the
test taker to respond to the given question, the tester
would change the question as the next action. The
questions target a specific oral proficiency level and
functioning at that level (ACTFL, 2012). This means
that an automated interview test should have the abil-
ity to adjust the difficulty level of its questions dur-
ing the course of an interview test. In this paper, we
propose a method to decide the difficulty of the next
question posed by the tester. Our method should first
estimate the appropriateness of the response of the
Question Difficulty Decision for Automated Interview in Foreign Language Test
761