ON PRACTICAL ISSUES OF MEASUREMENT DECISION THEORY

An Experimental Study

Jiri Dvorak

Scio, s.r.o., Pobrezni 34, Prague, Czech Republic

Keywords:

Educational Measurement, Test Evaluation, Decision Theory, Simulation, Calibration.

Abstract:

In Educational Measurement ﬁeld Item Response Theory is a dominant test evaluation method. A few years

ago Lawrence Rudner has introduced an alternative method providing better results than IRT in some cases

of measurement. However, his method called Measurement Decision Theory did not get much of interest

in the community. In this article we would like to give MDT some of the focus we believe it deserves. In

particular we are focusing on the practical issues necessary to successfully implement MDT into a daily life

of Educational Measurement. We will summarize classiﬁcation abilities. After that in the main part of this

paper we will explain in depth calibration process which is a crucial part of MDT implementation. A basic

calibration process will be described as well as its characteristics. Then, as a main result, an improvement of

this basic process will be introduced.

1 INTRODUCTION

The most widely used test evaluation method at this

time is the Item Response Theory (IRT). IRT pro-

vides great results in estimating the ability level of

a tested person. Unfortunately such outcomes are

not always applicable. Lots of testing problems are

pass/fail problems: HR services, professional certiﬁ-

cations, high school or university entrance exams etc.

Other tests have to compare person’s skills to a given

standard deﬁning a set of groups (categories/grades)

an examinee could belong to (e.g. CEFRL certiﬁca-

tion, school grades or state assessments in some coun-

tries). These kinds of tests are intended to classify ex-

aminees into given (and deﬁned in advance) groups -

categories. The purpose of many of today’s tests is

rather classiﬁcation than ability estimation. This ap-

proach is not new. Even Cronbach and Gleser in their

book (Cronbach and Gleser, 1957) argue that the ul-

timate purpose of testing is to arrive at classiﬁcation

decisions.

Rudner in (Rudner, 2002; Rudner, 2009; Rudner,

2010) discusses main features of IRT usage in solving

classiﬁcation problems. He argues that since classiﬁ-

cation is a different (and in many ways simpler) task

than ability estimation and IRT is fairly complex and

relies on a several restrictive assumptions, we should

ﬁnd a more suitable evaluation method intended di-

rectly for classiﬁcation. Rudner then presents educa-

tional testing based on the classiﬁcation named Mea-

surement Decision Theory (MDT). We would like to

recall main principles of MDT in the next section (2).

Although MDT has been known for about ten

years and even its background was discussed as soon

as 1970s (Hambleton and M, 1973; van der Linden

and Mellenbergh, 1978), it remains out of the main

focus of measurement community. In this paper we

would like to give MDT some of the attention we

think it deserves. We present a brief overview of ef-

ﬁciency of MDT (section 4) and especially a kind of

guideline (an application-ready process) of item pa-

rameters estimation in section 5. We recognize lack

of research in both topics as one of the main reasons

why MDT is used so rarely.

2 BACKGROUND

(MEASUREMENT DECISION

THEORY)

Measurement Decision Theory (MDT) is a test eval-

uation method intended to classify examinees. MDT

was introduced by Rudner in (Rudner, 2002) and re-

vised in (Rudner, 2009). In his papers Rudner has

proven that MDT issimpler and more efﬁcient in clas-

sifying examinees than cut-point based IRT classiﬁca-

tion.

Dvorak J..

ON PRACTICAL ISSUES OF MEASUREMENT DECISION THEORY - An Experimental Study.

DOI: 10.5220/0003901300940099

In Proceedings of the 4th International Conference on Computer Supported Education (CSEDU-2012), pages 94-99

ISBN: 978-989-8565-07-5

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

MDT is nothing more than Naive Bayes Clas-

siﬁer (NBC), a well-known classiﬁer from Arti-

ﬁcial Intelligence. Classiﬁers are algorithms in-

tended to classify objects (examinees in our case

of Educational Measurement) according to their at-

tributes (responses to test items) into a pre-deﬁned

set of classes/groups/categories. NBC overcomes

most of the prerequisites of IRT (especially uni-

dimensionality of tested domain) assuming only a lo-

cal independence of items similarly to both IRT and

CTT.

2.1 Method

2.1.1 Basic Deﬁnitions

Def.: Let M be a set of categories and m

∈ M the

k-th category. Let U be a set of items and u

∈ U the

i-th item. Let Z be a set of examinees and z

∈ Z the

j-th examinee.

Def.: Let P(m

) be a probability of randomly se-

lected examinee belonging to a category m

∈ M and

let ~p = (P(m

), P(m

), . . . , P(m

)).

Def.: Let P(u

) be a probability of

correct response of an examinee of cate-

gory m

∈ M to the item u

∈ U and let

= (P(u

), P(u

), . . ., P(u

)) be a

vector of parameters of item u

∈ U (i.e. calibration

of item u

∈ U).

2.1.2 Method Description

Priors. MDT requires us to know in advance sets

M and U and two other priors. The ﬁrst one is a

vector of distribution of categories in population ~p

and the second one is a set of parameters of items

P = {~p

∈ U}.

Observations. Observations obtained from a test

for a single examinee are represented by a vec-

tor of his/her responses ~z

= (z

, z

j2,

..., z

) where

∈ {0, 1} for incorrect/correct response. Let R =



∈ Z



Classiﬁcation. Let’s describe the classiﬁcation pro-

cess by a function F : (~p, P,R) → ~c where ~c =

...,c

) is vector of categories c

∈ M

such as c

= m

⇐⇒ examinee z

belongs to

category m

. Function F could be rewritten

into a vector of simpler functions F (~p, P, R) =

( f (~p, P,~z

), f (~p,P,~z

), . . . , f (~p, P,~z

)) deﬁned in

equation 1 and subsequent equations 2, 3 and 4.

f (~p, P,~z) = m ∈ M; P(m|~z) = max

∈M

(P(m

|~z)) (1)

P(m

|~z) = n

P(~z|m

)P(m

) (2)

∑

∈M

P(~z|m

)P(m

)

(3)

P(~z|m

) =

∏

{i|z

=1}

P(u

) ·

∏

{i|z

=0}

(1− P(u

))

(4)

Note that function F is application of Bayes’ The-

orem. Equation 4 implies the “naive“ assumption of

local independence of responses to items. n

used in

equation 2 and deﬁned in equation 3 is a normalizing

constant ensuring

∑

∈M

P(m

|~z) = 1.

3 METHODOLOGY

Our study is based on results of experimental appli-

cations of MDT performed in simulated environment.

In this section we summarize essential parts of simu-

lation.

Def.: Since simulation often uses randomness we

deﬁne a random function RAN : (S,~v) → s ∈ S

such as s is selected randomly from S with re-

spect to the probability distribution vector ~v =



, v

, . . . , v

|S|



∑

= 1.

3.1 Test and Item Generator

A test is a set of randomly generated items. There-

fore, an item generator is an essential part of a sim-

ulation engine. Our model represented by func-

tion GI :

0 → ~p

is based on two main assumptions.

At ﬁrst it assumes that categories represent sequen-

tial grades: for each item u

and each m

stands

P(u

k−1

) < P(u

k+1

). At second that

items are quite good: max(P(u

k+1

) − P(u

)) ∈

h0.2, 0.6i. Function GI ()generates ~p

of random ele-

ments P(u

) with respect to this assumptions.

4 ACCURACY OF

CLASSIFICATION

Instead of recalling Rudner’s experiments comparing

IRT and MDT we are focusing on practical issues of

ONPRACTICALISSUESOFMEASUREMENTDECISIONTHEORY-AnExperimentalStudy

MDT. At ﬁrst we wouldlike to showa relationship be-

tween classiﬁcation accuracyand the number of items

or categories respectively. In both cases we are inter-

ested in theoretical limits (given quality of items) of

accuracy. The classiﬁcation is performed on actual

parameters of items not on the estimated parameters.

Following experiments share commonframework.

A single experiment is repeated a few hundred times

and then statistical characteristics of results of the set

of experiments are evaluated. Overall results are pre-

sented as a so-called box-graph where around a hori-

zontal line representing median is a box showing ﬁrst

and third quartile with whiskers as extreme values.

Box-graphs show both the most likely results (me-

dians) and the stability of results (quartiles and ex-

tremes).

4.1 Experiments

In this section we present a framework common to

experiments following in section 4.2.

Def.: Let’s have a given number of categories m,

number of items u and number of examinees z = 200

deﬁning sets M, U and Z.

Step 1. Let’s have a set of parameters of items P

∈U

GI() of actual parameters of items in U, vector

of categories~c

such as c

= RAN ({1, 2, .. . , m}, ~p)

where ~p =



, . . . ,



(note we are assuming

equal distribution of ~p), where examinees z

∈ Z

belong to and a set R of responses such as z

RAN



{0, 1} ,

1− P





, P



o

Step 2. Let~c = F



~p, P

, R



Step 3. Let classiﬁcation error rate e = E



~c,~c



where function E = (~v, ~w) is deﬁned by equation 5.

E (~v, ~w) =





j| v

6= w





|~w|

(5)

4.2 Results

Here we show results of two sets of experi-

ments. The ﬁrst one with setting m = 5 and u =

(10, 20, 30, 40, 50, 60, 70, 80, 90, 100) is shown in

Figure 1-left . We can see that the error rate falls with

the number of items not only in the sense of the most

likely result but also in the sense of stability. The

same effect could be seen in Figure 1-right where a

similar set of experiments with m = 8 is presented.

10 20 30 40 50 60 70 80 90 100

0.0 0.1 0.2 0.3 0.4

items (5 groups)

error rate

10 20 30 40 50 60 70 80 90 100

0.0 0.1 0.2 0.3 0.4 0.5 0.6

items (8 groups)

error rate

Figure 1: Accuracy of classiﬁcation vs. number of items

(5(left)/8(right) categories, ideal parameters = theoretical

limits).

2 3 4 5 6 7 8 9 10

0.00 0.05 0.10 0.15 0.20 0.25

categories

error rate

Figure 2: Accuracy of classiﬁcation vs. number of cate-

gories (ideal parameters = theoretical limits).

Figure 2 shows results of an experiment of setting

u = 50 and m = (2, 3, 4, 5, 6, 7, 8, 9, 10). The accu-

racy of classiﬁcation signiﬁcantly decreases with the

increasing number of categories.

In Figure 3 we are presenting an overview of the-

oretical limits of calibration. We can see how many

items we need for a given number of categories to ob-

tain e < 0.1. More precisely, the ﬁgure shows mini-

mal number of items for a given number of categories

when median of error rates was below 0.1.

2 4 6 8 10

10 20 30 40 50 60

categories

items

Figure 3: Number of items needed to get median of classi-

ﬁcation errors below 0.1 for given number of categories.

CSEDU2012-4thInternationalConferenceonComputerSupportedEducation

5 CALIBRATION

In (Rudner, 2002; Rudner, 2009; Rudner, 2010) Rud-

ner spends only a few words talking about MDT cali-

bration (estimation of priors - vector ~p and set P). But

for practical purposes calibration process is essential.

In this section we are presenting methods of calibra-

tion and results of our experiments showing charac-

teristics of calibration process important to implement

MDT in real-world testing. In this section we are fo-

cused entirely on an estimation of set P because in the

worst case, if we were unable to estimate ~p, we could

set it ~p =



|M|

, .. . ,

|M|



(equally distributed cat-

egories in population) without fatal consequences to

method precision (see (Rudner, 2009)).

5.1 Basic Calibration

As we have already mentioned MDT is an instance

of well-known Naive Bayes Classiﬁer (NBC). NBC

is widely used in a scope of Artiﬁcial Intelligence

where calibration process (“classiﬁer training“) is

well-developed. In AI there is a “training set“ of ob-

jects of known attributes as well as their classiﬁca-

tion. Equivalent to training set in Educational Mea-

surement is pilot testing performed on a set of exami-

nees (“pre-testees“) of known classiﬁcation (typically

obtained from external sources e.g. existing certiﬁ-

cations). Once we have a set of objects (pre-testees),

their attributes (responses to items) and their classi-

ﬁcations we are able to compute parameters of at-

tributes (items).

More precisely: Let’s have a set of categories M,

set of items U , set of examinees Z, set of appropriate

responses R and a vector of appropriate classiﬁcation

~c. Our task is to obtain an appropriate set P. Once

again we could describe the process as a function B :

(R,~c) → P. Since P is a set of P(u

) elements

we could simplify the computation of function B to

a computation of each element. In equations 6, 7 and

8 there is a description of evaluation of P(u

) in

three steps.

= {~z

|~z

∈ R∧ c

= m

} (6)

= {z

|~z

∈ R

∧ z

= 1} (7)

P(u

) =



(8)

Crucial difference between usage of NBC in Ar-

tiﬁcial Intelligence and Educational Measurement is

the size of a training set. In AI we are typically op-

erating with large training sets even larger than the

set of objects we want to classify (see examples in

(Caruana and Niculescu-mizil, 2006)). In contrast in

Educational Measurement we are very limited in the

number of pre-testees. It is an expensiveprocess to re-

cruit persons of known classiﬁcation especially if we

are developing a brand-new test. Therefore there is

a strong motivation to keep number of pre-testees as

small as possible. Two next sections are dedicated to

the analysis of required number of pre-testees (section

5.2) and to the description of a particular calibration

improvement technique (section 5.3).

5.2 Items or Categories

Two approaches are possible when describing sufﬁ-

cient number of pre-testees: a per-item (used by Rud-

ner in (Rudner, 2010)) or per-group. In this section

we will show which approach is more appropriate.

To answer this question we have constructed two

sets of experiments. Experiments are repeated a few

times and share a common framework. Results of ex-

periments are again presented as box-graphs.

5.2.1 Framework

Let’s have a given number of categories m, number of

items u, number of pre-testees z

, number of exami-

nees z = 200 deﬁning sets M, U, Z

, Z and number of

selected items u

≤ u.

Step 1. Let’s again have a set of parame-

ters of items P

∈U

GI(), vector of cat-

egories of pre-testees z

∈ Z

belong to ~c

such as c

= RAN ({1, 2, . . . , m}, ~p) where

~p =



, . . . ,



(again equal distribution of ~p),

a set R

of responses of pre-testees such as z

RAN



{0, 1} ,

1− P





, P



o

⊆ U of u

randomly (equally distributed)

selected items u

∈ U, and a set of responses

of examinees to items of U

such as z

RAN



{0, 1} ,

1− P





, P



o

Step 2. Let P = B



,~c



and than let ~c =

F (~p, P, R

Step 3. Let again classiﬁcation error rate e =



~c,~c



. And let difference of calibration to real pa-

rameters d = D



P, P



where function D is deﬁned

by equation 9.

ONPRACTICALISSUESOFMEASUREMENTDECISIONTHEORY-AnExperimentalStudy

10 20 30 40 50 60 70 80 90 100

0.00 0.05 0.10 0.15 0.20 0.25

items

error rate

10 20 30 40 50 60 70 80 90 100

0.010 0.015 0.020 0.025 0.030 0.035

items

difference

Figure 4: Accuracy of classiﬁcation(left)/calibration(right)

vs. number of items.



, P



∑

∈U, m

∈M



) − P

)



m· u

(9)

5.2.2 Experiment 1

The ﬁrst experiment focuses on the per-item ap-

proach. This approach suggests that given a

ﬁxed number of categories and pre-testees the ac-

curacy of calibration and classiﬁcation should de-

crease while the number of items to calibrate in-

creases. We have constructed set of experiments

with setting m = 2, z

= 20, u

= 20 and u =

(10, 20, 30, 40, 50, 60, 70, 80, 90, 100). Note that we

are selecting a subset of items from a whole pool to

ensure relevant comparison of classiﬁcation results

between experiments with different number of cali-

brated items with respect to the dependency of classi-

ﬁcation accuracy to the number of items discussed in

section 4. Figure 4-left shows how value of e changes

with respect to number of calibrated items. Figure 4-

right shows results of d instead of e. As we can see

both the accuracy of calibration and the accuracy of

classiﬁcation remain constant.

5.2.3 Experiment 2

The second experiment checks the inﬂuence of the

number of categories to the accuracy of calibra-

tion and classiﬁcation. The setting of the experi-

ment is now u = 50, z

= 30, u

= 50 and m =

(2, 3, 4, 5, 6, 7, 8). Figures 5-left and 5-right show

the results. Now we can see a very different picture

to one seen in the previous ﬁgures. The accuracy of

calibration as well as the accuracy of classiﬁcation de-

creases with the increasing number of groups.

5.2.4 Conclusion

Our experiments have proven that describing number

of pre-testees on per-group basis is more appropriate.

2 3 4 5 6 7 8

0.05 0.10 0.15

categories

difference

2 3 4 5 6 7 8

0.00 0.05 0.10 0.15 0.20

categories

error rate

Figure 5: Accuracy of calibration(left)/classiﬁcation(right)

vs. number of categories.

5.3 Unknown Objects

Although we have shown that good calibration could

be obtained from a relatively small number of pre-

testees, the calibration process could be still very ex-

pensive and further improvements of the original cali-

bration process are needed. The method we are going

to explain was developed experimentally by us inde-

pendently from the mentioned references.

NBC is used also in document classiﬁcation. Clas-

siﬁcation of documents is in many ways similar to

testing. In document classiﬁcation as well as in test-

ing there is a huge amount of objects (documents, ex-

aminees) to be classiﬁed but it is difﬁcult to obtain

a training set. Therefore the training set is typically

very small. Nigam et al. (Nigam et al., 2000) took

inspiration from (Dempster et al., 1977) and used un-

classiﬁed objects to improve calibration of NBC. We

can use the same approach summarized in the follow-

ing algorithm:

1. Let’s have sets M, U, Z

, Z, R

and R and vectors

~p and~c

(in the notation of previous sections).

2. Let P

= B



,~c



3. Let P

t+1

= B(R, F (~p, P

, R)).

4. Repeat iterative step 3 until terminal condition

t+1

= P

(i.e. P

t+1

) = P

) ∀u

∈

U, m

∈ M) is reached.

5. As a side effect of this calibration classiﬁcation of

examinees we obtain: ~c = F (~p, P

, R).

Improvement of both calibration and classiﬁcation

accuracy performed by this iterative calibration algo-

rithm could be seen in Figures 6 and 7 for m = 5,

u = 50, z

= 20 and z = (100, 200, 300, 400, 500).

6 CONCLUSIONS

Measurement decision theory is a powerful test eval-

uation method in cases where we want to classify

examinees into a set of pre-deﬁned categories. In

this paper we have presented results of a couple of

CSEDU2012-4thInternationalConferenceonComputerSupportedEducation

0 100 200 300 400 500

0.005 0.010 0.015 0.020 0.025

Figure 6: Improvement of calibration accuracy.

0 100 200 300 400 500

0.00 0.05 0.10 0.15 0.20 0.25 0.30

Figure 7: Improvement of classiﬁcation.

experiments showing some interesting characteristics

of MDT. These experiments have followed and ex-

panded on the work (Rudner, 2002; Rudner, 2009;

Rudner, 2010) of Rudner. We were focused on prac-

tical issues of MDT to give a solid base for future

applications of MDT.

We have shown an overview of theoretical limits

of classiﬁcation via MDT in section 4 and a depen-

dency of classiﬁcation accuracy to number of items

and number of categories was discussed. As a main

result of this section we have summarized our ﬁnd-

ings to a direct suggestion of how many items should

be chosen to obtain good classiﬁcation (error rate less

than 0.1) results for different number of categories.

In the next section we have focused on the most

important obstacle on the path to real-life usage of

MDT - the process of calibration of items. We

have explained in depth the whole process of simple

straightforward calibration of items. Finally we have

introduced an improvement to the calibration process

which signiﬁcantly reduces the number of required

pre-testees. Experimental results showing the reduc-

tion were presented as well.

Based on the results presented in this paper MDT

becomes a ready-to-use method.

ACKNOWLEDGEMENTS

This paper was written in the colaboration with my

colleagues from Scio (www.scio.cz).

REFERENCES

Caruana, R. and Niculescu-mizil, A. (2006). An empirical

comparison of supervised learning algorithms. In In

Proc. 23 rd Intl. Conf. Machine learning (ICML`06),

pages 161–168.

Cronbach, L. J. and Gleser, G. C. (1957). Psychological

tests and personnel decisions. Urbana: University of

Illinois Press.

Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977).

Maximum likelihood from incomplete data via the em

algorithm. In Journal of the Royal Statistical Society,

Series B, 39 (1), pages 1–38.

Hambleton, R. and M, M. N. (1973). Toward an integration

of theory and method for criterion-referenced tests. In

Journal of Educational Measurement, 10, pages 159–

170.

Nigam, K., Mccallum, A. K., Thrun, S., and Mitchell, T.

(2000). Text classiﬁcation from labeled and unlabeled

documents using em. In Machine Learning, 39 (2/3),

pages 103–134.

Rudner, L. M. (2002). An examination of decision-theory

adaptive testing procedures. In Paper presented at

the annual meeting of the American Educational Re-

search Association, April 2002.

Rudner, L. M. (2009). Scoring and classifying examinees

using measurement decision theory. In Practical As-

sessment Research & Evaluation, 14(8).

Rudner, L. M. (2010). Measurement decision the-

ory (a measurement decision theory tutorial). In

http://echo.edres.org:8080/mdt/.

van der Linden, W. J. and Mellenbergh, G. J. (1978). Coefﬁ-

cients for tests from a decision-theoretic point of view.

In Applied Psychological Measurement, 2, pages 119–

134.

ONPRACTICALISSUESOFMEASUREMENTDECISIONTHEORY-AnExperimentalStudy