STATISTICAL DECISIONS IN PRESENCE OF IMPRECISELY

REPORTED ATTRIBUTE DATA

Olgierd Hryniewicz

Systems Research Institute, Polish Academy of Sciences, Newelska 6, Warsaw, Poland

Keywords: Statistical decisions, Attribute data, Imprecise data, Fuzzy approach, Possibility distribution.

Abstract: The paper presents a new methodology for making statistical decisions when data is reported in an

imprecise way. Such situations happen very frequently when quality features are evaluated by humans. We

have demonstrated that traditional models based either on the multinomial distribution or on predefined

linguistic variables may be insufficient for making correct decisions. Our model, which uses the concept of

the possibility distribution, allows to separate stochastic randomness from fuzzy imprecision, and provides a

decision – maker with more information about the phenomenon of interest.

1 INTRODUCTION

When we make decisions on the basis of statistical

analysis of data, we call such decisions – statistical

decisions. An important class of statistical decisions

is based on attribute data. In the simplest case,

attribute data are presented in a form of a random

sample consisting of elements having only two

values: zero and one. All cases described by zeroes

are usually called “failures”, and the remaining

statistical observations are called “successes”.

Statistical decisions based on attribute data are

well known in many fields of application. They were

introduced more than eighty years ago in statistical

quality control, and since that time they have been

widely used in industry, business and administration.

However, in information technology these methods

are, as for know, not very popular. Take, for

example, typical decision problems of Artificial

Intelligence or Pattern Recognition. Quality of

proposed algorithms is evaluated on widely accepted

benchmarks without taking into account the

randomness of their outputs which results from the

randomness of input data. In this paper we present

an attempt to deal with this problem in cases which

seem to be typical in such applications like e.g.

linguistic summarizations of text data or automatic

classification of documents.

The theory of statistical decisions for the

attribute data (i.e. 0 – 1) is well known for more than

eighty years. It has been developed mainly for

applications in statistical quality control or other

industrial applications. In all such cases each

element of the analysed sample is precisely

evaluated as either “success” (1) or “failure” (0).

However, in many areas of application such precise

evaluations are hardly possible. Consider, for

example, an automatic selection of text documents,

where users evaluate the appropriateness of the

selection. The proportion of documents which have

been wrongly classified may serve as a measure of

the effectiveness of this algorithm. In many cases

however, it is difficult to present unequivocal

evaluations. The users may prefer to have a

possibility to give also answers like “May be Yes”,

“I am Undecided” or “May be Not”, and not only

either “Yes” or “No”. To give another example from

the area of information technology, let us consider

the evaluation of a new algorithm for the

compression of graphics. The perceived quality of

this new method can be evaluated by a group of

experts who are asked about the acceptability of

compressed pictures.

The practical necessity to work with such

imprecisely reported data prompted some authors to

develop appropriate statistical tools that could be

useful in decision making. The simplest approach is

based on the multinomial model for imprecisely

reported attribute data. We present this model in the

second section of the paper. Another possibility, the

application of fuzzy linguistic variables is analysed

in the third section. In the fourth section we present

a new approach based on the possibility theory

introduced by Zadeh. We present a possibilistic

309

Hryniewicz O. (2009).

STATISTICAL DECISIONS IN PRESENCE OF IMPRECISELY REPORTED ATTRIBUTE DATA.

In Proceedings of the 11th International Conference on Enterprise Information Systems - Artiﬁcial Intelligence and Decision Support Systems, pages

309-312

DOI: 10.5220/0001863003090312

 SciTePress

generalization of the multinomial model. In Monte

Carlo simulation experiments which are not

described in this short paper we have shown that this

approach provides more information for decision

makers in comparison to the aforementioned

methods.

2 MULTINOMIAL MODEL FOR

IMPRECISE ATTRIBUTE DATA

Suppose that a random variable, representing

statistical data of interest, may have k distinct

values. These values can be represented by natural

numbers, but can be also represented by either

ordered or unordered labels. The probabilities of

observing those values are denoted by

()

p,,p …

where

∑

1. If we observe a random sample

of n realization of this random variable, the

probability distribution that describes the numbers of

occurrences of all possible values

()

X,,X …

this random variable is called the multinomial

distribution, and is defined by the following function

()

.xpp

xX,,xXP

∑

===



…

(1)

This distribution may be used for the

construction of decision-making procedures when

observed values may be assigned to k different

categories. For example, in statistical quality control

we may observe different types of failures. If for

each considered type of failure we fix a critical

number of nonconforming items in the sample, we

can use (1) for the calculation of the probability of

the acceptance of the sampled population for all

possible values of the probabilities

()

p,,p …

Let us now consider the situation when observed

attribute data are imprecisely described by linguistic

labels. Without loosing generality, we may assume

that these data are described by the following set of

labels: “Yes” (Y) “May be Yes” (MY), “I am

Undecided” (U), “May be Not” (MN), and “Not”

(N). Let denote by

()

NMYY

p,p,p … the vector of

the corresponding probabilities of observations.

Then, we can use (1) for the calculation of all

interesting probabilities. We should note, however,

that in the considered case the decision – making

procedure should be different than in the

aforementioned case of statistical quality control.

We are usually interested in the unknown proportion

of actual (A) successes. Let us assume that we may

make only two decisions: “Accept” (if the actual

proportion of “successes” is small) or “Reject” (if

otherwise). The decision is based on the number of

“successes” in the sample. If this number is not

greater than a certain critical number c our decision

is to “Accept”. Otherwise, the decision is to

“Reject”. The decision criterion c is determined

from the analysis of the probability of “Acceptance”

calculated from the appropriate binomial

distribution.

In the considered case of imprecisely reported

observations actual “successes” may be hidden

under four possible labels, i.e Y, MY, U, and even

MN. Therefore, we may think about four possible

critical numbers: c

, for observations, which occur

with probability p

(Y)

, c

, which occur with

probability p

(MY)

, c

, which occur with

probability p

(U)

, and c

, which occur

with probability p

(MN)

. In order to

set all these critical values we have to know all

acceptable values for all these probabilities.

However, in practice we know these values only for

the actual probability of a “success”. Therefore, it is

natural to use only one critical value c for all these

possible outcomes of the test. If we do so, it is easy

to show that the probabilities of “Acceptance” will

be quite different, depending on the values of the

probabilities

(

)

NMYY

p,p,p … . However, usually

we do not know these probabilities, so we don’t

know the actual characteristics of our decision

procedure. Therefore, the multinomial model, if it

has to be used for the modelling of imprecise

attribute data, requires additional knowledge about

the probabilities of different answers.

3 FUZZY LINGUISTIC

VARIABLES AS MODELS OF

IMPRECISE ATTRIBUTE DATA

Imprecise values of attribute data can bee looked

upon as linguistic data described by fuzzy linguistic

variables as it was proposed by Zadeh. For

modelling quality data his approach was adopted in

(Wang and Raz, 1990) who proposed to describe

imprecise answers by predefined fuzzy subsets of

the interval [0,1]. In their original paper they

proposed to use fuzzy triangular number defined on

overlapping subsets of [0,1]. For making decisions

Raz and Wang proposed to use some real-valued

representations of fuzzy numbers, such as: modal

ICEIS 2009 - International Conference on Enterprise Information Systems

310

value, midpoint of the 50%

-cut, average, and

centroid. It is easy to show that for the first three of

the abovementioned representations it does not

matter if we calculate representative values for

individual observation and then sum them up or if

we calculate a fuzzy sum, and then its representative

value. In the fourth case this important property

holds only either for triangular fuzzy numbers or for

rectangular fuzzy numbers (i.e. for intervals).

In order to make decision about “Acceptance”

(or “Rejection”) we have to compare the

representative value of the sum of observed

linguistic variables with a certain critical value.

Unfortunately, this critical value cannot be easily

calculated for a simple reason that the representative

values of the fuzzy sum of fuzzy observations may

be quite different from the expected number of

evaluated “successes”. Especially when the fraction

of imprecise observations is significant the observed

representative values may be quite different than the

expected numbers of “successes” in the sample.

Another problem with determination of a correct

critical value for representative values of fuzzy

observations is related to their strong dependence on

the assumed representations of imprecise linguistic

concepts. All these problems and difficulties make

decision – making which is based on this fuzzy

approach rather questionable.

It is also worth noticing that in all cases when

calculation of representative values can be

performed on individual fuzzy observations the

whole procedure boils down to ordinary weighting

of observations. This concept is also known as the

calculation of “demerits”, and has been successfully

implemented in statistical process control (SPC).

However, in SPC it is assumed that available

information let us compute probabilistic

characteristics of the considered statistic.

Unfortunately, this is usually not the case for the

problem considered in this paper. Recently, in

(Gülbay and Kahraman, 2007) another fuzzy

approach has been proposed for the analysis of

linguistic quality data. However, this approach in the

context of decision-making has exactly the same

limitations as that of Wang and Raz.

4 POSSIBILITIC MODEL OF

IMPRECISE ATTRIBUTE DATA

In the previous two sections we have demonstrated

that in case of imprecisely reported attribute data the

information provided in terms of simple linguistic

labels may be not sufficient for correct decision –

making if this correctness should depend upon the

fraction of “successes” in a considered population.

In (Hryniewicz, 2008) an extension of the

considered model has been proposed by allowing

additional information about imprecise observations.

Our extension is based on a fact that each

observation may be treated as a “success”, but to a

certain degree, and vice versa, as a “failure”, but

also to a certain degree. Thus, the result of each

observation can be described by a fuzzy set

{}

11010

101010

≤

,max,,,||

(2)

defined on the set {0,1}. This fuzzy representation

may be also interpreted as a possibility distribution

over the set of two crisp outcomes of an observation:

“success” (one) and “failure” (zero). When the result

of an observation is described linguistically in such a

way that it can be regarded as a “failure”, the result

of observation is expressed as a fuzzy set with the

membership function

101

+ . Full (i.e.

undoubted) “failures”, which in our setting are

represented by labels “No”, are now described by

crisp sets. In this case the membership function is

given by

1001 ||

. When 10

the

corresponding label is “May be No”, and

in this

case describes the degree to which this label is

incompatible with an unequivocal label “No”. On

the other hand, if the result of an observation is

described linguistically in such a way that it can be

regarded as a “success”, the result of observation is

expressed as a fuzzy set with the membership

function

110

. Full (i.e. undoubted)

“successes”, which in our setting are represented by

a labels “Yes”, are described by crisp sets with the

membership function

1100 || + . When 10

the corresponding label is “May be Yes”, and

this case describes the degree to which this label is

incompatible with an unequivocal label “Yes”.

When

, we have the situation which we

describe by a label “Undecided”, as in this case there

is the same possibility either of “successes” and

“failures”.

Assume now, that in the sample of n items n

cases are characterized by fuzzy sets described by

the membership function

1110 n,,i,||

…

(3)

and in the remaining

nnn −

cases by fuzzy sets

described by the membership function

1101 n,,i,||

…

(4)

STATISTICAL DECISIONS IN PRESENCE OF IMPRECISELY REPORTED ATTRIBUTE DATA

311

Without loss of generality we can assume that

010

≤≤≤≤

n,,

μμ

…

(5)

and

111

≥≥≥≥

n,,

μμ

…

(6)

Hence, the fuzzy total number of “successes” in this

sample, calculated using Zadeh’s

extension

principle, is given by:

() ( )

nn|n|

n|||x

n,,

211111

12010

110

+++++

++++=

μμ

…

(7)

This number has to be compared with a critical

number of “successes” c in order to make a decision

of “Acceptance” or of “Rejection”.

Comparison of fuzzy numbers cannot be done

unequivocally, as they are not completely ordered.

One of the widely accepted methods of comparison

is based on the concepts of possibility and necessity

of dominance introduced in (Dubois and Prade,

1983). Let

()

be the membership function of the

fuzzy set

, and

()

be the membership function

of the fuzzy set

. When the evidence that

strictly greater than

is rather strong we can

express this feature using the Necessity of Strict

Dominance (NSD) index, defined as follows:

(

)

()

(

)()

[]

y,xminsupY

NSD

yx:y,x

νμ

≤

−= 1

(8)

When this evidence is weak, we can use the

Possibility of Strict Dominance (PSD) index which

is related to the NSD using the following formula.

(

)

(

)

PSDY

NSD  −= 1 .

(9)

The interpretation of these indices reflects

common understanding of the words “possible” and

“necessary”. Relations which are only partially

possible (PSD < 1) are not necessary (NSD = 0). On

the other hand, relations which are even partially

necessary (NSD > 0) are always fully possible (PSD

=1).

It is easily seen that when the number of

“successes” with the maximal value of the

membership function (equal to one) is situated to the

left of c we can say about a certain necessity that the

relation x<c has been fulfilled. This necessity is

equal to one only in case when the whole support of

(i.e. values of x with positive membership) is

located to the left of

c. When the number of

“successes” with the maximal value of the

membership function (equal to one) is situated to the

right of

c we can only say about a certain possibility

that the relation

x<c has been fulfilled. This

possibility is equal to zero only in case when the

whole support of

is located to the right of c.

This interpretation of possibility and necessity

indices let us formulate simple rules for decision –

making. We have to fix the critical value

c and the

required value of the necessity/ possibility of

≺ .

Thus, we are able to define a non-fuzzy decision

rule.

5 CONCLUSIONS

In the paper we have proposed a new approach to

the analysis and decision – making when

information is presented in a form of imprecisely

reported attribute data. We have demonstrated on

examples that traditional and popular approaches

provide only restricted information which might be

insufficient for correct decision making. The new

approach is definitely more flexible. Moreover, it

can be straightforwardly extended to the case when

the definitions of “success” and “failure” may be

imprecise. This imprecision may lead to imprecise

(fuzzy) decision criteria.

REFERENCES

Dubois, D., Prade, H., 1983. Ranking Fuzzy Numbers in

the Setting of Possibility Theory. Information Science,

vol.30, 183-224.

Gülbay, M., Kahraman, C., 2007. An alternative approach

to fuzzy control charts: Direct fuzzy approach.

Information Sciences, vol.177, 1463-1480.

Hryniewicz, O., 2008. Statistics with fuzzy data in

statistical quality control. Soft Computing, vol.12, 229-

234.

Wang, J.H., Raz, T., 1990. On the Construction of Control

Charts Using Linguistic Variables. International

Journal of Production Research, vol.28, 477-487.

ICEIS 2009 - International Conference on Enterprise Information Systems

312