A Fuzzy Poisson Naive Bayes Classiﬁer for Epidemiological Purposes

Ronei M. Moraes

and Liliane S. Machado

Department of Statistics, Federal University of Paraiba, Paraiba, Brazil

Department of Informatics, Federal University of Paraiba, Paraiba, Brazil

Keywords:

Classiﬁcation, Fuzzy Poisson Naive Bayes, Epidemiology.

Abstract:

Statistical methods have been used to classify data in different areas. In epidemiological studies, some mea-

sures follow speciﬁc statistical distribution and compatible classiﬁers can be designed for those cases. Clas-

siﬁers based on measures that follow Poisson distributions can be found in the scientiﬁc literature. Due to

uncertainty on epidemiological measures, a fuzzy approach may be interesting and the present work proposes

a new classiﬁer named Fuzzy Poisson Naive Bayes (FPNB). The theoretical development is presented as well

as results of its application on simulated multidimensional data. A brief comparison with a classical Poisson

Naive Bayes classiﬁer and with a Naive Bayes classiﬁer is performed too.

1 INTRODUCTION

Several kind of classiﬁers can be found in the sci-

entiﬁc literature and applied in different areas, as

pattern recognition (Kim et al., 2003), image pro-

cessing (Richards, 2013) and psychomotor skills as-

sessment of training based on virtual reality (Moraes

and Machado, 2014). There are classiﬁers designed

for Multinomial (Duda et al., 2000), Beta (Moraes

et al., 2012) (Moraes et al., 2014), Binomial (Bielza

and Larranaga, 2014), Gaussian (Johnson and Wich-

ern, 2007), Fuzzy Gaussian (Moraes and Machado,

2012) and mixture of distributions (Melo et al., 2003)

(Ogura et al., 2014). Some of them can be applied

without taking into account the statistical distribution

followed by the data, as neural networks (Bishop,

2007), genetic algorithms and decision trees (Cong-

don, 2000), K-NN (Vadrevu and Murty, 2010) and

Fuzzy K-NN (Keller et al., 1985). For this last case, it

can be observed a generalized use of classiﬁers, even-

tually with acceptable results. However, it is also pos-

sible to ﬁnd cases of use of non suitable classiﬁers for

that distribution of statistical data, resulting in per-

formances lower than expected or even poor perfor-

mances.

Some measures follow speciﬁc statistical distri-

bution and classiﬁers compatible with each case can

be designed. For example, the number of registered

cases of a particular disease in a period of time fol-

lows Poisson distribution (Feller, 1971). This dis-

tribution can also be used for other epidemiological

measures and it has been applied in other areas. For

instance, when the probability of a disease is small

and the total number of the population is large, Pois-

son distribution provides a good approximation for

Binomial distribution, with an important advantage:

it is easier to be computed than the last one. Classi-

ﬁers based on Poisson distribution are interesting for

applications in other areas too. In fact, Poisson Naive

Bayes Classiﬁer (PNB) has been applied to text clas-

siﬁcation (Altheneyan and Menai, 2014) (Kim et al.,

2003) and neurosciences (Ma et al., 2006), among

others.

However, the uncertainty on epidemiological

measures, which may be underestimated due to fail-

ure in data collection, or overestimated due to sup-

posed unconﬁrmed diagnoses (Rothman et al., 2012),

suggests that a fuzzy approach may be more appropri-

ate. So, a new approach based on Poisson distribution

and fuzzy data can be interesting to generate classiﬁ-

cations from epidemiological measures.

This paper is organized as following: the Section

2 presents some theoretical aspects of probability of

fuzzy events and introduces a new classiﬁer based on

Poisson distribution and fuzzy data. The Section 3

brings results from the application of the new method

in simulated Poisson distributed data. Comparisons

with two classiﬁers are performed in the Section 4:

classical Poisson Naive Bayes and Naive Bayes. Fi-

nally, the conclusions are provided in the last section.

Moraes, R. and Machado, L..

A Fuzzy Poisson Naive Bayes Classiﬁer for Epidemiological Purposes.

In Proceedings of the 7th International Joint Conference on Computational Intelligence (IJCCI 2015) - Volume 2: FCTA, pages 193-198

ISBN: 978-989-758-157-1

193

2 METHODOLOGY

For better understanding of the classiﬁer proposed,

some theoretical considerations need to be provided.

Firstly, it is deﬁned the concept of Naive Bayes classi-

ﬁer, followed by the concept of Poisson Naive Bayes

classiﬁer and by the new Fuzzy Poisson Naive Bayes

classiﬁer proposition. After those ones, details about

the epidemiological simulation are provided. Finally,

is introduced the Kappa Coefﬁcient, which is used to

perform statistical analysis of results.

2.1 Naive Bayes Classiﬁer

Formally, let be the classes of performance in space

of decision Ω = {1, ...,M} where M is the total num-

ber of classes. Let be a vector of training data X, ac-

cording to sample data D, where X is a vector with n

distinct features, i.e. X = {X

,, X

} and w

, i ∈ Ω

is the class in space of decision for the vector X. So,

the probability of the class w

, given the vector X, can

be estimated using the Bayes Theorem:

P(w

|X) =

P(X|w

)P(w

)

P(X)

P(X

,. .., X

)P(w

)

P(X)

(1)

The computation of equation (1) has complexity

directly proportional to the increase of the number k

of variables. An alternative is assuming the naive hy-

pothesis (Duda et al., 2000), in which each feature X

is conditionally independent of every other feature X

for all k 6= l ≤ n. This hypothesis, though sometimes

it is not exactly realistic, enables an easier calculation

of equation (1). As advantage of that assumption is

the strength of the Naive Bayes (NB) classiﬁer and

the fact that it can classify data for which it was not

trained for (Ramoni and Sebastiani, 2001). So, unless

a scale factor S, which depends on X

,. .., X

, the

equation (1) can be expressed by:

P(w

,. .., X

) =

P(w

)

∏

k=1

P(X

) (2)

The classiﬁcation rule for NB is:

X ∈ w

if P(w

,. .., X

) > P(w

,. .., X

)

(3)

for all i 6= j and the probability P is given by (2).

2.2 Poisson Naive Bayes Classiﬁer

A possible approach for Naive Bayes classiﬁer is to

assume Poisson distribution for each X

, where:

P(X

= v|w

) =

−λ

(4)

where v = 0,1,2, ..., v! is the factorial of v, and com-

pute its parameter from D, i.e., the mean λ

(for vari-

able X

and the class i) (Feller, 1971). From equa-

tion (2) it is possible to use the logarithm function to

simplify the exponential function in the Poisson dis-

tribution formula (equation 4) and, consequently, to

reduce computational complexity by replacing multi-

plications by additions. So, the Poisson Naive Bayes

(PNB) classiﬁer is given by:

g (w

,. .., X

) = log[P(w

,. .., X

)]

= log(1/S) + logP(w

) +

∑

k=1

log[P(X

)] (5)

where g is the classiﬁcation function and P(X

) is

given by (4). The log[P(X

)] in the equation (5)

can be rewritten as:

log [P(X

= v|w

)] = log

−λ

= v × log(λ

) − λ

− log(v!). (6)

The classiﬁcation rule for PNB is:

X ∈ w

if g(w

,. .., X

) > g(w

,. .., X

)

(7)

for all i 6= j and the function g is given by (5).

2.3 Fuzzy Poisson Naive Bayes

Classiﬁer

Zadeh introduced a probability measure for fuzzy

events (Zadeh, 1968). Let B be a σ-ﬁeld of Borel sub-

sets in R

and P be a probability measure over Ω. Let

A be a fuzzy event in B. Thus, the probability of A

can be expressed as a Lebesgue-Sieltjes integral:

P(A) =

A⊆R

dP =

A⊆R

(x) dP = E(µ

) (8)

So, the probability of a fuzzy event A is the

mathematical expectation of its membership function,

which can be written as:

P(A) =

A⊆R

(x) P(x) dx (9)

At this point, it is assumed that X

,. .., X

are

also fuzzy variables (Klir and Yuan, 1995), and for

each one a membership function µ

) is available

for all k 6 n. Then, based on probability of a fuzzy

FCTA 2015 - 7th International Conference on Fuzzy Computation Theory and Applications

194

event (Zadeh, 1968) given by the equation (9), the

Fuzzy Poisson Naive Bayes (FPNB) classiﬁer is done

by:

,. .., X

) = log[P(w

,. .., X

)] =

= log(1/S

) + logP(w

) +

∑

k=1

log[µ

)] + log[P(X

)] (10)

where g

is the new classiﬁcation function, S

is a new

scale factor and log[P(X

)] is given by (6).

The necessary parameters for computing of

P(X

) and µ

) should be learned from sample

data D. The better estimation for class of the vector X

can be obtained from the highest values of the classi-

ﬁcation function g

. However, as S

is a scale factor,

it is not necessary to computed it in this maximization

process. Then, from the equations (10) and (6):

,. .., X

) = logP(w

) +

∑

k=1

log[µ

)] + v × log(λ

) − λ

− log(v!)

(11)

Finally, the classiﬁcation rule for FPNB is:

X ∈ w

if g

,. .., X

) >

> g

,. .., X

) (12)

for all i 6= j and the functions g

are given by (11).

2.3.1 Parameters Estimation

In this paper, two estimators for λ using sample data

D are presented. The ﬁrst one is the maximum likeli-

hood estimator, which is given by (Feller, 1971):

dim(D)

∑

k=1

) (13)

where dim(D) is the length of sample data D for

which the class is w

and

∑

dim(D)

k=1

) is the count-

ing of events in D, in which the value of X

is associ-

ated to the class w

The second estimator is given by (Ogura et al.,

2014):

∑

dim(D)

k=1

)

+ dim(D)

(14)

where c

and c

are smoothing parameters (constants)

used to prevent estimations with value zero for

Thus, using the estimators provided by equation

(13) or (14) is possible to compute g

from the equa-

tion (11) for each class w

. In this paper, the estimator

provided by equation (14) is used and the parameters

= 0.1 and c

= 1 are set.

The membership functions µ

) should be

learned from sample data D. A possible ap-

proach is obtain them from normalized relative fre-

quency histograms of X

variables (Dubois and Prade,

1983)(Kaufmann et al., 2015).

2.4 Simulations

In order to assess the new classiﬁer, a Monte Carlo

simulation was used for the counting of new regis-

tered cases of three diseases. In practical situation,

they could be three communicable diseases. The ﬁrst

one is a vector-born disease: dengue fever, whose

vector in Brazil is the Aedes aegypti mosquito. The

second disease is HIV-AIDS and the third one is tu-

berculosis, which are spread person-to-person.

According to that situation, the goal is to predict

the class of epidemiological priority of municipali-

ties to support actions against those diseases. Thus,

databases with 200 observations (municipalities) for

each disease were generated to contain the three dif-

ferent diseases with three Poisson distributions using

different parameters. Each line of database simulates

the number of morbidities registered for each disease

for the municipalities. Three levels of priority were

deﬁned for all cases, according to the statistical ter-

ciles calculated in the training database for each dis-

ease. After that, a logical combination of those ter-

ciles deﬁnes the priority level of a municipality in:

low level, medium level and high level.

In total, 40 double databases were created, where

the ﬁrst one is for training and the second one is for

testing. The same Poisson parameters were used to

create both of them. However, those parameters were

changed for each double in order to know the variabil-

ity of the classiﬁcation results.

2.5 Coefﬁcient of Agreement

Assessment

A statistical comparison between two different classi-

ﬁers using several statistical coefﬁcients (Duda et al.,

2000) was performed. In the literature of Pattern

Recognition, a robust pondered measure which takes

into account agreements and disagreements between

two sources of information (Viera and Garrett, 2005)

is the Kappa Coefﬁcient, proposed by Cohen (Cohen,

1960) and given by:

A Fuzzy Poisson Naive Bayes Classiﬁer for Epidemiological Purposes

195

K =

− P

1 − P

, (15)

where:

∑

i=1

and P

∑

i=1

(16)

with n

as elements of the main diagonal of classiﬁ-

cation matrix; n

as the total of line i in the classiﬁ-

cation matrix, n

as the total of column in the same

matrix, M as the number of possible classes and N as

the total number of possible decision presented in the

matrix.

The variance of Kappa Coefﬁcient, denoted by

is given by:

(1 − P

)

N(1 − P

)

2(1 − P

) + 2P

− θ

N(1 − P

)

(1 − P

)

− 4P

N(1 − P

)

, (17)

where θ

is given by:

∑

i=1

+ n

)

, (18)

and θ

is given by:

∑

i=1

+ n

)

, (19)

respectively.

3 RESULTS

Using the 40 databases created from the simulations

described in the Section 2.4, the FPNB classiﬁer was

used to assing one of three levels of epidemiological

priority for each municipality simulated in databases.

Firstly, a ﬁle with training samples was used to esti-

mate the parameters of FPNB classiﬁer. After that,

the second ﬁle with testing samples was used to eval-

uate the performance of FPNB classiﬁer.

In order to provide closer to reality simulations,

the λ parameters used were obtained from Epidemi-

ological Bulletins from Brazilian Ministry of Health

and are reproduced below:

• Dengue fever: 282.2 cases by 100,000 inhabitants

(Surveillance, 2014);

• HIV-AIDS: 20.2 cases by 100,000 inhabitants

(Surveillance, 2013);

• Tuberculosis: 33.5 cases by 100,000 inhabitants

(Surveillance, 2015).

The best result obtained, according to Kappa Co-

efﬁcient, can be observed in the classiﬁcation matrix

presented in Table 1. In that table, the main diagonal

of the matrix brings the correct classiﬁcation. Outside

of the main diagonal are presented all errors of classi-

ﬁcation. The Kappa Coefﬁcient was used to perform

the comparison of the classiﬁcation agreement. From

the classiﬁcation matrix obtained, the Kappa coefﬁ-

cient for all samples was K = 62.0% with variance

7.091×10

−4

. The FPNB made mistakes in 152 cases.

That performance is very acceptable and it shows the

good adaptation of FPNB in the solution of this kind

of problem.

Table 1: Classiﬁcation matrix for the FPNB classiﬁer.

Database

FPNB

1 2 3

1 148 50 2

2 36 128 36

3 1 27 172

Another important result is the computational per-

formance of the FPNB classiﬁer: with a Core 2 Duo

PC compatible with 2GB of RAM, the average time

of CPU consumed by the assessment was 0.3590 sec-

onds. Then, it is possible to afﬁrm that the FPNB has

low computational complexity.

4 COMPARISON WITH OTHER

CLASSIFIERS

A comparison was performed between the FPNB with

other two classiﬁers described in this paper: the PNB

and the NB classiﬁers. All of them were conﬁgured

using the same methodology mentioned before. Thus,

the same samples of training were used to obtain the

parameters for both classiﬁers, and the same sam-

ples of testing were used for a controlled and impar-

tial comparison among the classiﬁers. The CPU time

used by both classiﬁers in the classiﬁcations tasks

were measured.

The classiﬁcation matrix obtained for the PNB

classiﬁer is presented in the Table 2. The Kappa coef-

ﬁcient was K = 58.25% with variance 7.4987 × 10

−4

and there were 167 misclassiﬁcations. The classiﬁca-

tion task demanded 0.1400 seconds of CPU.

The NB classiﬁer provided the classiﬁcation ma-

trix presented in the Table 3. For this classiﬁer, the

Kappa coefﬁcient was K = 41.5000% with variance

9.0710 × 10

−4

, demanding 0.5140 seconds of CPU.

In this case, there were 234 misclassiﬁcations.

FCTA 2015 - 7th International Conference on Fuzzy Computation Theory and Applications

196

Table 2: Classiﬁcation matrix for the PNB classiﬁer.

Database

PNB

1 2 3

1 156 43 1

2 50 113 37

3 2 34 164

Table 3: Classiﬁcation matrix for the NB classiﬁer.

Database

1 2 3

1 173 27 0

2 87 106 7

3 25 88 87

It is possible to see by Tables 1, 2, 3 and by Kappa

coefﬁcients that the performance of the FPNB clas-

siﬁer is better than both other classiﬁers. In statis-

tical terms, the difference of performance between

those assessment methods can be considered signif-

icant. Observing the computational performance, the

FPNB was faster than the one based on NB, but PNB

is the fastest.

5 CONCLUSIONS

In this paper was presented a new classiﬁer based on

Fuzzy Poisson Naive Bayes. Classiﬁers based on this

approach can be applied to epidemiological studies as

well as to other areas of human knowledge, as text

classiﬁcation and neurosciences.

The Fuzzy Poisson Naive Bayes performance was

compared with other classiﬁers performance based on

Poisson Naive Bayes and Naive Bayes. The results

obtained showed that the ﬁrst one presents signiﬁ-

cant better classiﬁcations than the others. The Pois-

son Naive Bayes classiﬁer provided competitive re-

sults and the Naive Bayes classiﬁer provided the worst

results.

In terms of CPU time, the Fuzzy Poisson Naive

Bayes was faster than the Naive Bayes, but Poisson

Naive Bayes is the fastest. The new classiﬁer pointed

out a competitive approach to solve problems in Epi-

demiology.

ACKNOWLEDGEMENTS

This project is partially supported by grants

310561/2012-4 and 310470/2012-9 of the National

Council for Scientiﬁc and Technological Develop-

ment (CNPq) and is related to the National Insti-

tute of Science and Technology “Medicine Assisted

by Scientiﬁc Computing”(181813/2010-6) also sup-

ported by CNPq.

REFERENCES

Altheneyan, A. S. and Menai, M. E. B. (2014). Naive bayes

classiﬁers for authorship attribution of arabic texts.

Journal of King Saud University Computer and In-

formation Sciences, 26(4):473–484.

Bielza, C. and Larranaga, P. (2014). Discrete bayesian net-

work classiﬁers: A survey. ACM Computing Surveys,

47(1):Article 5.

Bishop, C. (2007). Pattern Recognition and Machine

Learning. Springer, Berlin, 1st edition.

Cohen, J. (1960). A coefﬁcient of agreement for nominal

scales. Educat. Psyc. Measurement, 20(1):37–46.

Congdon, C. B. (2000). Classiﬁcation of epidemiological

data: a comparison of genetic algorithm and decision

tree approaches. In Proceedings of the 2000 Congress

on Evolutionary Computation, pages 442–449.

Dubois, D. and Prade, H. (1983). Unfair coins and neces-

sity measures: Towards a possibilistic interpretation

of histograms. Fuzzy Sets and Systems, 10(1-3):1520.

Duda, R. O., Hart, P. E., and Stork, D. G. (2000). Pat-

tern Classiﬁcation. Wiley Interscience, New York,

2nd edition.

Feller, W. (1971). An Introduction to Probability Theory

and its Applications. Wiley, 2nd edition.

Johnson, R. A. and Wichern, D. W. (2007). Applied Multi-

variate Statistical Analysis. Pearson, 6th edition.

Kaufmann, M., Meier, A., and Stoffel, K. (2015). Ifc-

ﬁlter: Membership function generation for inductive

fuzzy classiﬁcation. Expert Systems with Applica-

tions, 42:83698379.

Keller, J. M., Gray, M. R., and Givens, J. A. (1985). A

fuzzy k-nearest neighbor algoritm. IEEE Trans. Syst.

Man and Cybernetics, 15(4):580–585.

Kim, S.-B., Seo, H.-C., and Rim, H.-C. (2003). Poisson

naive bayes for text classiﬁcation with feature weight-

ing. In Proceedings of the Sixth International Work-

shop on Information Retrieval with Asian Languages,

pages 33–40.

Klir, G. J. and Yuan, B. (1995). Fuzzy Sets and Fuzzy Logic:

Theory and Applications. Prentice Hall, 1st edition.

Ma, W. J., Beck, J. M., Latham, P. E., and Pouget, A.

(2006). Bayesian inference with probabilistic popu-

lation codes. Nature Neuroscience, 9(11):1432–1438.

Melo, A. C. O., Moraes, R. M., and Machado, L. S. (2003).

Gaussian mixture models for supervised classiﬁca-

tion of remote sensing muliespectral images. Lecture

Notes in Computer Science, 2905:440–447.

Moraes, R. M. and Machado, L. S. (2012). Online assess-

ment in medical simulators based on virtual reality us-

ing fuzzy gaussian naive bayes. Journal of Multiple-

Valued Logic and Soft Computing, 18(5-6):479–492.

A Fuzzy Poisson Naive Bayes Classiﬁer for Epidemiological Purposes

197

Moraes, R. M. and Machado, L. S. (2014). Psychomo-

tor skills assessment in medical training based on vir-

tual reality using a weighted possibilistic approach.

Knowledge Based Systems, 70:97–102.

Moraes, R. M., Rocha, A. V., and Machado, L. S. (2012).

Intelligent assessment based on beta regression for re-

alistic training on medical simulators. Knowledge-

Based Systems, 32:3–8.

Moraes, R. M., Simas, A. B., Rocha, A. V., and Machado,

L. S. (2014). New parameters estimators using em-

like algorithm for naive bayes classiﬁer based on

beta distributions. In 11th International FLINS Con-

ference on Decision Making and Soft Computing

(FLINS2014), pages 155–160, Brazil. World Scien-

tiﬁc.

Ogura, H., Amano, H., and Kondo, M. (2014). Classi-

fying documents with poisson mixtures. Transac-

tions on Machine Learning and Artiﬁcial Intelligence,

2(4):48–76.

Ramoni, M. and Sebastiani, P. (2001). Robust bayes classi-

ﬁers. Artiﬁcial Intelligence, 125(1-2):209–226.

Richards, J. A. (2013). Remote Sensing Digital Image Anal-

ysis: An Introduction. Springer, 5th edition.

Rothman, K. J., Lash, T. L., and Greenland, S. (2012). Mod-

ern Epidemiology. Wolters Kluwer, 3rd edition.

Surveillance, H. (2013). Aids e dst. Epidemiological Bul-

letin: HIV-AIDS - Secretariat of Health Surveillance -

Brazilian Ministry of Health, 2(1):1–16.

Surveillance, H. (2014). Monitoramento dos casos de

dengue e febre de chikungunya ate a semana epidemi-

ologica 47 de 2014. Epidemiological Bulletin - Sec-

retariat of Health Surveillance - Brazilian Ministry of

Health, 45(31):1–7.

Surveillance, H. (2015). Detectar, tratar e curar: desaﬁos

e estratgias brasileiras frente tuberculose. Epidemio-

logical Bulletin - Secretariat of Health Surveillance -

Brazilian Ministry of Health, 46(9):1–19.

Vadrevu, S. H. R. and Murty, S. U. (2010). A novel tool for

classiﬁcation of epidemiological data of vector-borne

diseases. J. Glob Infect Dis., 2(1):35–38.

Viera, A. J. and Garrett, J. M. (2005). Understanding in-

terobserver agreement: The kappa statistic. Family

Medicine, 37(5):360–363.

Zadeh, L. A. (1968). Probability measures of fuzzy events.

J. Math. Anal. Applic., 10:421–427.

FCTA 2015 - 7th International Conference on Fuzzy Computation Theory and Applications

198