into iterations, on each the system outputs the ex-
tracted information and asks the user to provide some
feedback, i.e. answers to questions like “Is it a posi-
tive example of X?” and/or “Is this a positive/negative
example of X because of Y”. To efficiently analyze
expert’s feedback and generate the questions, we in-
dex all the CASes using a graph database. It’s pre-
ferred, but not necessary to use an incremental learn-
ing algorithm. SVM and inductive rule builders sup-
port this (Cauwenberghs and Poggio, 2000).
The problem of information extraction is treated
as a combination of binary classification problems:
first, a classifier must determine, whether a token rep-
resents a particular characteristics; second, it must
find mentions of patient groups (similarly to the first
subproblem); last, it must link characteristics to these
groups. Most of existing classification algorithms
work with vector-space object representation, thus it’s
important to choose appropriate feature selection pro-
cedure. Currently, we experiment with information
gain, Euclid distance between positive and negative
examples and various graph traversal strategies for
features extraction and selection. Classifiers being
tested are VM, HMM, CRF, ProbLog, decision trees,
neural networks (Collobert et al., 2011).
4.2 Feature Extraction
The information extraction step results in a summary
table in which rows correspond to groups of patients
and columns – to their features. Table 1 is an ex-
ample of such a table. In most cases, it contains
columns that identify patient groups (paper identi-
fier/number of the group), describe diseases (via ICD-
10 codes or verbally), present results of various anal-
yses (e.g. tumor antigens), demographic information
(gender, sex), type of vaccine used, loading technique
and measures of outcomes (averagesurvival time with
and without treatment). Usually the summary table
contains a few hundreds of columns. Additionally, to
increase separability of classes, the binarization tech-
nique may be applied to this table. According to bina-
rization, features with K possible values (K > 2) are
replaced with K new features with only two possible
values, indicating presence or absence of the corre-
sponding value of the source feature.
To identify the set of significant features, the pre-
viously mentioned table is analyzed using inductive
machine learning methods. To determ the effective-
ness of specific DC vaccine types we choose JSM
method (Anshakov et al., 1991). JSM method is an
algorithm that allows discovering of cause-and-effect
relations. Although it was extensively used in vari-
ous areas, its applications are limited to only small-
scale problems (only about a few tens of features). To
overcome this issue, we propose to preliminary select
features using a more lightweight method, e.g. AQ
(Wojtusiak et al., 2006). The idea of the proposed
feature pre-filtering is that only those features that AQ
(or other preprocessing method) chooses to build up
the decision rules are forwarded to JSM method.
For this purpose, the table is represented as a
matrix of feature values A = {a
ij
}. In this matrix
columns correspond features and rows - to theirs val-
ues:
p
j
→ (a
1j
, a
2j
, . . . , a
nj
), (1)
and each group of patients (object) corresponds to it
description
o
i
→ (p
1
= a
i1
, p
2
= a
i2
, . . . , p
m
= a
im
), (2)
where p
j
= a
ij
is a characteristic of an object.
Continuous features are discretized as follows.
The whole interval of feature values is divided into
three subintervals: w
1
– low values, w
2
– medium
values, and w
3
– high values. The system divides
all patients into classes according to successfulness
of the vaccination. As a result of learning, all classes
are described using a set of characteristics (AQ-rules
(Michalski, 1973; Wojtusiak et al., 2006)). Charac-
teristic h
j
is a disjunction of feature value intervals:
p
j
=
S
q
w
q
.
We propose to treat the process of rules genera-
tion as an optimization problem that consists in find-
ing a possibly optimal set of rules. However, clas-
sical optimization procedures cannot be applied, be-
cause of the large number of features and their values.
So it is reasonable to use a genetic algorithm (GA).
GA have been extensively applied to solving complex
optimization problems with non-standard algorithmic
assignment of functions, complex configurationof the
admitted region, with multi-extremal functions, large
number of variables, etc. (Goldberg, 1989).
We use a recently developed modification of the
well-known GA - co-evolutionary asymptotic genetic
algorithm (CAGA)(Sergienko and Semenkin, 2013).
It has fewer parameters than the standard GA. It rep-
resents several asymptotic probabilistic GA that work
in parallel and compete for a common resource - a
number of individuals in the population, and share the
best found solutions with each other. Base algorithms
have an adaptive mutation operator and differ from
each other by the selection criteria. Such a combi-
nation of algorithms make it unnecessary to choose
the selection, recombination and mutation operators,
which are individual for each discrete task.
Thus, we suggest to run an iterative process that
uses CAGA to find the best rule that covers the maxi-
mum number of positive examples and uses the mini-
mum number of characteristics. Examples that satisfy
AssessmentofDendriticCellTherapyEffectivenessBasedontheFeatureExtractionfromScientificPublications
273