mance. A new model can be constructed based on
the specific needs for the consumer. The same ap-
plies to refreshing a model to reflect the new trends in
the data. This is related to the concept of incremen-
tal data mining, which was introduced in (M.Harries
et al., 1998). A related approach can be found in
(Kuncheva, 2004), where strategies for building en-
sembles of classifiers in non-stationary environments
in which even the classification task may change are
presented.
The current work aims to provide a general con-
text in which composition of data mining models can
be achieved within the boundaries of an online scor-
ing system called DeVisa. While in (Gorea, 2008) and
(DeVisa, 2007) a general description of the system is
given, the current paper focuses on the model compo-
sition aspect. DeVisa only stores the prediction mod-
els expressed in PMML (PMML, 2007), not the orig-
inal training data. Therefore we consider the training
data as not being available. Subsequently the model
composition in DeVisa is limited to certain techniques
which are described in 2. However the consumer ap-
plication can build new models based on the existing
DeVisa models and its own validation set.
2 DEVISA COMPOSITION OF
PREDICTIVE MODELS
DeVisa supports two types of composition methods
described in the PMML specifications: model se-
quencing and model selection. In its current version
(3.2), PMML supports the combination of decision
trees or rules and simple regression models.
Model sequencing is the case in which two or
more models are combined into a sequence where
the results of one model are used as input in another
model. Model sequencing is very often an intrin-
sic part of a model, namely a transformation func-
tion. For instance, a supervised discretization algo-
rithm is applied to a certain attribute, which is de-
scribed within a transformation dictionary, or missing
values for an attribute are filled using a transforma-
tion function made of decision rules. Model selection
is when one of many models can be selected based on
decision rules. A common model selection method
for optimizing prediction models is the combination
of segmentation and regression.
Although the producer applications can upload
composite models in DeVisa, in this section we fo-
cus moreover on the situations in which DeVisa is re-
sponsible for composing the models. Depending on
the moment when the composition process occurs, we
can further classify the composition in DeVisa in im-
plicit or explicit composition.
Implicit composition is the situation when the mod-
els are composed within the orchestration of the scor-
ing or search service (see 3).
A scoring goal (query) is a tuple (MSpec, R),
where
1. MSpec is the model specification, defined as
MSpec ::= {MRef} | ({Filter}, SRef | S[, DRef])
The model specification has several instances:
Exact model case, in which exact references to
one or more DeVisa model that the consumer wishes
to score on is given via MRe f.
Exact schema case, in which the consumer gives
an exact reference to a mining schema SRef and
wishes to score on the models complying to that
schema. However, an additional set of filters corre-
sponding to the properties that the model needs to
conform to can be specified via the Filter element.
Match schema case, in which S describes a min-
ing schema that needs to be matched against one or
more in the DeVisa Catalog. To restrict the search,
an existing DeVisa data dictionary can be optionally
referenced. A reference to an ontology in order to
explain the terminology can be included. Also an op-
tional Filter element can be specified.
2. R is the dataset to score.
The implicit model composition is applicable in
two situations, given that the consumer allows scoring
on composite models:
1. More models complying to MSpec can be found;
2. No model complying to MSpec can be found.
The first situation can occur in all the model
specification instances (exact model, exact schema or
match schema).
In the first two instances, given the models or
schema reference, an existing data dictionary is im-
plicit. In the match schema case a mining schema
S and a reference to a data dictionary in DeVisa, D
should be provided.
In the exact model case, all the referred models are
retrieved. In the exact schema case the engine finds
all the models complying to the specified schema. In
the match schema case, the engine tries to find one
or more models that match S. The composition de-
scribed below applies to the situation in which more
models satisfy the requirements. They are combined
to give the best prediction as follows. The composer
component of the engine (see 3.1) scores on all the
models and then applies a voting procedure (similar to
the bagging approach) and returns either the outcome
that has the highest vote (in the case of categorical
predictions), or the average (in the case of numeric
TOWARDS ONLINE COMPOSITION OF PMML PREDICTION MODELS
301