REFERENCES
Acid, S., Campos, L. M., and Castellano, J. G. (2005).
Learning Bayesian network classifiers: Searching in
a space of partially directed acyclic graphs. Machine
Learning, 59:213–235.
Bishop, C. M. (2007). Pattern Recognition and Ma-
chine Learning (Information Science and Statistics).
Springer, 1st ed. 2006. corr. 2nd printing edition.
Boyd, S. and Lieven, V. (2004). Convex Optimization. Cam-
bridge University Press.
Breiman, L. (1996). Bagging predictors. Machine Learn-
ing, 24(2):123–140.
Freund, Y. and Schapire, R. E. (1995). A decision-theoretic
generalization of on-line learning and an application
to boosting.
Greiner, R., Su, X., Shen, B., and Zhou, W. (2005). Struc-
tural extension to logistic regression: Discriminative
parameter learning of belief net classifiers. Machine
Learning, 59(3):297–322.
Guo, Y., Wilkinson, D., and Schuurmans, D. (2005). Max-
imum margin Bayesian networks. In Proceedings of
the 21th Annual Conference on Uncertainty in Artifi-
cial Intelligence, pages 233–242. AUAI Press.
Karypis, G. and Kumar, V. (2006). A fast and high quality
multilevel scheme for partitioning irregular graphs,.
SIAM Journal on Scientific Computing, 20(1):359–
392.
Koller, D. and Friedman, N. (2009). Probabilistic Graphi-
cal Models: Principles and Techniques. MIT Press.
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998).
Gradient-based learning applied to document recogni-
tion. Proceedings of the IEEE, 86(11):2278–2324.
Pearl, J. (1988). Probabilistic Reasoning in Intelligent Sys-
tems: Networks of Plausible Inference. Morgan Kauf-
mann Publishers Inc., San Francisco, CA, USA.
Pernkopf, F., Van Pham, T., and Bilmes, J. A. (2009). Broad
phonetic classification using discriminative Bayesian
networks. Speech Communication, 51(2):151–166.
Pernkopf, F., Wohlmayr, M., and Tschiatschek, S. (2011).
Maximum margin bayesian network classifiers. IEEE
Transactions on Pattern Analysis and Machine Intel-
ligence, (accepted).
Platt, J. (1999). Sequential minimal optimization: A fast
algorithm for training support vector machines. Ad-
vances in Kernel Methods-Support Vector Learning,
pages 1–21.
Rish, I. (2001). An empirical study of the naive Bayes clas-
sifier. In IJCAI 2001 Workshop on Empirical Methods
in Artificial Intelligence, pages 41–46.
Roos, T., Wettig, H., Gr¨unwald, P., Myllym¨aki, P., and Tirri,
H. (2005). On Discriminative Bayesian Network Clas-
sifiers and Logistic Regression. Machine Learning,
59(3):267–296.
Schenk, O. and G¨artner, K. (2004). Solving unsymmet-
ric sparse systems of linear equations with PARDISO.
Future Generation Computer Systems, 20(3):475–
487.
Schenk, O. and G¨artner, K. (2006). On fast factoriza-
tion pivoting methods for symmetric indefinite sys-
tems. Electronic Transactions on Numerical Analysis,
23:158–179.
Shalev-Shwartz, S., Singer, Y., and Srebro, N. (2007). Pega-
sos: Primal estimated sub-gradient solver for svm. In
Proceedings of the 24th international conference on
Machine learning, pages 807–814. ACM.
Vapnik, V. N. (1998). Statistical Learning Theory. Wiley-
Interscience.
W¨achter, A. and Biegler, L. T. (2005). On the implemen-
tation of an interior-point filter line-search algorithm
for large-scale nonlinear programming. Mathematical
Programming, 106(1):25–57.
APPENDIX
Solving the Intermediate Optimization
Problem
The optimization problem (21) for fixed w satisfying
the subnormalization constraints and given training
set T of N samples can be rewritten as
minimize
γ,ε
1
,...,ε
N
1
2γ
2
+ B
N
∑
n=1
ε
n
(23)
s.t. ex
n,c
≥ γ − ε
n
, ∀n and c 6= c
(n)
γ ≥ 0, ε
n
≥ 0, ∀n,
where we set ex
n,c
= ∆
n,c
w. For n = 1, . . . , N let
x
n
= min
c∈C ,c6=c
(n)
ex
n,c
.
Then, the above problem is equivalent to
minimize
γ,ε
1
,...,ε
N
1
2γ
2
+ B
N
∑
n=1
ε
n
(24)
s.t. x
n
≥ γ − ε
n
, ∀n
γ ≥ 0, ε
n
≥ 0, ∀n,
because the removed constraints will be simultane-
ously satisfied. In an optimal solution with margin γ
′
the term
∑
N
n=1
ε
n
must be as small as possible. There-
fore, all the ε
n
are required to take the minimal value
that is still feasible. This is ε
n
= γ
′
− x
n
, if this quan-
tity is positive. Or ε
n
= 0 otherwise. In this way, the
optimization problem becomes
minimize
γ
1
2γ
2
+ B
N
∑
n=1
max{γ− x
n
, 0} (25)
s.t. γ ≥ 0
and can be easily solved. If required, the slacks ε
n
can
subsequently be calculated as ε
n
= max{γ− x
n
, 0}.
CONVEX COMBINATIONS OF MAXIMUM MARGIN BAYESIAN NETWORK CLASSIFIERS
77