All Together Now! The Beneﬁts of Adaptively Fusing Pre-trained Deep

Representations

Yehezkel S. Resheff

, Itay Lieder

and Tom Hope

Intuit Tech Futures, Israel

Intel Advanced Analytics, Israel

Keywords:

Deep Learning, Fusion.

Abstract:

Pre-trained deep neural networks, powerful models trained on large datasets, have become a popular tool in

computer vision for transfer learning. However, the standard approach of using a single network potentially

misses out on valuable information contained in other readily available models. In this work, we study the

Mixture of Experts (MoE) approach for adaptively fusing multiple pre-trained models for each individual

input image. In particular, we explore how far we can get by combining diverse pre-trained representations

in a customized way that maximizes their potential in a lightweight framework. Our approach is motivated

by an empirical study of the predictions made by popular pre-trained nets across various datasets, ﬁnding that

both performance and agreement between models vary across datasets. We further propose a miniature CNN

gating mechanism operating on a thumbnail version of the input image, and show this is enough to guide a

good fusion. Finally, we explore a multi-modal blend of visual and natural-language representations, using a

label-space embedding to inject pre-trained word-vectors. Across multiple datasets, we demonstrate that an

adaptive fusion of pre-trained models can obtain favorable results.

1 INTRODUCTION

In many real-world scenarios arising in computer

vision applications, practitioners turn to pre-trained

deep neural networks – powerful models (Chollet,

2016; He et al., 2016; Simonyan and Zisserman,

2014; Szegedy et al., 2016) which have already been

trained on a large data set and can help jump-start a

given task. Fortunately, it turns out that image fea-

tures extracted from these pre-trained networks are

broadly applicable to other datasets and tasks (Yosin-

ski et al., 2014; Ge and Yu, 2017; Girshick et al.,

2014; Agrawal et al., 2014; Azizpour et al., 2015;

Oquab et al., 2014; Chu et al., 2016).

In practice, some form of new learning is required

in order to adapt the pre-trained model to the new

task. A common practice in such cases, especially

in settings where training data is scarce, is to either

ﬁne-tune only the very last layer(s), or simply pro-

ceed by extracting high-level features from one of the

ﬁnal layers of the model and plugging them into a lin-

ear classiﬁer such as an SVM (Kim et al., 2016; Chu

et al., 2016; Sharif Razavian et al., 2014).

The rapid pace of deep learning research has

spawned many candidates for pre-trained networks,

with very different architectures. Newer, more ad-

vanced networks tend to have overall better perfor-

mance on a few large-scale datasets on which they

were trained, but for any given task it is unclear which

pre-trained model would work best. As conﬁrmed by

our empirical study, the question of which pre-trained

net to employ is not simple to answer, and depends on

dataset and even on class within a dataset. Further-

more, even “older” models such as VGG (Simonyan

and Zisserman, 2014) can beat the more advanced

and modern architectures for some speciﬁc classes,

giving the old adage of “respect your elders” new

meaning. This suggests that combining multiple pre-

trained networks could be beneﬁcial.

Our empirical ﬁndings on model (dis)agreement

across datasets and speciﬁc image classes (Section

2) suggest that it could be useful to combine these

diverse pre-trained features by customizing for each

individual image. To test this hypothesis, we em-

ploy a lightweight yet ﬂexible Mixture of Experts

(MoE) (Masoudnia and Ebrahimpour, 2014; Eigen

et al., 2013; Shazeer et al., 2017) framework for fus-

ing multiple sources of pre-trained information (Fig-

ure 1) in a principled manner, while requiring no ﬁne-

tuning at all. A gating mechanism differentially as-

Resheff, Y., Lieder, I. and Hope, T.

All Together Now! The Beneﬁts of Adaptively Fusing Pre-trained Deep Representations.

DOI: 10.5220/0007367301350144

In Proceedings of the 8th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2019), pages 135-144

ISBN: 978-989-758-351-3

135

Figure 1: The overall structure of the Mixture of Experts (MoE) fusion model. Individual pre-trained representations are

gated by a dedicated CNN with thumbnail input. Label-space embedding with GloVe word vectors is used to augment the

pre-trained features prior to gating.

signs weights to the output of base-classiﬁers trained

on each set of pre-trained features, in a way that is

adaptable to each input image.

The generality of the MoE framework allows mul-

tiple design choices. We study several speciﬁc meth-

ods (Section 3) exploring diverse premises and struc-

tures controlling how pre-trained information is incor-

porated into the model. In one experiment, we exam-

ine whether low-level image features can help guide

our MoE’s weighting of each pre-trained set of fea-

tures, by indicating which pre-trained model is best

suited for it. We test a lightweight CNN-based gating

module, with only two small convolution layers that

process a thumbnail version of the original input im-

age. This miniature design is able to achieve excellent

results while being small enough to train and deploy

easily.

In another method, we test the multi-modal fusion

of pre-trained natural language information. In par-

ticular, we extend the CNN gating network to include

a label-space embedding of the original 1000 Ima-

geNet labels, and then initialize this embedding with

pre-trained word vectors based on class names. We

ﬁnd that in some cases, incorporating this “semantic”

knowledge helps improve results.

The contribution of this paper is two-fold. First,

we systematically evaluate the transfer learning prop-

erties of multiple pre-trained models to many bench-

mark datasets, showing that there is no clear winner

and therefore we could beneﬁt from a method to se-

lect which model to follow for a speciﬁc prediction.

Once the need to combine models is established, we

investigate and compare several methods of doing so,

and propose the thumbnail-CNN gating mechanism

as a lightweight yet effective way of adaptively fus-

ing pre-trained deep representations.

2 EXPLORING PRE-TRAINED

PREDICTIONS

In this section we conduct an empirical study of

the representations extracted from popular pre-trained

models, and explore whether there is evidence sug-

gesting they contain complementary information that

could be tapped into by a model that combines them.

We train classiﬁers based on features from the ﬁ-

nal fully-connected layers of Xception, Inception V3,

VGG-16, VGG-19 and ResNet-50 and compare their

predictions across 7 datasets (Table 1), and across im-

age classes within each dataset. We explore some key

differences and similarities between the 5 pre-trained

models, by examining the agreement between them

according to various metrics and segmentations of the

datasets.

We ﬁnd that while overall agreement is quite high,

ICPRAM 2019 - 8th International Conference on Pattern Recognition Applications and Methods

136

there are many disparities, suggesting that this diver-

sity could be exploited by combining the “expertise”

of pre-trained nets in a dynamic, instance-level fash-

ion. These ﬁndings motivate the MoE models we

present in Section 3 and test in Section 5.

Finally, we look at cases where there is a high

level of disagreement between the individual pre-

trained models, and test the results obtained using one

of our MoE models. We ﬁnd that the MoE is able

to either surpass or match the best single pre-trained

representation, showing the utility of training a model

that is able to adaptively assign a weighting to the in-

dividual pre-trained nets.

2.1 Dataset Disparities between Models

We begin by examining the agreement between the

classiﬁers trained on pre-trained features. We mea-

sure prediction consistency between the models with

Cohen’s Kappa inter-rater agreement measure.

Figure 2: Cohen’s Kappa agreement between models. Hor-

izontal box plots show the agreement scores for each pair of

models, on each of the 7 datasets. The vertical orange line

and green triangle markers indicate the median and mean

scores.

As shown in Figure 2, the scores are high, but vary

substantially across datasets, having almost perfect

agreement between all pairs of models for 4 out of

the 7 datasets, moderate for 2 of them, and only a

somewhat fair agreement score for the CUB200 birds

dataset. The relatively low agreement in some of the

datasets suggests that models do not necessarily make

the same mistakes, and that there are non-overlapping

correct classiﬁcations. Figure 3A shows that almost

always at least some of the models are correct. If we

could learn to predict for a given example which of

the models will do well on it, we could expect a sig-

niﬁcant boost in results. This is core notion behind

the method proposed in this paper.

Figure 3: Comparison of model errors. A The fraction of

total examples correctly predicted by each of the models.

Datasets are denoted by colors. The overwhelmingly low

proportion of examples that all models mislabel (#correct

models = 0), points to the viability of the gated Mixture of

Experts approach we suggest for this task. B: The ratio be-

tween the number of examples that all models got wrong

and the number of errors made by the best-performing

model further supports our approach.

2.2 Class-level Differences

While the examination at the dataset level revealed

some coarse differences between the pre-trained pre-

dictions, performance at the class-level is more

model-sensitive. This ﬁnding is demonstrated in

Figure 4. Each of the three columns show an ex-

ample class with corresponding F1 scores. For a

given dataset (rows), each of the examples has dif-

ferent best-performing models. For instance, in the

Caltech101 dataset (cyan), while Xception does best

at classifying the brontosaurus category, VGG-16 is

best at the water lilly category and Inception V3 wins

at wild cat. These inconsistencies challenge the no-

tion of a superior “champion” model, crowned on

some benchmark data. Additionally, we present the

F1 scores of one of the MoE models, which is gener-

ally superior to each of the individual models.

Figure 5 shows histograms of the number of

classes won (highest accuracy) by each model for the

corresponding dataset. While Xception and ResNet

dominate by this measure, there is no one model that

wins across the board, and in addition even the weaker

VGG models win for some classes of images. In

addition, the histogram with the MoE model is su-

perimposed (black), revealing that it wins the largest

number of classes in each and every dataset when in-

cluded.

Another illustration of the models’ diversity, this

time focusing on two speciﬁc classes, is shown in Fig-

All Together Now! The Beneﬁts of Adaptively Fusing Pre-trained Deep Representations

137

Table 1: The 7 datasets used for all experiments in this paper, and their basic characteristics.

Dataset Images Labels Description

Actions40 (Yao et al., 2011) 9,532 40 Human actions

CUB200 (Wah et al., 2011) 11,788 200 Birds

Caltech101 (Fei-Fei et al., 2007) 9,146 101 Diverse Objects

Caltech256 (Grifﬁn et al., 2007) 30,607 256 Diverse Objects

Flowers102 (Nilsback and Zisserman, 2008) 8,189 102 Flowers

Flowers17 (Nilsback and Zisserman, 2006) 1,360 17 Flowers

Pets (Parkhi et al., 2012) 7,390 37 Cats and dogs

Figure 4: F1 scores of example classes taken from each

dataset. Each of the examples has different best-performing

models. For each dataset (rows, also denoted by color), and

a speciﬁc class example (columns, label is the title), the F1

scores (y-axis) are plotted for each of the 5 individual mod-

els along with an additional embedding-based MoE model

(diagonal stripes, see Section 3) that mostly either rivals or

outperforms the best individual model.

ure 6. Some images in the action classes of texting

message and smoking have considerably high con-

fusion between them (Figure 6A and B). While all

models are quite good at distinguishing the two when

smoking is the true label (Figure 6D), the ability to

do so when texting message is the true label is signiﬁ-

cantly worse (Figure 6C), dropping especially low for

the two VGG models, followed by ResNet-50.

The ability of each model to separate the two is

also reﬂected in the feature-space. Figure 6E-I shows

Figure 5: Histograms with number of classes won by each

model. There is no one model that wins across the board.

For each dataset (subplot), two superimposed histograms

show the number of times each model outperformed the

rest. Each of the colored histograms include only the 5 in-

dividual models, while the black histograms also include

the MoE model, revealing that it wins the largest number of

classes in every dataset when included.

the T-SNE (t-Distributed Stochastic Neighbor Em-

bedding) (Maaten and Hinton, 2008) 2d projection of

each of the 5 individual models. As seen, the two

classes are clearly better distinguished as two sepa-

rate clusters when viewed with the features extracted

by the two best models – Xception and Inception V3.

To wrap-up, the overall ﬁndings of this section

suggest that each model captures slightly different

behaviors, and thus has its own strengths and weak-

nesses, performing better or worse depending on the

speciﬁc class and perhaps even the speciﬁc instance

in question. The evidence presented here supports the

notion that combining the models in an adaptive way

could exploit their non-overlapping capabilities. In

the next section we test this idea with MoE methods

fusing together the multiple representations.

ICPRAM 2019 - 8th International Conference on Pattern Recognition Applications and Methods

138

Figure 6: Illustration of model differences in separating

two similar classes. Some models are better at distinguish-

ing high-confusion images. A,B: Illustrative images for the

two classes texting message (A) and smoking (B). C,D: The

number of predictions (y-axis) of either class (denoted by

red and blue colors respectively) by each model (x-axis)

when the true class is either texting message (C) or smoking

(D). E-I: 2D T-SNE projection of the features of examples

from the two classes (denoted by the same colors as in plots

C and D), where each point is a different instance and the

x-axis and y-axis show the ﬁrst and second T-SNE compo-

nents respectively. Each plot corresponds to the features of

a pre-trained model, as indicated in the title.

3 METHODS AND MODELS

In our setting, we are interested in training a classi-

ﬁcation model by fusing information from multiple

pre-trained deep neural networks. Our goal is em-

ploying a lightweight framework that is broad enough

to enable ﬂexible design choices for how the pre-

trained information is combined.

We thus begin by presenting a simple, general

framework adopting ideas from the rich literature on

deep Mixture of Experts (Eigen et al., 2013; Ma-

soudnia and Ebrahimpour, 2014; Shazeer et al., 2017)

models. We then present some speciﬁc variants we

study, illustrated in Figure 1, also casting previous

work in the ﬁeld as a simple special case.

Let {x

(i)

}

i=1

be our dataset consisting of im-

ages x ∈ X and the associated labels y ∈ Y . At

our disposal are K pre-trained embedding functions

Φ = {φ

,..,φ

}, where φ

: X → R

is an embedding

function of deep learning model i, typically trained

on datasets several orders of magnitude larger than

P. For example, φ

could represent the ResNet model

(He et al., 2016) trained on the ImageNet dataset

(Deng et al., 2009), φ

the Inception model (Szegedy

et al., 2016), and so forth. As discussed in the intro-

duction and demonstrated in Section 2, different pre-

trained networks can perform very differently for a

given input image, potentially capturing diverse as-

pects of the input. Using only one pre-trained net-

work for transfer learning thus potentially misses out

on much information that is just as easy to obtain.

Here, we utilize multiple pre-trained embeddings by

learning a probabilistic classiﬁcation fusion function

f : X → ∆

|Y |

of the form:

f (x)

= f (x,φ

(x),...,φ

(x)). (1)

In principle, the entire set of network weights in

each φ

∈ Φ could be ﬁne-tuned in tandem, generaliz-

ing the common ﬁne-tuning setting (of one individual

). However, even the process of ﬁne-tuning one pre-

trained network φ

can often be costly, in terms of re-

quired resources and the amount of data needed, and

in training difﬁculty (Kim et al., 2016). A common

approach is to freeze the ﬁrst l layers of φ

and ﬁne-

tune only the remaining top layers. In practice, es-

pecially in low-resource scenarios, many practitioners

simply copy and freeze all but the last fully-connected

classiﬁcation layer, at times replacing the ﬁnal soft-

max regression with a linear SVM (Kim et al., 2016;

Sharif Razavian et al., 2014; Chu et al., 2016). This

basic method can often yield excellent results while

keeping effort minimal.

In this paper, our primary focus is to demon-

strate how a simple, principled extension of this com-

mon practice – a fusion of information extracted from

multiple pre-trained φ

’s – can lead to a substantial

boost in results while still being very practical and

lightweight.

To this end, we primarily focus on functions that

decompose into separate base-classiﬁers c : φ(x) →

∆

|Y |

, each trained on a single pre-trained representa-

tion of the images:

f (x)

= f (x,c

(φ

(x)),...,c

(φ

(x))) (2)

We now turn to discuss some speciﬁc realizations

we develop and explore for this fusion function. We

also show a simple extension to incorporate additional

types and sources of pre-trained information.

3.1 A Gating Mechanism for

Base-classiﬁers

We seek a model that is trained to combine the predic-

tions from the multiple c

base-classiﬁers while being

ﬂexible enough to allow diverse design choices. We

All Together Now! The Beneﬁts of Adaptively Fusing Pre-trained Deep Representations

139

begin by employing a gating mechanism, a general

method of learning a (convex) combination of these

predictions that adapts to each input image, in order

to maximize the probability of a correct labeling for a

given image.

Given the set of base-classiﬁers {c

} trained on Φ,

we construct the following classiﬁer:

f (x) =

∑

i=1

(x)c

(φ

(x)), (3)

or in vector notation

f (x) = a(x)c(φ(x)), (4)

where ∀x ∈ X : a

(x) ≥ 0,

∑

(x) = 1. The learned

function a : X → ∆

acts as a gating mechanism,

selecting the combination of pre-trained embeddings

and models most likely to label the speciﬁc exam-

ple correctly. In this light, our function f can be

viewed as a deep mixture of experts meta-learning

model, with individual experts based on pre-trained

networks.

3.2 A Low-level CNN Gating

Mechanism

Our basic premise, based on the results in Section 2

showing model diversity, is that each individual im-

age is “suitable” for each pre-trained model φ

∈ Φ

to different extents, and that each φ

captures image

properties in a potentially different fashion. Further

building on this idea, we explore the use of extracting

low-level image features to train our gating mecha-

nism with. In particular, we experiment with a gating

mechanism based on a very small CNN model and

thumbnail versions of images:

a(x) = CNN(R(x)) (5)

where CNN : R(X ) → ∆

is built of only two lay-

ers with very few ﬁlters and R(x) is a resize function

mapping the original image x to a smaller version (see

Section 5). For example, we experiment with using

resized images as small as 32X32, obtaining good re-

sults. These design choices reﬂect the assumption that

low-level image features are sufﬁcient to determine

the appropriateness of each of the pre-trained embed-

dings for a speciﬁc example.

In addition, the use of a very small network and

small images not only tests our ability to exploit low-

level image information for our gating function, but

also ensures a compact, lightweight model that is easy

to train and use for inference in practice.

3.3 A Feature-embedding Gating

Mechanism

A simpler gating method is to ignore the raw image

altogether, and embed all the representations from

the multiple φ

∈ Φ in a shared feature space. In

particular, we learn K weight matrices (fully con-

nected layers), transforming each φ

(x) into a lower-

dimensional vector and aggregating, before passing

through a softmax function:

˜x

= W

(x) (6)

a(x) = softmax(m(γ([ ˜x

, ˜x

,..., ˜x

]))),

where W

is the embedding matrix corresponding to

, γ is a point-wise non-linearity such as the ReLU

function (He et al., 2015), and m is an aggregation

function such as the mean operator.

While this method does not employ a richer model

to exploit raw image x, its advantage is in its simplic-

ity (and speed), empirically giving very good results

in our experiments.

3.4 A Fusion of Image & Label-space

Embeddings

In the above models, we use φ

to extract high-level

features and plug them into their respective classi-

ﬁers c

. In our case, however, each φ

is trained on

the ImageNet data set, with a rich set of 1000 output

classes. We seek to explore the effect of incorporating

the ﬁnal softmax outputs of each φ, and whether this

added “semantic” information can enrich our fusion

of pre-trained networks. To capture rich label seman-

tics and reduce the dimensionality of the label-space,

we ﬁnd a lower-dimensional embedding of the Ima-

geNet classes. Along these lines, we show how one

instance of such a label-space embedding can be used

to plug-in pre-trained word vectors in place of the em-

bedded classes. The hope is that this multi-modal fu-

sion of pre-trained information from multiple sources

could enhance our ﬁnal classiﬁer.

More technically, we learn a weighted embedding

of the label-space on which each φ

was trained, with

weights corresponding to the conﬁdence scores. Let

be the label-space of the source dataset (ImageNet)

on which each φ

was trained. We extend our deﬁni-

tion of φ(x) to output both high-level vision features

and the ﬁnal network predictions:

: X → (h

) ∈ R

× ∆

1000

, (7)

so that now φ

outputs both the original features h

∈

, along with softmax predictions vector p

of size

| (1000).

ICPRAM 2019 - 8th International Conference on Pattern Recognition Applications and Methods

140

Next, we adapt our base-classiﬁers. We note

that each softmax predictions vector p

assigns a

conﬁdence score p

in [0, 1] to each class k ∈

{1,...,1000}, where each class is encoded as a one-

hot binary vector v

. Let V be an embedding map-

ping the class vectors v

to a lower-dimensional

dense representation, V : v → R

300

. Finally, denote

by {k

,...,k

} the indices of the top-T conﬁdence

scores p

∈ p

. For example, T = 5 means we se-

lect the indices corresponding to the top-5 predicted

classes.

Each c

now takes the one-hot vectors v

, embeds

them with the (trainable) V, and then takes a weighted

average of the v

,...,v

with weights correspond-

ing to conﬁdences p

,..., p

v =

∑

k∈{k

,...,k

}

· p

∑

k∈{k

,...,k

}

, (8)

where we use · to denote an elementwise product of

each element in vector v

with the scalar p

This weighted average embedded vector

v is then

combined with the visual features h

, and gated via a

CNN as in Equation 5 or the method in Equation 6.

For simplicity, we experiment only with concatenat-

ing

v,h

, but other forms of fusion are possible in this

ﬂexible design.

3.4.1 Fusion of Pre-trained Word Vectors as

Classes

The above formulation of the label-space embedding

allows us to trivially experiment with incorporating an

external source of pre-trained information: word vec-

tors trained on massive textual corpora. In particular,

we semantically represent each class vector v

with

the corresponding natural-language class name. For

example, if v

is the one-hot vector indicating the cat

class, we use the word cat. We then replace embed-

ding V with the pre-trained GloVe (Pennington et al.,

2014) word vectors, so that the embedding of v

now the GloVe word vector for cat. (For class names

composed of more than one token, if the combination

does not exist in GloVe, we simply compute the aver-

age of tokens in the class name).

Using this idea we test whether fusing rich seman-

tic knowledge on classes, as captured in pre-trained

word vectors, can help enrich our transfer learning

model without adding much complexity to the pro-

cess.

3.5 Stacking as a Simple Special Case

We end this section showing that a recent method

(Akilan et al., ) can be cast as an instance of our gen-

eral formulation (Equation 2). Adopting notation sim-

ilar to the above, (Akilan et al., ) proposes:

stacked softmax

(x) = g(

∑

(φ

(x))), (9)

where all g functions represent softmax regression

classiﬁers ﬁtted with respect to the class-label target

y. In the ﬁrst stage individual classiﬁers g

are ﬁtted,

then their outputs g

(φ

(x)) are averaged and used in

a second stage classiﬁer. This stacking technique is

shown to be superior to classiﬁers based on individual

pre-trained embeddings, on several datasets. The au-

thors also propose replacing the average of g

(φ

(x))

with a product, which in our experiments failed to

produce worthwhile results.

Note that the raw image x is not used in Equa-

tion 9. The essential difference between Equation 9

and the MoE approach is the dynamic weighting of

base-classiﬁers, according to each speciﬁc example.

In other words, rather than a stacking approach with

a degenerate (constant) gate, we take a broader ap-

proach, adapting to each individual image with di-

verse design choices for the gating mechanism and

base-classiﬁers. In addition, as shown in Section 3.4,

our framework is able to easily incorporate further

sources of rich information, using label-space embed-

dings and pre-trained word vectors. Our experiments

show that personalization with respect to the input

images, and fusion of richer sources of information,

leads to better overall results.

4 RELATED WORK

Mixture of Experts and Ensembles. In addition to

the litany of work on transfer learning and ﬁne-tuning

touched upon throughout the paper, our work draws

heavily on the extensive literature on Mixture of Ex-

perts (MoE). In MoE (Masoudnia and Ebrahimpour,

2014; Eigen et al., 2013; Shazeer et al., 2017), a gat-

ing model is trained to weight the outputs of “expert”

sub-models to produce a ﬁnal prediction, so that each

input is assigned a different distribution over the ex-

perts. In recent work, (Zhao et al., 2017) developed a

deep MoE model to combine a set of base deep CNNs

all based on the AlexNet architecture, to recognize

atomic object classes, constructing a class ontology to

guide assignment to each base CNN. (Shazeer et al.,

2017) employed a sparsely-gated MoE with tens of

thousands of sub-networks, obtaining state-of-art re-

sults in natural language processing tasks. In work

related to MoE, (Aljundi et al., 2016) recently de-

veloped a deep neural network gating mechanism for

All Together Now! The Beneﬁts of Adaptively Fusing Pre-trained Deep Representations

141

lifelong learning, where tasks are assumed to arrive

sequentially.

More generally, ensemble methods have been

popular in deep learning (see (Schmidhuber, 2015)

for a review), with applications in computer vision

(Antipov et al., 2016), speech recognition (Deng and

Platt, 2014), and forecasting (Qiu et al., 2014) to

name a few. Unlike work on ensembling of differ-

ent network architectures (Theagarajan et al., 2017;

Ju et al., 2017), which typically requires heavier re-

sources and large datasets, in this paper we study the

utility of a light MoE based solely on features ex-

tracted from pre-trained nets.

Combining Pre-trained Features. As discussed in

the Introduction and Section 3, there has been pre-

vious work on combining pre-trained representations

for transfer learning. Notably, in (Kim et al., 2016) a

computationally efﬁcient SVM-based method is pro-

posed to select a subset of pre-trained features. The

selected pre-trained features are concatenated and

used in a linear model. More recently, in (Akilan

et al., ) a stacking method of individual classiﬁers

based on pre-trained representations is proposed, ob-

taining excellent results that beat various baselines. In

both these approaches a single global model for com-

bining the pre-trained models is learned, rather than

weighting each pre-trained net dynamically depend-

ing on the input image, as in the ﬂexible MoE frame-

work we study that enables the exploration of more

general fusion models.

5 EXPERIMENTS

We present experiments on 7 benchmark datasets,

comparing the methods developed in Section 3 to

baseline methods for combining pre-trained models.

The datasets we selected (Table 1) are highly repre-

sentative of relatively small-sample recognition tasks,

and include diverse content from birds and ﬂowers to

human actions. The ﬁve individual pre-trained mod-

els used in our experiments are VGG-16, VGG-19

(Simonyan and Zisserman, 2014) , ResNet-50 (He

et al., 2016) , Inception V3 (Szegedy et al., 2016)

and Xception (Chollet, 2016). These popular mod-

els are widely utilized in transfer learning. All mod-

els were implemented using the Keras Python library

with a TensorFlow backend. Results are reported for

a 50%/50% train/test cross-validation procedure. For

the MoE-Embedding method, embedding size was set

to 100. All CNN-gating models reported in Table 2

consist of two layers (32/64 ﬁlters, of size 3X3).

5.1 Individual Models

We start by comparing the performance of individual

pre-trained models. Transfer learning is conducted by

replacing the ﬁnal softmax layer of the original Ima-

geNet model with a similar classiﬁcation layer for the

task at hand. Only the ﬁnal layer is trained. Results

(Table 2) point to no clear winner when considering

the individual models, with ResNet and Xception tak-

ing the lead most often.

5.2 Baselines and Mixture of Experts

We start with a combination (concatenation) of all

pre-trained embeddings (all + SVM in Table 2). Re-

sults indicate this outperforms all individual mod-

els only in two cases. Interestingly, the popular ap-

proach of concatenation followed by PCA (all + PCA

+ SVM) fails for all datasets (1K PCA components

were used. Results remained the same for other val-

ues of the number of components). Next, we turn to

Mixture of Experts methods.

We compare the stacking-based model (Akilan

et al., ) to our two proposed methods: MoE-

Embedding, and CNN-gated models. Each of the

two methods is used with and without the label-space

embedding extension (see Section 3.4). We present

results using CNN-gating with input images of size

32X32 and 64X64 (see supplementary material for re-

sults with additional image sizes and network sizes).

Firstly, all proposed combination methods (Table

2, bottom half) outperform all individual embedding-

based models, with a margin of up to 5%. This ﬁnd-

ing is in line with the general concept of the MoE,

and the hypothesis that different information content

in the various embeddings will have an additive effect,

beneﬁting overall accuracy.

Secondly, CNN-gating models operating on im-

ages as small as 32X32 pixels improved on the stack-

ing method in 5/7 cases (no-LE) and 6/7 cases (LE).

With 64X64 pixels this is only marginally improved

to 6/7 for both. We interpret these results as an indi-

cation that it is very general (low-level) features of the

images that are successfully guiding the gating mech-

anism. A more thorough investigation of CNN-gating

with various image sizes, and number of layers indi-

cated that the majority of the beneﬁt is attained al-

ready for tiny models based on thumbnail images.

Across all datasets, we signiﬁcantly outperform

individual models and their concatenation. On most

datasets, we also outperfom the (Akilan et al., ) stack-

ing technique (a special case of our framework) by a

margin of about 0.5%-1.3%, corresponding to relative

error reduction of several percentage points.

ICPRAM 2019 - 8th International Conference on Pattern Recognition Applications and Methods

142

Table 2: Comparison of classiﬁcation methods based on pre-trained embeddings. (Top panel) Individual pre-trained models

with the ﬁnal softmax layer replaced and re-trained on the new labels. The “all + SVM” and “all + PCA + SVM” baselines

represent a concatenation of all 5 pre-trained representations (in the PCA case, the ﬁrst 1K components) fed into a linear

SVM. (Bottom panel) Mixture of Experts (MoE) approaches for pre-trained embeddings. MoE-Stacking - see Section 3.5.

MoE-Embedding - all pre-trained representations are embedded in a common space and a ﬁnal softmax layer is then applied.

MoE-CNN - see Section 3. LE/no-LE refers to label space embedding, see section 3.4. Dataset abbreviations: Actions40:

Actions, CUB-200-2011: CUB, Caltech101: C101, Caltech256: C256, Flowers17: F17, Flowers102: F102, Pets: Pets.

Method Actions CUB C101 C256 F102 F17 Pets

Inception 78.72 63.12 92.02 83.36 89.06 92.65 92.50

ResNet 75.22 63.10 92.94 76.55 93.36 92.94 90.69

VGG-16 68.42 56.60 90.62 71.96 83.96 87.94 88.39

VGG-19 69.45 56.28 90.57 72.78 83.00 88.38 87.31

Xception 80.00 67.80 92.34 85.81 90.79 90.74 93.34

all + SVM 78.56 63.86 93.57 84.93 92.38 93.38 91.72

all + PCA + SVM 74.49 56.72 90.50 80.71 91.82 94.26 89.77

no-LE

MoE-Stacking 81.56 73.13 94.05 87.01 94.46 93.97 94.42

MoE-Embedding 81.62 73.02 94.60 87.48 94.90 95.15 94.21

MoE-CNN(32) 81.91 73.01 94.49 87.26 94.63 95.00 94.07

MoE-CNN(64) 81.85 72.70 94.44 87.36 94.82 95.29 94.32

MoE-embedding 81.87 73.11 94.53 87.34 94.85 94.85 94.21

MoE-CNN(32) 81.85 73.14 94.60 87.30 94.90 95.29 94.18

MoE-CNN(64) 81.64 73.28 94.62 87.36 94.92 95.15 94.32

6 CONCLUSION

In this work, we study a Mixture of Experts (MoE)

framework for fusing multiple pre-trained models in

the transfer learning setting. We perform an empir-

ical study showing the diversity of predictions made

by pre-trained model and their dependence on dataset

as well individual classes of images. We examine

multiple simple models derived from the MoE frame-

work and test several gating mechanisms that, unlike

previous work, adaptively assign varying importance

to each set of pre-trained features based on the input

image. In addition, we construct a label-embedding

method and incorporate pre-trained word vectors, ex-

ploring the effect of a multi-modal fusion of visual

and language-based information. We generalize pre-

vious work and obtain better results with a ﬂexible,

lightweight approach, serving to demonstrate the ad-

vantage of exploiting individual-image information

for a better fusion of pre-trained models. A com-

mon approach in transfer learning is to ﬁne-tune the

ﬁnal layers of pre-trained nets. Thus, an interest-

ing future direction is to experiment with ﬁne-tuning

multiple sets of pre-trained nets simultaneously using

the MoE framework in a scalable manner that avoids

over-ﬁtting in the small-sample, low-resource setting.

REFERENCES

Agrawal, P., Girshick, R., and Malik, J. (2014). Analyzing

the performance of multilayer neural networks for ob-

ject recognition. In European conference on computer

vision, pages 329–344. Springer.

Akilan, T., Wu, Q. J., Safaei, A., and Jiang, W. A late fusion

approach for harnessing multi-cnn model high-level

features.

Aljundi, R., Chakravarty, P., and Tuytelaars, T. (2016). Ex-

pert gate: Lifelong learning with a network of experts.

arXiv preprint arXiv:1611.06194.

Antipov, G., Berrani, S.-A., and Dugelay, J.-L. (2016). Min-

imalistic cnn-based ensemble model for gender pre-

diction from face images. Pattern recognition letters,

70:59–65.

Azizpour, H., Sharif Razavian, A., Sullivan, J., Maki, A.,

and Carlsson, S. (2015). From generic to speciﬁc deep

representations for visual recognition. In Proceedings

of the IEEE Conference on Computer Vision and Pat-

tern Recognition Workshops, pages 36–45.

Chollet, F. (2016). Xception: Deep learning with

depthwise separable convolutions. arXiv preprint

arXiv:1610.02357.

Chu, B., Madhavan, V., Beijbom, O., Hoffman, J., and

Darrell, T. (2016). Best practices for ﬁne-tuning vi-

sual classiﬁers to new domains. In Computer Vision–

ECCV 2016 Workshops, pages 435–442. Springer.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-

Fei, L. (2009). ImageNet: A Large-Scale Hierarchical

Image Database. In CVPR09.

All Together Now! The Beneﬁts of Adaptively Fusing Pre-trained Deep Representations

143

Deng, L. and Platt, J. (2014). Ensemble deep learning for

speech recognition.

Eigen, D., Ranzato, M., and Sutskever, I. (2013). Learning

factored representations in a deep mixture of experts.

arXiv preprint arXiv:1312.4314.

Fei-Fei, L., Fergus, R., and Perona, P. (2007). Learning gen-

erative visual models from few training examples: An

incremental bayesian approach tested on 101 object

categories. Computer vision and Image understand-

ing, 106(1):59–70.

Ge, W. and Yu, Y. (2017). Borrowing treasures from

the wealthy: Deep transfer learning through selective

joint ﬁne-tuning. arXiv preprint arXiv:1702.08690.

Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2014).

Rich feature hierarchies for accurate object detec-

tion and semantic segmentation. In Proceedings of

the IEEE conference on computer vision and pattern

recognition, pages 580–587.

Grifﬁn, G., Holub, A., and Perona, P. (2007). Caltech-256

object category dataset.

He, K., Zhang, X., Ren, S., and Sun, J. (2015). Delv-

ing deep into rectiﬁers: Surpassing human-level per-

formance on imagenet classiﬁcation. In Proceedings

of the IEEE international conference on computer vi-

sion, pages 1026–1034.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-

ual learning for image recognition. In Proceedings of

the IEEE conference on computer vision and pattern

recognition, pages 770–778.

Ju, C., Bibaut, A., and van der Laan, M. J. (2017). The rela-

tive performance of ensemble methods with deep con-

volutional neural networks for image classiﬁcation.

arXiv preprint arXiv:1704.01664.

Kim, Y.-D., Jang, T., Han, B., and Choi, S. (2016). Learn-

ing to select pre-trained deep representations with

bayesian evidence framework. In Proceedings of the

IEEE Conference on Computer Vision and Pattern

Recognition, pages 5318–5326.

Maaten, L. v. d. and Hinton, G. (2008). Visualizing data

using t-sne. Journal of Machine Learning Research,

9(Nov):2579–2605.

Masoudnia, S. and Ebrahimpour, R. (2014). Mixture of ex-

perts: a literature survey. Artiﬁcial Intelligence Re-

view, pages 1–19.

Nilsback, M.-E. and Zisserman, A. (2006). A visual vo-

cabulary for ﬂower classiﬁcation. In Computer Vi-

sion and Pattern Recognition, 2006 IEEE Computer

Society Conference on, volume 2, pages 1447–1454.

IEEE.

Nilsback, M.-E. and Zisserman, A. (2008). Automated

ﬂower classiﬁcation over a large number of classes.

In Proceedings of the Indian Conference on Computer

Vision, Graphics and Image Processing.

Oquab, M., Bottou, L., Laptev, I., and Sivic, J. (2014).

Learning and transferring mid-level image represen-

tations using convolutional neural networks. In Pro-

ceedings of the IEEE conference on computer vision

and pattern recognition, pages 1717–1724.

Parkhi, O. M., Vedaldi, A., Zisserman, A., and Jawahar,

C. V. (2012). Cats and dogs. In IEEE Conference

on Computer Vision and Pattern Recognition.

Pennington, J., Socher, R., and Manning, C. (2014). Glove:

Global vectors for word representation. In Proceed-

ings of the 2014 conference on empirical methods in

natural language processing (EMNLP), pages 1532–

1543.

Qiu, X., Zhang, L., Ren, Y., Suganthan, P. N., and Ama-

ratunga, G. (2014). Ensemble deep learning for re-

gression and time series forecasting. In Compu-

tational Intelligence in Ensemble Learning (CIEL),

2014 IEEE Symposium on, pages 1–6. IEEE.

Schmidhuber, J. (2015). Deep learning in neural networks:

An overview. Neural networks, 61:85–117.

Sharif Razavian, A., Azizpour, H., Sullivan, J., and Carls-

son, S. (2014). Cnn features off-the-shelf: an as-

tounding baseline for recognition. In Proceedings of

the IEEE conference on computer vision and pattern

recognition workshops, pages 806–813.

Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le,

Q., Hinton, G., and Dean, J. (2017). Outrageously

large neural networks: The sparsely-gated mixture-of-

experts layer. arXiv preprint arXiv:1701.06538.

Simonyan, K. and Zisserman, A. (2014). Very deep con-

volutional networks for large-scale image recognition.

arXiv preprint arXiv:1409.1556.

Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wo-

jna, Z. (2016). Rethinking the inception architecture

for computer vision. In Proceedings of the IEEE Con-

ference on Computer Vision and Pattern Recognition,

pages 2818–2826.

Theagarajan, R., Pala, F., and Bhanu, B. (2017). Eden: En-

semble of deep networks for vehicle classiﬁcation. In

Computer Vision and Pattern Recognition Workshops

(CVPRW), 2017 IEEE Conference on, pages 906–913.

IEEE.

Wah, C., Branson, S., Welinder, P., Perona, P., and Be-

longie, S. (2011). The Caltech-UCSD Birds-200-2011

Dataset. Technical report.

Yao, B., Jiang, X., Khosla, A., Lin, A. L., Guibas, L.,

and Fei-Fei, L. (2011). Human action recognition by

learning bases of action attributes and parts. In Com-

puter Vision (ICCV), 2011 IEEE International Con-

ference on, pages 1331–1338. IEEE.

Yosinski, J., Clune, J., Bengio, Y., and Lipson, H. (2014).

How transferable are features in deep neural net-

works? In Advances in neural information processing

systems, pages 3320–3328.

Zhao, T., Yu, J., Kuang, Z., Zhang, W., and Fan, J. (2017).

Deep mixture of diverse experts for large-scale visual

recognition. arXiv preprint arXiv:1706.07901.

ICPRAM 2019 - 8th International Conference on Pattern Recognition Applications and Methods

144