Vector Quantization to Visualize the Detection Process

Kaiyu Suzuki

, Tomofumi Matsuzawa

, Munehiro Takimoto

and Yasushi Kambayashi

Department of Information Sciences, Tokyo University of Science, Chiba, Japan

Department of Computer Information Engineering, Nippon Institute of Technology, Saitama, Japan

Keywords:

Explainable AI (XAI), Machine Learning, Neural Networks, Disaster Countermeasures, Seismic Disaster.

Abstract:

One of the most important tasks for drones, which are in the spotlight for assisting evacuees of natural disas-

ters, is to automatically make decisions based on images captured by on-board cameras and provide evacuees

with useful information, such as evacuation guidance. In order to make decision automatically from the afore-

mentioned images, deep learning is the most suitable and powerful method. Although deep learning exhibits

high performance, presenting the rationale for decisions is a challenge. Even though several existing decision

making methods visualize and point out which part of the image they have considered intensively, they are

insufﬁcient for situations that require urgent and accurate judgments. When we look for basis for the deci-

sions, we need to know not only WHERE to detect but also HOW to detect. This study aims to insert vector

quantization (VQ) into the intermediate layer as a ﬁrst step in order to show HOW to detect for deep learn-

ing in image-based tasks. We propose a method that suppresses accuracy loss while holding interpretability

by applying VQ to the classiﬁcation problem. The applications of the Sinkhorn–Knopp algorithm, constant

embedding space and gradient penalty in this study allow us to introduce VQ with high interpretability. These

techniques should help us apply the proposed method to real-world tasks where the properties of datasets are

unknown.

1 INTRODUCTION

In recent years, unmanned vehicles, such as robots

and drones, have been attracting attention as a means

of providing assistance for the evacuees of natural

disasters. Unmanned drones can operate in extreme

conditions such as hazard-prone areas. We have

conducted on the autonomous control of robots and

drones and their application to disasters (Goto et al.,

2016), (Taga et al., 2019). In the aforementioned

studies, information obtained from the evacuees or

unmanned drones was shared among mobile devices

of the evacuees and the drone, and the information is

used to determine evacuation routes and movements.

Currently, drones serve as information relays or cap-

ture images for human decision-making. In order

to make drones support-activities more effective, we

have to make them process and judge captured camera

images and reﬂect the decision to the behavior of the

swarms of the drones. In the aforementioned studies,

information obtained from the evacuees or unmanned

drones was shared among mobile devices of the evac-

uees and the drone, and the information is used to de-

termine evacuation routes and movements. Currently,

unmanned drones serve as information relays or cap-

ture images for human decision-making. In order to

make future drone support-activities be more effec-

tive, the drones must be able to process, to judge cap-

tured camera images and then to determine the behav-

ior of the swarms of drones.

Deep learning is currently in the spotlight as

an accurate decision-making method based on im-

ages and videos and has achieved state of the art

in several tasks. In particular, for image classiﬁca-

tion tasks where a large dataset is available, AlexNet

(Krizhevsky et al., 2012), VGG (Simonyan and Zis-

serman, 2014), ResNet (He et al., 2016), WideResNet

(Zagoruyko and Komodakis, 2016), and EfﬁcientNet

(Tan and Le, 2019) have achieved excellent accu-

racy in addition to outperform others. While these

deep learning classiﬁers achieve a high degree of

classiﬁcation accuracy, the challenge is to determine

how the classiﬁers make decisions. Deep learning

is susceptible to the biases contained in the dataset,

which are derived from the quantity and quality of the

dataset. Therefore, it is very important to visualize,

in a human-understandable form, how the classiﬁer

makes decisions to check whether they learn robust

features. This is especially important when they are

applied to real-world tasks, e.g., hazard-prone areas.

Several methods can visualize the basis for de-

cisions in image classiﬁcation tasks, such as Grad-

Cam (Selvaraju et al., 2017), SmoothGrad (Smilkov

et al., 2017) and LIME (Ribeiro et al., 2016). These

Suzuki, K., Matsuzawa, T., Takimoto, M. and Kambayashi, Y.

Vector Quantization to Visualize the Detection Process.

DOI: 10.5220/0010426005530561

In Proceedings of the 13th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2021) - Volume 1, pages 553-561

ISBN: 978-989-758-484-8

553

methods reveal the focused regions of an image for

classifying. In addition, in the models incorporating

self-attention (Zhao et al., 2020), each layer empha-

sizes important region of the image; in other words,

a visualization model is incorporated at the estima-

tion stage. On the other hand, while these methods

reveal where they paid attention in the image, they do

not reveal why they focused those areas. For exam-

ple, suppose that in the task of classifying a dog and

other objects, the classiﬁer focuses on the ears of the

dog. We cannot tell whether the classiﬁer responds

to the shape of the region, the texture, or the color of

the ears. Without knowing exactly how the classiﬁer

achieved the task, it is difﬁcult for us to accurately

analyze the results.

If we know how the model focuses a particular

area, the applicability of the deep learning model to

real-world tasks will dramatically increase. For exam-

ple, we can assign a drone the task of judging evac-

uation routes based on the images it’s capturing. If

the evidence for such a decision is clear, the drone

can provide accurate feedback to the human opera-

tor to correct its decision-making. Moreover, we can

use the information of the wrong decision to reduce

the probability of making the same mistake next time.

Another application is the ability to build multiple

models that have different intended decision-making

grounds. These multiple models can be combined

into a single model, or multiple drones can be loaded

with separate models, allowing them to robustly be-

havior as a swarm. To achieve these objects, it is nec-

essary to assemble a model that allows us to know

how the model ﬁnd as well as where they ﬁnd.

In order to achieve this purpose, we have intro-

duced a model that makes it easier to visualize inter-

mediate layers as the ﬁrst step toward clarifying how

the model is focused. First, we show that applying

vector quantization (VQ) (van den Oord et al., 2017)

to the intermediate layer can improve interpretabil-

ity without reducing the classiﬁcation accuracy. And

then, we propose a method to deal with decreased in-

terpretability and accuracy that occurs when VQ is ap-

plying to a classiﬁcation model.

2 RELATED WORK

2.1 Explainable AI for Image

Classiﬁcation

Explainable AI (XAI) is aimed at explaining to hu-

mans how AI makes decisions. Deep Learning in

particular is a difﬁcult task to interpret due to its

huge number of parameters. An excellent early study

that analyzed convolutional neural networks (CNN)

for image classiﬁcation shows which parts of an im-

age are responsive by deconvnet (Zeiler and Fergus,

2014). This study attempt to interpret how each ﬁl-

ter in each layer behaves by using the Hesse matrix.

This study helps to visually understand the features

of each ﬁlter. On the other hand, it is not possible

to correlate between the ﬁlters or to interpret them

by space (only some of the most active feature maps

are shown). Subsequently, methods such as GradCam

(Selvaraju et al., 2017), SmoothGrad (Smilkov et al.,

2017) and LIME (Ribeiro et al., 2016) are proposed

to show which part of the image is being focused on

to make a decision. These commonly don’t directly

indicate how the image is judged. They emphasize

the location that the AI is focusing on, but it is the

human who looks at the emphasized location that de-

cides what it indicates. These can be useful for tasks

that are simple for humans, such as distinguishing be-

tween a dog and a car. However, as shown in the ﬁg-

ure 2, in the case of a task to indicate whether a dam-

aged road is passable, it is not obvious to humans at a

glance. In such a case, it is not enough to show where

the AI made its decision. It is necessary to show how

the AI make decisions, which is the purpose of this

research.

2.2 Vector Quantization

Figure 1: VQ-VAE (van den Oord et al., 2017) comprises

Encoder, VQ, and Decoder.

In the deep learning, a method for the vector quantiza-

tion of intermediate representations in each space or

time series was proposed in VQ-VAE (van den Oord

et al., 2017). In general, an autoencoder consists of

an encoder that outputs an intermediate representa-

tion z from input x and a decoder that outputs x from

z. Thus, VQ-VAE is one of the autoencoders. An au-

toencoder is trained so that input x and output ˆx of

the decoder are equal. For example, VAE (Kingma

and Welling, 2014) assumes a normal distribution in

the intermediate representation. This allows the de-

coder to generate new data similar to the training data

from the ˆz sampled from the normal distribution. On

the other hand, VQ-VAE assumes K discrete repre-

sentation spaces (Figure 1), where K is the number of

SDMIS 2021 - Special Session on Super Distributed and Multi-agent Intelligent Systems

554

representaion vectors.

In VQ-VAE, the intermediate representation q ∈

D×H×W

is described in equation 1, where the dis-

crete space representation is e ∈ R

K×D

and the output

of the encoder is z(x) × H ×W .

i j

= e

where argmin

− z(x)k

(1)

In addition, to feed the backpropagation to z(x), the

input to Decoder ˆq is applied with equation 2, where

sg is an operation that points to a stop gradient.

ˆq = sg [q − z(x)] + z(x) (2)

The loss function for learning these is provided the

equation 3, where L

(x) is a task T -dependent loss

function.

L = L

(x) + ksg [q] − z(x)k

+ kq − sg [z(x)]k

(3)

In VQ-VAE, L

(x) = kx − ˆxk

equalizes input x and

output ˆx.

2.3 Sinkhorn–Knopp for

Representation Learning

Unsupervised representation learning automatically

extracts features from unlabeled data. DeepCluster

(Caron et al., 2018) uses CNN models such as VGG

(Simonyan and Zisserman, 2014) and ResNet (He

et al., 2016) to extract good features from images

that are well represented and easy to transfer to other

tasks. It is trained in three different steps as follows.

1. Transferring D-dimensional features to each im-

age by CNN

2. Clustering by K-means (Hartigan and Wong,

1979) based on the transferred feature

3. Training the model to predict to which cluster

each image belongs

Models trained by this method achieves then the state-

of-the-art in transition learning. On the other hand, a

problem inherent in the K-means-based method (and

as the assignment of the VQ-VAE (van den Oord

et al., 2017) is biased toward speciﬁc embedding vec-

tors) remains, i.e. the assignments are biased toward

a few speciﬁc clusters. For example, even if we wants

to assign images to 10,000 clusters, most images may

be assigned to dozens of clusters. This makes the

models to only a few rough clusters and prevented

them from learning more detailed information.

SeLa (YM. et al., 2020) is based on DeepCluster

but eliminates allocation bias among clusters by ap-

plying fast Sinkhorn–Knopp (Cuturi, 2013). When N

data are allocated to K clusters, SeLa performs an op-

timization using equation 4 as the objective function,

where P ∈ R

K×N

denotes the probability by model,

Q ∈ R

K×N

is targeted doubly stochastic matrix, U is

the set doubly stochastic matrix, and Q ∈ R

K×N

is the

double probability matrix such that the sum of each

row and each column is equal.

min

Q∈U

Q, −log P

KL(QkQ) (4)

The ﬁrst term of equation 4 minimizes the cross-

entropy distance between the predicted P and the tar-

get, and the second term constrains the distance be-

tween the equal double probability matrix Q and the

target so that it becomes small.

The fast Sinkhorn–Knopp (Cuturi, 2013) algo-

rithm optimizes Q based on equation 5.

Q = diag(α)P

diag(β) (5)

λ is a hyperparameter that adjusts the ﬁrst and second

terms of expression 4, and α, β are obtained using the

following progressive formulas:

∀y : α

← [P

β]

−1

∀i : β

← [α

]

−1

We can perform these iterative optimizations in high-

speed by using matrix operations and using the GPU.

2.4 Gradient Penalty

Gradient penalty has been proposed in DoubleBack-

propagation (Drucker and Le Cun, 1992). Recently,

Gradient penalty is used in generative deep learning

systems, such as WGAN-GP (Gulrajani et al., 2017)

to satisfy the 1-Lipschitz constant (smoothness of the

change in the output to the change in the input). In

this method, the equation 6 denotes the loss, where

C(x) is the function we want to vary smoothly and

refers to the derivative of the input x.

(x) =

C(x)

− 1

(6)

In WGAN-GP, x is the alpha-blended image of the

training and generated data, and C(x) is the discrimi-

nator output.

On the other hand, DUO (Van Amersfoort et al.,

2020) introduces a gradient penalty for the input x

in addition to the usual CrossEntropyLoss implement

out-of-distribution detection, which detects objects

not included in the training data in the image clas-

siﬁcation tasks.

C(x) =

∑

exp

−

(x) − e

2σ

(7)

Vector Quantization to Visualize the Detection Process

555

In equation 7, y is the class label, f

is the output of

the last layer of the model, W

is the class-speciﬁc

transfer matrix, e

is the embedding of the features of

each class, and σ is the hyperparameter for scaling.

Thus, adding the gradient penalty prevents excessive

changes in the loss function of the classiﬁcations ow-

ing to changes in the input. In other words, the effect

is such that the priority is to prevent a large change

in the loss function when a small change in the data

occurs, rather than to reduce the loss function for a

given data. This loss function allows the classiﬁer to

react only to in-distribution data and not to out-of-

distribution data.

3 METHOD

3.1 VQ with Momentum

Sinkhorn–Knopp

VQ assigns the intermediate layer to the nearest em-

bedding space, and this assignment causes a problem

that the initial assignment bias affects the ﬁnal learned

assignment to a speciﬁc embedding. In order to solve

this problem, we introduce Sinkhorn–Knopp (Cuturi,

2013), as used in SeLa (YM. et al., 2020). This en-

ables us to learn embedding to minimize the cost of

assigning to embedding space while increasing the as-

signment entropy of the selected embedding.

Let SK(P) be the embedding selected by

Sinkhorn–Knopp algorithm for the assignment prob-

ability P. Then, the expression for generating the em-

bedding corresponding to the output z(x) of the en-

coder can be written as equation 8 using the tempera-

ture parameter z(x).

i j

= e

where



SK(exp(−λke

− z(x)k

)) training

argmin

− z(x)k

otherwise

(8)

In the case of SeLa (YM. et al., 2020), however, the

assignment is applied to the entire dataset in such a

way that it is equalized, but in order to apply it to

the intermediate layer, it is necessary to perform the

calculation for each batch. This is inefﬁcient. There-

fore, we propose momentum Sinkhorn–Knopp. We

use α

and β

, which are moving averages of α and β

obtained from previous batches in iteration t, to use

expression 5 for each batch.

∀y : α

← [P

t−1

]

−1

← (1 − m) × α + m × α

∀i : β

← [α

]

−1

← (1 − m) × β + m × β

where m is an update parameter (m = 0.999 in this

study). This allows us to apply the Sinkhorn–Knopp

algorithm on a batch-by-batch basis, referring to pre-

vious batch assignments in high-speed.

3.2 Constant Embedding Space

VQ updates the discrete space representation (3), but

this changes the distance between each vector. In

this case, even if the Sinkhorn–Knopp algorithm can

achieve equal assignment to vectors, the similarity

problem in the vector representation may arise, re-

sulting in the same problem, i.e., the lack of equal

assignment. In order to improve the interpretability of

the intermediate representation, the distance between

each vector should be constant.

We propose constant embedding space, which

ﬁxes the embedding space at the initial value sam-

pled from a normal distribution and does not update it

by learning. This technique makes it easy to perform

post-learning comparisons and solves the problem of

the similarity of the expressions.

3.3 Gradient Penalty

VQ with Sinkhorn–Knopp assignment and constant

embedding space improves the interpretability of in-

termediate representations, but this method has a neg-

ative effect on accuracy as shown in Table 1. The rea-

son why the accuracy decreased is that this method

increases the distance between embedding e

and en-

coder output z(x), and thus, makes the backpropaga-

tion not properly transmitted to the layers shallower

than the VQ.

Therefore, we add the following loss function to

the assignment of the embedding space as shown in

DUO (Van Amersfoort et al., 2020).

C(z) =

∑

k=1

exp

−

kz − e

2σ

(9)

This loss function prevents the cost of assignment to

the embedding space from overreacting to changes in

the input.

SDMIS 2021 - Special Session on Super Distributed and Multi-agent Intelligent Systems

556

4 EXPERIMENTAL RESULTS

In order to demonstrate the usefulness of the pro-

posed method, we have conducted two experiments.

The ﬁrst one shows that the proposed method has a

similar classiﬁcation accuracy compared with the one

without VQ and the assignment is more uniform than

that of the normal VQ. The second experiment shows

that embedding obtained using SK-VQ is more inter-

pretable than that of the normal VQ.

4.1 Accuracy and Assignment Entropy

We have measured how the classiﬁcation accuracy

rate changes when applying VQ and the proposed

method to the intermediate layer of the classiﬁca-

tion model, and conﬁrmed its improvement compared

with the original model. Moreover, we show the mo-

mentum Sinkhorn–Knopp and constant embedding

increases the assignment entropy, which indicates the

uniformity of the assignment to each vector. We show

our method is superior to the normal VQ.

We have employed CIFAR10 as the dataset. It

contains 50,000 training data and 10,000 test data

for 10 classes of images of size 32 × 32. Further-

more, we chose Wide-Resnet-28-2 (Zagoruyko and

Komodakis, 2016) as the base CNN model. We train

the model from scratch with minibatch size 256 for

1000 epochs. We have more details in the appendix.

When applying VQ to the intermediate layer, we

have applied batch normalization (Ioffe and Szegedy,

2015) into the previous layer of VQ to stabilize the

training because the values are ﬁxed in a normal dis-

tribution in the constant embedding space.

We compare the results with and without VQ, con-

stant embedding space, Sinkhorn–Knopp, and gradi-

ent penalty loss (Table 1).

Table 1: Accuracy and Assignment Entropy.

VQ CE SK GP Accuracy Entropy

94.55

X X 92.46 2.280

X X X 92.83 5.316

X X X X 93.06 5.311

4.2 Reconstruction

To show that the interpretability of the vector repre-

sentation obtained using SK-VQ is better than that

of the normal VQ, we have trained a decoder to re-

cover the original image from the vector representa-

tion. This is because the high image reproduction rate

of the decoder indicates that embedding is a relevant

feature space. In form, it is similar to autoencoder

such as VQ-VAE (van den Oord et al., 2017), but uses

ﬁxed weights for the encoder and only trains the de-

coder.

We have also conducted experiments on VQ and

SK-VQ-GP. Table 2 shows the restoration errors. The

experimental results show that the restoration error

of the proposed method is smaller than that of VQ.

Therefore we can conclude that the proposed method

has a better reproducibility rate than that of VQ.

Table 2: Reconstruction Error.

Model Reconstruction Error

VQ 0.2733

SK-VQ-GP(ours) 0.1862

5 DISCUSSION

5.1 VQ with Momentum

Sinkhorn–Knopp

As Table 1 shows, we have achieved more or less

uniform assignments to the vector space by apply-

ing momentum Sinkhorn–Knopp to VQ. This allows

us to obtain various representations without being af-

fected by the initial values. On the other hand, we

believe that the gradient transmitted to the pre-VQ

layers in backpropagation is adversely affected by the

fact that the assignment is no longer made to the near-

est Embedding. However, ﬁne-tuning using VQ with-

out Sinkhorn–Knopp can provide comparable or even

better accuracy than with Sinkhorn–Knopp, with little

loss of the assignment balance.

5.2 Constant Embedding Sampled from

a Normal Distribution

By introducing constant embedding, we are able to

maintain the distance of each embedding constant.

Moreover, the features are sampled based on a nor-

mal distribution, which makes it easier to compare the

strength of each feature. On the other hand, a nor-

mal distribution is not necessarily suitable for a fea-

ture space. For example, one-hot features are easier

to interpret, even if it is harder to learn in practice. As

a future investigation, we have to verify what distri-

bution is the most appropriate for feature spaces.

Vector Quantization to Visualize the Detection Process

557

5.3 Improved Accuracy by Gradient

Penalty

As Table 1 shows, the gradient penalty reduces

the loss of accuracy based on the momentum

Sinkhorn–Knopp and constant embedding in VQ. On

the other hand, the introduction of gradient penalty

without momentum Sinkhorn–Knopp and constant

embedding did not change the accuracy of the VQ sig-

niﬁcantly. These indicate that the accuracy owing to

larger gap between the embedding space and the in-

termediate representation z caused by constraints as-

signed to the embedding space. In order to achieve

a comparable accuracy for normal VQ with constant

embedding in place, we need a loss function or model

that reduces the embedding space and the gap be-

tween the intermediate representations without in-

creasing the classiﬁcation loss.

6 FUTURE WORK

6.1 Importance of Basis for Judgment

Pertaining to Disaster-Relief Drones

Figure 2: Road collapse site.

If we can visualize WHERE and HOW regarding a

decision in an image-based task, the applicability to

evacuation guidance tasks of drones during disasters

would increases drastically. Image data during a dis-

aster, especially in the midst of a disaster rather than

after it, is very difﬁcult to collect, and that difﬁculty

makes machine-learning-based methods less accurate

than those of other methods. Nonetheless, errors in

judgment easily lead to risks pertaining to human

lives.

In a situation like Figure 2, for example, a classi-

ﬁer on-board of a drones have to determine whether

the evacuees can pass this road or not in an emer-

gency. However, considering the answer “The evac-

uees cannot pass this road,” it is not clear that the road

is considered impassable because the left side is col-

lapsed, or because the remaining road on the right

side is also dangerous to pass. Furthermore, even if

it judges the right side be unsafe, it is impossible for

a human operator to judge whether it is simply be-

cause of the existence of cracks, or it is expected to

collapse soon, or simply it misjudged. It is extremely

difﬁcult to solve these problems by simply increasing

the accuracy, and there is a need to quickly provide

feedback to the evacuees based on for their decisions.

The proposed method will make it easy to visu-

ally conﬁrm the evacuation route determined by the

unmanned aircraft on the spot by providing the evacu-

ation route and decision-making rationale to the evac-

uees and controllers. If the decision-making rationale

is inconsistent with the visual assessment results of

the evacuees, the evacuees can request that the deci-

sion be corrected in advance. In addition, the pre-

sentation of the basis for the judgment can provide a

psychological reassurance to the evacuees.

For disaster relief, where the urgency is high and

a single mistake can be fatal, it is very important to

show the basis for decision-making. In the future, we

would like to show that it can be applied to disaster

relief tasks.

6.2 Creating a Model with a Different

Basis for Judgment

As exempliﬁed by AdaBoost (Schapire, 1999), en-

sembles that make a ﬁnal decision based on major-

ity votes of multiple classiﬁers that follow different

decision-making criteria are important and powerful

methods. Considering the deployment of multiple

drones at a disaster site, a group of drones with mul-

tiple decision-making criteria is more robust than a

group of drones with the same criteria. However, with

the existing the deep learning technique, creating sep-

arate models based on decisions is difﬁcult.

The proposed method allows us to assign the in-

termediate layer to the embedding space, which is

more interpretable by the VQ, which allows us to cre-

ate models considering different decision bases. Sup-

pose we create Model A with the proposed method

and then train Model B with the same architecture.

To separate the decision bases of Models A and B, we

can perform adversarial training so that the embed-

ding space of Model A cannot be inferred from the

intermediate layer of Model B. The aforementioned

procedure can be used to create an ensemble model

or a model with multiple agents complementing each

other.

SDMIS 2021 - Special Session on Super Distributed and Multi-agent Intelligent Systems

558

6.3 WHERE to Detect and HOW to

Detect by VQ

To clarify WHERE to detect and HOW to detect by

VQ, we can anticipate the following steps. First, we

extract embedding and corresponding image regions.

Then, we aggregate the image regions allocated to the

same embedding region and provide them a label by

ﬁnding the common denominator. Finally, by looking

at the wrong labels during the test, we can ﬁgure out

“for what the model mistook a region.” It is presumed

that the above steps can be used to clarify the basis for

decisions.

The correspondence between embedding and

“which region” is not shown in this study, but it can

be possible by applying existing region visualization

methods such as SmoothGrad (Smilkov et al., 2017).

However, compared with the existing methods, a tai-

lored method is required to reveal the subdivided re-

gions.

Moreover, determining common denominators

and providing labels to the regions need to be done

manually at present. However, it is difﬁcult to present

a representative image other than the low allocation

cost to embedding; therefore, it is necessary to show

that this allocation cost is a reliable indicator.

7 CONCLUSIONS

This study aims to create an AI system that consid-

ers WHERE to detect and HOW to detect. Such a

system is necessary in situations where everything is

urgent and errors are fatal, i,e., disaster scenes. For

this raeson, we have proposed three methods for deep

learning CNN models, based on the idea that the in-

terpretability can be improved by applying VQ to the

intermediate layer. First, we proposed the momen-

tum Sinkhorn–Knopp VQ. This method solves the

bias problem in the embedding selection, which is

an inherent problem of VQ, and has better restora-

tion performance than the normal VQ. Moreover, we

have used Sinkhorn–Knopp on a per-batch basis for

fast calculation. Second, we showed the effectiveness

of constant embedding space. The intermediate rep-

resentation, ﬁxed by a normal distribution, allows us

to transfer the intermediate layer into an interpretable

feature space. Third, we have showed that gradi-

ent penalty improves the performance of momentum

Sinkhorn–Knopp and constant embedding space ap-

plied models. These proposed methods help us to re-

alize the ﬁrst step to create a model that reveals how

to detect using VQ. Currently, we are planning to cre-

ate a method for visual decision-making and to cre-

ate a model with multiple decision-making bases. It

should be applied to and should be useful for disaster

relief activities.

ACKNOWLEDGMENTS

This work is supported in part by Japan Society for

Promotion of Science (JSPS), with the basic research

program (C) (No.17K01342), Grantin-Aid for Scien-

tiﬁc Research.

REFERENCES

Caron, M., Bojanowski, P., Joulin, A., and Douze, M.

(2018). Deep clustering for unsupervised learning of

visual features. In Ferrari, V., Hebert, M., Sminchis-

escu, C., and Weiss, Y., editors, ECCV (14), volume

11218 of Lecture Notes in Computer Science, pages

139–156. Springer.

Cubuk, E. D., Zoph, B., Shlens, J., and Le, Q. V. (2020).

Randaugment: Practical automated data augmentation

with a reduced search space. In 2020 IEEE/CVF Con-

ference on Computer Vision and Pattern Recognition

Workshops (CVPRW), pages 3008–3017.

Cuturi, M. (2013). Sinkhorn distances: Lightspeed compu-

tation of optimal transport. In Burges, C. J. C., Bot-

tou, L., Welling, M., Ghahramani, Z., and Weinberger,

K. Q., editors, Advances in Neural Information Pro-

cessing Systems, volume 26, pages 2292–2300. Cur-

ran Associates, Inc.

Drucker, H. and Le Cun, Y. (1992). Improving gener-

alization performance using double backpropagation.

IEEE Transactions on Neural Networks, 3(6):991–

997.

Goto, H., Ohta, A., Matsuzawa, T., Takimoto, M., Kam-

bayashi, Y., and Takeda., M. (2016). A guidance sys-

tem for wide-area complex disaster evacuation based

on ant colony optimization. In Proceedings of the 8th

International Conference on Agents and Artiﬁcial In-

telligence - Volume 1: ICAART,, pages 262–268. IN-

STICC, SciTePress.

Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and

Courville, A. C. (2017). Improved training of wasser-

stein gans. In Guyon, I., Luxburg, U. V., Bengio, S.,

Wallach, H., Fergus, R., Vishwanathan, S., and Gar-

nett, R., editors, Advances in Neural Information Pro-

cessing Systems, volume 30, pages 5767–5777. Cur-

ran Associates, Inc.

Hartigan, J. A. and Wong, M. A. (1979). A k-means cluster-

ing algorithm. JSTOR: Applied Statistics, 28(1):100–

108.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-

ual learning for image recognition. 2016 IEEE Con-

ference on Computer Vision and Pattern Recognition

(CVPR), pages 770–778.

Ioffe, S. and Szegedy, C. (2015). Batch normalization:

Accelerating deep network training by reducing in-

Vector Quantization to Visualize the Detection Process

559

ternal covariate shift. volume 37 of Proceedings of

Machine Learning Research, pages 448–456, Lille,

France. PMLR.

Kingma, D. P. and Welling, M. (2014). Auto-encoding

variational bayes. In 2nd International Conference

on Learning Representations, ICLR 2014, Banff, AB,

Canada, April 14-16, 2014, Conference Track Pro-

ceedings, volume abs/1312.6114.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012).

Imagenet classiﬁcation with deep convolutional neu-

ral networks. In Pereira, F., Burges, C. J. C., Bottou,

L., and Weinberger, K. Q., editors, Advances in Neu-

ral Information Processing Systems, volume 25, pages

1097–1105. Curran Associates, Inc.

Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao, J.,

and Han, J. (2020). On the variance of the adap-

tive learning rate and beyond. In International Con-

ference on Learning Representations, ICLR 2020,

https://openreview.net/forum?id=rkgz2aEKDr.

Ribeiro, M. T., Singh, S., and Guestrin, C. (2016). ”why

should I trust you?”: Explaining the predictions of any

classiﬁer. In Proceedings of the 22nd ACM SIGKDD

International Conference on Knowledge Discovery

and Data Mining, San Francisco, CA, USA, August

13-17, 2016, pages 1135–1144.

Schapire, R. E. (1999). A brief introduction to boosting.

In Proceedings of the 16th International Joint Confer-

ence on Artiﬁcial Intelligence - Volume 2, IJCAI’99,

pages 1401–1406, San Francisco, CA, USA. Morgan

Kaufmann Publishers Inc.

Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R.,

Parikh, D., and Batra, D. (2017). Grad-cam: Visual

explanations from deep networks via gradient-based

localization. In Proceedings of the IEEE International

Conference on Computer Vision (ICCV), pages 618–

626.

Simonyan, K. and Zisserman, A. (2014). Very deep con-

volutional networks for large-scale image recognition.

CoRR, abs/1409.1556.

Smilkov, D., Thorat, N., Kim, B., Vi

egas, F., and Watten-

berg, M. (2017). Smoothgrad: removing noise by

adding noise. ArXiv, abs/1706.03825.

Taga, S., Tomofumi, M., Takimoto, M., and Kambayashi, Y.

(2019). Multi-agent base evacuation support system

using manet. Vietnam Journal of Computer Science,

06(02):177–191.

Tan, M. and Le, Q. (2019). EfﬁcientNet: Rethinking model

scaling for convolutional neural networks. volume 97

of Proceedings of Machine Learning Research, pages

6105–6114, Long Beach, California, USA. PMLR.

Van Amersfoort, J., Smith, L., Teh, Y. W., and Gal, Y.

(2020). Uncertainty estimation using a single deep de-

terministic neural network. In III, H. D. and Singh, A.,

editors, Proceedings of the 37th International Con-

ference on Machine Learning, volume 119 of Pro-

ceedings of Machine Learning Research, pages 9690–

9700, Virtual. PMLR.

van den Oord, A., Vinyals, O., and kavukcuoglu, k. (2017).

Neural discrete representation learning. In Guyon,

I., Luxburg, U. V., Bengio, S., Wallach, H., Fer-

gus, R., Vishwanathan, S., and Garnett, R., editors,

Advances in Neural Information Processing Systems,

volume 30, pages 6306–6315. Curran Associates, Inc.

YM., A., C., R., and A., V. (2020). Self-labelling via simul-

taneous clustering and representation learning. In In-

ternational Conference on Learning Representations,

ICLR 2020, https://openreview.net/forum?id=Hyx-

jyBFPr.

Zagoruyko, S. and Komodakis, N. (2016). Wide residual

networks. In Richard C. Wilson, E. R. H. and Smith,

W. A. P., editors, Proceedings of the British Ma-

chine Vision Conference (BMVC), pages 87.1–87.12.

BMVA Press.

Zeiler, M. D. and Fergus, R. (2014). Visualizing and under-

standing convolutional networks. In Fleet, D., Pajdla,

T., Schiele, B., and Tuytelaars, T., editors, Computer

Vision – ECCV 2014, pages 818–833, Cham. Springer

International Publishing.

Zhang, M., Lucas, J., Ba, J., and Hinton, G. E. (2019).

Lookahead optimizer: k steps forward, 1 step back.

In Wallach, H., Larochelle, H., Beygelzimer, A.,

d'Alch

e-Buc, F., Fox, E., and Garnett, R., editors,

Advances in Neural Information Processing Systems,

volume 32, pages 9597–9608. Curran Associates, Inc.

Zhao, H., Jia, J., and Koltun, V. (2020). Exploring self-

attention for image recognition. In Proceedings of the

IEEE/CVF Conference on Computer Vision and Pat-

tern Recognition (CVPR), pages 10076–10085.

APPENDIX

Architecture. Our classiﬁcation model architecture

is based on the stacking of residual blocks like Wide

Resnet (Zagoruyko and Komodakis, 2016), and the

application of VQ and batch normalization (Ioffe and

Szegedy, 2015) like Table 3. We set N = 28, k = 2,

and made the embedding space contain 256 embed-

dings.

Table 3: Classiﬁer Architecture.

Group name Output size Block type

conv1 32 × 32 [3 × 3, 16]

conv2 32 × 32



3 × 3, 16 × k



× N

conv3 16 × 16



3 × 3, 32 × k



× N

conv4 8 × 8



3 × 3, 64 × k



× N

bn 8 × 8 [1 × 1]

vq 8 × 8 [1 × 1]

avg-pool 1 × 1 [8 × 8]

Moreover, we have designed a decoder to reproduce

the original image from the embedding of the classi-

ﬁer, symmetrical with the Classiﬁer (Table 4).

SDMIS 2021 - Special Session on Super Distributed and Multi-agent Intelligent Systems

560

Table 4: Decoder Architecture.

Group name Output size Block type

conv1 8 × 8



3 × 3, 64 × k



× N

conv2 16 × 16



3 × 3, 32 × k



× N

conv3 32 × 32



3 × 3, 16 × k



× N

conv4

32 × 32 [3 × 3, 16]

conv5 32 × 32 [1 × 1, 3]

Training Details. We have used the Ranger Opti-

mizer, consisting of a composite of RAdam (Liu et al.,

2020) and Lookahead (Zhang et al., 2019). The hy-

perparameters are listed in Table 5.

Table 5: Optimizer Hyperparameters.

Parameter name Value

learning rate 1.0e

−3

weight decay 5.0e

−4

alpha 0.5

betas (0.95, 0.999)

eps 1.0e

−5

Moreover, we apply RandAugment (Cubuk et al.,

2020) for data augmentation. This allows generic data

augmentation for image classiﬁcation to be adjusted

with a single parameter magnitude, which is one of

the standard data augmentation methods. Although

magnitude is magnitudein ∈ [0, 10], in this study, it is

adjusted with magnitude = 5.

Vector Quantization to Visualize the Detection Process

561