Improving Dense Crowd Counting Convolutional Neural Networks

using Inverse k-Nearest Neighbor Maps and Multiscale Upsampling

Greg Olmschenk

, Hao Tang

and Zhigang Zhu

1,3

The Graduate Center of the City University of New York, New York, U.S.A.

Borough of Manhattan Community College - CUNY, New York, U.S.A.

The City College of New York - CUNY, New York, U.S.A.

Keywords:

Crowd Counting, Convolutional Neural Network, k-Nearest Neighbor, Upsampling.

Abstract:

Gatherings of thousands to millions of people frequently occur for an enormous variety of events, and automated

counting of these high-density crowds is useful for safety, management, and measuring signiﬁcance of an event.

In this work, we show that the regularly accepted labeling scheme of crowd density maps for training deep

neural networks is less effective than our alternative inverse k-nearest neighbor (i

NN) maps, even when used

directly in existing state-of-the-art network structures. We also provide a new network architecture MUD-i

NN,

which uses multi-scale drop-in replacement upsampling via transposed convolutions to take full advantage of

the provided i

NN labeling. This upsampling combined with the i

NN maps further improves crowd counting

accuracy. Our new network architecture performs favorably in comparison with the state-of-the-art. However,

our labeling and upsampling techniques are generally applicable to existing crowd counting architectures.

1 INTRODUCTION

Every year, gatherings of thousands to millions occur

for protests, festivals, pilgrimages, marathons, con-

certs, and sports events. For any of these events, there

are countless reasons to desire to know how many

people are present. For those hosting the event, both

real-time management and future event planning is de-

pendent on how many people are present, where they

are located, and when they are present. For security

purposes, knowing how quickly evacuations can be

executed and where crowding might pose a threat to

individuals is dependent on the size of the crowds. In

journalism, crowd sizes are frequently used to measure

the signiﬁcance of an event, and systems which can

accurately report on the event size are important for a

rigorous evaluation.

Many systems have been proposed for crowd count-

ing purposes, with most recent state-of-the-art methods

being based on convolutional neural networks (CNNs).

To the best of our knowledge, every CNN-based dense

crowd counting approach in recent years relies on us-

ing a density map of individuals, primarily with a

Gaussian-based distribution of density values centered

on individuals labeled in the ground truth images. Of-

ten, these density maps are generated with the Gaus-

sian distribution kernel sizes being dependent on a

k-Nearest Neighbor (

NN) distance to other individu-

als (Zhang et al., 2016). In this work, we explain how

this generally accepted density map labeling is lacking

and how an alternative inverse

NN (i

NN) labeling

scheme, which does not explicitly represent crowd den-

sity, provides improved counting accuracy. We will

show how a single i

NN map provides information

similar to the accumulation of many density maps with

different Gaussian spreads, in a form which is better

suited for neural network training. This labeling pro-

vides a signiﬁcant gradient spatially across the entire

label while still providing precise location information

of individual pedestrians (with the only exception be-

ing exactly overlapping head labelings). We show that

by simply replacing density map training in an existing

state-of-the-art network with our i

NN map training,

the testing accuracy of the network improves. This is

the ﬁrst major contribution of the paper.

Additionally, coupling multi-scale drop-in replace-

ment upsampling with densely connected convolu-

tional networks (Huang et al., 2017) and our proposed

NN mapping, we provide a new network structure,

MUD-i

NN, which performs favorably compared to

existing state-of-the-art methods. Our network inte-

grates multi-scale upsampling with transposed convo-

lutions (Zeiler et al., 2010) to make effective use of the

full ground truth label, particularly with respect to our

Olmschenk, G., Tang, H. and Zhu, Z.

Improving Dense Crowd Counting Convolutional Neural Networks using Inverse k-Nearest Neighbor Maps and Multiscale Upsampling.

DOI: 10.5220/0009156201850195

In Proceedings of the 15th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2020) - Volume 5: VISAPP, pages

185-195

ISBN: 978-989-758-402-2; ISSN: 2184-4321

185

NN labeling scheme. The transposed convolutions

are used to spatially upsample intermediate feature

maps to the ground truth label map size for comparison.

This approach provides several beneﬁts. First, it al-

lows the features of any layer to be used in the full map

comparison, where many existing methods require a

special network branch for this comparison. Notably,

this upsampling, comparison, and following regression

module can be used at any point in any CNN, with the

only change being the parameters of the transposed

convolution. This makes the module useful not only

in our speciﬁc network structure, but also applicable

in future state-of-the-art, general-purpose CNNs. Sec-

ond, as this allows features which have passed through

different levels of convolutions to be compared to the

ground truth label map, this intrinsically provides a

multi-scale comparison without any dedicated addi-

tional network branches, thus preventing redundant

parameters which occur in separate branches. Third,

because the transposed convolution can provide any

amount of upsampling (with the features being used

to specify the upsampling transformation), the upsam-

pled size can be the full ground truth label size. In

contrast, most existing works used a severely reduced

size label map for comparison. These reduced sizes

remove potentially useful training information. Al-

though some recent works use full-size labels, they

require specially crafted network architectures to ac-

complish this comparison. Our proposed upsampling

structure can easily be added to most networks, in-

cluding widely used general-purpose networks, such

as DenseNet. This proposed network structure is the

second major contribution of the paper.

Importantly, these contributions are largely com-

plementary to, rather than alternatives to, existing ap-

proaches. Most approaches can easily replace their

density label comparison with our proposed i

NN map

comparison and upsampling map module, with little

to no modiﬁcation of the rest of their method or net-

work architecture. As the i

NN label does not sum to

the count, the i

NN label and map module should go

hand-in-hand.

The paper is organized as follows. Section 2 dis-

cusses related work. Section 3 proposes our new net-

work architecture for crowd counting, MUD-i

NN.

Section 4 details the proposed k-nearest neighbor

map labeling method and its justiﬁcation. Section 5

presents experimental results on several crowd datasets

and analyzes the ﬁndings. Section 6 provides a few

concluding remarks.

2 RELATED WORK

Many works use explicit detection of individuals to

count pedestrians (Wu and Nevatia, 2005; Lin and

Davis, 2010; Wang and Wang, 2011). However, as

the number of people in a single image increase and

a scene becomes crowded, these explicit detection

methods become limited by occlusion effects. Early

works to solve this problem relied on global regression

of the crowd count using low-level features (Chan

et al., 2008; Chen et al., 2012; Chen et al., 2013).

While many of these methods split the image into a

grid to perform a global regression on each cell, they

still largely ignored detailed spatial information of

pedestrian locations. (Lempitsky and Zisserman, 2010)

introduced a method of counting objects using density

map regression, and this technique was shown to be

particularly effective for crowd counting by (Zhang

et al., 2015). Since then, to the best of our knowledge,

every CNN-based crowd counting method in recent

years has used density maps as a primary part of their

cost function (Idrees et al., 2018; Sam et al., 2017;

Sindagi and Patel, 2017; Zhang et al., 2015; Zhang

et al., 2016; Shen et al., 2018; Li et al., 2018; Ranjan

et al., 2018; Shi et al., 2018).

A primary advantage of the density maps is the

ability to provide a useful gradient for network train-

ing over large portions of the image spatially, which

helps the network identify which portion of the im-

age contains information signifying an increase in the

count. These density maps are usually modeled by

representing each labeled head position with a Dirac

delta function, and convolving this function with a

2D Gaussian kernel (Lempitsky and Zisserman, 2010).

This forms a density map where the sum of the total

map is equal to the total count of individuals, while the

density of a single individual is spread out over several

pixels of the map. The Gaussian convolution allows

a smoother gradient for the loss function of the CNN

to operate over, thereby allowing slightly misplaced

densities to result in a lower loss than signiﬁcantly

misplaced densities.

In some works, the spread parameter of the Gaus-

sian kernel is often determined using a

-nearest neigh-

bor (

NN) distance to other head positions (Zhang

et al., 2016). This provides a form of pseudo-

perspective which results in pedestrians which are

more distant from the camera (and therefore smaller in

the image) having their density spread over a smaller

number of density map pixels. While this mapping

will often imperfectly map perspective (especially in

sparsely crowded images), it works well in practice.

Whether adaptively chosen or ﬁxed, the Gaussian ker-

nel size is dependent on arbitrarily chosen parameters,

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

186

Figure 1: An example of a crowd image and various kinds of labelings. From left to right, on top: the original image, the

density map, the

NN map with

k = 1

. On bottom: the inverse

NN map with

k = 1

k = 3

, and

k = 1

shown with a log

scaling (for reader insight only). Note, in the case of the density map, any values a signiﬁcant distance from a head labeling are

very small. In contrast, the inverse kNN map has a signiﬁcant gradient even a signiﬁcant distance from a head position.

usually ﬁne-tuned for a speciﬁc dataset. We compare

with adaptive version in this work, due to its success

and being more closely related to our method.

In a recent work (Idrees et al., 2018), the authors

used multiple scales of these

NN-based, Gaussian

convolved density maps to provide levels of spatial

information, from large Gaussian kernels (allowing

for a widespread training gradient) to small Gaussian

kernels (allowing for precise localization of density).

While this approach effectively integrates information

from multiple Gaussian scales, thus providing both

widespread and precise training information, the net-

work is left with redundant structures and how the

various scales are chosen is fairly ad hoc. Our alterna-

tive i

NN labeling method supersedes these multiple

scale density maps by providing both a smooth train-

ing gradient and precise label locations (in the form

of steep gradients) in a single label. Our new network

structure utilizes a single branch CNN structure for

multi-scale regression. Together with the i

NN label-

ing, it provides the beneﬁts of numerous scales of these

density maps.

Though most CNN-based approaches use a re-

duced label size, some recent works (Shen et al., 2018;

Li et al., 2018; Cao et al., 2018; Laradji et al., 2018)

have begun using full resolution labels. In contrast

even to these works, we provide a generalized map

module which can be added to existing network struc-

tures. Speciﬁcally, the map module can be used as a

drop-in replacement for the density map comparisons.

This map module can be added to most dense crowd

counting architectures with little or no modiﬁcation to

the original architecture. In this paper, our proposed

network is based off the DenseNet201 (Huang et al.,

2017), with our map module added to the end of each

DenseBlock.

Our i

NN mapping is obliquely related to a dis-

tance transform, which has been used for counting

in other applications (Arteta et al., 2016). However,

the distance transform is analogous to a

NN map,

rather than our i

NN. Notably, the i

NN crowd la-

beling presents the network with a variable training

gradient to the network, with low values far from head

labelings and cusps at a head labeling. In contrast,

NN or distance transform provides constant train-

ing gradients everywhere. To our knowledge, neither

the distance transform nor a method analogous to our

NN labeling has been used for dense crowd count-

ing.

3 MUD-ikNN: A NEW NETWORK

ARCHITECTURE

We propose a new network structure, MUD-i

NN,

with both multi-scale upsampling using Dense-

Blocks (Huang et al., 2017) and our i

NN mapping

scheme. For providing a context of our proposed i

mapping scheme, we will describe the network struc-

ture ﬁrst, before the detailed description of the i

mapping in Section 4). We show that the new MUD-

NN structure performs favorably compared with ex-

isting state-of-the-art networks. In addition to the use

of i

NN maps playing a central role, we also demon-

strate how features with any spatial size can contribute

Improving Dense Crowd Counting Convolutional Neural Networks using Inverse k-Nearest Neighbor Maps and Multiscale Upsampling

187

Figure 2: A diagram of the proposed network architecture MUD-i

NN: multiscale regression with DenseBlocks and i

mapping. Best viewed in color.

in the prediction of i

NN maps and counts through the

use of transposed convolutions. This allows features

of various scales from throughout the network to be

used for the prediction of the crowd. Throughout this

section, a ”label map” may refer to either our i

map or a standard density map, as either can be used

with our network.

The proposed MUD-i

NN network structure is

shown in Figure 2. Our network uses the DenseBlock

structures from DenseNet201 (Huang et al., 2017)

in their entirety. DenseNet has been shown to be

widely applicable to various problems. The output

of each DenseBlock (plus transition layer) is used

as the input to the following DenseBlock, just as it

is in DenseNet201. However, each of these outputs

is also passed to a map module (excluding the ﬁnal

DenseBlock output), which includes a transposed con-

volutional layer, a map prediction layer, and a small

count regression module with four convolution lay-

ers. For each transposed convolution, the kernel size

and stride are the same value, resulting in each spatial

input element being transformed to multiple spatial

output elements. The kernel size/stride value is chosen

for each DenseBlock such that resulting map predic-

tion is the size of the ground truth label. This form of

upsampling using transposed convolutions allows the

feature depth dimensions to contribute to the gradients

of the map values in the predicted label map. Both the

stride and kernel size of the transposed convolutions

of our network are 8, 16, and 32 for the ﬁrst three

Denseblocks, respectively.

The label map generated at after each DenseBlock

is individually compared against the ground truth label

Table 1: A speciﬁcation of the map module layers. This

module is used at 3 points throughout our network as shown

in Figure 2, so the initial input size varies. However, the

transposed convolution always produces a predicted map

label which is uniform size (1x224x224).

Layer Output size Filter

Input from

DenseBlock

128x28x28

256x14x14

896x7x7

Transposed

convolution

1x224x224

(map prediction)

(8,16,32)x(8,16,32)

stride=(8,16,32)

Convolution 8x112x112 2x2 stride=2

Convolution 16x56x56 2x2 stride=2

Convolution 32x28x28 2x2 stride=2

Convolution 1x1x1 28x28

map, each producing a loss which is then summed,

MSE(

, M

) (1)

where

is the index of the DenseBlock that the output

came from,

is the ground truth label map, and

is the predicted map labeling.

Each predicted label map is then also used as the

input to a small count regression module. This module

is a series of four convolutional layers, shown in the

inset of Figure 2. The sizes of these layers are speciﬁed

in Table 1. The regression module then has a singleton

output, corresponding to the predicted crowd count.

The mean of all predicted crowd counts from the

regression modules, three in Figure 2, and the out-

put of the ﬁnal DenseBlock is used as the ﬁnal count

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

188

prediction.

= MSE







end

j=1

m + 1

, C







(2)

with

being the ground truth count,

end

being the

regression count output by the ﬁnal DenseBlock, and

being the count from the

th map regression mod-

ule (

j = 1, 2, ..., m; m = 3

in Figure 2). This results

in a total loss given by L = L

+ L

This approach has multiple beneﬁts. First, if an

appropriately sized stride and kernel size are speciﬁed,

the transposed convolutional layer followed by label

map prediction to regression module can accept any

sized input. This means this module of the network

is very generalizable and can be applied to any CNN

structure at any point in the network. For example, an

additional DenseBlock could be added to either end of

the DenseNet, and another of these map modules could

be attached. Second, each label map is individually

trained to improve the prediction at that layer, which

provides a form of intermediate supervision, easing

the process of training earlier layers in the network.

At the same time, the ﬁnal count is based on the mean

values of the regression modules. This means that if

any individual regression module produces more accu-

rate results, its results can individually be weighted as

being more important to the ﬁnal prediction.

We note that the multiple Gaussian approach by

(Idrees et al., 2018) has some drawbacks. The spread

of the Gaussians, as well as the number of different

density maps, is arbitrarily chosen. Additionally, with-

out upsampling, a separate network branch is required

to maintain spatial resolution. This results in redun-

dant network parameters and a ﬁnal count predictor

which is largely unconnected to the map prediction

optimization goal. Our upsampling approach allows

the main network to retain a single primary branch

and connects all the optimization goals tightly to this

branch.

The input to the network is 224

224 image

patches. At evaluation time, a 224

224 sliding win-

dow with a step size of 128 was used for each part

of the test images, with overlapping predictions aver-

aged.The label maps use the same size patches, and

predictions from the network are of the same resolu-

tion. Each count regression module contains the same

four layers, as speciﬁed in Table 1.

For each experiment, the network was trained for

training steps. The network was designed and

training process carried out using PyTorch (v0.4.0).

The network was trained on a Nvidia GTX 1080 Ti.

Complete details of the network code and hyperparam-

eters can be found at https://github.com/golmschenk/

sr-gan.

4 INVERSE k-NEAREST

NEIGHBOR MAP LABELING

We propose using full image size i

NN maps as

an alternative labeling scheme from the commonly

used density map explained in the Related Work in

(Section 2). Formally, the commonly used density

map (Idrees et al., 2018; Sam et al., 2017; Sindagi and

Patel, 2017; Zhang et al., 2015; Zhang et al., 2016) is

provided by,

D(x, f (·)) =

h=1

√

2πf (σ

)

exp



−

(x − x

)

+ (y − y

)

2f(σ

)



(3)

where

is the total number of head positions for

the example image,

is a size determined for each

head position

, y

)

using the

NN distance to other

heads positions (a ﬁxed size is also often used), and

is a manually determined function for scaling

to provide a Gaussian kernel size. We use this adap-

tive Gaussian label as the baseline in our experiments.

For simplicity, in our work we deﬁne

as a simple

scalar function given by

f(σ

) = βσ

, with

being a

hand-picked scalar. Though they both apply to head

positions, the use of

NN for

in the density map is

not to be confused with the full

NN map used in our

method, which is deﬁned by,

K(x, k) =

min



(x − x

)

+ (y − y

)

, ∀h ∈ H



(4)

where

is the list of all head positions. In other words,

the

NN distance from each pixel,

(x, y)

, to each head

position, (x

, y

), is calculated.

To produce the inverse kNN (ikNN) map, we use,

M =

K(x, k) + 1

, (5)

where

is the resulting i

NN map, with the addition

and inverse being applied element-wise.

To understand the advantage of an i

NN map over

a density map, we can consider taking the generation

of density maps to extremes with regard to the spread

parameter of the Gaussian kernel provided by

. A

similar explanation is illustrated in Figure 3. At one

extreme, is a Gaussian kernel with zero spread. Here

Improving Dense Crowd Counting Convolutional Neural Networks using Inverse k-Nearest Neighbor Maps and Multiscale Upsampling

189

0 1 2 3 4

0.5

Distance from head position

Map value

ikNN

Composite

Narrow Gaussian

Wide Gaussian

Figure 3: A comparison of the values of map labeling

schemes with respect to the distance from an individual

head position (normalized for comparison). Two Gaussians

are shown in green. The narrow Gaussian provides a precise

location of the head labeling. However, it provides little

training information as the distance from the head increases.

The wide Gaussian provides training information at a dis-

tance, but gives an imprecise location of the head position,

resulting in low training information near the correct an-

swer. The blue line shows a composite of several Gaussians

with spread parameters between those of the two extremes

((Idrees et al., 2018) uses 3 Gaussian spreads in their work).

This provides both precise and distant training losses. Our

approach of the i

NN map shown in red (with

k = 1

) ap-

proaches a map function with a shape similar to the integral

on the spread parameter of all Gaussians for a spread pa-

rameter range from 0 to some constant. Additionally, our

method provides both the precise and distant gradient train-

ing information in a single map label. Also notable, is that

even the large Gaussian shown here approaches near zero

much sooner than the ikNN map value.

the delta function remains unchanged, which in practi-

cal terms translates to a density map where the density

for each pedestrian is fully residing on a single pixel.

When the difference between the true and predicted

density maps is used to calculate a training loss, the

network predicting density 1 pixel away from the cor-

rect labeling is considered just as incorrect as 10 pixels

away from the correct labeling. This is not desired,

as it both creates a discontinuous training gradient,

and the training process is intolerant to minor spatial

labeling deviations. The other extreme is a very large

Gaussian spread. This results in inexact spatial infor-

mation of the location of the density. At the extreme,

this provides no beneﬁt over a global regression, which

is the primary purpose for using a density map in the

ﬁrst place. The extreme cases are shown for explana-

tory purposes, yet any intermediate Gaussian spread

has some degree of both these issue. Using multiple

scales of Gaussian spread, (Idrees et al., 2018) tries

to obtain the advantage of both sides. However, the

size of the scales and the number of scales are then

arbitrary and hard to determine.

In contrast, a single i

NN map provides a substan-

tial gradient everywhere while still providing steep

gradients in the exact locations of individual pedestri-

ans. Notably, near zero distance, the i

NN mapping

clearly has a greater slope, and in comparison, for any

Gaussian there exists a distance at which all greater

distances have a smaller slope than the equivalent po-

sition on the i

NN mapping. This means, the slope

of the Gaussian is only greater than the slope of the

NN mapping for an middle range arbitrarily deter-

mined by the Gaussian spread. The i

NN curve and its

derivative’s magnitude (the inverse distance squared)

monotonically increase toward zero. We want to note

here that directly using a

NN map doesn’t have the

advantage of using an inverse

NN map, since a

or distance transform provides constant training gra-

dients everywhere. This was further veriﬁed in our

preliminary experiments. An example of our i

map compared with a corresponding density map label-

ing can be seen in Figure 1. (Idrees et al., 2018) uses 3

density maps with different Gaussian spread parame-

ters, with the Gaussian spread being determined by the

NN distance to other head positions multiplied by one

of the 3 spread parameters. For a single head position,

all Gaussian distributions integrated on

from 0 to an

arbitrary constant results in a form of the incomplete

gamma function. This function has a cusp around the

center of the Gaussians. Similarly, the inverse of the

NN map also forms a cusp at the head position and re-

sults in similar gradients at corresponding distances as

the integrated Gaussian function. In our experiments,

we found that an inverse

NN map outperformed den-

sity maps with ideally selected spread parameters.

In one experiment, we use (Idrees et al.,

2018)’s network architecture, which utilized Dense-

Blocks (Huang et al., 2017) as the basis, but we replace

the density maps with i

NN maps and show there is

an improvement in the prediction’s mean absolute er-

ror. This demonstrates the direct improvement of our

NN method on an existing state-of-the-art network.

Note, the regression module from i

NN map to count

is then also required to convert from the i

NN map to

a count. The difference in error between the original

approach in (Idrees et al., 2018) and the network in

(Idrees et al., 2018) with our i

NN maps, though im-

proved, is relatively small. We suspect this is because

the density maps (or i

NN maps) used during training

are downsampled to a size of 28x28 (where the orig-

inal images and corresponding labels are 224x224).

This severe downsampling results in more binning of

pixel information, and this seems to reduce the impor-

tance of which system is used to generate that label.

At the extreme case, when downsampled to a single

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

190

Table 2: General statistics for the tested datasets.

Dataset Images Total count Mean count Max count Average resolution

UCF-QNRF 1535 1,251,642 815 12,865 2013×2902

ShanghaiTech Part A 482 241,677 501 3139 589×868

ShanghaiTech Part B 716 88,488 123.6 578 768×1024

UCF-CC-50 50 63,974 1279 4633 2101×2888

value, both approaches would only give the global

count in the patch (where the i

NN map gives the in-

verse of the average distance from a pixel to a head

labeling which can be translated to an approximate

count). This downsampling is a consequence of the

network structure only permitting labels of the same

spatial size as the output of the DenseBlocks. Our

network (which will be described below) remedies this

through transposed convolutions, allowing for the use

of the full-size labels.

The generation of the i

NN labels occurs as one-

time data preprocessing step before the training pro-

cess, and thus the label generation method does not

have an impact on the speed of training steps.

5 EXPERIMENTAL RESULTS

5.1 Evaluation Metrics

For each dataset that we evaluated our method on, we

provide the mean absolute error (MAE), normalized

absolute error (NAE), and root mean squared error

(RMSE). These are given by the following equations:

MAE =

i=1



− C



(6)

NAE =

i=1



− C



(7)

RMSE =

i=1

(

− C

)

(8)

In the ﬁrst set of experiments, we demonstrate the

improvement of the i

NN labeling scheme compared

to the density labeling scheme. We trained our net-

work using various density maps produced with dif-

ferent Gaussian spread parameters,

(as described in

Section 4), and compared these results to the network

using i

NN maps with varying

. We also analyze

the advantage of upsampling the label for both density

and i

NN maps. In the second set of experiments, we

provide comparisons to the state-of-the-art on standard

crowd counting datasets. In these comparisons, the

best i

NN map and density map from the ﬁrst set of

experiments is used. Most works provide their MAE

and RMSE results. (Idrees et al., 2018) provided the

additional metric of NAE. Though this result is not

available for many of the datasets, we provide our own

NAE on these datasets for future works to refer to. The

most directly relevant work, (Idrees et al., 2018), has

only provided their results for their latest dataset, UCF-

QNRF. As such, their results only appear in regard to

that dataset. Finally, we offer a general analysis of

the results using our i

NN maps and upsampling ap-

proaches. General statistics about the datasets used in

our experiments is shown in Table 2.

5.2 Impact of Labeling Approach and

Upsampling

5.2.1 Density Maps vs ikNN Maps

We used the ShanghaiTech dataset (Zhang et al., 2016)

part A for this analysis. The results of these tests are

shown in Table 3. The density maps provide a curve,

where too large and too small of spreads perform worse

than an intermediate value. Even when choosing the

best value (where

β = 0.3

), which needs to manually

determined, the i

NN label signiﬁcantly outperforms

the density label.

Included in the table are experiments, in the fash-

ion of (Idrees et al., 2018), with density maps using

3 different

values. Here

denotes the spread pa-

rameter used as the label map for the ﬁrst map module,

while β

and β

are for the second and third modules.

Contrary to (Idrees et al., 2018)’s ﬁndings, we only

gained a beneﬁt from 3 density labels when the ﬁrst

output had the smallest spread parameter. Even then,

the gain was minimal. Upon inspection of the weights

produced by the network from the map to the count

prediction, the network reduces the predictions from

the non-optimal

maps to near zero and relies solely

on the optimal map (resulting in a reduced accuracy

compared to using the optimal map for each map mod-

ule).

With varying

, we ﬁnd that an increased

results

in lower accuracy. This is likely due to the loss of

precision in the location of an individual. The most

direct explanation for this can be seen in the case of

Improving Dense Crowd Counting Convolutional Neural Networks using Inverse k-Nearest Neighbor Maps and Multiscale Upsampling

191

(a) i1NN predictions. (b) i3NN predictions.

Figure 4: A small sample of patch predictions for map labels. In each subﬁgure, from left to right is the original image patch,

the ground truth label, and the patches from the three map modules in order through the network.

Table 3: Results using density maps vs i

NN maps with

varying

and

, as well as the various upsampling resolu-

tions on the ShanghaiTech Part A dataset. If a resolution is

not shown, it is the default 224

224. Multiple

correspond

to a different Gaussian density map for each of the 3 map

module comparisons.

Method MAE NAE RMSE

MUD-densityβ0.3

28x28

79.0 0.209 120.5

MUD-densityβ0.3

56x56

74.8 0.181 121.0

MUD-densityβ0.3

112x112

73.3 0.176 119.1

MUD-i1NN 28x28 75.8 0.180 120.3

MUD-i1NN 56x56 72.7 0.181 117.4

MUD-i1NN 112x112 70.8 0.166 117.0

MUD-densityβ0.05 84.5 0.233 139.9

MUD-densityβ0.1 76.8 0.189 120.3

MUD-densityβ0.2 75.3 0.175 124.2

MUD-densityβ0.3 72.7 0.174 120.4

MUD-densityβ0.4 75.7 0.176 130.5

MUD-densityβ0.5 76.3 0.182 130.0

MUD-density

0.5,β

0.3,β

78.5 0.205 124.2

MUD-density

0.5,β

0.3,β

0.05

77.8 0.207 124.9

MUD-density

0.4,β

0.2,β

0.1

76.7 0.202 122.7

MUD-density

0.1,β

0.2,β

0.4

75.1 0.191 119.0

MUD-density

0.2,β

0.3,β

0.4

76.0 0.196 122.1

MUD-i1NN 68.0 0.162 117.7

MUD-i2NN 68.8 0.168 109.0

MUD-i3NN 69.8 0.169 110.7

MUD-i4NN 72.2 0.173 116.0

MUD-i5NN 74.0 0.182 119.1

MUD-i6NN 76.2 0.188 120.9

k = 2

. Every pixel on the line between two nearest

head positions will have the same map value, thus

losing the precision of an individual location.

5.2.2 Upsampling Analysis

Most existing works use a density map with a reduced

size label for testing and training. Those that use the

full label resolution design speciﬁc network architec-

tures for the high-resolution labels. Our map module

avoids this constraint by upsampling the label using

a trained transposed convolution, which can be in-

tegrated into most existing architectures. Using the

ShanghaiTech part A dataset, we tested our network

using various label resolutions to determine the im-

pact on the predictive abilities of the network. These

results can be seen in Table 3. Experiments without

no label resolution given are 224

224. From these

results, it is clear that the higher resolution leads to

higher accuracy. Note, this results in a minor change

to the map module structure, as the ﬁnal convolution

kernel needs to match the remaining spatial dimension.

A set of predicted i

NN map labels can be seen in

Figure 4, where a grid pattern due to the upsampling

can be identiﬁed in some cases.

5.3 Comparisons on Standard Datasets

The following demonstrates our network’s predictive

capabilities on various datasets, compared to various

state-of-the-art methods. Again, we note that our im-

provements are expected to complementary to the ex-

isting approaches, rather than alternatives.

For these experiments, we used the best

, 1, and

best β, 0.3, from the ﬁrst set of experiments.

The ﬁrst dataset we evaluated our approach on is

the UCF-QNRF dataset (Idrees et al., 2018). The re-

sults of our MUD-i

NN network compared with other

state-of-the-art networks are shown in Table 4. Our

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

192

Table 4: Results on the UCF-QNRF dataset.

Method MAE NAE RMSE

Idrees et al.(2013) (Idrees et al., 2013) 315 0.63 508

MCNN (Zhang et al., 2016) 277 0.55 426

Encoder-Decoder (Badrinarayanan et al., 2017) 270 0.56 478

CMTL (Sindagi and Patel, 2017) 252 0.54 514

SwitchCNN (Sam et al., 2017) 228 0.44 445

Resnet101 (He et al., 2016) 190 0.50 227

DenseNet201 (Huang et al., 2017) 163 0.40 226

Idrees et al.(2018) (Idrees et al., 2018) 132 0.26 191

(Idrees et al., 2018) with i1NN maps 122 0.252 195

MUD-i1NN 104 0.209 172

Table 5: Results on the ShanghaiTech Part A dataset.

Method MAE NAE RMSE

ACSCP (Shen et al., 2018) 75.7 - 102.7

D-ConvNet-v1 (Shi et al., 2018) 73.5 - 112.3

ic-CNN (Ranjan et al., 2018) 68.5 - 116.2

CSRNet (Li et al., 2018) 68.2 - 115.0

MUD-densityβ0.3 72.7 0.174 120.4

MUD-i1NN 68.0 0.162 117.7

Table 6: Results on the ShanghaiTech Part B dataset.

Method MAE NAE RMSE

D-ConvNet-v1 (Shi et al., 2018) 18.7 - 26.0

ACSCP (Shen et al., 2018) 17.2 - 27.4

ic-CNN (Ranjan et al., 2018) 10.7 - 16.0

CSRNet (Li et al., 2018) 10.6 - 16.0

MUD-densityβ0.3 16.6 0.130 26.9

MUD-i1NN 13.4 0.107 21.4

Table 7: Results on the UCF-CC-50 dataset.

Method MAE NAE RMSE

ACSCP (Shen et al., 2018) 291.0 - 404.6

D-ConvNet-v1(Shi et al., 2018) 288.4 - 404.7

CSRNet (Li et al., 2018) 266.1 - 397.5

ic-CNN (Ranjan et al., 2018) 260.9 - 365.5

MUD-densityβ0.3 246.44 0.188 348.1

MUD-i1NN 237.76 0.191 305.7

network signiﬁcantly outperforms the existing meth-

ods. Along with a comparison of our complete method

compared with the state-of-the-art, we compare with

(Idrees et al., 2018)’s network, but replace their density

map predictions and summing to count with our i

map prediction and regression to count. Using the

NN maps, we see that their model sees improvement

in MAE with i

NN maps, showing the effect of the

ikNN mapping.

The second dataset we evaluated our approach on

is the ShanghaiTech dataset (Zhang et al., 2016). The

dataset is split into two parts, Part A and Part B. For

both parts, we used the training and testing images as

prescribed by the dataset provider. The results of our

Improving Dense Crowd Counting Convolutional Neural Networks using Inverse k-Nearest Neighbor Maps and Multiscale Upsampling

193

evaluation on part A are shown in Table 5. Our MUD-

NN network slightly outperforms the state-of-the-art

approaches on this part. The results of our evaluation

on part B are shown in Table 6. Here our network per-

forms on par or slightly worse than the best-performing

methods. Notably, our method appears perform better

on denser crowd images, and ShanghaiTech Part B is

by far the least dense dataset we tested.

The third dataset we evaluated our approach on

is the UCF-CC-50 dataset (Idrees et al., 2013). We

followed the standard evaluation metric for this dataset

of a ﬁve-fold cross-evaluation. The results of our eval-

uation on this dataset can be seen in Table 7.

Overall, our network performed favorably com-

pared with existing approaches. An advantage to our

approach is that the our modiﬁcations can be applied

to the architectures we’re comparing against. The most

relevant comparison is between the i

NN version of

the MUD network, and the density map version. Here,

the i

NN approach always outperformed the density

version. We speculate that the state-of-the-art meth-

ods we have compared with, along with other general-

purpose CNNs, could be improved through the use of

ikNN labels and upsampling map modules.

6 CONCLUSIONS

We have presented a new form of labeling for crowd

counting data, the i

NN map. We have compared

this labeling scheme to commonly accepted labeling

approach for crowd counting, the density map. We

show that using the i

NN map with an existing state-

of-the-art network improves the accuracy of the net-

work compared to density map labelings. We have

demonstrated the improvements gained by using in-

creased label resolutions, and provide an upsampling

map module which can be generally used by other

crowd counting architectures. These approaches can

be used a drop-in replacement in other crowd counting

architectures, as we have done for DenseNet, which

resulted in a network which performs favorably com-

pared with the state-of-the-art.

ACKNOWLEDGMENTS

The research is supported by National Science Foun-

dation through Awards PFI #1827505 and SCC-

Planning #1737533, and Bentley Systems, Incorpo-

rated, through a CUNY-Bentley Collaborative Re-

search Agreement (CRA). Additional support is pro-

vided by the Intelligence Community Center of Aca-

demic Excellence (IC CAE) at Rutgers University.

REFERENCES

Arteta, C., Lempitsky, V., and Zisserman, A. (2016). Count-

ing in the wild. In European conference on computer

vision, pages 483–498. Springer.

Badrinarayanan, V., Kendall, A., and Cipolla, R. (2017). Seg-

net: A deep convolutional encoder-decoder architec-

ture for image segmentation. IEEE transactions on pat-

tern analysis and machine intelligence, 39(12):2481–

2495.

Cao, X., Wang, Z., Zhao, Y., and Su, F. (2018). Scale

aggregation network for accurate and efﬁcient crowd

counting. In Proceedings of the European Conference

on Computer Vision (ECCV), pages 734–750.

Chan, A. B., Liang, Z.-S. J., and Vasconcelos, N. (2008). Pri-

vacy preserving crowd monitoring: Counting people

without people models or tracking. In Computer Vi-

sion and Pattern Recognition, 2008. CVPR 2008. IEEE

Conference on, pages 1–7. IEEE.

Chen, K., Gong, S., Xiang, T., and Change Loy, C. (2013).

Cumulative attribute space for age and crowd density

estimation. In Proceedings of the IEEE conference on

computer vision and pattern recognition, pages 2467–

2474.

Chen, K., Loy, C. C., Gong, S., and Xiang, T. (2012). Fea-

ture mining for localised crowd counting. In BMVC,

volume 1, page 3.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-

ual learning for image recognition. In Proceedings of

the IEEE conference on computer vision and pattern

recognition, pages 770–778.

Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger,

K. Q. (2017). Densely connected convolutional net-

works. In CVPR, volume 1, page 3.

Idrees, H., Saleemi, I., Seibert, C., and Shah, M. (2013).

Multi-source multi-scale counting in extremely dense

crowd images. In Proceedings of the IEEE confer-

ence on computer vision and pattern recognition, pages

2547–2554.

Idrees, H., Tayyab, M., Athrey, K., Zhang, D., Al-Maadeed,

S., Rajpoot, N., and Shah, M. (2018). Composition loss

for counting, density map estimation and localization

in dense crowds. arXiv preprint arXiv:1808.01050.

Laradji, I. H., Rostamzadeh, N., Pinheiro, P. O., Vazquez,

D., and Schmidt, M. (2018). Where are the blobs:

Counting by localization with point supervision. In

Proceedings of the European Conference on Computer

Vision (ECCV), pages 547–562.

Lempitsky, V. and Zisserman, A. (2010). Learning to count

objects in images. In Advances in neural information

processing systems, pages 1324–1332.

Li, Y., Zhang, X., and Chen, D. (2018). Csrnet: Dilated

convolutional neural networks for understanding the

highly congested scenes. In Proceedings of the IEEE

conference on computer vision and pattern recognition,

pages 1091–1100.

Lin, Z. and Davis, L. S. (2010). Shape-based human detec-

tion and segmentation via hierarchical part-template

matching. IEEE Transactions on Pattern Analysis and

Machine Intelligence, 32(4):604–618.

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

194

Ranjan, V., Le, H., and Hoai, M. (2018). Iterative crowd

counting. In Proceedings of the European Conference

on Computer Vision (ECCV), pages 270–285.

Sam, D. B., Surya, S., and Babu, R. V. (2017). Switch-

ing convolutional neural network for crowd counting.

In Proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition, volume 1, page 6.

Shen, Z., Xu, Y., Ni, B., Wang, M., Hu, J., and Yang, X.

(2018). Crowd counting via adversarial cross-scale

consistency pursuit. In Proceedings of the IEEE con-

ference on computer vision and pattern recognition,

pages 5245–5254.

Shi, Z., Zhang, L., Liu, Y., Cao, X., Ye, Y., Cheng, M.-M.,

and Zheng, G. (2018). Crowd counting with deep neg-

ative correlation learning. In Proceedings of the IEEE

conference on computer vision and pattern recognition,

pages 5382–5390.

Sindagi, V. A. and Patel, V. M. (2017). Cnn-based cascaded

multi-task learning of high-level prior and density es-

timation for crowd counting. In Advanced Video and

Signal Based Surveillance (AVSS), 2017 14th IEEE

International Conference on, pages 1–6. IEEE.

Wang, M. and Wang, X. (2011). Automatic adaptation of a

generic pedestrian detector to a speciﬁc trafﬁc scene.

In Computer Vision and Pattern Recognition (CVPR),

2011 IEEE Conference on, pages 3401–3408. IEEE.

Wu, B. and Nevatia, R. (2005). Detection of multiple, par-

tially occluded humans in a single image by bayesian

combination of edgelet part detectors. In null, pages

90–97. IEEE.

Zeiler, M. D., Krishnan, D., Taylor, G. W., and Fergus, R.

(2010). Deconvolutional networks. In 2010 IEEE

Computer Society Conference on computer vision and

pattern recognition, pages 2528–2535. IEEE.

Zhang, C., Li, H., Wang, X., and Yang, X. (2015). Cross-

scene crowd counting via deep convolutional neural

networks. In Proceedings of the IEEE Conference

on Computer Vision and Pattern Recognition, pages

833–841.

Zhang, Y., Zhou, D., Chen, S., Gao, S., and Ma, Y. (2016).

Single-image crowd counting via multi-column convo-

lutional neural network. In Proceedings of the IEEE

conference on computer vision and pattern recognition,

pages 589–597.

Improving Dense Crowd Counting Convolutional Neural Networks using Inverse k-Nearest Neighbor Maps and Multiscale Upsampling

195