Fully Convolutional Crowd Counting on Highly Congested Scenes

Mark Marsden, Kevin McGuinness, Suzanne Little and Noel E. O’Connor

Insight Centre for Data Analytics, Dublin City University, Dublin, Ireland

Keywords:

Computer Vision, Crowd Counting, Deep Learning.

Abstract:

In this paper we advance the state-of-the-art for crowd counting in high density scenes by further exploring the

idea of a fully convolutional crowd counting model introduced by (Zhang et al., 2016). Producing an accurate

and robust crowd count estimator using computer vision techniques has attracted signiﬁcant research interest in

recent years. Applications for crowd counting systems exist in many diverse areas including city planning, retail,

and of course general public safety. Developing a highly generalised counting model that can be deployed in

any surveillance scenario with any camera perspective is the key objective for research in this area. Techniques

developed in the past have generally performed poorly in highly congested scenes with several thousands

of people in frame (Rodriguez et al., 2011). Our approach, inﬂuenced by the work of (Zhang et al., 2016),

consists of the following contributions: (1) A training set augmentation scheme that minimises redundancy

among training samples to improve model generalisation and overall counting performance; (2) a deep, single

column, fully convolutional network (FCN) architecture; (3) a multi-scale averaging step during inference. The

developed technique can analyse images of any resolution or aspect ratio and achieves state-of-the-art counting

performance on the Shanghaitech Part B and UCF CC 50 datasets as well as competitive performance on

Shanghaitech Part A.

1 INTRODUCTION

Vision based crowd size estimation, often referred to

as crowd counting, has become an important topic for

the computer vision community. Crowd counting al-

gorithms attempt to produce an accurate estimation of

the true number of people present in a crowded scene.

A crowd count is inherently more objective than other

crowd size representations (e.g. crowd density level)

but is also more challenging to produce. Accurate

knowledge of the crowd size in a public space can

provide valuable insight for tasks such as city plan-

ning, analysing consumer shopping patterns as well as

maintaining general crowd safety. Several key chal-

lenges such as visual occlusions and high levels of

variation in scene content have limited progress in this

area. Techniques developed for crowd counting can

also be applied to tasks from other domains such as

counting bacteria or cells in microscopic images (Xie

et al., 2016).

Related work.

Existing approaches to crowd counting

largely fall into two categories: counting by detection

and counting by regression.

Counting by detection approaches involve training

a visual object detector to ﬁnd and count each person in

the scene. Each human is assumed to be an individual

entity that must be found. These algorithms (Wu and

Nevatia, 2005; Lin et al., 2001; Ge and Collins, 2009)

are computationally demanding, requiring the image

to be exhaustively analysed at multiple scales due to

perspective issues, which alter the size of people in

different parts of the scene. The robustness of these

object detectors also suffers signiﬁcantly due to visual

occlusions, resulting in rapid performance degradation

as a crowd becomes highly congested (i.e. several

hundred people in frame).

Counting by regression techniques (Change Loy

et al., 2013; Chen et al., 2012; Lempitsky and Zisser-

man, 2010; Liu and Tao, 2014; Chan and Vasconce-

los, 2012) on the other hand attempt to learn a direct

mapping between low-level features and the overall

number of people in frame or within a frame region. In-

dividual people are not explicitly detected or tracked in

these approaches, meaning visual occlusions have less

impact on counting accuracy. While generally more

computationally efﬁcient than counting by detection

methods, regression-based techniques have suffered

greatly from overﬁtting in the past due to a lack of

varied training data. To remedy this, a number of high

density, high variation crowd counting datasets such

Marsden M., McGuinness K., Little S. and E. Oâ

ZConnor N.

Fully Convolutional Crowd Counting on Highly Congested Scenes.

DOI: 10.5220/0006097300270033

In Proceedings of the 12th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2017), pages 27-33

ISBN: 978-989-758-226-4

as UCF CC 50 (Idrees et al., 2013) and Shanghaitech

(Zhang et al., 2016) have emerged. Recent advance-

ments in graphical processing unit (GPU) hardware

and the availability of very large, labelled datasets

such as ImageNet (Deng et al., 2009) have resulted in

deep learning approaches such as convolutional neu-

ral networks (CNN) achieving state-of-the-art perfor-

mance in many computer vision tasks (image classiﬁ-

cation, face detection, object detection). Deep learning

techniques have recently been applied to the task of

regression-based crowd counting (Zhang et al., 2015;

Hu et al., 2016; Zhang et al., 2016), resulting in a no-

table improvement in counting accuracy, especially for

high density scenes (i.e. where there are 1000+ people

in frame).

Fully convolutional networks (FCN) are a unique

variation on the CNN technique where a proportionally

sized feature map output is produced for a given input

image rather than a classiﬁcation label or regression

score. FCNs have been used for a variety of tasks

including semantic segmentation (Long et al., 2015)

and saliency prediction (Pan et al., 2016). Zhang et

al. (Zhang et al., 2016) trained an FCN to transform

an image of a crowded scene into a crowd density

heatmap, which when integrated produces a highly ac-

curate crowd count estimate, even for very challenging

scenes. One of the key aspects of fully convolutional

nets that makes the method particularly suited to crowd

counting is the use of a variable size input, allowing the

model to avoid the loss of detail and visual distortions

typically encountered during image downsampling and

reshaping.

Contributions of this Paper.

The core objective of

this paper is to achieve highly accurate crowd counting

on densely congested scenes. This study will further

explore the idea of a fully convolutional crowd count-

ing model originally introduced by (Zhang et al., 2016).

The core contributions can be summarized as follows:

A training set augmentation scheme is proposed

which minimises redundancy among training sam-

ples in order to improve model generalisation and

overall counting performance.

A deep, single column, fully connected network

is used to generate crowd density heatmaps. The

greater model capacity improves the FCNs ability

to learn the highly abstract, nonlinear relationships

present in crowd counting datasets.

To overcome the scale and perspective issues that

often limit the accuracy of crowd counting algo-

rithms a given test image is fed into the network

at multiple scales (e.g. original size + 80% origi-

nal size). The crowd count is estimated for each

scale and the mean is taken as the overall estimate.

Table 1: Shanghaitech Part B validation performance using

different training set augmentation schemes. Horizontal ﬂips

are used in all cases.

Augmentation

Scheme

MAE MSE

None 30.5 47.5

4 quadrants crops +

1 overlapping centre

crop

25.5 36.5

4 quadrants crops 24.1 33.5

This simple step taken during inference results in

signiﬁcant performance gains.

2 A FULLY CONVOLUTIONAL

NETWORK FOR CROWD

COUNTING

A fully convolutional network (FCN) allows for the in-

put images used during training and inference to be of

any resolution and aspect ratio, thanks to the absense

of any fully connected layers. Rather than produce

a ﬁxed size classiﬁcation or regression output, FCNs

generate a feature map or set of feature maps propor-

tionally sized to the input image. This type of network

can then be used for a range of tasks including image

transformation and pixel wise regression/classiﬁcation

(Pan et al., 2016; Long et al., 2015).

Zhang et al. (Zhang et al., 2016) trained an FCN to

transform an image of a crowded scene into a crowd

density heatmap, which when integrated produces a

highly accurate crowd count estimate. In order to train

a network to produce this function a set of ground

truth heatmap images must be generated for which the

integral is equal to the pedestrian count. The head

annotations found in most crowd counting datasets can

be used to this end. For each of the

head annotations

associated with a given training image a unit impulse is

added to the heatmap ground truth at the given location,

as described in equation 1 where

is the position of a

given head.

H(x) =

∑

i=0

δ(x − x

) (1)

To convert this discrete density heatmap to a con-

tinuous function, convolution with an adaptive Gaus-

sian kernel

σi

is applied for each head annotation

(Zhang et al., 2016). The spread parameter

used

for a given head annotation

is decided based on the

mean distance to the 5 nearest heads

, using equation

VISAPP 2017 - International Conference on Computer Vision Theory and Applications

Figure 1: Fully convolutional network architecture used to perform crowd counting. Each convolutional layer is followed by a

ReLU activation layer apart from ”Convolution 6”. A 2D integration (simply an element-wise sum in this case) is applied to the

network output in order to produce the crowd count estimate value.

2. Distance to the surrounding heads roughly corre-

lates with proximity to the camera, producing more

smoothing the closer to the camera a pedestrian is,

helping us account for perspective distortion issues.

The 0.3 weighting was found empirically by (Zhang

et al., 2016) to produce optimal results and is main-

tained. This fully convolutional approach to crowd

counting will form the basis of our technique.

σi

= 0.3 ∗

(2)

2.1 Training Set Augmentation Scheme

The training set generation scheme used and particu-

larly the chosen augmentation techniques, play an im-

portant role in the strong counting accuracy achieved

by our method. Most crowd counting datasets consist

of only a few hundred images, making augmentation

an essential step. Taking several image crops to in-

crease training set size and variation is a common aug-

mentation technique used in computer vision. While it

is perfectly acceptable to allow these crops to overlap

for image recognition tasks, pixel-wise tasks can poten-

tially overﬁt when the network is continually exposed

to a given set of pixels during training. Therefore our

augmentation scheme is developed to ensure there is

no such redundancy. For each training set image the

four image quadrants as well as their horizontal ﬂips

are taken as training samples, ensuring no overlap. In

order to validate this augmentation scheme the Shang-

haitech Part B training set is further split into training

and validation subsets using a 9:1 ratio. Table 1 high-

lights the difference in validation performance when

our model is trained on a dataset with and without over-

lapping crops. Both runs are trained from scratch using

the same network architecture. This simple change re-

sults in a notable improvement in counting accuracy,

despite the reduction in overall training set size.

2.2 FCN Architecture

Processing high resolution images (e.g.

1000 × 1000

pixels) using a fully connected network presents cer-

tain challenges and constraints, particularly in terms

of memory usage on GPU hardware. We are limited

in the number of convolutional kernels and layers (i.e.

model capacity) our FCN can have. Therefore we must

attempt to design the best possible FCN architecture

capable of processing high resolution images such as

those in the UCF CC 50 dataset. An Nvidia GTX 970

card with 4GB of VRAM was used for our experi-

ments. With these constraints in mind we designed a

6 layer, single column FCN as illustrated in ﬁgure 1.

This network contains just 315,000 parameters, thanks

Fully Convolutional Crowd Counting on Highly Congested Scenes

Table 2: Shanghaitech Part B validation performance using

different network architectures.

Network Architecture MAE MSE

Proposed 24.1 33.5

Multi-Column FCN (Zhang

et al., 2016)

25.5 36.5

largely to the absence of any fully connected layers.

Rectiﬁed linear unit (ReLU) activations are applied af-

ter each convolutional layer apart from the last.

1 × 1

convolutions are used in the ﬁnal layer to produce the

single channel crowd density heatmap. This density

heatmap is then fed into a 2D integrator (simply an

element-wise sum in this case) to produce the crowd

count estimate. The network is optimised in a sin-

gle training run using stochastic gradient descent and

backpropagation. We chose to minimise the Euclidean

distance between the produced density heatmap and

the ground truth heatmap. This loss function is fully

deﬁned as follows:

L(Θ) =

∑

i=1

F(X

;Θ) − F

, (3)

where

is the set of network parameters to optimise,

N is the batch size, X

is the

batch image and F

is the corresponding ground truth density heatmap.

F(X

;Θ)

is the estimated heatmap for a given batch

image X

Table 2 presents the difference in Shanghaitech

Part B validation performance when our high capac-

ity architecture is used over a shallower multi-column

FCN architecture (Zhang et al., 2016). All hyperpara-

maters including the training set augmentation scheme

are kept identical for both runs. We can see a clear

improvement in performance when our deeper single

column architecture is used.

2.3 Multi-Scale Averaging During

Inference

Scale and perspective issues often limit the perfor-

mance of crowd counting algorithms. A top down

camera perspective is ideal for this task but cannot

be guaranteed in real world settings. In most CCTV

scenarios foreground pedestrians are much larger than

those in the background, who may only occupy a few

pixels. As FCNs allow for a variable size input im-

age, we can easily resize a given test image before

feeding it into the network and estimating the crowd

size. A scaled down version may result in more accu-

rate crowd counting in certain scene regions than the

original. Therefore in order to overcome these issues

a given test image is fed into the network at multiple

scales (e.g. original size + 80% original size). The

crowd count is estimated for each scale and the mean

is taken as the overall estimate. Table 3 shows the val-

idation performance of several multi-scale averaging

schemes. The same training and validation subsets are

used as before. Scheme 2 performs best and is thus

used for all further experiments.

3 EXPERIMENTS

The performance of our fully convolutional crowd

counting technique is evaluated on three crowd count-

ing benchmarks from two datasets. These benchmarks

vary greatly in terms of congestion level and scene con-

tent. Our model achieves very strong crowd counting

performance, particularly on images of high density

scenes with several thousand people in frame. The

Caffe framework (Jia et al., 2014) and it’s Python

wrapper are used to train and deploy our model. Both

mean absolute error (MAE) and mean squared error

(MAE) are used to compare crowd counting perfor-

mance on all datasets. These two metrics are deﬁned

as follows:

MAE =

∑

i=1

− ˇz

, (4)

MSE =

∑

i=1

− ˇz

)

, (5)

where N is the number of test images, z

is the ac-

tual number of people in the

image and

ˇz

is the

estimated number of people in the

image. MAE

indicates the accuracy of the estimates while MSE

corresponds to the robustness of the estimates.

3.1 Shanghaitech Dataset

The Shanghaitech dataset (Zhang et al., 2016) con-

tains 1198 images of crowded scenes with a total of

330,165 head annotations included. This dataset is

split into two parts; Part A contains images of high

density scenes (up to 3000 people) taken from the in-

ternet while Part B consists of medium density crowd

images (up to 600 people) taken in the streets of Shang-

hai. Each part consists of a respective training and test

set. The performance of the proposed approach is

evaluated on each part separately.

The redundancy minimising augmentation ap-

proach discussed in section 2.1 is used for both parts.

The network is trained from scratch in a single run for

iterations using a base learning rate of

−6

, with

the learning rate decreased by a factor of 10 after

VISAPP 2017 - International Conference on Computer Vision Theory and Applications

Table 3: Shanghaitech Part B validation performance using different mutli-scale averaging inference schemes.

Multi-scale Averaging Scheme MAE MSE

1) None 24.1 33.5

2) MeanCount(Original Size, 80% Original Size) 22.1 31.5

3) MeanCount(Original, 80% Original Size, 70% Original Size) 24.6 34.1

4) MeanCount(Original, 80% Original Size, 70% Original Size,

60% Original Size)

25.2 34.8

Figure 2:

Top:

Shanghaitech Part A test image, Ground Truth Heatmap, Estimated Heatmap. True Count=1326, Estimated

Count=1138.

Bottom:

Shanghaitech Part B test image, Ground Truth Heatmap, Estimated Heatmap. True Count=240,

Estimated Count=242.

iterations. Gaussian weight initialisation with a stan-

dard deviation of 0.01 is used as well as a weight decay

of 0.0005 and a momentum of 0.9. Due to memory

limitations and the high image resolution of the Shang-

haitech dataset a batch size of 1 is used during training.

During testing the proposed multi-scale averaging is

applied, using scheme 2 from table 3.

Our method is compared to the existing approaches

in table 4 and achieves state-of-the-art performance

on Shanghaitech Part B, improving MAE by 10%

and MSE by 19%. Competitive performance is also

achieved on Shanghaitech Part A, with an MSE near

identical to the state-of-the-art produced. Figure 2

shows our technique in action on images from this

dataset.

3.2 UCF CC 50 Dataset

The UCF CC 50 dataset (Idrees et al., 2013) contains

50 highly challenging crowd images taken from the

Internet. The number of pedestrians present in a frame

ranges between 94 and 4500. Following convention

(Idrees et al., 2013) a 5-fold cross validation is per-

formed on this dataset. The same augmentation pro-

cess, training hyperparamaters and multi-scale averag-

ing scheme are used as for the Shanghaitech dataset.

Again due to memory limitations a batch size of 1

is used. Table 5 compares our technique with the

existing approaches, with our method improving the

state-of-the-art for MAE and MSE by 11% and 13%

respectively. Figure 3 shows our technique in action

on an image from this dataset.

3.3 Cross Dataset Performance

In order to investigate the generalisation potential of

our technique, we performed a number of cross dataset

experiments. In each experiment we take a model

trained on a speciﬁc dataset as the ”source” domain

and then evaluate MAE and MSE performance on an-

other unseen dataset or ”target” domain. The results

of these experiments are shown in Table 6. Supe-

rior performance is achieved when our source and

target domain both contain images of a similar density

level (Shanghaitech Part B

Shanghaitech Part A,

Shanghaitech Part A

UCF CC 50). On the other

hand very poor performance is achieved when our

source domain contains signiﬁcantly higher density

images than the target (UCF CC 50

Shanghaitech

Part A). Therefore a model used for real world deploy-

ment must be trained on an appropriately large and

varied training set.

Fully Convolutional Crowd Counting on Highly Congested Scenes

Table 4: Comparing the performance of different crowd counting approaches on the Shanghaitech dataset.

Part A Part B

Method

MAE MSE MAE MSE

(Zhang et al., 2016)

110.2 173.2 26.4 41.3

(Zhang et al., 2015)

181.8 277.7 32.0 49.8

Proposed Approach

126.5 173.5 23.76 33.12

Figure 3: UCF CC 50 test image, Ground Truth Heatmap, Estimated Heatmap. True Count=1544, Estimated Count=1566.

Table 5: Comparing performance of different crowd counting

approaches on the UCF CC 50 dataset.

Method

MAE MSE

(Rodriguez et al., 2011)

655.7 697.8

(Lempitsky and Zisserman,

2010)

493.4 487.1

(Idrees et al., 2013)

419.5 541.6

(Zhang et al., 2016)

377.6 509.1

(Hu et al., 2016)

431.5 438.5

(Zhang et al., 2015)

467.0 498.6

Our Approach

338.6 424.5

Figure 4: Comparing count error and processing speed as

we scale down the image resolution on Shanghaitech Part B.

3.4 Trade-off between Computation

Speed and Counting Accuracy

The ability of a fully convolutional network to process

images of any resolution is one of the key reasons

behind the strong counting performance achieved by

our method. However, analysing such high resolu-

tion images results in high memory consumption and

slower processing speed during inference. Therefore

we want to investigate to what degree image resolution

can be reduced during test time before we see signiﬁ-

cant performance degradation. Doing so we can ﬁnd

the best possible trade-off between computation speed

and accuracy. The Shanghaitech Part B dataset is used

for this experiment. Test images are scaled down to

a given percentage of their original size with aspect

ratio maintained. The results of these experiments are

presented in ﬁgure 4. Surprisingly we do not see the

error increase signiﬁcantly until we reduce the image

size to 50%. However with this 50% resizing applied

the processing speed is increased by a factor of 4. In

deployment scenarios this type of downsampling can

be applied in order to analyse real-time video without

a major loss of accuracy.

4 CONCLUSION

In this paper we have proposed a deep, fully convolu-

tional crowd counting model that can perform highly

accurate single image crowd counting in almost any

surveillance scenario. Our model achieves state-of-

the-art performance on both the Shaghaitech Part B

and UCF CC 50 datasets as well as competitive per-

formance on the Shaghaitech Part A dataset. Images

of any resolution and aspect ratio can be analysed.

The developed approach also performs well even with

signiﬁcant image downsampling applied at test time.

Future work in this area will look to extend our net-

work to also perform other pixel-wise tasks such as

crowd segmentation in order to exploit the inter-task

correlations present.

VISAPP 2017 - International Conference on Computer Vision Theory and Applications

Table 6: Cross dataset performance of our method. The percentage increases in MAE and MSE are highlighted.

Source Domain Target Domain MAE MSE

Shanghaitech B Shanghaitech A 191(+52%) 337.5(+94%)

UCF CC 50 Shanghaitech A 269(+116%) 359.5(107%)

Shanghaitech A Shanghaitech B 68(+189%) 100.5(+200%)

UCF CC 50 Shanghaitech B 165(+614%) 215(+540%)

Shanghaitech A UCF CC 50 473(+40%) 680(+50%)

Shanghaitech B UCF CC 50 699(+100%) 866 (+105%)

ACKNOWLEDGMENTS

This publication has emanated from research con-

ducted with the ﬁnancial support of Science

Foundation Ireland (SFI) under grant number

SFI/12/RC/2289.

REFERENCES

Chan, A. B. and Vasconcelos, N. (2012). Counting people

with low-level features and bayesian regression. IEEE

Transactions on Image Processing, 21(4):2160–2177.

Change Loy, C., Gong, S., and Xiang, T. (2013). From

semi-supervised to transfer counting of crowds. In

Proceedings of the IEEE International Conference on

Computer Vision, pages 2256–2263.

Chen, K., Loy, C. C., Gong, S., and Xiang, T. (2012). Fea-

ture mining for localised crowd counting. In BMVC,

volume 1, page 3.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-

Fei, L. (2009). ImageNet: A Large-Scale Hierarchical

Image Database. In CVPR09.

Ge, W. and Collins, R. T. (2009). Marked point processes

for crowd counting. In Computer Vision and Pattern

Recognition, 2009. CVPR 2009. IEEE Conference on,

pages 2913–2920. IEEE.

Hu, Y., Chang, H., Nian, F., Wang, Y., and Li, T. (2016).

Dense crowd counting from still images with convolu-

tional neural networks. Journal of Visual Communica-

tion and Image Representation, 38:530–539.

Idrees, H., Saleemi, I., Seibert, C., and Shah, M. (2013).

Multi-source multi-scale counting in extremely dense

crowd images. In Proceedings of the IEEE Conference

on Computer Vision and Pattern Recognition, pages

2547–2554.

Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J.,

Girshick, R., Guadarrama, S., and Darrell, T. (2014).

Caffe: Convolutional architecture for fast feature em-

bedding. In Proceedings of the 22nd ACM inter-

national conference on Multimedia, pages 675–678.

ACM.

Lempitsky, V. and Zisserman, A. (2010). Learning to count

objects in images. In Advances in Neural Information

Processing Systems, pages 1324–1332.

Lin, S.-F., Chen, J.-Y., and Chao, H.-X. (2001). Estimation

of number of people in crowded scenes using perspec-

tive transformation. IEEE Transactions on Systems,

Man, and Cybernetics-Part A: Systems and Humans,

31(6):645–654.

Liu, T. and Tao, D. (2014). On the robustness and gen-

eralization of cauchy regression. In 2014 4th IEEE

International Conference on Information Science and

Technology, pages 100–105. IEEE.

Long, J., Shelhamer, E., and Darrell, T. (2015). Fully convo-

lutional networks for semantic segmentation. In Pro-

ceedings of the IEEE Conference on Computer Vision

and Pattern Recognition, pages 3431–3440.

Pan, J., McGuinness, K., Sayrol, E., O’Connor, N., and Giro-

i Nieto, X. (2016). Shallow and deep convolutional

networks for saliency prediction.

Rodriguez, M., Laptev, I., Sivic, J., and Audibert, J.-Y.

(2011). Density-aware person detection and tracking

in crowds. In 2011 International Conference on Com-

puter Vision, pages 2423–2430. IEEE.

Wu, B. and Nevatia, R. (2005). Detection of multiple, par-

tially occluded humans in a single image by bayesian

combination of edgelet part detectors. In Tenth IEEE In-

ternational Conference on Computer Vision (ICCV’05)

Volume 1, volume 1, pages 90–97. IEEE.

Xie, W., Noble, J. A., and Zisserman, A. (2016). Microscopy

cell counting and detection with fully convolutional

regression networks. Computer Methods in Biome-

chanics and Biomedical Engineering: Imaging & Visu-

alization, pages 1–10.

Zhang, C., Li, H., Wang, X., and Yang, X. (2015). Cross-

scene crowd counting via deep convolutional neural

networks. In Proceedings of the IEEE Computer Soci-

ety Conference on Computer Vision and Pattern Recog-

nition, volume 07-12-June, pages 833–841.

Zhang, Y., Zhou, D., Chen, S., Gao, S., and Ma, Y. (2016).

Single-image crowd counting via multi-column convo-

lutional neural network. In Proceedings of the IEEE

Conference on Computer Vision and Pattern Recogni-

tion, pages 589–597.

Fully Convolutional Crowd Counting on Highly Congested Scenes