How Low Can You Go? Privacy-preserving People Detection with an

Omni-directional Camera

Timothy Callemein, Kristof Van Beeck and Toon Goedem

EAVISE, KU Leuven, Jan Pieter de Nayerlaan 5, Sint-Katelijne-Waver, Belgium

Keywords:

Privacy Sensitive, Omni-directional Camera, Low Resolution, Knowledge Distillation.

Abstract:

In this work, we use a ceiling-mounted omni-directional camera to detect people in a room. This can be used

as a sensor to measure the occupancy of meeting rooms and count the amount of ﬂex-desk working spaces

available. If these devices can be integrated in an embedded low-power sensor, it would form an ideal extension

of automated room reservation systems in ofﬁce environments. The main challenge we target here is ensuring

the privacy of the people ﬁlmed. The approach we propose is going to extremely low image resolutions, such

that it is impossible to recognise people or read potentially conﬁdential documents. Therefore, we retrained a

single-shot low-resolution person detection network with automatically generated ground truth. In this paper,

we prove the functionality of this approach and explore how low we can go in resolution, to determine the

optimal trade-off between recognition accuracy and privacy preservation. Because of the low resolution, the

result is a lightweight network that can potentially be deployed on embedded hardware. Such embedded

implementation enables the development of a decentralised smart camera which only outputs the required

meta-data (i.e. the number of persons in the meeting room).

1 INTRODUCTION

Recent progress in deep learning has provided rese-

archers with many exciting possibilities that pave the

way towards numerous new applications and challen-

ges. Apart from the academic world, the industry

has also noticed this evolution and shows interest in

adopting these techniques as soon as possible in their

real-life cases. The ﬁeld of computer vision is a fast-

changing landscape, constantly improving in speed

and/or accuracy on publicly available datasets. Ho-

wever, industrial companies frequently construct their

own datasets which are often not – or incorrectly –

annotated and unavailable for public use. This paper

aims to tackle a very speciﬁc real-life scheduling pro-

blem: automatically detecting the occupancy of meet-

ing rooms or ﬂex-desks

in ofﬁce buildings, using

only RGB cameras that are mounted on the ceiling.

Indeed, in practice it is often the case that the occu-

pancy of the available meeting rooms is sub-optimal

(i.e. a large meeting room is occupied with only few

participants). On a large scale, this has a signiﬁcant

economic impact (e.g. the construction of additional

ofﬁce buildings could be avoided if the occupancy is

A space where you do not have a ﬁxed desk but use the

available desks, commonly with a reservation list.

improved). Such problems are mainly due to employ-

ees having a reoccurring reservation on a ﬂex-desk

while working from home or empty booked meeting

rooms when the meeting got cancelled. Through the

occupancy detection of each meeting room (and ﬂex-

desk) companies aim to optimise their ofﬁce space

capacity. Not only for binary cases, used and not

used, but also to optimise the capacities of the meet-

ing rooms by counting the amount of people during a

meeting, opposed to the meeting room capacity.

However, placing cameras in working environ-

ments or public places inherently involves privacy is-

sues. Such issues are strictly regulated by the govern-

mental bodies which in most countries allow such re-

cordings when they are not transmitted to a centrali-

sed system or saved in long term memory. It is ho-

wever allowed to derive, save and transfer generated

meta-data based on images containing people. Apart

from these regulations, the employees present will

have a feeling of unease when a camera is watching

their movements, even when the company claims only

meta-data is being transmitted. To cope with the afo-

rementioned problem, we propose to reduce the reso-

lution of the input images (or even the physical num-

ber of pixels on the camera sensor itself) such that per-

sons become inherently unrecognisable. Evidently,

630

Callemein, T., Van Beeck, K. and Goedemé, T.

How Low Can You Go? Privacy-preserving People Detection with an Omni-directional Camera.

DOI: 10.5220/0007573206300637

In Proceedings of the 14th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2019), pages 630-637

ISBN: 978-989-758-354-4

Figure 1: The overall system of the automated annotation process and the retraining of the YOLOv2 models.

there is a lower limit in resolution down-sampling: at

a speciﬁc point the computer vision algorithms fail to

efﬁciently detect persons. As such, we need to de-

termine the optimal resolution at which persons are

unrecognisable (to human observers) yet are still au-

tomatically detectable.

One of the main goals of this paper is to evaluate

how low we can go in resolution until the deep lear-

ning algorithms fail to detect people. We thus aim to

investigate the trade-off between input image resolu-

tion (i.e. privacy), detection accuracy and detection

speed. By lowering the amount of input data the de-

tection challenge for the neural network is increased,

but a gain in speedup is achieved since the amount

of required calculations decreases dramatically. In-

deed, deep learning architectures often require signi-

ﬁcant computational power to perform inference. A

lower computational complexity allows for an embed-

ded implementation on the camera itself (i.e. a decen-

tralised approach) where the original image data never

leaves the embedded device. Such scenario would be

ideal as this inherently preserves privacy.

In this paper – as a proof-of-concept – we start

from a standard camera with high resolution and ma-

nually downscale the images in order to compare

performances in-between different resolutions. Du-

ring deployment however, the system will ﬁrst le-

arn in high-resolution and can be replaced by a low-

resolution for inference only. For example placing

the hardware lens out of focus, resembles a blur ﬁlter

and is considered a hardware adjustment. An additi-

onal challenge arises from the fact that as few came-

ras as possible should be required to cover the meet-

ing rooms. Therefore, we use omni-directional ca-

meras recording the 360

◦

space around the camera.

For this, we rectify the image, as discussed further

on. Finally, the acquisition of enough annotated trai-

ning data to (re)train deep neural networks remains

time-consuming and thus expensive. In this paper we

propose an approach in which we automatically ge-

nerate annotations to retrain our low-resolution net-

works, based on the high resolution images. Figure 1

shown an overview of this approach. To summarise,

the main contributions of this paper are:

• Finding an optimal trade-off between: lowest re-

solution, accuracy, processing time, perseverance

of privacy

• An automatic pipeline where we run two detectors

on unwarped images to output annotations

• Public Omni-directional dataset containing se-

veral meeting room scenarios

The remainder of this paper is structured as follows.

We ﬁrst talk about the related work in section 2 fol-

lowed by discussing what we recorded and used as

dataset in section 3. In section 4 we detail the top row

in ﬁg. 1: the automatic generation of automatic an-

notations on our dataset, where we use a combination

of strong person detectors on the high-resolution ima-

ges to ﬁnd reliably the persons in the room. The lat-

ter step yields annotated images, which we downscale

in resolution to use as training examples for our low-

resolution person detector, illustrated in the bottom

row of ﬁg. 1 and explained in further detail in section

6. showing our results on how low we can go using

the annotated data. Section 7 will conclude this re-

search and show some possible future work based on

our current results.

How Low Can You Go? Privacy-preserving People Detection with an Omni-directional Camera

631

2 RELATED WORK

Employing cameras in public or work environments

inherently presents privacy issues. People tend to feel

uneasy given the knowledge that they are constantly

being ﬁlmed. Furthermore often sensitive documents

are processed in the work environment. Several pos-

sible solutions exist which aim to temper these pro-

blems. First, a closed system could be constructed in

which the sensitive image data never leaves the de-

vice. However, even if this is the case most people

remain reluctant to the use of these devices. A second

solution could be the use of an image sensor with an

extremely low resolution such that privacy is auto-

matically retained. Such solution could be mimicked

using a high resolution camera of which the images

are e.g. blurred or down-sampled. A user study pre-

sented in (Butler et al., 2015) shows that the use of

different image ﬁlters indeed increases the sense of

privacy. Even a simple blur image ﬁlter of only 5px

already decreased the privacy issues (depending on

the object which was visible). In this work we do-

wnsample the image which gives similar results as a

blur ﬁlter (concerning privacy). We aim to evaluate

several resolutions and expect the sense of privacy to

increase with every downsampling step.

In section 3 we discuss how we recorded a new

dataset which will be made publicly available meet-

ing all of our criteria. A large disadvantage when re-

cording a new dataset is found in the manual labour

to annotate all image frames. Instead of resorting to

time-consuming manual annotations we propose the

use of knowledge distillation to automatically gene-

rate the required annotations. We thus aim at trans-

ferring knowledge from a teacher network to a stu-

dent network. In ﬁg. 1 the top row will represent the

teacher architecture, while the bottom row is the stu-

dent network that will use the generated annotations

by the teacher as training data. Techniques presented

in (Hinton et al., 2015; Ba and Caruana, 2014; Lee

et al., 2018) indeed show the potential of these ap-

proaches and illustrate that it is possible to use a pre-

trained complex network to train a simple network.

Our input data consists of omni-directional images

and thus no pretrained person detector models are cur-

rently available. Therefore we cannot use knowledge

distillation in its current form. Instead of using a com-

plex network to re-purpose the weights of a new mo-

del, we aim to employ person detectors on the unwar-

ped high-resolution omni-directional images. These

detections are then used as training examples for the

low-resolution networks. Several excellent state-of-

the-art object detector exist. For example, SSD (Liu

et al., 2016), RetinaNet (He et al., 2016), R-FCN (Dai

et al., 2016) all perform well with high accuracy and

are capable of running real-time on modern desktop

GPUs. In this paper we opted for the Darknet frame-

work (Redmon and Farhadi, 2017) (more speciﬁcally,

the YOLOv2 architecture). This framework outper-

forms the aforementioned methodologies in terms of

speed, by a factor 3 or more with a negligible loss in

accuracy. However, our initial experiments revealed

that in speciﬁc cases the person detection fails, mainly

at regions with heavy lens distortion. Therefore we

used a single-frame bottom-up pose estimator (Cao

et al., 2017; Wei et al., 2016) to further increase the

accuracy and efﬁciently fused both methodologies as

discussed further. While the person detector focuses

on the overall person, the pose estimator ﬁrst detects

separate body parts which are then combined into a

complete pose.

In a next step the combined person detector ap-

proach mentioned above is used to automatically ge-

nerate annotations on the high resolution input ima-

ges. These annotations are then used to train the

low-resolution networks. Similar work is proposed in

(Chen et al., 2017) where the authors uses high reso-

lution frames combined with extreme low resolution

frames to train a single model. The showed that the

combination of both feature spaces produce an action

recognition model with few parameters. Different

work by (Ryoo et al., 2017) also focuses on extreme

low resolution images for action recognition, but uses

video as an additional dimension. Both works show

that even at extreme low resolutions (12 × 16) action

recognition based on down-scaled frames is possible.

Our work signiﬁcantly diverges from previous works.

We aim to develop a frame-by-frame system to detect

people and count people (instead of preforming action

recognition) with an emphasis on ﬁnding an optimum

trade-off between a privacy preserving low resolution

and high accuracy.

3 DATASET

As discussed above, to maximally cover the meet-

ing rooms with a minimum number of cameras we

employ omni-directional cameras mounted at the cei-

ling. To the best of our knowledge no such publicly

available dataset exists which meet these criteria, we

thus recorded our own dataset. Note that high quality

images are not a requirement (since we downscale the

images for future processing). We therefore recorded

our videos at 15 Frames per second (FPS) with a reso-

lution of 876×876. Our dataset consists of ﬁve differ-

ent scenes (meeting rooms), as illustrated in ﬁgure

VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications

632

Figure 2: The left two images show the scenario used for as train set A (meeting rooms) and as test set A. The third and fourth

image show two additional scenarios (ﬂex-desks) added to train set A, to form train set B the train set B. The ﬁfth image show

an unseen scenario (meeting room) to test the generic character of the models.

Table 1: Details on the recorded datasets, amount of images

and people per set.

Images Amount of people

Training set A 8 527 0-8

Training set B 13 509 0-3

Test set A 10 100 0-6

Test set C 2 048 0-3

2, and is made publicly available

. The two leftmost

images show scenario A, which are divided in a sepa-

rate training and a test set. The training and test set

consist of different meetings in the same room with a

varying number of unique people. As such, the scena-

rio will be identical to the model during training and

inference, but the scene is different. The third and

fourth image in ﬁgure 2 are part of the training data-

set used to train model B, which is more diverse and

includes a ﬂex-desk. Table 1 gives a more detailed

overview of the different data set parts.

To further evaluate the generalisability of our mo-

dels we recorded test-set C consisting of a completely

new scenario which was not used for training of either

model. In the next section we continue by suggesting

an approach that automatically generates annotations

from this dataset. For this we ﬁrst unwarp the images

before using a combination of state-of-the-art person

detectors.

4 AUTOMATIC ANNOTATIONS

As mentioned in section 3 we recorded our own da-

taset In this section we now propose an approach that

is able to automatically generate annotations based on

the high-resolution input image. This is done in a two-

step process: we ﬁrst unwarp/rectify the images and

then perform person detection on them. If the qua-

lity of the frame detections is insufﬁcient, the total

URL hidden due to anonymous review

frame is dropped. One quality measure we preform is

looking at the amount of detections within a tempo-

ral window remaining after the conﬁdence threshold,

if the current frame has a sudden drop in detections,

the frame will not be used. This will have little to no

impact, since sufﬁcient training data is available.

4.1 Unwarping and Rectifying

We ﬁrst start by unwarping the 876×876 image using

the log-polar mapping method (Wong et al., 2011).

Since to the radius of the omni-directional image the

unwarped image has height of 438px. As step size

for the circular unwarp we chose 0.5

◦

, resulting in a

width of 720px.

Due to the distortion of the omni-directional ca-

mera the image will be compressed along the y-axis

near the outer boundary, while being stretched near

the centre as illustrated in ﬁgure 3a. In order to rectify

this distortion we used a striped calibration board. Ba-

sed on these points and the desired rectiﬁed points we

calculated the parameters for the following rectiﬁca-

tion equation:

(y) = −0.001387y

+ 1.247y − 2.007 (1)

This is illustrated in ﬁgure 3b. Due to the rectiﬁca-

tion a small bottom portion of the image is lost, since

this is the area with the highest distortion. Note that

this is not an issue: due to our camera viewpoint the

image centre coincides with the middle of the meeting

tables, in which persons never need to be detected.

4.2 Combined Detectors

We combined both Darknet (YOLOv2 (Redmon and

Farhadi, 2017)) and OpenPose (Cao et al., 2017; Wei

et al., 2016) to detect all the people in the unwarped

high-resolution images in order to generate reliable

training annotations for our low-resolution detector to

be trained.

How Low Can You Go? Privacy-preserving People Detection with an Omni-directional Camera

633

(a) Unrectiﬁed image (b) Rectiﬁed image

Figure 3: Before and after rectifying the y-axis of the images.

Figure 4: The omni-directional input image (left), the unwarped version with YOLOv2 detections (top centre), pose estimation

detections (bottom centre) and the combined detections as automatic annotations on the low resolution image (right).

As can be seen in ﬁg. 4, these two techniques

are quite complementary to reliably detect all people

in the image. We form a bounding box around the

OpenPose keypoints with a margin of 25% around

each person. In order to avoid false positives, we only

accept people with more than 5 OpenPose keypoints.

When too many detections were lost after the conﬁ-

dence threshold we decide not to use the frame, abi-

ding the strict policy.

Next we combine both bounding boxes using

non-maximum suppression (NMS) (Neubeck and

Van Gool, 2006), only leaving one detection per per-

son. Figure 4 illustrates the input omni-directional

image and rectiﬁed unwarped image containing the

YOLOv2 detection and the pose estimator output. We

can see that in this frame the pose estimator failed to

detect the person, while YOLOv2 still found the per-

son, showing that the combination has its beneﬁts. To

the right we see the produced automatic annotations

after downscaling the image to a low resolution ver-

sion (96 × 96).

4.3 Validation of the Automatic

Annotations

In order to validate our automatic annotation appro-

ach we manually annotated a set of 100 random ima-

ges from the complete recorded dataset. Visual ana-

lysis of the automated annotations show that there

was a lot of margin due to taking the bounding box

around the points that were warped back to the omni-

directional frame. The manual annotations were done

properly around the persons and will have no margin.

Figure 5 shows the result of the comparison of manual

and automatic annotations as the precision and re-

call for different values of the Intersection over Union

(IoU) threshold. We observe a drop occurring around

an IoU of 0.4, which indeed can be explained by the

different sizes of the bounding boxes. We therefore,

for the remainder of this paper, use this IoU since we

are more interested in an estimated location and not

in a perfectly ﬁtted detection.

Table 2 shows that an IoU of 0.4 misses 36 de-

tections and introduces 13 false detections, while the

IoU of 0.5 – often used in object detection literature

– has almost 3 times as much false negatives and 5

times the amount of false positives. The majority of

these issues are only due to too large detections op-

VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications

634

Table 2: Results of the automatic analysis for IoU {0.4;

0.5}.

IoU 0.4 0.5

TP 299 244

FP 13 68

FN 36 91

P 0.958 0.782

R 0.893 0.728

posed to smaller manual annotations. Nevertheless,

taken notice of their slightly less accurate positioning

of bounding boxes, we conclude that our automatic

annotations are reliably enough to use as training ma-

terial for our low-resolution person detector network,

thereby eliminating the tedious manual annotation la-

bour work.

Figure 5: The precision and recall for each IoU.

5 TRAINING A

LOW-RESOLUTION PERSON

DETECTOR

To allow the resulting system to run on embedded

System-on-Chip (SoC) systems, we focus on low pro-

cessing time from early on. We therefore chose to

perform all current experiments on YOLOv2 and kept

using this network to evaluate the inﬂuence of lowe-

ring the network input resolutions. Training a whole

new network from scratch is clearly impossible given

the limited amount of data. We chose using the pre-

trained weights of YOLOv2 trained on the coco (Lin

et al., 2014) dataset, that we can repurpose by using

transfer learning (Mesnil et al., 2011). Because the

coco trained model is, amongst many others, trained

on a class person, we assume the network has know-

ledge about the visual appearance of a person, which

we can inherit by ﬁne-tuning (transfer learning) the

weights towards detecting people on low-resolution

omni-directional data as well.

Traditionally, the YOLOv2 network has an input

resolution between 608 × 608 and 320 × 320, yet this

is not low enough for our application. The lowest

we can go in resolution with the native YOLOv2 net-

work is 96 × 96. Indeed, the network has 5 max pool-

ing layers with size 2 and stride 2, with 3 additional

convolutions, this limits the smallest input resolution

to 3 × 2

= 96. To investigate how low we can go,

we therefore trained between the range of 608. . .96

(448px, 160px and 96px), on which we trained two

models for each input resolution (using different trai-

ning data sets, to allow to investigate data set bias).

As in the original YOLOv2 implementation, du-

ring training we slightly vary the input resolution of

the network around the desired input resolution such

that the model becomes more scale-invariant.

For each resolution we trained two models. Mo-

del A only contains meeting room scenes, on which

the camera is placed at the centre of the table. Me-

anwhile, model B is trained on the same data, with

addition training data (ﬂex-desks) on which we hope

to see a more generic model, since lots of additional

clutter (e.g. objects on the desks) is present. In section

6 we will discuss the results, the speed and the degree

of perceived privacy of each model.

6 RESULTS: How low can we go?

As explained above, for each resolution we trained

two models, a ﬁrst on a basic dataset A and a second

on a more diverse training dataset B (as illustrated in

ﬁg. 2). To validate these models we ﬁrst used test set

A (two leftmost frames in ﬁg. 2), containing footage

that is recorded in the same scenes as in the training

set. A second evaluation was on test set C (rightmost

frame in ﬁg. 2), which was recorded in a totally new

room, unseen during training. We included the latter

in order to test the system’s generalisability towards

new scenes.

Figure 6 shows the precision-recall curves of these

validations. We observe conclude that each model

trained on training set A and validated on test set A

(containing different images, but acquired in the same

room as training set A) achieves high mAPs.

However, when we validate on test set C, con-

taining an unseen scene, we notice that only the

448 × 448 model in ﬁg. 6a performs adequately.

How Low Can You Go? Privacy-preserving People Detection with an Omni-directional Camera

635

Table 3: Inference speed, FLOPS and equivalent blur kernel

size of the models on a single NVIDIA V100 GPU.

Resolution FPS FLOPS Blur kernel size

448px 52 34.15 Bn 2px

160px 108 4.36 Bn 5px

96px 186 1.57 Bn 9px

Remarkably, we also observe that training on a

more diverse dataset (B) does not necessarily yield a

more generalisable detector for unseen situations.

Figures 6b and 6c illustrate that for lower input

resolution the models either over-ﬁtted on the data, or

were unable to generalise to different scenarios.

Apart from accuracy we aim to determine the opti-

mal trade-off between privacy and speed as well. Ta-

ble 3 shows the inference time on a V100 NVIDIA

GPU, together with the amount of ﬂoating point ope-

rations (FLOPS) that are needed for a single frame.

We see that the computational complexity of the

160px model is 8x less and that of the 96px model

even 22x less than the original network. This already

seem reasonable when targeting an embedded system

(or SoC), moreover if we lower the required FPS to

one frame per minute. Indeed, for a room occupancy

sensor, one measurement per minute is enough.

The rightmost column in table 3 shows for each

resolution the equivalent blur kernel size. Note that

we made input and output videos of each model avai-

lable online, together with the results of the automatic

annotations

7 CONCLUSIONS

The goal of this paper was the development of a fra-

mework which is able to detect people in a room

using ceiling-mounted omni-directional cameras, al-

lowing for occupancy optimisation in a room manage-

ment system. However, placing cameras in workpla-

ces like meeting rooms or ﬂex-desks is heavily regu-

lated and most people tend to feel unease when con-

stantly being ﬁlmed. Therefore in this work we re-

searched how we can use state-of-the-art detectors to

detect people while ensuring their privacy.

We recorded a new publicly available dataset, con-

taining 5 different meeting room scenes. Because of

degrading the image resolution (or using an image

sensor with a very low resolution) has a positive in-

ﬂuence on the sense of privacy, in this work we opted

to develop a framework which is able to efﬁciently

detect persons in extremely low resolution input ima-

ges. The use of such low resolution images inherently

https://tinyurl.com/VISAPP2019-HLCYG

(a) Net resolution 448

(b) Net resolution 160

Figure 6: PR curve of models A and B on test sets A and C

with IoU 0.4.

VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications

636

ensures privacy of the individuals being recorded. We

evaluated different downscaled resolutions to deter-

mine the optimal trade-off between resolution and de-

tection accuracy.

To avoid the need for time-consuming and ex-

pensive manual annotations we proposed an appro-

ach that is able to automatically generate new trai-

ning data for the low resolution networks, based on

the high resolution input images. The validity of this

approach was proven when comparing with true ma-

nual annotations.

Extensive accuracy experiments were performed.

On test sets based on known scenes, the models sho-

wed an acceptable performance for all resolutions.

When tested on a similar scene with unseen data an

evident declining performance with lower resolution

is witnessed. However, because of the proposed auto-

matic annotation pipeline it remains easily possible

to add additional training data for each scene. In-

deed, a sensor that is newly installed in a certain room

can easily acquire during the ﬁrst hours some high-

resolution footage, with which a room-speciﬁc low-

resolution detector can quickly be (transfer) learned.

Based upon our results we conclude that, despite

the extremely low input resolution of our lowest-

resolution model (96×96px), our YOLOv2-based de-

tection pipeline is still able to efﬁciently detect per-

sons, even though they are not recognisable by human

beings. Our framework thus is able to serve as an ef-

ﬁcient occupancy detection system.

Furthermore, the low input resolution allows for a

lightweight network which thus is easily implementa-

ble on embedded systems while still maintaining high

processing speeds.

Although the current approach is suitable to be

used by the industry as is, we believe that we have

not yet reached the extreme lower limit and deem it

possible to decrease even further in resolution.

ACKNOWLEDGEMENT

This work is partially supported by the VLAIO via the

Start to Deep Learn project.

REFERENCES

Ba, J. and Caruana, R. (2014). Do deep nets really need to

be deep? In Advances in neural information proces-

sing systems, pages 2654–2662.

Butler, D. J., Huang, J., Roesner, F., and Cakmak, M.

(2015). The privacy-utility tradeoff for remotely te-

leoperated robots. In ACM/IEEE International Con-

ference on Human-Robot Interaction, pages 27–34.

ACM.

Cao, Z., Simon, T., Wei, S.-E., and Sheikh, Y. (2017). Re-

altime multi-person 2d pose estimation using part af-

ﬁnity ﬁelds. In CVPR.

Chen, J., Wu, J., Konrad, J., and Ishwar, P. (2017). Semi-

coupled two-stream fusion convnets for action recog-

nition at extremely low resolutions. In WACV, 2017

IEEE Winter Conference on, pages 139–147. IEEE.

Dai, J., Li, Y., He, K., and Sun, J. (2016). R-fcn: Object de-

tection via region-based fully convolutional networks.

In Advances in neural information processing sys-

tems, pages 379–387.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resi-

dual learning for image recognition. In CVPR, pages

770–778.

Hinton, G., Vinyals, O., and Dean, J. (2015). Distilling

the knowledge in a neural network. arXiv preprint

arXiv:1503.02531.

Lee, S. H., Kim, D. H., and Song, B. C. (2018). Self-

supervised knowledge distillation using singular value

decomposition. In ECCV, pages 339–354. Springer.

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P.,

Ramanan, D., Doll

ar, P., and Zitnick, C. L. (2014).

Microsoft coco: Common objects in context. In Euro-

pean conference on computer vision, pages 740–755.

Springer.

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S.,

Fu, C.-Y., and Berg, A. C. (2016). Ssd: Single shot

multibox detector. In ECCV, pages 21–37. Springer.

Mesnil, G., Dauphin, Y., Glorot, X., Rifai, S., Bengio, Y.,

Goodfellow, I., Lavoie, E., Muller, X., Desjardins, G.,

Warde-Farley, D., et al. (2011). Unsupervised and

transfer learning challenge: a deep learning approach.

In Proceedings of the 2011 International Conference

on Unsupervised and Transfer Learning workshop-

Volume 27, pages 97–111. JMLR. org.

Neubeck, A. and Van Gool, L. (2006). Efﬁcient non-

maximum suppression. In ICPR 2006, volume 3, pa-

ges 850–855. IEEE.

Redmon, J. and Farhadi, A. (2017). Yolo9000: better, faster,

stronger. arXiv preprint.

Ryoo, M. S., Rothrock, B., Fleming, C., and Yang, H. J.

(2017). Privacy-preserving human activity recogni-

tion from extreme low resolution. In AAAI, pages

4255–4262.

Wei, S.-E., Ramakrishna, V., Kanade, T., and Sheikh, Y.

(2016). Convolutional pose machines. In CVPR.

Wong, W. K., ShenPua, W., Loo, C. K., and Lim, W. S.

(2011). A study of different unwarping methods for

omnidirectional imaging. In ICSIPA, 2011, pages

433–438. IEEE.

How Low Can You Go? Privacy-preserving People Detection with an Omni-directional Camera

637