ADAPTIVE IMAGE SENSOR SAMPLING FOR LIMITED MEMORY

MOTION DETECTION

David Gibson

, Henk Muller

and Neill Campbell

Computer Science, University of Bristol, Bristol, U.K.

XMOS Ltd, Bristol, U.K.

Keywords:

Motion Detection, Programmable Sensors, Low Memory.

Abstract:

In this paper we propose that the combination of a state-of-the-art high frequency, low energy demanding

microprocessor architecture combined with a highly programmable image sensor can offer a substantial re-

duction in cost and energy requirement when carrying out low-level visual event detection and object tracking.

The XMOS microprocessor consists of a single or multi-core concurrent architecture that runs at between

400 and 1600 MIPS with 64KB per-core of on chip RAM. Modern highly programmable image sensors such

as the Kodak KAC-401 can capture regions-of-interest (ROI) at rates in excess of 1500fps. To compare the

difference between two 320 by 240 pixel images one would usually require 150KB of RAM, by combining

the above components as a computational camera this constraint can be overcome. In the proposed system the

microprocessor programs the sensor to capture images as a sequence of high frame rate regions-of-interest.

These regions can be processed to determine the presence of motion as differences of ROIs over time. By

providing additional cores extensive image processing can be carried out and ROI pixels can be composited

onto an LCD to give output images of 320 by 240 pixels at near standard frame rates.

1 INTRODUCTION

There is increasing interest in low cost computer vi-

sion systems with a wide range of applications includ-

ing gesture based user interfaces, surveillance, au-

tomotive systems and robotics. As the complexity

of consumer, sensing and military systems increase

the demands on energy resources becomes critical for

high-level computing performance. Vision systems

are proving to be extremely valuable across a range

of applications and to be able to efﬁciently process vi-

sual information offers a huge advantage in the func-

tionality of such systems.

Traditional computer vision systems typically

consist of a camera continually capturing and trans-

mitting images at a ﬁxed frame rate and resolution

with a host computer sequentially processing them

to obtain a result such as the trajectory of a mov-

ing object. A major drawback of this pipeline is that

large amounts of memory are required to store the im-

age data before it is processed, especially as frame

rate and image resolution increase. Additionally large

amounts of the image data is transmitted to the host

for processing regardless of the amount of informa-

tion contained in this data. In the case of object trac-

king computer vision algorithms work towards creat-

ing a concise description such as a group of pixels

at a certain location is moving in a particular way.

Often the object is relatively small compared to the

whole image and the background maybe static. In

cases like this the traditional computer vision process-

ing pipeline could be considered as being highly inef-

ﬁcient as large amounts of image data are being cap-

tured, transmitted to the host, stored in memory and

being processed on a per-pixel basis while most of

the visual information comes from a small number of

changing pixels. In such cases most of the image data

is discarded as it contains no useful information.

In the case of a scene with an object moving

across a static background most of the image data

changes very little while some pixel areas might

change rapidly or move a different speeds. The ﬁxed

temporal sampling rate of standard camera systems

cannot take this into account and artifacts such as mo-

tion blur and temporal incoherence are introduced.

These artifacts consequently confound down stream

processing necessitating ever more complexcomputer

vision algorithms to overcome these imaging effects.

In (Shraml, S. and Belbachir, A. N., 2010) a so-

called Dynamic Vision System (DVS) (Lichtsteiner,

399

Gibson D., Muller H. and Campbell N..

ADAPTIVE IMAGE SENSOR SAMPLING FOR LIMITED MEMORY MOTION DETECTION.

DOI: 10.5220/0003824603990402

In Proceedings of the 2nd International Conference on Pervasive Embedded Computing and Communication Systems (PECCS-2012), pages 399-402

ISBN: 978-989-8565-00-6

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

P. and Posch, C. and Delbruck, T., 2007) is used to

overcome the issues discussed above. The DVS has

very low latency and only delivers pixel difference

information when pixel luminance values change.

While this is an exciting new technology it has two

major drawbacks; it is low resolution (only 128 by

128 pixels) and only image differences are captured

so a traditional image cannot be created.

In this paper we propose a system that conceptu-

ally ﬁts between the DVS of above and a traditional

computer vision capture system. By scanning regions

of interest on the surface of the sensor at high sam-

pling rates, piece-wise image processing can be car-

ried out in very small amounts of memory. The ﬁrst

level of processing involves temporal ROI differenc-

ing to determine whether any activity has occurred in

the current ROI. This temporal ROI differencing can

be used to provide a powerful technique for control-

ling the amount and rate at which data is captured by

the sensor, the amount and extent of further image

processing that is performed and whether pixel data

should be transmitted, stored or displayed. The poten-

tial of the low-level ROI differencing is demonstrated

by two prototype hardware systems built to investi-

gate the combination of XMOSs’ high frequency mi-

croprocessor and a highly programmable image sen-

sor.

2 SYSTEM ARCHITECTURE

The XMOS

XCore is a multi-threaded processing

component with instruction set support for commu-

nication, I/O and timing. Thread execution is deter-

ministic and the time taken to execute a sequence of

instructions can be accurately predicted. This makes

it possible for software executing on an XCore to per-

form many functions normally performed by hard-

ware, especially DSP and I/O. An initial prototype

was constructed to test the feasibility of the system

described above, Figure 1. As a minimal system it

consisted of a single core XMOS XS1-L1 processor

running at 400 MIPS with four fast threads directly

connected to a Kodak KAC-401 WVGA image sen-

sor

. The XMOS architecture allows for four fast

threads per core and while more threads are available

they will share the system resources between them.

In this application the high clock speed of the im-

age sensor dictated that only fast threads could be

used. The Kodak KAC-401 image sensor is highly

www.xmos.com

Technical datasheet obtained in 2010, “MTD-PS-1170

KAC-00401

Revision 1.0 MTDPS-1070.pdf”, no longer

available from Kodak.com.

Figure 1: The initial prototype (left) consisting of a sin-

gle core XMOS XS1-L1 and a Kodak KAC-401 image sen-

sor. The sensor is directly connected to the processor, a

M12 lens is shown to give a reference to the scale of the

components and would normally cover the sensor. A four-

threaded architecture (right) with the thread connectivity is

described. Data is received by the XCore via bi-directional

hardware ports and communication between threads is car-

ried out via bi-directional channels.

programmable and provides features such as pixel

binning (image sub-sampling), region-of-interest po-

sitioning, frame and row delays, digital and analogue

gain adjustment, variable bit-depth, etc. The default

clock speed is 25MHz and the sensor registers are up-

dated using the I2C protocol. To program the sen-

sor intermediate registers are written to and an update

a single frame cycle to write from the intermediate

registers to the main register set.

Throughout this paper the pixel depth was set to

8 bits and ROI were set to 64 by 40 pixels and pro-

grammed as a 5 by 6 ROI grid to cover the exposed

sensor surface. The sensor resolution was set to 640

by 480 pixels and 2 by 2 pixel binning was used to

give an effective image resolution of 320 by 240 pix-

els. The XMOS processors were programmed via a

JTag connection to a host PC, the JTag connection in-

cludes a 10Kbs UART which was used in the initial

prototype.

The initial prototype consisted of one thread to

read from the sensor two consecutive ROI at a given

grid location, each 64 by 40 pixels and consisting of

2560 bytes. This data is passed to a second, image

processing, thread which compares the buffered pix-

els values of the ROI pair. As there were not enough

resources for displaying results on an LCD, the re-

sults of the ROI comparison are transmitted as a bit

pattern via a third thread running a UART to the host

side console via the JTag connection. In a synchro-

nized and concurrent manner the fourth thread pro-

grams and updates the sensor to capture the next ROI

of the grid. As will be discussed in the results sec-

tion, the output of the initial prototype responded as

expected to basic stimuli such as passing a hand over

and above the sensor, this lead to the development of

a phase two prototype consisting of a multi-core ar-

chitecture and LCD.

PECCS 2012 - International Conference on Pervasive and Embedded Computing and Communication Systems

400

In order to further explore the concepts described

above a second prototype was built. This system com-

prised of the same image sensor but with a number of

quad-core processors, ethernet and a LCD display. A

schematic of this more extensive system is shown in

Figure 2.

320x240 LCD

Image sensor

Sensor

Grabber

Control

LCD

Driver

Buffer

Manager

Ethernet

LCD

Buffer1

LCD

Buffer2

Image Processing

Multi-core

5.9MHz

25MHz

>30KBs

Host PC UDP/IP server

debug interface

Figure 2: The system architecture of the second phase pro-

totype, roughly from left to right; Pixels are read in from

the sensor as a series of sub-images and passed onto the

image processing sub-system. The results of the image pro-

cessing sub-system are used to re-program the frame delay,

gain and sub-image position and size registers of the image

sensor. Image processing state spaces are transmitted to an

ethernet delegate and the raw pixel data is passed through,

via the frame buffering sub-system, to the LCD. The inter-

nal states of the image processing sub-system are visualized

in their various forms using a windows based interface.

A detailed representation of the multi-core image

processing sub-system is shown in Figure 3. The sys-

tem proceeds as follows; the sensor is programmed to

grab a ROI, data is read from the grabbing thread into

the ﬁrst image handling thread, while, concurrently,

image data is passed on to a histogram analysing

thread. In this thread a histogram is created for ev-

ery other whole image and it’s mean is compared to

a pre-deﬁned ideal mean value. A proportional in-

tegral derivative (PID) control is used to adjust the

frame delay and analogue gain so as to move the cur-

rent mean towards the ideal, the updated values are

sent back to the grabber thread where they are used

to update the sensor registers. Meanwhile, the image

handler passes the ROI pixel data onto the a set of par-

allel threads to compute the integral images from each

ROI image. The ﬁrst image handler also requests the

next ROI from the sensor. In the current implemen-

tation the integral ROI images are just used to create

an 8 by 8 mean ﬁltered representation of each 64 by

40 pixel ROI. Each 8 by 8 mean ROI representation

is added to a 40 by 48 array to give low resolution

representation of the high resolution 5 by 6 sensor

grid sampling. The 40 by 48 array is considered as a

set of observations which is compared to the previous

frame of observations, the difference of which gives

a motion detection map. The motion detection map

is smoothed by a 3 by 3 gaussian ﬁlter each frame

to generate a motion-history-image style representa-

tion (Davis and Bobick, 1997). The image difference

and gaussian ﬁlter parts of the architecture in Figure 3

allow for results of statistical analysis to be feed back

to the ﬁrst image handler and sensor via the second

image handler. The second image handler also chan-

nels the original pixel data on to the LCD output sys-

tem. The results of parts of the image processing ar-

chitecture are concurrently passed on to an ethernet

delegate which transmits data to a server running on

the host PC system. The later has proved invaluable

for debugging, algorithm prototyping and visualisa-

tion of the internal states of the image processing sys-

tem.

Gaussian

Filter

Image

Differences

Image

Input

Image

Handler

Integral

Image

Integral

Image

Integral

Image

Handler

Image

Output

Histogram

Calculation

Ethernet

Delegate

Figure 3: The image processing sub-system with parallel

and concurrent thread usage which is distributed across a

single quad-core XMOS processors.

3 RESULTS

In Figures 2 and 3 each small block effectively rep-

resents a single thread of the system. Consequently

all of the image capture, sensor programming, image

processing and LCD management can ﬁt on a single

16 thread quad-core processor running at 1600 MIPS.

The ethernet connection requires two more threads.

The initial prototype generated a bit pattern where

each bit represented a ROI and was set for motion

detected and off for no motion. The system output

behaved as expected when passing a hand over the

sensor. However, using the ethernet and LCD output

of the second prototype gave a much greater insight to

the performance and functionality of the system. Ta-

ble 1 shows timings of the system. The histogram pro-

cessing is used to attempt to obtain a reasonable in-

tensity balance for image pixels in varying light con-

ditions. The system works in two modes; good light

and bad light modes and the frame delay is allowed to

change within a ﬁxed range to best adapt to these con-

ADAPTIVE IMAGE SENSOR SAMPLING FOR LIMITED MEMORY MOTION DETECTION

401

ditions as reﬂected in the F. delay values of Table 1.

Table 1: Timing information of the multi-core system. F.

delay is the frame delay used to increase exposure time.

Sen. ROI is the rate at which ROI frames are captured by

the sensor per second. Sys. ROI is the rate at which ROI

images actually pass through the entire system per second.

LCD and PC are the rate, in frames per second, at which

320 by 240 pixel images and 40 by 48 motion representa-

tions reach the LCD and PC (via ethernet) respectfully.

F.delay (µs) Sen. ROI Sys. ROI LCD PC

285 436 305 10.2 8.5

162 1442 489 16.3 15.3

113 1700 643 21.3 18.5

The Sensor ROI and System ROI values of Table 1

show the potential rate of ROI sensor capture and ac-

tual rate of ROI processing for the system. Clearly

the system cannot process image data at the sensor

ROI grabbing rate leaving the sensor to free run until

the system is ready to grab the next ROI. ROI transfer

times from the sensor to the system are 46µs and sen-

sor ROI programming times are 26µs. The LCD val-

ues show the 320 by 240 pixel frames per second rate

of output to the LCD and the PC values show the rate

at which 40 by 48 motion representations are received

by the host PC via the ethernet connection. The gaus-

sian ﬁltering is computed in a naive manner and by

using the algorithms of (Wells, W. M., 1986) system

performance and functionality could be increased. It

should be noted that there has been no explicit optimi-

sation applied to the system software which is written

in the XMOS XC language, an extension of C. The

above timings are given as an initial report of results

and much more analysis is required to fully under-

stand the true performance of the system.

4 CONCLUSIONS

In this paper we have shown that by leveraging the

programmability of an image sensor, motion detec-

tion can be carried out at near standard frame rates at

an effective resolution of 320 by 240 pixels using a

single-core four thread processor with just 64KBs of

RAM. Further we have shown that by using a multi-

core architecture motion detection and various addi-

tional image processing can be carried out at near real

time rates at an effective resolution of 320 by 240 pix-

els using a distributed system with no more than four

unshared blocks of 64KB of RAM. It is expected that

with further development the proposed system will be

able to compute higher-level computer vision algo-

rithms such as optical ﬂow (Barron, J. L. and Fleet,

D. J. and Beauchemin, S., 1994), point tracking (Shi,

J. and Tomasi, C., 1994), gesture recognition (Shot-

ton, J. and Fitzgibbon, A. and Cook, M. and Sharp,

T. and Finocchio, M. and More, R. and Kipman, A.

and Blake, A., 2011) and face detection (Viola, P.

and Jones, M. J. and Snow, D., 2005). Key contribu-

tions of this paper include leveraging the programma-

bility of modern image sensors and the use of high

frequency low power XMOS processors.

ACKNOWLEDGEMENTS

This work was sponsored by an EPSRC Knowledge

Transfer Secondment help by the Research, Enter-

prise and Development department of the University

of Bristol.

REFERENCES

Barron, J. L. and Fleet, D. J. and Beauchemin, S. (1994).

Performance of optical ﬂow techniques. In Interna-

tional Journal of Computer Vision, volume 12, pages

43–77.

Davis, J. and Bobick, A. (1997). The representation and

recognition of action using temporal templates. In In-

ternational Conference on Computer Vision and Pat-

tern Recognition.

Lichtsteiner, P. and Posch, C. and Delbruck, T. (2007). An

128x128 120db 15us-latency temporal contrast vision

sensor. In IEEE Journal Solid State Circuits.

Shi, J. and Tomasi, C. (1994). Good features to track. In In-

ternational Conference on Computer Vision and Pat-

tern Recognition.

Shotton, J. and Fitzgibbon, A. and Cook, M. and Sharp, T.

and Finocchio, M. and More, R. and Kipman, A. and

Blake, A. (2011). Real-time human pose recognition

in parts from single depth images. In International

Conference on Computer Vision and Pattern Recogni-

tion.

Shraml, S. and Belbachir, A. N. (2010). A spatio-temporal

clustering method using real-time motion analysis on

event-based 3d vision. In International Conference on

Computer Vision and Pattern Recognition.

Viola, P. and Jones, M. J. and Snow, D. (2005). Detect-

ing pedestrians using patterns of motion and appear-

ance. In International Journal of Computer Vision,

volume 63, pages 153–161.

Wells, W. M. (1986). Efﬁcient synthesis of gaussian ﬁlters

by cascaded uniform ﬁlters. In IEEE Transactions on

Pattern Analysis and Machine Intelligence, volume 8,

pages 234–239.

PECCS 2012 - International Conference on Pervasive and Embedded Computing and Communication Systems

402