FAST PROTOTYPING OF EMBEDDED IMAGE PROCESSING

APPLICATION ON HOMOGENOUS SYSTEM

A Parallel Particle Filter Tracking Method on Homogeneous Network of

Communicating Processors (HNCP)

Hanen Chenini, Jean Pierre Derutin and Thierry Chateau

LASMEA-UMR 6602 CNRS, Blaise Pascal University, 24 Av. DesLandais, Clermont-Ferrand, France

Keywords:

Face Tracking, K-Nearest Neighbor (KNN), Parallel Implementations, Homogeneous Network of Communi-

cating Processors.

Abstract:

This article discusses the design of an application speciﬁc MP-SoC (Multi- Processors System on Chip) ar-

chitecture dedicated to face tracking algorithm. The proposed algorithm tracks a Region-Of-Interest (ROI)

by determining the similarity measures between the reference and the target frames. In our approach, this

measure is the estimation of the Kullback-Leibler divergence from the K-nearest neighbor (KNN) framework.

The metric between pixels is an Euclidean norm in a joint geometric and radiometric space. The adopted

measure allows us to check if the regions have similar colors and also if these colors appear at the same

location. Considering the necessary computation amounts, we propose a parallel hardware implementation

of the developed algorithm on MP-SoC architecture. Creating multiple processors in one system is hard for

software developers using traditional hardware design approaches due to the complexity to design software

models suitable for such FPGA implementations. In order to deal with this problem, we have introduced a

CubeGen tool to avoid fastidious manual editing operations for the designer. This new methodology enables

us to instantiate a generic Homogeneous Network of Communicating Processors (called HNCP) tailored for

our targeted application. Our implementations are demonstrated using the Xilinx FPGA chip XC6VLX240T.

1 INTRODUCTION

We will provide a brief introduction to particle ﬁlters

(PFs), as their application and implementation is the

focus of this article.

In most of the application, prior knowledge about

the phenomenon being modeled is available. This

knowledge allows us to formulate bayesian models.

Within this setting, we used the Monte-Carlo simu-

lation methods (particle ﬁltering) in order to imple-

ment the bayesian framwork (Chen, 2003). PFs are

Bayesian in nature and their goal is to ﬁnd an approxi-

mation to the posterior density of the states of interest

(e.g. position of a moving object in tracking) based

on observations corrupted by additive gaussian white

noise which are inputs to the ﬁlter. This is done using

the principle of Importance Sampling (IS) whereby,

multiple copies (particles) of the variable of interest

are drawn from a known density (Importance Func-

tion (IF)), each one associated with a weight that sig-

niﬁes the quality of that speciﬁc particle based on the

received observations. An estimation of the variable

of interest is obtained by the weighted sum of all the

particles.

The main objective of particle ﬁltering (Maskell

and Gordon, 2001) is to track a variable of interest as

its evolves over time t, typically with nonlinear mod-

els and non-gaussian noises. Consequently, they have

been used in a wide variety of real-time applications

including tracking. Tracking is a process measure-

ment to sequentially estimate hidden states s

from all

the available measurements z

(observations). An ef-

ﬁcient and robust tracking of objects in complex en-

vironments is important for a variety of applications

relying on tracking including surveillance and moni-

toring (Xinyu and Baoxin, 2005), autonomousdriving

(Petrovskaya and Thrun, 2009), medical imaging (De

Bruijne and Nielsen, 2004) or robotics. Depending

on the state-space model proposed, the complexity is

related to the prediction of M sampled particles and

weight evaluation stage.

The proposed algorithm considered in our ap-

proach, uses a Gaussian approximation to the full-

posterior as the importance function. Also, we pro-

122

Chenini H., Pierre Derutin J. and Chateau T..

FAST PROTOTYPING OF EMBEDDED IMAGE PROCESSING APPLICATION ON HOMOGENOUS SYSTEM - A Parallel Particle Filter Tracking

Method on Homogeneous Network of Communicating Processors (HNCP).

DOI: 10.5220/0003828401220133

In Proceedings of the International Conference on Computer Vision Theory and Applications (VISAPP-2012), pages 122-133

ISBN: 978-989-8565-04-4

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

pose to compute the Kullback-Leibler distance using

the KNN framework to evaluate likelihood that fur-

ther increases the complexity of the algorithm. To

sum up, the proposed particle ﬁlter algorithm consists

of an initialization of the template model and a se-

quential Monte Carlo implementation of a Bayesian

ﬁltering for the stochastic tracking system.

Translating an algorithm for real-time implemen-

tation requires making speciﬁc choices so that the de-

sign meets the constraints. Some of the main con-

straints are speed of execution, power dissipation, ac-

curacy of the results, cost and time involved in the

implementation. Dedicated hardware implementation

may be useful for high speed processing but it does

not offer the ﬂexibility and programmability required

for system evolution. Applications with stringent

resource-consumption and runtime constraints are in-

creasingly resorting to MP-SoC architectures. Gen-

erally, the MP-SoC architecture has become a solu-

tion for designing embedded systems dedicated to ap-

plications that require intensive computations. In the

MP-SoC architecture, the mapping of software tasks

to hardware resources is important since it affects the

degree of parallelism among multiple processors and

the utilization of hardware resources.

Except for the state estimate and initialization

stage, processing of individual particles can be done

in parallel. With regard to the latter, we applied

our design technique to implement the proposed face

tracking algorithm on the homogeneous MP-SoC ar-

chitecture (Si´eler et al., 2010). This parallel architec-

ture contains multiple homogenous processors, mem-

ory blocks, DMA (Direct memory access) and several

I/O resources in the same chip. In addition, it can be

beneﬁcial to take advantage of the parallelism, low

cost, and low power consumption offered by FPGAs.

The design and implementation of a real-time object

tracking algorithm on a FPGA focuses on minimiz-

ing resource utilization to allow functionality of the

application that uses the tracking information to be

added. We then propose a FPGA implementation to

evaluate real-time performance of the developedalgo-

rithm. The details of this implementation are provided

in section 4.

The rest of the paper is organized as follows. Sec-

tion 2 describes the theory behind the PFs and the sur-

vey of existing related efforts. In Section 3, we brieﬂy

describe and present the face tracking algorithm with

particle ﬁlter framwork, including face model, dy-

namic model and observation model. Section 4 is

mainly concerned with outlining our proposed design

ﬂow for implementing the developed image process-

ing application. Evaluation of resource utilization and

latency of the two parallel schemes on FPGA platform

to speed up the tracking process is presented in sec-

tion 5. Section 6 concludes the paper.

2 STATE OF THE ART

Sequential Monte Carlo (SMC) (Diaconis, 2003)

methods are a set of simulation based methods which

provide a convenient and attractive approach to com-

puting the posterior distribution. Therefore, SMC

methods are very ﬂexible, easy to implement, par-

allelizable and applicable in very general settings.

Several closely related algorithms under the names

of bootstrap ﬁlters, the Sampling Importance Re-

sampling Filter (SIRF), condensation, particle ﬁlters,

Monte Carlo ﬁlters, Sequential Importance Sampling

(SIS) and interacting particle approximation have ap-

peared in several research ﬁelds. The problem en-

countered by the SIS is that, as time t increases, the

distribution of the importance weights becomes more

and more skewed. Practically, after a few time steps,

most particles have negligible weights. To avoid this

degeneracy, the key idea of SIR (Bootstrap ﬁlter) is

to eliminate or replicate particles depending on their

importance weight (Particles with high weight are se-

lected more and more often, others die out slowly).

PFs (Maskell and Gordon, 2001) are used to per-

form ﬁltering for dynamic state-space problem that

can be described as a sequential representation (sys-

tem transition equation and observation equation).

State-space model depends on physics of this prob-

lem.

In most practical scenarios, these models are non-

linear and the densities involved are non-Gaussian.

PFs are used to estimate states of a nonlinear, non-

Gaussian state space model. Traditional ﬁlters like the

Extended Kalman Filter (Greg and Gary, 1995), ex-

tension of Kalman ﬁlter, are known to ﬁnd an optimal

solution for the recursive problem if state and/or mea-

surement models are nonlinear, and state and mea-

surement noises are non-Gaussian.

In our implementation, we choose to use the

problem modeled as Markovian nonlinear and non-

Gaussian state space. Often, in practical operations, a

large number of particles need to be used for comput-

ing estimates of the desired state. Hence for meeting

speed requirements of real time applications, it is nec-

essary to have high throughput designs with ability to

process a larger number of particles in a given time.

Parallelizability is the key to high throughput design

for PFs, as this enables simultaneous processing of

particles.

In the digital implementation of PFs algorithms,

the choices available are to either use a digital sig-

FAST PROTOTYPING OF EMBEDDED IMAGE PROCESSING APPLICATION ON HOMOGENOUS SYSTEM - A

Parallel Particle Filter Tracking Method on Homogeneous Network of Communicating Processors (HNCP)

123

nal processor (DSP), a ﬁeld-programmable gate array

(FPGA) or an application speciﬁc IC (ASIC).

Recently, many approaches on the hardware im-

plementation of particle ﬁlters have drawn attention

(Boli´c et al., 2004), resulting in an FPGA prototype

for a particle ﬁlter algorithm. As part of that effort,

the authors have developed an architecture for a dig-

ital hardware implementation of particle ﬁlters along

with efﬁcient resampling algorithms. Their initial at-

tempt was evaluated on TI TMS320C54x DSP for

bearings-only tracker algorithm. With M = 2000 par-

ticles, on a single state-of-the-art DSP, yielded speeds

of up to 500Hz for SIRF. By using a Xilinx Virtex II

Pro FPGA, they achieved a maximum sampling fre-

quency of 50 kHz for a similar application. This later

led to an application speciﬁc integrated circuit (ASIC)

implementation for realizing certain stages in the par-

ticle ﬁlter algorithm.

The successful implementation of this kind of im-

age processing algorithm illustrates that the digital

signal processing required for high rate sensing appli-

cation can be efﬁciently implemented on FPGA hard-

ware(Xilinx Spartan-II XC2S200) (Boli´cet al., 2004).

In parallel implementations, propagation and

weight calculation for different particles are indepen-

dent and each requires M iterations for one PF recur-

sion. The resampling, which is inherently sequen-

tial, has been modiﬁed in order to allow for parallel

implementation. As resampling requires all particle

weights to be available; complete parallelization of

the particle ﬁlter algorithm is difﬁcult. Efﬁcient meth-

ods to address this in hardware can be found in (Boli´c,

2004). More recently, a Single Instruction Multiple

Data (SIMD) architecture that uses N processors to

process M particles for particle ﬁlters has been pre-

sented in (Medeiros et al., 2008). Other interesting

implementation strategy was done using General Pur-

pose Graphical Processing Units (GP-GPUs). In (Liu

et al., 2010), the authors explored the implementation

of multi-cue based face tracking algorithm on dedi-

cated processors and demonstrated the efﬁciency of

two parallel computing techniques (one is multicore

based parallel algorithm with a MapReduce thread

model and the other is GPU based speedup approach).

For human tracking according to (Anastasios and

Nikolaos, 2012), they apply the Service Oriented

(SOA) Architecture framework for scheduling com-

plex industrial workﬂows in particular a real-time

self-initialized human tracking based on Particle Fil-

ter under complex observation conditions. Durant the

exploration of the unknown environment, the tracker

position maintains a set of hypotheses with regard to

the last position of the tracked object and the dif-

ferent objects around it that have similar colors and

shapes. The input for updating the tracker position

comes from the various sensors cameras distributed

placed in critical parts of the industries (Nissan Iber-

ica Automobile Construction company). Moreover,

the estimate of the position of the tracked object can

be updated based on the dataset collected up to that

point in time.

3 FACE TRACKING WITH

PARTICLE FILTER

3.1 Particle Filter

Our goal is to track the face through a sequence of

video. Tracking objects in video involves the mod-

eling of non-linear and non-gaussian systems. The

particle ﬁlter can be employed by using a probabilis-

tic framework which formulates tracking problems as

a Hidden Markov Model (HMM).

The state of face at time t is donated s

, which

represents unobserved state (hidden position) of the

object, and its history is S = {s

...s

}. Similarly, the

temporal data sequence (images) features at time t is

with history is Z = {z

...z

From a Bayesian perspective, the tracking prob-

lem is to recursively compute the posterior state-

density p(s

) of the state s

at time t, taking differ-

ent values, given the data z

0:t

up to time t (using the

process density p(s

t−1

) and the observation den-

sity p(z

)). To sum up, the model is described by:

• p(s

) for t = 0

• p(s

t−1

) for t > 0

• p(z

) for t > 0

Given the face model, the tracking algorithm con-

sists of four main steps:

1)Sampling step, Generation of new particles, in

which M particles s

for m = {1...M} are generated

from old sample set s

t−1

using an importance function

ˆs

≈ π(s

0:t−1

, z

) (1)

2) Weight measurement, assigns importance

weights ω

for each newly generated samples based

on the received observation. This step is the most

computationally intensive and generally involves the

computation of transcendental trigonometric and ex-

ponential functions.

∝ ω

t−1

P(z

)P(s

t−1

)

π(s

0:t−1

, z

1:t−1

)

(2)

3) State estimation obtains the ﬁnal state vector

of face by newly generated samples and its weights

VISAPP 2012 - International Conference on Computer Vision Theory and Applications

124

4) Resampling step where the input is an array of the

weights of the particles and the output is an array of

indices of which particles are going to propagate for-

ward. The ﬁrst three steps form the particle ﬁlter Se-

quential Importance Sampling (SIS) ﬁlter. The ﬁlter

that performs all the four operations is the Sample Im-

portance Resampling Filter (SIRF).

3.2 Tracking Application

The application is based on skin color detection which

is performed on the input image. A particle ﬁlter is

applied then for tracking the image position s

of the

region of interest (the face).

3.2.1 Skin Detection using Color

In the following, we present a conceptually simple

approach for face detection. Skin color (Vezhnevets

et al., 2003) is a distinguishing feature of human

faces. In a controlled environment, the skin detec-

tion can be efﬁcient to locate faces in images. The

ﬁrst step of the proposed algorithm is to detect the

human face in the ﬁrst frame. Since we are interested

in tracking, we assume that the target object has been

detected using an intuitive method to ﬁnd a rectangu-

lar region of the face. The skin detection technique is

summarized in ﬁgure 1. This method used the thresh-

olding of RGB color space for skin detection, seg-

mentation technique and region labeling in order to

separate the face region. The detected object become

reference template for the ﬁrst frame and it is updated

in every frame.

Threshold

Image

segmentation

RGB colour image

Binary Image

Labelling

image regions

Separate

regions

Spatial Extraction

Final labelling Face region

Originale

Image

Figure 1: Face detection using skin color.

3.2.2 Color based Tracking Approach

We want to apply a particle ﬁlter in a color model

based framework. This system depends on the deter-

ministic search of a window (particle), whose color

content matches a reference color model. To model

Image Sequence

Frame 1

Frame 2

Frame 3

Frame 4

.............

Frame N

Generation of

Particles

Evaluation

Estimation

Resample

Weight of each particle

based on Kullback-Leibler

divergence

Select the location of the

target as a particle with

maximum weight

Kullback-Leibler divergence

K-Nearest Neighbors

[s^t , w^t]

[s^best , w^best]

[s^* , w^(-1)]

Estimation

Evaluation

Measurement

Process

D_KL1

D_KL2

D_KL3

D_KLN

Figure 2: Particle ﬁlter scheme.

the target using color information, we pick M rectan-

gular regions {s

...s

} within the object to be tracked.

Each rectangle s

is represented by the mean (h, s, v)

color of the pixels within regions

(other color spaces

can be considered similarly). Since we use the SIRF

for our implementation, we obtain the optimal Impor-

tance distribution which is given by:

π(s

0:t

) = p(s

t−1

, z

) (3)

Consequently, the recurrence relations (1) and (2)

are simpliﬁed and formed the basis for the optimal

Bayesian solution:

• For the m

sample

≈ p(s

t−1

) (4)

• For the m

sample, the weight assignment equa-

tion is :

= ω

t−1

p(z

) (5)

Figure 2 outlines an iteration of the particle ﬁl-

ter algorithm. The performance of the ﬁlter has been

tested on a rolling ball sequence. The actual frame

(example we take here frame 2) of the sequence is

loaded and M samples are taken using (x, y) position

of upper-left corner stored in each particle.

In the ﬁrst iteration, this sampling is randomly

generated from a uniform posterior probability den-

sity. In a measurement process, M windows are cap-

tured from the image using the coordinates given by

the previous sampling stage. Particle weights com-

putation is based on the distinctive features that can

be followed from frame to frame around the region

of interest instead of performing an exhaustive search

over the whole image. In order to improve weights

computation, we have chosen to get an estimation of

the Kullback-Leibler divergence from the K

Near-

est Neighbor (KNN). The particle with the maximum

weight s

max

is selected as best candidate for the state

of the system in the iteration. In order to track mov-

ing objects efﬁciently, we perform a resampling stage

(see Figure 2) evaluating previous particle weights

FAST PROTOTYPING OF EMBEDDED IMAGE PROCESSING APPLICATION ON HOMOGENOUS SYSTEM - A

Parallel Particle Filter Tracking Method on Homogeneous Network of Communicating Processors (HNCP)

125

and concentrating particles around the most proba-

ble states, discarding those with lower weights. A

detailed description of the particle ﬁlter algorithm is

presented in the following sections.

3.2.3 Face Model

We are used a rectangle region to describe face fea-

tures for tracking. In our methods, the face model is

deﬁned by:

= α

+ α

(6)

where α

, α

and α

are the conﬁdence values

of the 2D position (x, y) and color space (H, S,V) re-

spectively. The values can be determined empirically

by human. The combination of the color information

and the evolution of the 2D position achieve excellent

performance in term of speed and accuracy.

To deﬁne the problem of tracking, we consider

that the state sequence s

of a target given by: s

,W, H} where s

is a rectangle which represents

the region-of-interest (ROI), where (X

) is the po-

sition of upper-left corner of the rectangle and W, H

are the width and the height of the rectangle respec-

tively. For subsequent frames, the tracking algorithm

conﬁnes its search space to an area centered on the

location found in the previous frame. The implemen-

tation details are described in the following section.

3.2.4 Evolution and Observation Model

We consider a dynamic system represented by the

stochastic process s

whose temporal evolution is

given by the state equation as shown in equation (8)

(Face dynamics are modeled as a ﬁrst order process,

as shown in equation):

= s

t−1

+ Ω

t−1

(7)

where Ω

t−1

indicates the Gaussian noises.

Since we use the SIRF for our implementation,

the observation model is performed to measure the

weight for all the newly generated samples (which in-

cludes two given information based on color features

and movement features). We want to estimate the

state vector s

at discrete times with the help of system

observations which are realizations of the stochastic

process z

governed by the measurement equation.

The observations are available at discrete times ac-

cording to:

p(z

) ≈ exp(−µD

) (8)

where D

is the Kullback-Leibler divergence be-

tween the reference and the target frames. The aim

of the likelihood distribution p(z

) is to determine

in successive frames the regionwhich best matches, in

terms of a similarity measure. This similarity between

the template and the current image can be measured

by the distance between them. As we will see later,

this distance is expressed from the samples using the

Nearest Neighbour framework (KNN).

3.2.5 The K

Nearest Neighbor(KNN)

Classically, distance can be a distance between color

histograms or, similarly, probability density func-

tions (PDFs). For example, the Bhattacharya distance

was used for tracking (Johncy Rani and Suja Priyad-

harsini, 2010). Another widely used similarity mea-

sure is the Kullback- Leibler distance (Boltz et al.,

2009):

(R, T) =

(s)

log( f

(s))

log( f

(s))

ds (9)

Distance (10) can be decomposed as follows:

(R, T) =

(s)log( f

(s))ds

−

(s)log( f

(s))ds

= −H(T) + H

(T, R)

(10)

where H is the differential entropy and H

is the

cross entropy, also called relative entropy or likeli-

hood.

In our approach, the reference frame and target frame

from the video frames are selected ﬁrst. Then the par-

ticular Region-of-Interest (RoI) is selected from the

reference frame and the (Y,U,V) colour feature was

extracted. Then, the colour and the geometric features

are extracted from the target frame. Finally, the Kull-

back Leibler Divergence is used for ﬁnding the sim-

ilarity measure between the reference and the target

frames.

In this paper, we propose to compute the

Kullback- Leibler distance between high-dimensional

PDFs using the K

Nearest Neighbor framework

(KNN). Since the Kullback- Leibler is a distance, the

KNN-based expression of the Kullback-Leibler dis-

tance proposed for RoI tracking is the difference be-

tween a cross entropy and differential entropy (Boli´c,

2004)(see Eq. (11)); the KNN estimate of this dis-

tance is given by:

(T, R) = H

KNN

(T, R) − H

KNN

(T)

= log(

|R|

|T − 1|

) +

|T|

∑

s∈T

log(

(R, s)

(T, s)

)

(11)

where ρ

(R, s), ρ

(T, s) are the distance to the K

nearest neighbor of s in R and T respectively exclud-

ing the sample located at s if any.

VISAPP 2012 - International Conference on Computer Vision Theory and Applications

126

CSW

Figure 3: Systematic resampling with non-normalized

weights (M=5).

3.2.6 Systematic Resampling in SIR

The particles are resampled to generate an un-

weighted particle set according to their importance

weights to avoid degeneracy. This is done by mul-

tiplying or discarding particles with respect to high or

low importance weights to obtain a predeﬁned num-

ber of particles.

Standard algorithms used for resampling such

as residual resampling (RR), branching corrections

(Crisan et al., 1999), systematic resampling (SR)

(Boli´c, 2004)...The proposed algorithm in this paper

uses the systematic resampling algorithm. This is the

most commonly used resampling algorithm for PFs

(Maskell and Gordon, 2001). The Resampled parti-

cles are drawn proportional to this distribution to re-

place the original set. The SR concept for a PF that

used 5 particles is shown in ﬁgure 3. First the cumu-

lative sum of weights (CSW) of sampled particles is

computed. Then, as shown on the y axis of the graph,

a function u(m) called the resampling function is sys-

tematically updated and compared with the CSW of

the particles. The corresponding particles are repli-

cated to form the resampled set which for this case is

{x(0), x(0), x(3), x(3), x(3)}. This method eliminates

particles with low weight and keeps more particles in

more probable regions.

3.3 The Proposed Particle Filter

Algorithm

Initialization. For most object tracking system, ini-

tialization algorithm is only performed at the begin-

ning of tracking which is an independentprocess from

the tracking process. In this tracking system, we have

developed an algorithm which will be introduced to

achieve automatic initialization and the face detected

is used as the reference face model. We assume in

our approach the image to be recorded with station-

ary camera.

Algorithm Description. We can now specify the al-

gorithm in detail as fellows. The face tracking algo-

rithm is separated into two main stages: automatic ini-

tialization and particle ﬁlter tracking. The principle of

the proposed algorithm is described in detail bellow.

1. Initialization(automatic)

Reference face template updating.

2. Particle ﬁlter tracking:

Using cumulative measurement up to time

t, Z = (z

...z

) , our aim is to estimate

, ω

]

m=0

= Function([s

t−1

, ω

t−1

]

m=0

, z

)).

• Simulate M independent and identically dis-

tributed random samples (Randomly cho-

sen particles to represent posterior distribu-

tion by generating a random number δ

∈

]−0.5, 0.5[, δ

∈ ]−0.5, 0.5[ uniformly dis-

tributed). [s

, ω

]

m=0

(where ω

are the as-

sociated weights and M is the number of par-

ticles). Given the observed data z

at t, for each

particle m = 0...M do:

• Calculate face model.

• Calculate Euclidean distance between the sam-

ple feature vector P

and the reference feature

vector P

• Calculate Kullback- Leibler distance.

• Weight measurement: For the m

, we obtain

its weights ω

by a Kullback- Leibler

similarity function as shown in Equation (10).

So we obtain the ﬁnal weight for the m

as:

≈ exp(−µD

)

3. Estimating state parameters at time step: calcu-

late mean position of each target using a weighted

average of the particles.

4. Resampling step [s

, ω

]

m=0

to obtain new set of

particles [s

]

m=0

(using the above-mentioned

resampling algorithm).

4 PARALLEL PARTICLE FILTER

ALGORITHM

4.1 Homogeneous Network of

Communicating Processors

Due to the increasing complexity of MPSoC architec-

tures, software and hardware designers as well as sys-

tem architects are facing completely new challenges.

Optimization of interconnects among processors and

memories becomes important as multiple processors

and memories can be integrated on a MPSoC since

FAST PROTOTYPING OF EMBEDDED IMAGE PROCESSING APPLICATION ON HOMOGENOUS SYSTEM - A

Parallel Particle Filter Tracking Method on Homogeneous Network of Communicating Processors (HNCP)

127

it may target multiple objectives: application per-

formance, power consumption/energy, temperature,

small chip area, etc. Consequently, high performance

embedded design must aim to obtain an ideal balance

between hardware constraints and performance con-

straints. As well, developing processors network sys-

tems tailored to a particular application domain is crit-

ical and design-time consuming in order to achieve

high performance customized solutions. The effec-

tiveness of such approaches largely depends on the

availability of an ad hoc design methodology. Our

goal was to limit the phase of architecture exploration

in order to reduce design time and to allow fast proto-

typing.

Our research interests are based mainly on fast

prototyping tools which enable the parallelization of

real-time signal and image algorithms in a homoge-

neous communicating processor network. In this pa-

per, we propose a new optimized design methodol-

ogy under performance constraints. Based on Multi-

Processors concept, our approach proposes an orig-

inal design ﬂow for the fast prototyping of image

processing on a MP-SoC architecture. Our HNCP

methodology is based on two essential concepts.

First, it consists in the derivation of a generic archi-

tecture based on a HNCP (Homogeneous Network

of Communicating Processors). The second feature,

parallelization of the sequential code on the differ-

ent soft-core performed using speciﬁc communication

functions based on Parallel Skeleton concept, such as

data, task and ﬂow parallel skeletons. We believe that

our methodologyprovides several beneﬁts such as im-

proving parallelism of tasks (or data), fast and valid

mapping of software models to hardware architecture.

4.1.1 The Proposed Multi-processors System on

Chip Design Methodology

The proposed methodology presents a generic MP-

SoC design ﬂow dedicated to the fast prototyping of

complex image processing. This approach is based

on concepts of HNCP and parallel programming with

the use of speciﬁc communication functions. In the

following, we present the generic HNCP architecture

and how to conﬁgure it.

The HNCP (Homogeneous Network of Commu-

nicating Processors). Different models of paral-

lel architecture exist, we choose the MIMD-DM

model (Multiple Instruction Multiple Data with Dis-

tributed Memory). Communications between proces-

sors are done using the message passing communica-

tion model. We chose this architecture because of its

ability to execute a large range of parallel scheme ef-

ﬁciently (data and task parallelism, pipeline). The

interconnection network is a static network with reg-

ular hypercube topology. This hypercube topology

eases scalability (when the number of processors is

doubled, the maximum distance D between two pro-

cessors and the number of links per node only in-

creases by 1) and routing which can be calculated us-

ing a simple combinatorial function.

This architecture is homogeneous. As seen from

the ﬁgure 4, each node comprises the same identi-

cal components (soft processor, with local memory

for application software and data storage, and a com-

munication device). All components can be chosen

inside a library of available custom components or

commercial IP. Depending on the application, an ar-

chitecture derived from the generic architecture may

differ from another through the different options and

parameters relating to the processor (optional arith-

metic units, implementation options), to the memory

(amount of local memory, size of potential buffers)

and to the interconnection network (type of link be-

tween processor, number of nodes).

Architecture Generation and HNCP Conﬁgura-

tion Via CubeGen Tool. In order to reduce the time

and effort needed to design a network of communicat-

ing processors, we have developed a tool called Cube-

Gen(see ﬁgure 4). It enables to automate the conﬁgu-

ration ﬁle creation dedicated to the Embedded Devel-

opment Kit (EDK) of Xilinx company. This .mhs ﬁle

contains the description of the system with the con-

nection between the different IPs. The designer has

to specify, via a graphical interface, the different pa-

rameters chosen for the network: network dimension,

MicroBlaze parametrization, memory size allocated

to each processor,type of communicationlink and use

or not of the special IP for I/O (VHDL block designed

to control the I/O directly from the video ﬂow).

CubeGen automates the high level description of

the HNCP in few seconds. The designer launches the

synthesis of the system with speciﬁc target to check if

this conﬁguration of the HNCP can be implemented.

This methodology matches perfectly with the concept

of fast prototyping on SoPC. The designer obtains

quickly an architecture tailored for the target appli-

cation. Moreover, CubeGen provides a well-suited li-

brary (regarding architecture conﬁguration choice) of

speciﬁc lightweight communication functions that fa-

cilitate conversion from sequential algorithm to a par-

allel C code.

Proposed Design Flow. The proposed design ﬂow

(as shown in ﬁgure 5) takes as input a sequential al-

gorithm (written in C Code). Thanks to a CubeGen

tool, an Homogeneous Network of Communicating

Processors (HNCP) is automatically generated using

VISAPP 2012 - International Conference on Computer Vision Theory and Applications

128

Communication

layers

Soft-

Processors

Memory

Details of each node

.mhs files

CubeGen Interface

Figure 4: GUI of the CubeGen tool.

a set of available IPs.

Parallelization process is based on the main idea

that most of parallel applications were built upon a

limited number of recurring schemes of paralleliza-

tion (called parallel skeletons). As we already men-

tioned, the proposed CubeGen tool generates speciﬁc

lightweight communication functions that are tuned

to the network conﬁguration (number of processors,

communication links, parallelization scheme...). With

these communication functions, designer can convert

sequential algorithm into a parallel C code. Thus the

whole system is instantiated on SoPC Xilinx platform

(ISE/EDK tools). If HW architecture does not meet

area constraints, a ﬁrst loop enables to re-conﬁgure

the HNCP thanks to CubeGen. Otherwise, the whole

system (SW and HW) can be tested on board or simu-

lated (using abstraction models for co-simulation). If

this ﬁnal system does meet timing constraints, reﬁn-

ing can be done on parallel code or/and on architec-

ture conﬁguration in order to converge to a satisfying

solution.

4.1.2 Parallel Programming

Today’s challenge is to provide high-level program-

ming concepts without sacriﬁcing efﬁciency. How-

ever, the programming level is still rather low and

programmers have to manage low level communica-

tion problems such as deadlocks. Moreover, the pro-

gram is split into a set of processes which are assigned

to the different processors. As described in previ-

ous section, the proposed approach tries to increase

the abstraction level of parallel programming and to

overcome the mentioned issues. The designer can de-

ﬁne easily and quickly a suited HNCP for the appli-

cation but software development still remains to pro-

gram the differentsoftcores. In order to tackle parallel

programming and communications aspects, we have

developed a set of speciﬁc skeletons and the associ-

ated communication functions. They suit perfectly

to the network topology and ease the designer works

(abstraction level). From a speciﬁc network conﬁg-

uration (size of hypercube an communication links),

a library of functions are automatically generated by

CubeGen (see ﬁgure 5) for all skeletons. The map-

ping and the scheduling of the skeletons are static and

are realized during the compilation. In this paper,

we investigate the approach of parallelization Farm

scheme.

Synthesis

ISE/EDK

Modelsim

Simulation, Validation

Architecture

Generation

Parallel

Programming

Compilation

GCC

Parallel

Skeletons

CUBEGEN

Sequential

Algorithm

Available

Library of

communication

function .c

HNCP

description

.mhs

Configuration

choice

Figure 5: The proposed design ﬂow.

Parallel Skeleton Farm. In practice, we generate

FAST PROTOTYPING OF EMBEDDED IMAGE PROCESSING APPLICATION ON HOMOGENOUS SYSTEM - A

Parallel Particle Filter Tracking Method on Homogeneous Network of Communicating Processors (HNCP)

129

M number of particle in the region of high interest

(i.e. around the Region of Interest RoI) for each iter-

ation. This search window is variable from one frame

to another. Therefore, we will focus to implement

our tracking algorithm using dynamic data with the

enhanced Farm skeleton provided by our skeleton li-

brary. Farm skeleton may use either a static load dis-

tribution or a dynamic load distribution (Poldner and

Kuchen, 2008). Since our approach is based on reg-

ular homogeneous architecture, we are interested to

the case where the master sends new data to any slave

whenever his previous work is completed. It consists

of a master entity and multiple workers. In a dynamic

data parallel implementation, the master decomposes

the input data in smaller independent data sets and

sends them to each worker. Workers process the data

and send their result to the master which merges them

to get the ﬁnal result as depicted in ﬁgure 6.

Worker

Master

Worker

Partial Result

Data set

Figure 6: Farm skeleton.

Our skeleton library offers a pre-implemented

function dedicated to farm implementation for ini-

tialisation (init-farm), synchronisation (synchro) and

work distribution (farm) as depicted bellow. The de-

signer can use directly these functions to shorten par-

allel programming.

init_farm(); //Initialization

synchro(); //the master node sends synchronization

//pulses to slave nodes to wake up.

while(proc_retour < Nb_max_of_Data)

//number of data to be sent is Nb_max_of_Data.

microblaze_disable_interrupts();

if(image_comp > 0)

image_maitre=image_comp;

if(image_maitre==image_comp)

image_comp--;

microblaze_enable_interrupts();

//Block the send or the receive until

//the slave executes the task and delivers the result

//back to the master.

compute _function_master

//The master itself process the next data.

microblaze_disable_interrupts();

//Wait for the request to send or to receive.

farm();

//Receive/send incoming data(outgoing data

from/to slave.

microblaze_disable_interrupts();

4.2 Parallel Particle Filter Algorithm on

HNCP

In our implementation of facetracking algorithm, sev-

eral independent parallel data (particles) are gener-

ated. A straight-forward method to apply the farm

skeleton with this speciﬁc application is to associate a

work element for each generated particle since all the

particles are independents .

In this way, the remaining sequential algorithm

part in the parallel algorithm (sample generation and

resampling steps) represents a minor part of the pro-

cessing time in the sequential implementation (see

section V) thus allowing an efﬁcient parallel im-

plementation according the Amadahl’s law which is

a model for the relationship between the expected

speedup of parallelized implementations of an algo-

rithm relative to the sequential algorithm.

The application is distributed over different pro-

cessors as deﬁned. Master process is mapped as one

processor and farmer processes are mapped on the re-

maining processors. Each worker process executes

the same processing code on the different data. In

other words, the master process generates a sequence

of inputs (particles) and assigns each of them to one of

several slaves. Each worker handles one particle at a

time t, and it does this in parallel to all the other work-

ers which produces a sequence of outputs(weights).

The proposed algorithm is executed in N proces-

sors. Since M particles are generated and because

the processing time is regular, each worker eventu-

ally process M/N sample. Each processor calculates

the weights measurements for one sample, includ-

ing three measurements from face model, Kullback-

Leibler distance and weight importance. One can see

in ﬁgure 7 the parallel tracking algorithm scheme us-

ing Farm skeleton. Finally, the sample weights from

all processors are collected and the sample parame-

ter with maximum weight is selected to be the ﬁnal

estimation for the target face. Resampling requires

knowledge of sum of all particle weights. Hence it

cannot begin before the weights of all the particles

have been calculated.

Figure 8 shows the whole application steps of the

face detection and tracking methods (including the

automatic skin detection step) of the FPGA imple-

mentation. The skin detection is delivered sequen-

tially by the master processor. The advantage of the

hardware implementation of this type of detector is

that it entails very small amounts of memory (mainly

the 128 bytes for the BRAM) and it gives sensible re-

VISAPP 2012 - International Conference on Computer Vision Theory and Applications

130

Input

Image

Slave1 Slave2

Slave3

Master

Output

Image

Receive Receive

Receive

Send_Slave

Send

Model

Calculation

Model

Calculation

Model

Calculation

Dkl

Calculation

Dkl

Calculation

Dkl

Calculation

Weight

Calculation

Weight

Calculation

Weight

Calculation

Figure 7: Parallel face tracking using FARM (case of HNCP

with 4 processors).

Skin

Detection

Face

Tracking

Resampling

Video

Frame

Tracking

Window

Figure 8: FPGA face detection and tracking implementaion.

sults with a much lower computational cost. As seen

earlier, we compute the importance weights of M im-

age regions in parallel in order to identify the region

corresponding to the face in the current frame. As we

already mentioned, the determination of the resam-

pled set of particles is done sequentially due to data

dependencies among the particle during this step of

the algorithm.

5 EXPERIMENT

IMPLEMENTATION AND

COMPARISON

The experimental section is divided into two parts.

First, we perform experiments demonstrating the

properties of our tracking approach and second we

present results on public available sequences for com-

parison to other tracking approaches.

Given a video sequence centered around the hu-

man ﬁgure, each tracking task has been initialized by

manually marking the target object in the ﬁrst frame.

The performance (speed) depends on the size of the

search region which we have deﬁned by enlarging the

target region by one third in each direction (the model

representations of all the particles are generated in-

side this region). In our experiments color and motion

model has been used.

5.1 Results

In this section, we present the results of the imple-

mentation and a comparison of the proposed architec-

tures instantiated using CubeGen tool.

5.1.1 Sequential Implementation

We aim to demonstrate and evaluate the implementa-

tion of the proposed algorithm by tracking through

video-sequence of a human-centered environment.

The timing performances of the tracking algorithm

were measured by processing an input image of size

384*288 pixels.

Figure 9: Human face tracking.

For implementation of particle ﬁlter, M samples

randomly generated consisting of rectangular regions

are taken from the input image (Figure 9). So each

particle m carries information of {x

, y

} subwindow

coordinates m = 0...M and a weight between the tar-

get color window in the current frame and the ref-

erence color window. The color window with the

maximum weight s

max

is chosen as best candidate

for the face. In our experiments under sequential

mode, the number of search window (particles) is

set to 100 in order to be better concentrated around

the true state. The size of the search window (face)

Particle

varies among different video se-

quence. We made several experiments on sequential

implementation. All this experiments have been done

on a standard 3 GHz PC with 3 GB RAM. Table 1

shows the execution time required for each process-

ing step of the algorithm and the number of search

windows used. The execution time is the time it takes

for processing a single frame. For the sequential im-

plementation on FPGA platform, Table 2 shows the

statistics of proposed face detection and tracking al-

gorithm on one MicroBlaze as soft-processor in our

network.

Hence, the parallel implementation of the devel-

oped resampling algorithm is vital to the development

of faster and more efﬁcient real-time particle ﬁlters.

5.1.2 Parallel Implementation

We would like to make a comparison in terms of num-

ber of particles and the speedup. In order to investi-

FAST PROTOTYPING OF EMBEDDED IMAGE PROCESSING APPLICATION ON HOMOGENOUS SYSTEM - A

Parallel Particle Filter Tracking Method on Homogeneous Network of Communicating Processors (HNCP)

131

Table 1: Timing results /OS Windows XP 32 bits (384*288

pixels).

Samples 100 50

Time of

detection

3.6ms 3.6ms

Time(each

sample)

Generate

Sample

10µs 10µs

Model 140µs 140µs

Weight 445µs 445µs

Time of

tracking

65ms 35ms

Time of

resampling

180µs 180µs

Table 2: Timing results /HNCP with 1 processor(256*192

pixels).

Samples 15

Time of detection 8.7ms

Time of tracking 150ms

Time of Resampling 4.4ms

gate the performance with different sizes of network,

we set the particle number to 45 particle and the com-

puting core from 1 to 16 (dimension of HNCP D=1...4

i.e. 2

nodes). One can often expect and frequently

achieve an improvement in performance by using far

more particles. The performance ﬁgure is shown in

ﬁgure 10. The computing result is from one person

(one face) in the investigate frame. In each exper-

iment, we vary the number of particles evaluations

that can be processed in real-time with different sizes

of HNCP. We have taken the following measures to

show the variation caused by using different particle

number.

The performance of the proposed resampling al-

gorithms is rarely close to the theroretical speedup.

This can be seen from ﬁgure 11. The time of latency

in the developed algorithm depends on the number of

particles M.

Figure 12 summarizes the total utilization of the

proposed architectures. The entire SIR along with

the computational units of sampling and importance

was implemented on Xilinx Virtex 6 XC6VLX240T

board.

5.2 Results Analysis

Our results show that it is possible to achieve

real-time tracking even operating at relatively low

clock frequencies (using the platform XC6VLX240T-

1FFG1156 FPGA running at 200 MHz). From ﬁg-

ure 11, we can observe that effectively the number of

particle M increases with the size of HNCP architec-

ture. There is signiﬁcant gain in increasing the level

HNCP(1)

HNCP(4)

HNCP(8)

HNCP(16)

Nbre of particle

Figure 10: FPGA particle number.

HNCP(1)

HNCP(4)

HNCP(8)

HNCP(16)

100

120

140

160

180

theoretical

processing

time(ms)

processing time

obtained (ms)

Figure 11: FPGA processing time results.

HNCP (1 ) HNCP (4 ) HNCP (8 ) HNCP (16 )

0,2

0,4

0,6

0,8

1,2

LUT

Block memory

Estimated Utilization %

Figure 12: FPGA Resource utilization of target device (Xil-

inx xc6v2x240tff 1156-1).

of parallelism when M/N becomes small. As can be

seen from the timing ﬁgure, the resampling step is a

bottleneck in the SIRF execution as it cannot be paral-

lelizable with other operations. Regarding the compu-

tation times, it should be mentioned that the proposed

tracking algorithm requires more than 93% of the to-

tal execution time according to the table 2 without the

resampling and the detection steps.

The time of resampling(including the time needed

to select the location of target as a particle with max-

imum weight) use less than 7% of the execution time

so does not signiﬁcantly affect the total time of ex-

ecution with respect to the table 2. Consequently, it

is possible to implement the major bottleneck of the

algorithm (the computation of the color histograms

and, the particle weights) in a parallel manner suit-

able for our HNCP architecture, but also that the non-

parallelizable steps can be implemented efﬁciently.

The experimental results demonstrate that the parallel

Farm scheme can achieve a good speedup compared

VISAPP 2012 - International Conference on Computer Vision Theory and Applications

132

to the corresponding sequentiel algorithms. Further-

more, as resampling is applied at every iteration of

the proposed algorithm, this can evaluating previous

particle weights. The ﬁnal goal of this parallel imple-

mentation is to develop algorithms and architectures

that can reach the minimum execution time.

6 CONCLUSIONS

In this work, we have designed and implemented

a preliminary real time particle ﬁlter algorithm that

makes use of our MP-SOC architecture (HNCP) to

execute the algorithms main performance bottleneck.

Our strategy uses speciﬁc communication functions

based on our Parallel skeleton library for reducing

the computational efforts generated by the sequential

evaluation (Particle weights computation). Our ex-

periments on different video sequences showed that

search of Region of Interest (ROI) are accelerated

in order to achieve real-time tracking with relatively

small number of particles. The experiment results

show that the method can get a good effect and

speedup. This allows us to propose a particle ﬁlter

framwork for fast face tracking to achieve real time

performance using our HNCP architecture. As fu-

ture work, the research goal concerns the algorithm

implementation in a manner suitable for an FPGA-

based intelligent camera. In particular, we aim to

develop a robust self-localization approach for mo-

bile robot equipped with our architecture of a smart

camera based on Xilinx FPGA (camera with sensors

for high resolution image acquisition equipped with

HNCP architecture). Therefore, it seems important

to ﬁnd a properly way to report results of applying

particle ﬁlters to providing a mobile robot with au-

tonomous capabilities.

REFERENCES

Anastasios, D. and Nikolaos, M. (2012). Visual under-

standing industrial workﬂows under uncertainty on

distributed service oriented architectures. Journal of

Original Research Article Future Generation Com-

puter Systems, 28:605–617.

Boli´c, M. (2004). Architectures for efﬁcient implementation

of particle ﬁlters. PhD thesis, Stony Brook University,

New York.

Boli´c, M., Athalye, A., Djuric, P., and Hong, S. (2004).

Algorithmic modiﬁcation of particle ﬁlter for hard-

ware implementation. In Proceedings of the European

Signal Processing Conference(EUSIPROC), Vienna,

Austria.

Boltz, S., Debreuve,

Eric., and Barlaud, M. (2009). High-

dimensional statistical measure for region-of-interest

tracking. IEEE Transactions on Image Processing,

18:1266–1283.

Chen, Z. (2003). Bayesian ﬁltering: From kalman ﬁlters to

particle ﬁlters, and beyond. Statistics, pages 1–69.

Crisan, D., Del Moral, P., and Lyons, T. (1999). Discrete

ﬁltering using branching and interacting particle sys-

tems. Journal of Markov Process and Related Fileds,

5(3):293–318.

De Bruijne, M. and Nielsen, M. (2004). Image segmen-

tation by shape particle ﬁltering. In Proceedings of

the Pattern Recognition, 17th International Confer-

ence on (ICPR’04) Volume 3 - Volume 03, ICPR 04,

pages 722–725, Washington, DC, USA. IEEE Com-

puter Society.

Diaconis, P. (2003). Sequential monte carlo methods in

practice. Journal of the American Statistical Associa-

tion, 98:496–497.

Greg, W. and Gary, B. (1995). An introduction to the

kalman ﬁlter. Technical report, Chapel Hill, NC, USA.

Johncy Rani, T. and Suja Priyadharsini, S. (2010). Region

of interest tracking in video sequences. International

Journal of Computer Applications, 3(7):32–36.

Liu, K., Zhang, T., and Wang, L. (2010). A new par-

allel video understanding and retrieval system. In

ICME’10, pages 679–684.

Maskell, S. and Gordon, N. (2001). A tutorial on parti-

cle ﬁlters for on-line nonlinear/non-gaussian bayesian

tracking. Journal of IEEE Transactions on Signal Pro-

cessing, 50:174–188.

Medeiros, H., Park, J., and Kak, A. (2008). A parallel im-

plementation of the color-based particle ﬁlter for ob-

ject tracking. In IEEE Computer Society Conference

on Computer Vision and Pattern Recognition Work-

shops, pages 1–8, Anchorage, AK.

Petrovskaya, A. and Thrun, S. (2009). Model based vehicle

detection and tracking for autonomous urban driving.

Journal of Autonomous Robots, 26(2–3):123–139.

Poldner, M. and Kuchen, H. (2008). On implement-

ing the farm skeleton. Parallel Processing Letters,

18(1):117–131.

Si´eler, L., D´erutin, J., Damez, L., and Landrault, A. (2010).

A generic mp-soc design methodology for the fast

prototyping of embedded image processing. In Inter-

national Conference in Microelectronics (ICM), pages

104–107, Cairo. IEEE Computer Society.

Vezhnevets, V., Sazonov, V., and Andreeva, A. (2003). A

survey on pixel-based skin color detection techniques.

In PROC. GRAPHICON-2003, pages 85–92.

Xinyu, X. and Baoxin, L. (2005). Rao-blackwellised parti-

cle ﬁlter for tracking with application in visual surveil-

lance. In Proceedings of the 14th International Con-

ference on Computer Communications and Networks,

pages 17–24, Los Alamitos, CA, USA. IEEE Com-

puter Society.

FAST PROTOTYPING OF EMBEDDED IMAGE PROCESSING APPLICATION ON HOMOGENOUS SYSTEM - A

Parallel Particle Filter Tracking Method on Homogeneous Network of Communicating Processors (HNCP)

133