IMAGE PROCESSING FRAMEWORK FOR FPGAS

Introducing a Plug-and-play Computer Vision Framework for Fast Integration

of Algorithms in Reconﬁgurable Hardware

Bennet Fischer and Raul Rojas

Intelligent Systems and Robotics, Free University of Berlin, Arnimallee 7, Berlin, Germany

Keywords:

Image Processing, FPGA, Framework, Signal Processing, Embedded.

Abstract:

This paper presents a framework for computer vision tasks on Field Programmable Gate Arrays (FPGA) which

allows rapid integration of vision algorithms by separating the framework from the vision algorithms. A vision

system can be created by using plug-and-play methodology. On an abstract level several input and output

channels of the system can be deﬁned. Also, commonly used image transformations are modularized and can

be added to the inputs or outputs of an algorithm. Special input and output modules allow the integration of

algorithms with no knowledge of the surrounding framework.

1 INTRODUCTION

Two dimensional signal processing has ever since

been a demanding problem in computer science.

Computer vision as an application of the latter is still

a very active research area. With the steady improve-

ment of general computer hardware, many computer

vision algorithms are able to run in real time on com-

modity hardware. The importance of real time imple-

mentations is growing as computer vision is starting

to be employed in every day products like game con-

soles, mobile phones and driver assistance systems.

Most of these systems are by their nature embedded

systems. However, many of the fundamental algo-

rithms i.e. dense 3D reconstruction or optical ﬂow

estimation are still not applicable to real time imple-

mentations. This applies to commodity pc hardware

and in particular to embedded processors.

Recently programmable graphics hardware gained

popularity in research for realizing real time imple-

mentations of demanding vision algorithms. This

type of hardware can be programmed in a familiar

way using the high level “C” language and thus re-

quires only a short training period for new users. The

level of hardware abstraction is comparable to CPU

programming. Memory access, input/output opera-

tions are all mapped to easy to use operations. This

makes graphics hardware attractive for quickly evalu-

ating algorithms. However, due to their high power

consumption they are not well suited to embedded

systems.

An established method of computational intensive

real time implementations is the use of ﬁeld pro-

grammable gate arrays (FPGA) devices. FPGAs gen-

erally suite better to the needs of embedded systems

due to their lower power consumption (Jin et al.,

2009). This makes them attractive for appliances out-

side of the researchers laboratories i.e. in autonomous

systems. A detailed comparison of general purpose

processors (GPP), graphic processors (GPU) and FP-

GAs is given in (Cope et al., 2009). The main dis-

advantage of FPGA systems is, however, the high de-

velopment effort. New users are faced with a steep

learning curve. The time to the ﬁrst productive use of

the device is usually long for two reasons:

• The user has to learn a new programming lan-

guage which describes hardware, not software.

Also the tool chain for implementing the written

programs is completely different to software tool

chains. High level synthesis tools trying to bring

the FPGA closer to the programming model of

traditional processors can help new users to de-

velop algorithms faster than before (BDTI, 2010).

However, these tools are in most cases not afford-

able for researchers or small companies.

• The hardware abstraction on FPGAs is poor, if not

absent. In the case of no abstraction, the periph-

eral hardware is simply wired to the input/output

(io) banks of the FPGA. All the higher levels of

abstraction have to be done by the user. One com-

mon concept of abstraction in the design of hard-

295

Fischer B. and Rojas R..

IMAGE PROCESSING FRAMEWORK FOR FPGAS - Introducing a Plug-and-play Computer Vision Framework for Fast Integration of Algorithms in

Reconﬁgurable Hardware.

DOI: 10.5220/0003801402950300

In Proceedings of the 2nd International Conference on Pervasive Embedded Computing and Communication Systems (PECCS-2012), pages 295-300

ISBN: 978-989-8565-00-6

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

ware is the use of intellectual property (IP) cores.

Theses cores cover the underlying complexity of

hardware peripherals and offer a more abstract in-

terface to the functionality. Most FPGA vendors

offer IP cores for often used peripherals like Ran-

dom Access Memory (RAM) or networking hard-

ware. However, the interface to these cores is still

complex as they usually offer bus interfaces (PLB,

AXI, Whishbone, etc.). To access these kind of

cores, the user has to implement a bus participant,

thus needs to know the bus speciﬁcation in detail.

This is a non trivial task and highly time consum-

ing.

In this paper we present a framework aimed at accel-

erating the development of FPGA vision algorithms.

The problems of FPGA development stated above, es-

pecial the latter one, are overcome by a high level

of peripheral hardware abstraction. This abstraction

speciﬁcally suits to the needs of computer vision al-

gorithms.

The typical use case of this framework is the cre-

ation of a real time capable prototype system. In con-

trast to GPU based real time prototypes, the imple-

mentation of the algorithms is much closer to a pro-

duction ready state. This allows the user to predict

the overall cost and energy consumption of the sys-

tem very precisely. It is assumed that the algorithms

are already evaluated using a non real time capable

implementation. This reference implementation can

then be ported to the FPGA and be deployed with lit-

tle effort. The user can concentrate on developing the

vision algorithms and is not forced to invest time in

retrieving and passing on the data.

The focus here is not to provide an overall high

level of abstraction covering also the vision algorithm

itself. Instead, the framework allows to test and use

production ready HDL implementations of algorithms

without any infrastructural developing overhead.

2 RELATED WORK

A FPGA co-processor framework is presented in

(Kalomiros and Lygouras, 2008). Several vision al-

gorithms are evaluated using a commercial Simulink

to HDL translator. Communication to the host PC is

done via USB and the data ﬂow is organized by a soft

processor. As being a co-processing system with no

direct access to the image data, the latency is higher

than in a pre-processing system but the possible range

of applications is broader.

A framework for veriﬁcation of vision algorithms

is presented in (van der Wal et al., 2006). Concep-

tual similar to the approach described by us, they use

image pipelines to process the data. However, due to

the crosspoint switch, their processing entities can be

connected at run-time, allowing high ﬂexibility. This

ﬂexibility is useful for hardwired Application Speciﬁc

Integrated Circuits (ASIC) which cannot be reconﬁg-

ured.

3 FUNDAMENTAL CONCEPTS

The framework itself consists of modules connected

by streams and a supervisor organizing the system

conﬁguration and data ﬂow. It is assumed that the

platform on which the framework is running consists

of at least an FPGA, an external RAM and a commu-

nication module to a workstation PC (gigabit ethernet,

PCI Express, etc.) all interconnected by the system

bus.

A module encapsulates an arbitrary function. In

the simplest form a module operates on data of the in-

put streams and delivers the result on one or several

output streams. More complex modules also interact,

besides the streams, with lower level components like

bus interfaces or hardware peripherals. However, to

the user this complexity is opaque as only the stream

interfaces are visible to him. Modules can be instanti-

ated and connected as hardware description language

(HDL) entities in source ﬁles or more convenient to

the user, via a graphical user interface (GUI) by drag-

ging them into the system to be built.

A stream is a unidirectional data ﬂow interface.

The most common use of it is to transfer pixel val-

ues. Note that this interface is kept as simple as pos-

sible. The synchronization of this interface is only

word-wise. Every other synchronization information

has to be implicit which means that the data format of

a stream has to be known a priori. This is in general

true for image processing algorithms. The implicit

synchronization offers several advantages. First, due

to the ﬁxed input format, the module can process the

data in a statically way. This eases the development

and normally speeds up the implementation. Second,

the module can trust the format of the input data. No

error checking has to be done on the format of the

input data. This leads to a better encapsulation of

functionality as an exceptional state will be handled

inside a module instead of being passed between two

modules. One prerequisite of this to work is that ev-

ery module delivers correctly synchronized data on its

output streams. In most cases this is easier to archive

than format error checking on the input streams in

case of explicit synchronization.

The supervisor is a general purpose soft proces-

sor with low speed requirements. It organizes the data

PECCS 2012 - International Conference on Pervasive and Embedded Computing and Communication Systems

296

ﬂow from the FPGA system to the host application on

a data packet base. As the data packages are already

prepared by hardware components, the load of this

processor is low. Another task for the supervisor is

to initialize and conﬁgure the hardware modules. For

normal applications the user does not need to change

the software running on the processor. Drivers are

bound to speciﬁc modules and, if necessary, inserted

into the software automatically.

4 CORE MODULES

As stated above, the basic building block of the frame-

work is a module. The modules are grouped into three

categories:

• IO modules.

• Processing modules.

• Simulation/Debug modules.

4.1 IO Modules

4.1.1 Live Data Source

The live data source (LDSO) module offers a single

output stream from a hardware device. Typically it

delivers a pixel stream from a camera sensor.

4.1.2 Memory Data Source

The memory data source (MDSO) module takes data

from an external RAM device and transforms the data

to one or several output streams. It is connected to the

RAM controller via the system bus and to the super-

visor via a control interface. The bus interactions are

fully encapsulated by the module. By using this mod-

ule instead of the live data source as the algorithm

input, the system transforms from a pre-processor to

a co-processor. Image data can be delivered by the

workstation PC, processed by the system and sent

back.

4.1.3 Synchronized Data Source

The synchronized data source (SDSO) takes up to

three equally formatted input streams and synchro-

nizes them to pixel accuracy. The synchronized data

is then sent on three output streams. In case of no pos-

sible synchronization (due to buffer size restrictions)

the output channels are fed with pixels marked as in-

valid. This module is useful to transform loosely syn-

chronized camera sensor input to pixel synchronous

streams.

4.1.4 Memory Data Sink

The memory data sink (MDSI) is the counterpart to

the memory data source. To the user it provides a

single stream input, to the system a bus connection

and to the supervisor a control interface. The user

can send data to the module with a stream which will

be orderly written into the external RAM. The data

is packetized, addressed and sent via the system bus

into the frame buffers in RAM. Also, the supervisor is

informed when new data has arrived in memory. This

information can be used by the supervisor to transfer

the data to a workstation PC via gigabit ethernet or pci

express. Already available scatter-gather direct mem-

ory access (DMA) components allow the supervisor

to transfer the data with zero copy overhead.

4.2 Processing Modules

In the following section two examples of processing

modules are given. These modules are part of the

framework and can be used to compose more complex

functionalities out of them. Two examples of their us-

age will be showed later on.

4.2.1 Separable Convolution

The separable convolution (SC) realizes a 2D ﬁnite

response ﬁlter. Using this module many standard op-

erations can be performed by adapting the ﬁlter coef-

ﬁcients. The coefﬁcients can be conﬁgured at build

time.

4.2.2 Geometric Image Transformation

The geometric image transformation (GIT) module

takes a pixel stream and performs an arbitrary geo-

metric transformation on the image. The transforma-

tion function can be changed in the running system.

This function is usually used to remove lens distor-

tion or rectify sets of image streams.

4.3 Simulation/Debug Modules

4.3.1 Memory Data Source

The memory data source module can be re-purposed

as a debug module. By replacing a live data source

with it, the system can process artiﬁcial or static im-

ages.

4.3.2 File Data Source

The ﬁle data source (FDSO) is a non synthesizable

simulation module. It can be used as a test bench dur-

IMAGE PROCESSING FRAMEWORK FOR FPGAS - Introducing a Plug-and-play Computer Vision Framework for Fast

Integration of Algorithms in Reconfigurable Hardware

297

ing simulation to verify the functionality of a module.

The FDSO reads data from image ﬁles or comma sep-

arated ﬁles and transforms the data to a stream. This

stream can be connected to a module under test as the

input stimuli.

4.3.3 File Data Sink

As with the MDSO, the ﬁle data sink (FDSI) is the

counterpart of the FDSO. In a simulation the output

stream of a module under test can be connected to

the FDSI. The data received is written orderly into a

ﬁle, making it available for post-simulation veriﬁca-

tion i.e. the check against a reference implementation.

The module veriﬁcation is visualized in ﬁgure 1.

5 EXAMPLE APPLICATIONS

The presented framework has been in use since one

year at the Free University of Berlin inside the au-

tonomous car “Made in Germany”.

Automotive image processing requires low la-

tency as well as low power consumption. These re-

quirements make the use of FPGA hardware attrac-

tive.

The image data is delivered by two CMOS cam-

eras mounted behind the windshield with 768x500

high dynamic range (HDR) images at 30 frames per

second. The processed data is sent to a laptop via gi-

gabit ethernet for higher level processing. This setup

frees the laptop from the highly time consuming low

level vision algorithms.

The data ﬂow graphs illustrating the examples

consist of modules visualized as boxes and streams

visualized as arrows.

5.1 Module Veriﬁcation

Before being uploaded to the hardware, modules need

to be veriﬁed. The framework supports veriﬁcation

through the FDSO and FDSI modules. A basic veri-

ﬁcation setup is illustrated in ﬁgure 1. Both the refer-

ence implementation and the hardware implementa-

tion getting the same stimuli and write their results to

a comma separated ﬁle. If the hardware implementa-

tion is behaviorally correct, the two result ﬁles should

be identical.

5.2 Optical Flow

The ﬁrst application example for the framework is

the estimation of the optical ﬂow (Lucas and Kanade,

1981).

Figure 1: Module veriﬁcation.

What follows is a brief overview of the Lucas-

Kanade algorithm:

As a ﬁrst step, the spatial derivatives are calculated.

(x, y) =

I(x +1, y) − I(x − 1, y)

(x, y) =

I(x, y + 1) − I(x, y − 1)

Next, the Spatial gradient matrix G is determined

where w describes the size of an integration window.

This matrix is also called structure tensor.

G =

∑

x=p

−w

∑

y=p

−w



(x, y) I

(x, y)I

(x, y)

(x, y)I

(x, y) I

(x, y)



By using the temporal derivative δI and the spa-

tial derivatives I

, I

the image mismatch vector b is

calculated.

δI(x, y) = I(x, y) − J(x, y)

b =

∑

x=p

−w

∑

y=p

−w



δI(x, y)I

(x, y)

δI(x, y)I

(x, y)



The following equation then gives the estimate of

the optical ﬂow η:

η = G

−1

The matrix G

−1

is also known as the covari-

ance matrix. The algorithm is explained in detail in

(Bouguet, 1999).

As a preﬁltering step the image pyramid is created

and written into the external RAM. The module struc-

ture of the preﬁlter is shown in ﬁgure 2.

After the arrival of a frame from the preﬁlter, the

supervisor triggers the MDSO module seen in ﬁgure

3. The current frame I and the preceding frame J are

PECCS 2012 - International Conference on Pervasive and Embedded Computing and Communication Systems

298

Figure 2: Preprocessing modules forming Gaussian pyra-

mid: A live image is low pass ﬁltered four times by a

Gaussian kernel (SC). Each ﬁlter result is written to RAM

(MDSI).

simultaneously input into the optical ﬂow estimator.

Note that the estimator has no other dependency to

the framework than its stream interfaces.

Figure 3: Lucas Kanade optical ﬂow integration.

The visualized result of the optical ﬂow estimator

can be seen in ﬁgure 4. The colors denote the direc-

tion and the intensity the speed of the ﬂow. Reference

colors are found on the border of the image.

Figure 4: Top: Source image. Bottom: Optical ﬂow result.

5.3 Stereo Vision

The second example for the framework is a module

for estimating the distance of objects by ﬁnding

corresponding image points in a stereo image pair.

As shown in ﬁgure 5 the geometric image transfor-

mation (GIT) module is used as a preprocessing step

to remove the lens distortion and rectify the images.

The transformed streams are then synchronized to

pixel accuracy by the SDSO module.

What follows is a brief description of the block

matching algorithm:

The cost C for the comparison of two blocks of

size w at the disparity d is deﬁned by the sum of

squared differences.

C(x, y, d) =

∑

x=p

−w

∑

y=p

−w

(L(x, y) − R(x − d, y))

The disparity D with the lowest cost over the

search window D

max

is the estimate of the disparity.

min

(x, y) = min

0≤d≤D

max

C(x, y, d) = C(x, y, D)

Figure 5: Stereo vision integration: Two live images

(LDSO) being rectiﬁed (GIT) and synchronized (SDSO).

The streams are block matched and the disparity is written

to RAM (MSDI).

The result of the depth estimation on a street sce-

nario is visualized in ﬁgure 6. The color denotes the

disparity of the pixel. Occluded and low textured ar-

eas are ﬁltered out with a left to right check and a con-

ﬁdence check regarding the uniqueness of the mini-

mal cost. Also a sub-disparity interpolation is per-

formed, resulting in a ready to use disparity image.

IMAGE PROCESSING FRAMEWORK FOR FPGAS - Introducing a Plug-and-play Computer Vision Framework for Fast

Integration of Algorithms in Reconfigurable Hardware

299

Figure 6: Stereo Vision result of a street scenario.

6 SUMMARY

In this paper we have presented a framework help-

ing researchers to quickly evaluate production ready

vision algorithms on FPGAs. The strict modulariza-

tion of functionality and small dependencies between

modules allow the user to quickly change the func-

tionality of the whole system. Functionalities can be

added or removed depending on the requirements of

the application and the available chip area.

The system proofed to fulﬁll the goal of fast al-

gorithm integration and reliable operation in the au-

tonomous car “Made in Germany”. Future devel-

opments will target on porting the upper level algo-

rithms interpreting the preprocessed data to an em-

bedded system. Examples for these algorithms are au-

tomatic camera calibration or obstacle avoidance cur-

rently running on a commodity laptop.

ACKNOWLEDGEMENTS

We would like to thank Robert Richter for his work

on implementing the Lucas-Kanade algorithm.

REFERENCES

BDTI (2010). The AutoESL Au-

toPilot High-Level Synthesis Tool.

http://www.bdti.com/MyBDTI/pubs/AutoPilot.pdf.

Bouguet, J. (1999). Pyramidal implementation of the lu-

cas kanade feature tracker description of the algo-

rithm. Intel Corporation, Microprocessor Research

Labs, OpenCV Documents, 3(2):1–9.

Cope, B., Cheung, P., Luk, W., and Howes, L. (2009). Per-

formance comparison of graphics processors to recon-

ﬁgurable logic: A case study. IEEE Transactions on

Computers, 59(4):433–448.

Jin, Q., Thomas, D., and Luk, W. (2009). Exploring

reconﬁgurable architectures for explicit ﬁnite differ-

ence option pricing models. In Field Programmable

Logic and Applications, 2009. FPL 2009. Interna-

tional Conference on, volume 54, pages 73–78. IEEE.

Kalomiros, J. and Lygouras, J. (2008). Design and eval-

uation of a hardware/software FPGA-based system

for fast image processing. Microprocessors and Mi-

crosystems, 32(2):95–106.

Lucas, B. and Kanade, T. (1981). An iterative image reg-

istration technique with an application to stereo vi-

sion. In Proceedings of the 7th International Joint

Conference on Artiﬁcial Intelligence (IJCAI), pages

674–679.

van der Wal, G., Brehm, F., Piacentino, M., Marakowitz,

J., Gudis, E., Suﬁ, A., and Montante, J. (2006). An

FPGA-based veriﬁcation framework for real-time vi-

sion systems. Pattern Recognition, 2.

PECCS 2012 - International Conference on Pervasive and Embedded Computing and Communication Systems

300