– The vertex processor for the manipulation of geometry data by means of vertex-
shaders.
– The fragment processor for pixel operations with pixel-shaders.
For calling functions on the GPU from the CPU, the standardized free and vendor in-
dependent graphics library OpenGL [7] by SGI is being used.
3.1 Fragment Processor Programming
The GL ARB fragment program extension allows for programming the fragment pro-
cessor in assembler. Obviously this way of programming is very time consuming and
does not scale well with improvements of the GPU. Therefore, for most of the pro-
grams in our image processing library we used GLSL (OpenGL Shading Language), a
high level language, similar to C. As a part of the extension GL ARB fragment shader,
GLSL has been developed by 3Dlabs. Later on the ARB (Architecture Review Board)
announced it as a standard. An extensive specification is available online [8].
A special feature of this language is that compilation of programs is performed
during run-time by the driver of the GPU. This makes it possible that existing programs
can benefit from new hardware features or better compilers.
4 Architecture Overview
Our OrontesNG low-level vision module (short ONG vision module) was designed for
low-level image processing in real-time. Tasks such as Canny Edge Detection, Gaussian
Blur, RGB Colour Normalization or even simple conversion from RGB to HSV typi-
cally require little logic, but enormous computing power because they are processed
completely on the pixel level. To optimize performance, these operations have been
implemented on the GPU. Please note that normally the GPU is used to process the
enormous amount of graphics data of modern video games. The data flow there goes
from CPU to GPU only, whereas in our image processing application, the reverse di-
rection from the GPU to the CPU is important as well. The ONG vision module is a
C++ library with an API for image processing tasks like the ones mentioned above (see
Figure 3).
Though the ONG vision library is implemented as singleton, it can process tasks
from an arbitrary number of threads in parallel. During the initialization the maximum
image size must be specified, such that internal buffers can be allocated.
After initialisation of the module jobs can be generated and via Invoke() functions
can be called. These jobs then are executed in a FIFO strategy by the GPU. Once started,
a job can not be cancelled before termination.
There are two different methods to wait for the termination of a job. First, there is
a polling method, where other tasks can be executed while waiting. Between two tasks
the function HasFinished() can be used to query the status in a non-blocking way.
Second, a blocking waiting can be realised with the function WaitUntilFinished().
Compared to the polling method, this function has the advantage of smaller CPU load
and shorter delays. Smart use of these functions enables the programmer to optimally
exploit parallelism of the GPU.
30