Figure 3: Spatial parallelism exploitation by block
partitioning (BP): row, column and rectangular BP.
However, as the algorithm increases its
complexity, the operation pipeline speedup becomes
less important, mainly due to operation
hierarchization and feedback. In this case, the major
speedup may be obtained within a specific operation
in the form of loops.
The outermost loop within each operation
usually iterates over the pixels within the image,
because many operations (e.g. spatial filtering)
perform the same function independently on many
pixels.
This is spatial parallelism, which may be
exploited by partitioning the image and using a
separate processor to perform the operation on each
partition (Bailey, 2011).
2.3 Block Partition Schemes
An important consideration when partitioning an
image is to minimize the expected communication
between PUs, i.e. between the different considered
partitions.
Typical partitioning schemes split the image into
blocks of rows, blocks of columns or rectangular
blocks, as illustrated in Fig. 3.
For low-level DIP operations such as spatial
filtering, the performance improvement approaches
the number of processors, as the communication
may be reduced to zero if a non-overlapping
partition scheme is utilized.
However, in higher level processes this
performance will be degraded as a result of
communication overheads or contention when
accessing shared resources. In high-level operations,
these shared resources may just not be pixel values.
Partitioning is therefore most beneficial when the
operations only require data from within a local
region, which is defined by the partition boundaries.
For that reason, each processor must have some
local memory to reduce any delays associated with
contention for global memory (Bailey, 2011).
If the operations performed within each region
are identical, this leads to a SIMD (single
instruction, multiple data) parallel processing
architecture.
On the other hand, a MIMD (multiple instructions,
multiple data) architecture is better suited for higher
level image operations, where latency may vary for
each block of data.
In these cases, better performances may be
achieved by having more partitions than processors,
utilizing a Pipeline Processor Farm (PPF) approach
(Fleury and Downton, 2001). In a PPF, each
partition is dynamically allocated to the next
available PU, thus reducing idle process latencies
related to block data dependencies.
In the next section, a procedure for block
partitioning will be devised to be optimally suited to
a PPF approach.
3 AN ADAPTIVE BLOCK
PARTITIONING PROCEDURE
3.1 Overlapping Neighbourhoods
Rectangular block partitioning is by far the main
shape choice in low-level image operations (Davies,
2012). Two types of rectangular neighbourhoods are
commonly considered: overlapping and non-
overlapping.
In spatial linear filtering, a series of sums and
products are needed for each pixel within the input
image, with pixels, as a result of applying a
small kernel of weight elements over the pixel
surrounding neighbours. In this case, each pixel and
its neighbourhood can be processed by a single PU
and no inter-process communication is needed.
Except for the border pixels of the image, where
additional data –usually zero values– is needed to
complete the neighbourhood, each PU can read the
corresponding pixel information directly from the
image and return its own data as output.
This same concept applies to focal statistics (de
Smith et al., 2013), where a neighbourhood raster is
applied. The algorithm visits each cell in the raster
and calculates a specific statistic over the identified
neighbourhood. As neighbourhoods can overlap,
cells in one neighbourhood will also be included in
any neighbouring cell’s neighbourhood.
This situation may enable data reutilization in the
raster (Huang et al., 1979), thus reducing shared
memory accesses in a PPF environment.
On the other hand, non-overlapping partitions
may be used for parallelization speedup in
hierarchical parallel schemes. This paradigm is best
suited for distributed memory MIMD architectures,
in which each PU handles its own local memory
VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications
140