some benchmark of the algorithm in a multi-
processor situation, a porting for G80 Nvidia graphic
cards featuring CUDA (Computer Unified Device
Architecture) was realized.
3.1 CUDA Specifications
Today’s GPUs offer incredible resources for both
graphics and non-graphics processing. The GPUs in
particular are well-suited for problems that can be
expressed in a data-parallel form. CUDA
architecture was introduced by Nvidia on G80
graphic card series as a data-parallel computing
device for managing computations on GPU (Nvidia
2008). The GPU is considered as a device capable of
executing a very high number of threads in parallel.
It operates as a coprocessor to the main CPU (or
host) but it has its own DRAM (referred as device
memory) separated from the CPU RAM (referred as
host memory). Transfer of data between them is
made by high-performance DMA access.
The device is implemented as a set of SIMD
(Single Instruction Multiple Data) multiprocessors.
Each multiprocessor supplies a set of local 32-bit
registers and a parallel data cache (or shared
memory), a read only constant cache and a texture
cache that are shared between all the processors.
A function that is executed on the device as many
different threads is called “kernel”. In the parallel
version of RDA every kernel will contain all the
steps required for compressing or decompressing a
single block of pixels.
3.2 Optimize RDA for CUDA
For using CUDA full computing capabilities, there
is the need to write a general “kernel” that operates
on different sets of input data. Every kernel is loaded
in a thread and a grid of threads is then sent to the
GPU multiprocessors for execution. Reference to
input data could be made by using thread IDs. In this
way a multiprocessor that is running a thread with
(Thread
IDx
, Thread
IDy
) is working to the block
having coordinates (X,Y). Since CUDA usually
works in an almost random order, there is not the
certainty that the blocks will be executed in a pre-
established way. As a consequence the file
containing the compressed data must be organized in
an appropriate way.
In a standard sequential form of the algorithm
(working on a single processor) the file is organized
as a list of blocks (each one with its correct header)
and the program reads, executes and writes every
single block starting from the first to the last one.
In random parallel execution condition we need
to read from memory the correct set of input data
related to each block of coordinates (X,Y). Since all
blocks have different size in bits due to different
compression rate and header length, there is the need
to arrange the order of the file both in compression
and decompression phase. Let’s consider
compression stage first. The steps of the algorithm
are:
1. Load uncompressed data to device memory.
2. Run RD kernel on each block.
3. Transfer results file to host memory.
During step 2 each processor has to share both
input and output data. While reading input data do
not create problems cause it is performed in different
and independent parts of the original image, the
output data have a big problem: it is impossible to
know the final data size of compressed block and
therefore each thread does not know where to write
down results (because we are in a random parallel
thread execution). A way to solve this problem is to
separate each block information in different but
consecutive memory location:
1.
Header type bit list (0 = DR64, 1 = DR16).
2.
Blocks minimum values.
3.
Blocks d
p
4.
Pixel in every blocks expressed with d
p
bits
In this way each processor knows exactly where
its results have to be written, because the length of
each one of this vector can be known a priori. All
these vectors are created as unsigned char vectors
and then cleaned out of useless data before starting
memory transfer from device to host. Vectors #2 and
#3 will be created with a size of 4 * (number of
image’s block). In this way it is possible to write
down 4 different data if a DR16 header occurs.
As benchmarks highlight (section 4) memory
transfer is a big problem in CUDA approach and
takes more than 90% of total execution time in
compression phase. Since it is better making a
unique large transfer instead of a set of many small
transfers, image data has to be moved from and to
the device memory all at once. That’s the reason
because memory clean out of unnecessary data is
very important before starting the transfer between
devices. When compressed data is available to host
memory a file is written maintaining the block
information division. This helps in decompression
phase where the first step is to reconstruct the four
vectors. In fact decompression stage operates in a
very similar way:
1. Load compressed data to device memory.
2. Division of the data in 4 vectors.
3. Run RD kernel on each block.
RELATIVE DISTANCE METHOD FOR LOSSLESS IMAGE COMPRESSION ON PARALLEL ARCHITECTURES - A
New Approach for Lossless Image Compression on GPU
23