tion and tuning of SpMxV kernels for the specific
matrix structures that these shortcomings can be over-
come(Williams et al., 2007).
Modern FPGAs provide abundant resources for
floating point computations. Aside from large logic
capabilities, these FPGAs also have sufficient on-chip
single-cycle access blocks of RAM (BRAMs) to pro-
vide required on-chip memory bandwidth. On the
other hand, a large number of I/O pins are available
to provide high memory bandwidth in case external
off-chip memories are to be used. However, off-chip
memories like DRAMs have large access latencies
and can considerably slow down the system if used
naively.
We present the design of a prototype embedded
system geared to accelerate SpMxV for scientific
computing. Since such an embedded system relies on
high random access latency DRAMs, data is stored
in a fashion amenable to burst accesses, thus hiding
DRAM access latencies. The Xilinx MicroBlaze plat-
form was chosen as platform for the embeddedsystem
and implemented on the Xilinx XUPV5-LX110T de-
velopment board.
2 PROBLEM DESCRIPTION
SpMxV requires that two elements - a non-zero ele-
ment from the matrix and an element from the vec-
tor - be fetched and multiplied. The result is accu-
mulated into the appropriate result vector element.
Thus two operations - a multiply and an accumulate
- are performed for every pair of the two elements.
These elements are not required for further process-
ing and are thus discarded. Only the result of the
multiply-accumulate operation is stored. Since two
input words are useful for only two computation oper-
ations, the ratio of computation to bandwidth require-
ment is low compared to other applications (namely
general matrix-matrix multiplication). This ratio be-
comes worse due to overhead of bandwidth require-
ment for fetching pointers - two per matrix element.
Assuming 32-bit pointers and double precision float-
ing point matrix and vector data, 24 bytes are fetched
in order to perform 2 floating point operations. Hence,
the performanceof SpMxV is usually less than a tenth
of the bandwidth available to the system.
Though modern FPGAs have large amounts of
fast access memories, they still fall short of the
amount of storage required in case the matrix and/or
vector data is to be stored in on-chip memories. The
largest Virtex-5 device has less than 24 Mb of storage
and devices in the the latest Virtex-6 family too has
less than 48 Mb of on-chip memory. Assuming 64-bit
data, this translates to 0.4M words and 0.8M words
in case of Virtex-5 and Virtex-6 devices respectively.
Moreover, as discussed in the above paragraph, if vec-
tor elements need to be replicated, then the size of the
matrices that can be handled drops far short of the
one million rank. Hence, an implementation geared
to handle matrices having multi-million rows has to
use external DRAMs for storage.
2.1 Related Work
We shall be referring to the work done by Prasanna
(Zhuo and Prasanna, 2005), Gregg (Gregg et al.,
2007), deLorimier (deLorimier and DeHon, 2005),
Sun (Sun et al., 2007) and Kuzmanov (Kuzmanov and
Taouil, 2009). The first three implementations aim
to accelerate iterative solvers via SpMxV on FPGAs.
With the exception of the architecture developed by
Gregg, DRAMs have not been used as the main stor-
age for matrix and vector data.
The SpMxV kernel implemented in a multi-FPGA
architecture by Zhuo and Prasanna was among the
earliest in the field. They use Compressed Row Stor-
age (CRS) format for their input which trims the ze-
ros from the sparse matrix rows. In their architecture,
each trimmed row is divided into sub-rows of fixed
length equal to the number of processing elements.
The dot products in a sub-row are assigned to differ-
ent processing elements and then a reduction circuit is
used to get the final vector element after all sub-rows
have been processed. This updated value is stored in a
second FPGA, and communication costs are reduced
for conjugate-gradient (CG) routine across iterations.
Optimizations to their design include load balancing
by merging appropriate sub-rows and padding them
with zeros if necessary, which significantly improves
performance. However the architecture proposed by
them relies on SRAMs for storage of matrix entries
which severely limits the matrix size. Large num-
ber of parallel accesses to the SRAMs contributes to
a bottleneck in the design. Moreover the entire vec-
tor is replicated in local storage of all processing ele-
ments. The sequential nature of the inputs to the al-
ready huge reduction circuit results in very high la-
tencies. The largest matrix evaluated had 21200 rows
and 1.5 million non-zeros. They reported an average
performance of 350 MFLOPS on a Virtex-II Pro de-
vice.
Special care has been taken by Gregg et. al to cre-
ate a DRAM based solution. They use pre-existing
SPAR architecture originally developed for ASIC im-
plementation and hence port a deeply pipelined de-
sign for FPGA implementation. They use local
BRAMs to create a cache for the DRAM data since
DOUBLE PRECISION SPARSE MATRIX VECTOR MULTIPLICATION ACCELERATOR ON FPGA
477