Listing 1: cudaCrossCorr 32fc source.
Listing 1: cudaCrossCorr 32fc source.
extern "C" void
cu daC ros sCo rr_ 32f c ( Co mp le x * h_ fi lt er _k er ne l ,
int fil te rS i ze , \
Co mp le x * h_s i gn al , int si gn al S iz e , \
Co mp le x * h_ co r r _ s i g n a l , int c or rSi ze , \
int lowLa g , bool si gn FF T e d )
{
// Pad signal and filter kernel
Co mp le x * h_ pa dd ed _s ig na l , h _ pa d de d _f i lt e r_ k er n e l ;
[. . .]
// Transform signal and kernel
if (! s i g n FF Te d ) {
cu f f t Ex e c C 2C ( plan , ( c u ff t C o mp l e x *) d_ s ig nal , \
( c uf f tC om p le x *) d _s i gn al , C UF F T _ FO R W A RD ) ;
}
cu f f t Ex e c C 2C ( plan , ( c u ff t C o mp l e x *) d _ f i l t e r_ ke rn el , \
( c uf f tC om p le x *) d_ fi lt er _k er ne l , C U F F T_ F O R WA R D ) ;
// Multiply coefficients together as in corr a.
*
conj(b)
Co m p l ex C on j P o i n t wi s eM u l A nd S ca l e < < < 3 2 , 256 > > > \
( d_ sig na l , d _ f i l t e r _k er ne l , ne w _s ize , 1. 0 f / ne w_ si ze
);
// Check if kernel execution generated and error
CU T _ C HE C K_ E R R OR \
(" Ke rn el f ai le d [ C o m p l e x Po i nt w is e Mu l An d S c a l e ] " );
// Transform signal back
cu f f t Ex e c C 2C ( plan , ( c u ff t C o mp l e x *) d_ s ig nal , \
( c uf f tC om p le x *) d _s i gn al , C UF F T _ IN V E R SE ) ;
[. . .]
}
been the re-computation of reference signal FFT. At
every cycle, in fact, a new chunk of ref. signal and
a new scrambling code would have been pushed to
the cudaCrossCorr function, that would have done the
FFT twice per iteration and then apply convolution
theorem, multiplying the resulting vectors and anti-
transform the result. Designed to work on multiple
cross correlations per session sharing the reference
signal, instead, using static variables to keep track of
how many FFT have been computed, on which part
of incoming signal and most important where results
are stored in memory, cudaCrossCorr can save up to
30% computing power (1 out of 3 FFT-based opera-
tions) per cycle, with a great advantage in terms of
execution time. See details in Listing 1 to see how
this simple mechanism works.
The only non-masked operation in cudaCrossCorr
body is the recombination of spectra after FFT execu-
tion, ( f
∗
∗ g) in eq. 1.
3.2 About Precision in
Cross-correlation
Several numeric tests have been executed, using
MATLAB as golden standard with its xcorr function
that implements circular cross correlation, computed
in double precision IEEE-754, complex arguments.
CUDA Version showed overall good perfor-
mances when compared to reference standard. See
Figure 4. CUDA FFT crossCorr, matched against
MATLAB FFT xcorr shows a residual term in the
order of ±5.0 · 10
−6
, so perfectly in line with expec-
tations for single precision 32 bit arithmetic of G92
cores.
ippsCrossCorr32fc on the other hand shows al-
most the same performance (single precision) on the
whole signal length, not visible for bigger Y scale,
with a notable exception on master signal’s tail. When
overlap between reference and template signal ex-
ceeded former’s length, cross correlation index starts
dropping towards zero, for effect of zero padding in-
side the matching function. Even if this behavior
has no practice relevance when looking for “best”
match (cross correlation index is guaranteed to be-
come lower than in other parts of data), it could have
dangerous effects if resulting vector will be normal-
ized against its L1 or L2 norm. These manipulations
are quite frequent in digital signal processing, and in
this case such these results could imply a loss of pre-
cision of successive steps of computation.
4 COMPUTATIONAL
CONSIDERATIONS
AND BENCHMARKS
As previously shown in CrossCorr code analysis, all
FFT computations have been offloaded from CPU
to GPU, and in first instance this process was flaw-
less and simple thanks to the CUFFT library usage.
CUFFT provides a simple interface for computing
parallel FFTs on an NVIDIA GPU, which allows
users to leverage the floating point power and paral-
lelism of the GPU without having to develop a cus-
tom, GPU based FFT implementation. FFT libraries
typically vary in terms of supported transform sizes
and data types. CUFFT API is modeled after FFTW
(see http://www.fftw.org), which is one of the most
popular and efficient CPU based FFT libraries.
Sadly, CUFFT implementation, or at least un-
til v2.1 of CUDA Toolkit, even showing a general
speedup against the underlying CPU structure (> 40
GFlop/s for a grand total of > 20 GB/s on GeForce
G92 8800GTS 512), doesn’t exploited the full poten-
tial of NV’s hardware computing power. That library
is in fact a collection of 5 different implementations
of part of Cooley-Turkey structure, often referred as
Radix2 FFT Scheme as in (Tian et al., 2004), extend-
ing it to cardinality of 3,4 and 5 elements per block,
thus adding to their library the capability to use also
Radix-3, Radix-4 and Radix-5 schemes.
Well accepted theory of “Mixed radix FFT
schemes” is described in detail in (Stasinski and
the re-computation of reference signal FFT. At ev-
ery cycle, in fact, a new chunk of ref. signal and
a new scrambling code would have been pushed to
the cudaCrossCorr function, that would have done the
FFT twice per iteration and then apply convolution
theorem, multiplying the resulting vectors and anti-
transform the result. Designed to work on multiple
cross correlations per session sharing the reference
signal, instead, using static variables to keep track of
how many FFT have been computed, on which part
of incoming signal and most important where results
are stored in memory, cudaCrossCorr can save up to
30% computing power (1 out of 3 FFT-based opera-
tions) per cycle, with a great advantage in terms of
execution time. See details in Listing 1 to see how
this simple mechanism works.
The only non-masked operation in cudaCrossCorr
body is the recombination of spectra after FFT execu-
tion, ( f
∗
∗ g) in eq. 1.
3.2 About Precision in
Cross-correlation
Several numeric tests have been executed, using
MATLAB as golden standard with its xcorr function
that implements circular cross correlation, computed
in double precision IEEE-754, complex arguments.
CUDA Version showed overall good perfor-
mances when compared to reference standard. See
Figure 4. CUDA FFT crossCorr, matched against
MATLAB FFT xcorr shows a residual term in the
order of ±5.0 · 10
−6
, so perfectly in line with expec-
tations for single precision 32 bit arithmetic of G92
cores.
ippsCrossCorr32fc on the other hand shows al-
most the same performance (single precision) on the
whole signal length, not visible for bigger Y scale,
with a notable exception on master signal’s tail. When
overlap between reference and template signal ex-
ceeded former’s length, cross correlation index starts
dropping towards zero, for effect of zero padding in-
side the matching function. Even if this behavior
has no practice relevance when looking for “best”
match (cross correlation index is guaranteed to be-
come lower than in other parts of data), it could have
dangerous effects if resulting vector will be normal-
ized against its L1 or L2 norm. These manipulations
are quite frequent in digital signal processing, and in
this case such these results could imply a loss of pre-
cision of successive steps of computation.
4 COMPUTATIONAL
CONSIDERATIONS
AND BENCHMARKS
As previously shown in CrossCorr code analysis, all
FFT computations have been offloaded from CPU
to GPU, and in first instance this process was flaw-
less and simple thanks to the CUFFT library usage.
CUFFT provides a simple interface for computing
parallel FFTs on an NVIDIA GPU, which allows
users to leverage the floating point power and paral-
lelism of the GPU without having to develop a cus-
tom, GPU based FFT implementation. FFT libraries
typically vary in terms of supported transform sizes
and data types. CUFFT API is modeled after FFTW
(see http://www.fftw.org), which is one of the most
popular and efficient CPU based FFT libraries.
Sadly, CUFFT implementation, or at least un-
til v2.1 of CUDA Toolkit, even showing a general
speedup against the underlying CPU structure (> 40
GFlop/s for a grand total of > 20 GB/s on GeForce
G92 8800GTS 512), doesn’t exploited the full poten-
tial of NV’s hardware computing power. That library
is in fact a collection of 5 different implementations
of part of Cooley-Turkey structure, often referred as
Radix2 FFT Scheme as in (Tian et al., 2004), extend-
ing it to cardinality of 3,4 and 5 elements per block,
thus adding to their library the capability to use also
Radix-3, Radix-4 and Radix-5 schemes.
Well accepted theory of “Mixed radix FFT
schemes” is described in detail in (Stasinski and
Potrymajo, 2004). Radix-3 and Radix-5 schemes use
FAST TEMPLATE MATCHING FOR UMTS CODE KEYS WITH GRAPHICS HARDWARE
269