Efﬁcient Soft-output Detectors

Multi-core and GPU implementations in MIMOPack Library

Carla Ramiro S

anchez

, Antonio M. Vidal Maci

and Alberto Gonzalez Salvador

Department of Information Systems and Computation, Universitat Polit

ecnica de Val

encia,

Camino de Vera s/n, Valencia, Spain

Institute of Telecommunications and Multimedia Applications, Universitat Polit `ecnica de Val`encia,

Camino de Vera s/n, Valencia, Spain

Keywords:

Soft-output Detector, BICM, HPC Library, GPU, Multi-core, CUDA, MIMO.

Abstract:

Error control coding ensures the desired quality of service for a given data rate and is necessary to improve re-

laibility of Multiple-Input Multiple-Output (MIMO) communication systems. Therefore, a good combination

of detection MIMO schemes and coding schemes has drawn attention in recent years. The most promising

coding schemes are Bit-Interleaved Coded Modulation (BICM). At the transmitter the information bits are

encoded using an error-correction code. The soft demodulator provides the reliability information in form of

real valued log-likehood ratios (LLR). These values are used by the channel decoder to make ﬁnal decisions on

the transmitted coded bits. Nevertheless, these sophisticated techniques produce a signiﬁcant increase in the

computational cost and require large computational power. This paper presents a set of Soft-Output detectors

implemented in CUDA and OpenMP, which allows to considerably decrease the computational time required

for the data detection stage in MIMO systems. These detectors will be included in the future MIMOPack

library, a High Performance Computing (HPC) library for MIMO Communication Systems. Experimental re-

sults conﬁrm that these implementations allow to accelerate the data detection stage for different constellation

sizes and number of antennas.

1 INTRODUCTION

Multiple-input multiple-output (MIMO) systems can

provide high spectral efﬁciency by means of spatially

multiplexing multiple data streams (Paulraj et al.,

2004), which makes them promising for current wire-

less standards. However, the use of MIMO technolo-

gies involves an increment of the detection process

complexity. The detector is present at the receiver

side and is the responsible for recover the received

signals (which are affected by the channel ﬂuctua-

tion) with the maximum reliability. This step be-

comes in many cases the most complex stage in the

communication. Another important factor that affects

the performance of a MIMO system is the number

of transmit and receive antennas, because as the sys-

tem grows the communication process becomes more

complicated. Although the number of antennas cur-

rently allowed in the standards is not large, it is ex-

pected that in the near future more than 100 transmit

antennas will be used (Rusek et al., 2013). Thus, the

search for high-throughput practical implementations

that are also scalable with the system size is impera-

tive.

Graphic processing units (GPUs) have been

recently used to develop reconﬁgurable software

deﬁned-radio platforms (Kim et al., 2010), high-

throughput MIMO detectors (Wu et al., 2010),

and fast low-density parity-check decoders (Falcao

et al., 2009). Although multicore central pro-

cessing unit (CPU) implementation could also re-

place traditional use of digital signal processors and

ﬁeld-programmable gate arrays (FPGAs), this option

would interfere with the execution of the tasks as-

signed to the CPU of the computer, possibly caus-

ing speed decrease. Since the GPU is more rarely

used than the CPU in conventional applications, its

use as a coprocessor in signal-processing systems is

very promising. Therefore, systems formed by a mul-

ticore computer with one or more GPUs are interest-

ing in this context.

In this paper, we consider the use of MIMO

with bit-interleaved coded modulation (BICM) (Caire

et al., 1998). The use of these kind of system al-

lows to improve the reliability of the MIMO systems.

However, the receiver stage becomes more compli-

335

Ramiro Sánchez C., M. Vidal Maciá A. and Gonzalez Salvador A..

Efﬁcient Soft-output Detectors - Multi-core and GPU implementations in MIMOPack Library.

DOI: 10.5220/0005369503350344

In Proceedings of the 5th International Conference on Pervasive and Embedded Computing and Communication Systems (AMC-2015), pages 335-344

ISBN: 978-989-758-084-0

 2015 SCITEPRESS (Science and Technology Publications, Lda.)

cated since the demodulator must to generate soft-

information (log-likehood ratios) that can be used by

the decoder. In order to accelerate the computation

of the log-likehood ratios (LLR), we present a set

of efﬁcient soft-output decoders with different com-

plexities and performances in terms of Bit Error Rate

(BER). These soft-output detectors will be included in

the MIMOPack library and have been implemented to

be run in a multicore and GPU system. Speedup re-

sults show how the execution time of the detection

stage can be meaningfully decreased using these im-

plementations.

2 SOFT-OUTPUT MIMO

DETECTION

Let us consider a MIMO-BICM system with n

trans-

mit antennas and n

≥ n

receive antennas (see Fig.

1). We assume a spatial multiplexing system, where

for any time instant n, the ith data stream s

[n] is

transmitted on the ith transmit antenna. The base-

band equivalent model for the received vector y[n] =

[n], ..., y

[n])

is given by

y[n] = H[n]s[n] +v[n], n = 1, . . . , N

−1, (1)

where N

is the number of time instants in the en-

tire transmission. H[n] is an n

×n

matrix model-

ing the Rayleigh fading MIMO channel, and the noise

components of vector v[n] = (v

[n], ..., v

[n])

are

assumed independent and circularly symmetric com-

plex Gaussian with variance σ

. We assume that the

channel H[n] is known at the receiver and the symbols

are taken from a constellation Ω of size |Ω|= 2

= M.

For the sake of simplicity, we will omit the time index

n and, thus write (1) as

y = Hs +v. (2)

In this system, the sequence of information bits b

is encoded using an error correcting code before being

demultiplexed into n

layers. The coded bits are then

passed through an interleaver Π and mapped via Gray

labeling.

At the receiver side, the demodulator uses the

model (1) to calculate the soft information about the

code bits in terms of log-likelihood ratios LLRs (Λ).

Thus, to calculate an Λ

i,k

for each coded bit b

i,k

, with

k = 1, . . . ,m, of the sent symbol vector s, the detector

uses the received vector y and the channel matrix H.

Finally, the reliability information is de-interleaved

(Π

−1

) and multiplexed into a single stream which will

be used by the channel decoder.

A strategy to decrease slightly the complexity of

the detection is to reduce the channel matrix in a

canonical form by orthogonal transformations before

the detection. If QR decomposition is employed in

a preprocessing stage, the channel matrix is decom-

posed into H = QR, where R is an upper triangu-

lar matrix. Left-multiplying (2) by Q

and calling

y = Q

y, the problem can be rewritten as:

s = arg min

s∈Ω

y −Rsk

= arg min

s∈Ω

y −Rsk

(3)

where the most probable transmitted symbol vector

is found by searching the smallest distance between

the received vector y and each possible vector s. To

clarify how the triangular structure of R can be ex-

ploited, Eq. 3 has been expressed in a more explicit

way

s = arg min

x∈Ω







∑

i=n



˜y

−

∑

j=i

i j









. (4)

Problem (4) can be represented as a decoding tree

with n

+ 1 levels. Each possible message s is repre-

sented by a branch and each symbol value by a node.

Then there are M

leaf nodes which represent the to-

tal of possible values for s. In order to solve Eq. 4 via

tree search, the following recursion is performed for

i = n

, n

−1, . . . , 1 :

) = D

i+1

) + |e

, (5)

where i denotes each tree level, S

(i)

, s

i+1

, . . . , s

], D

(i)

) is the accumulated Par-

tial Euclidean Distance (PED) up to level i, with

) is set to 0 and the Distance Increment

(DI), also called branch weight is computed as:

) = ˜y

−

∑

j=i

i j

. (6)

3 TOOLS AND OPTIMIZATION

TECHNIQUES

In this section we present some tools and additional

optimizations performed to accelerate and reduce the

number of FLOPS (ﬂoating points operations) needed

to perform the detection. These strategies will be used

in many of the algorithms and therefore are needed to

understand the implementations in the next section.

PECCS2015-5thInternationalConferenceonPervasiveandEmbeddedComputingandCommunicationSystems

336

Figure 1: Block diagram of a MIMO-BICM system.

3.1 Graphic Processing Units and

CUDA

Compute Uniﬁed Device Architecture (CUDA)

(NVIDIA, 2014) is a software programming model

that exploits the massive computation potential of-

fered by GPUs. A GPU can have multiple stream

multiprocessors (SM), each with a certain number of

pipelined cores. A CUDA device has a large amount

of off-chip device memory (global memory) and a fast

on-chip memory called shared memory. Following

Flynn’s taxonomy (Flynn, 1972), from a conceptual

point of view, a GPU can be considered as a single

instruction, multiple data (SIMD) machine; that is, a

device in which a single set of instructions is executed

on different data sets. In the CUDA model, the pro-

grammer deﬁnes the kernel function which contains

a set of common operations. At runtime, the kernel

is called from the main central processing unit (CPU)

and spawns a large number of threads blocks, which

is called a grid. Each thread block contains multiple

threads and all the blocks within a grid must have the

same size. Each thread can select a set of data using

its own unique ID and execute the kernel function on

the selected set of data. Threads within a block can

synchronize their execution through a barrier to co-

ordinate memory accesses. In contrast, thread blocks

are completely independent and can only share data

through the global memory once the kernel ends.

3.2 Multicore Processors and OpenMP

OpenMP is an Application Programming Interface

(API) (OpenMP, 2013) for programming shared-

memory parallel computers. It consists of a set of

compiler directives, callable library routines and en-

vironment variables which may be embedded within

a code written in a programming language such as

Fortran, C/C++ on several processor architectures and

operating systems such as GNU/Linux, Mac OS X,

and Windows platforms.

Basically, a master thread launch a number of

slave threads and divide the workload among them.

The runtime will attempt to allocate the threads to

different processors and the threads will run concur-

rently.

The multicore parallelization performed in the

Soft-Output detectors is common for all of them and

is based in the estimation of a particular transmitted

signal

s[n] per thread.

3.3 Efﬁcient Calculation of Partial

Euclidean Distances

We propose a strategy, consisting in the previous es-

timation of all posible values of the inner sumatory

∑

j=i

i j

in equation (4). This will allow us to avoid

common computation for different possible solutions

(s ∈ Ω

) decreassing the computational cost of the

detection process. These values are stored in a M ×n

matrix T. The elements T

i, j

, contains the result of

multiplying the constellation symbol Ω

by the j-th

non-zero value of matrix R:

T =







Ω

1,1

Ω

1,2

. . . Ω

Ω

1,1

Ω

1,2

. . . Ω

Ω

1,1

Ω

1,2

. . . Ω







, (7)

where n

is the number of non-zero values of R.

Then, each row i contains all non-zero valued el-

ements of matrix R multiplied by the constellation

complex-valued symbol Ω

Algorithm 1 shows the steps needed to calculate

the accumulated PED of a path s from the root up to

level lini by using the matrix T. As an example, let us

consider a 2 ×2 MIMO system using a 16-QAM con-

stellation, which has the following triangular matrix

associated

R =



1,1

1,2

0 R

2,2



the n

= 3 non-zero values are represented as R

(l)

i, j

where l represents the index of the column that it oc-

EfficientSoft-outputDetectors-Multi-coreandGPUimplementationsinMIMOPackLibrary

337

cupies in the T matrix. We want to calculate the Par-

tial Euclidean Distances (PED) of the following tree-

path S:

(1)

= [s

, s

]

= [−3 −1i, 1 −3i]

= [Ω

, Ω

]

the Distance Increment (DI) are computed as follows:

(2)

) = ˜y

−

∑

j=2

i, j

= ˜y

−R

2,2

Ω

= ˜y

−T

15,3

(8)

(1)

) = ˜y

−

∑

j=1

i, j

= ˜y

−(R

1,1

Ω

+ R

1,2

Ω

)

= ˜y

−(T

3,1

+ T

15,2

)

(9)

Assuming that the root node starts with accumu-

lated PED equal to zero d

(3)

) = 0; the ﬁnal Eu-

clidean Distance at the leaf node is:

(1)

) = |˜y

−T

15,3

+ |˜y

−(T

3,1

+ T

15,2

(10)

Algorithm 1: Efﬁcient Calculation of Accumulated

Partial Euclidean Distances from level lini to lend.

Input: T,

y, s

Output: D

lini

1: D

= 0

2: for i = n

, . . . , lini do

3: for j = i, . . . , n

4: Get index l using i and j,

5: aux = aux + T

6: end for

7: D

= D

i+1

+ |˜y

−aux|

8: end for

4 SOFT-OUTPUT DETECTORS

IMPLEMENTATION

4.1 Optimum and Max-log

Demodulators

Assuming that all transmit vectors are equally likely

the optimal soft MAP (OMAP) demodulator calcu-

lates the exact LLR for b

i,k

= log

P(b

i,k

= 1|y, H)

P(b

i,k

= 0|y, H)

= log

∑

s:s

∈Ω

−

y−Rs

∑

s:s

∈Ω

−

y−Rs

(11)

where Ω

denotes the set of all symbols s ∈ Ω whose

label u ∈ {0, 1} in bit position k. The complexity of

this method is O(|Ω|

) since the LLRs are calculated

for all layers n

, therefore is mandatory the computa-

tion of |Ω|

distances.

If the receiver uses a max-log approximation

(MLA) demodulation the computation of the LLRs

for each code bit is calculed according to (Muller-

Weinfurtner, 2002)

i,k

≈

min

s:s

∈Ω

y −Rs

− min

s:s

∈Ω

y −Rs

(12)

There are numerous suboptimal alternatives of

soft MIMO detectors in order to avoid an exhaus-

tive search over the entire range of possibilities |Ω|

In this paper two tree-search-based soft demodulation

have been considered and are described in future sec-

tions.

4.1.1 CUDA Implementation

The proposed OMAP and MLA GPU implementa-

tions have a similar algorithmic scheme. Both CUDA

codes are composed by two kernels which work to-

gether to perform the estimation of the signals re-

ceived in N

time slots.

Algorithm 2: CUDA Parallelization for the OMAP

and MLA detection of N

time instants.

1: Allocate Memory in GPU-GM for: T,

y, s and D,

2: Copy from CPU to GPU-GM: T and

3: Copy from CPU to GPU Constant Memory: gray,

4: Select kernel conﬁguration with n

= N

·M

5: Obtain D and s using Kernel 1

6: Select kernel conﬁguration with n

= N

·m ·n

7: if OMAP method is selected then

8: Obtain Λ using Kernel 2

9: else

10: Obtain Λ using Kernel 3

11: end if

12: Copy from GPU-GM to CPU: Λ

Algorithm 2 shows the steps needed to launch the

kernel. First, is necessary to allocate memory in the

GPU Global Memory (GPU-GM) for input (T,

y),

output (D) and auxiliary (s) matrices related to the N

signals. Then, input matrices should be copied into

the GPU-GM. The next step is to launch the Kernel 1,

with the appropiatte grid dimension.

In Kernel 1, each thread is in charge to compute

the accumulated PED for a given signal n and a pos-

sible jth combination of the range Ω

. During the

detection process, the detector should maintain a list

of P = M

symbols that are being estimated for each

PECCS2015-5thInternationalConferenceonPervasiveandEmbeddedComputingandCommunicationSystems

338

signal. In order to reduce memory requirements and

the cost of data transfers our algorithms keep the list

of symbols in integer format (not complex) which will

be represented as s

:, j

. This is especially important for

the GPU implementations since they have certain lim-

its in memory capacity and the data transfers to/from

the CPU are very expensive. Once the indices of the

constellation symbols for a given three-path j have

been obtained (see step 3) , its Euclidean Distance

is calculated by adding different elements of the pre-

built matrix T such as Alg. 1 and store it in D

[n].

Therefore n

= N

·M

threads are needed to

launch the kernel. A bidimensional grid conﬁgura-

tion with N

= N

, N

= N

has been considered

for all Soft Detectors. The number of blocks N

de-

pends on the number of threads per dimension, which

are denoted by N

and N

, respectively. The block

size will be chosen to be a multiple of 32 in order to

avoid incomplete warps. Then the value of N

is ob-

tained as:



·N

)



. (13)

Kernel 1: Calculation of one of the branches of the

OMAP and MLA detectors by the z-th thread for N

time slots.

Input: T,

y, N

Output: D, s

1: Calculate using the thread global index z:

• Time slot identiﬁer n

• Path identiﬁer j

2: if n < N

then

3: Get selected path s

:, j

[n],

4: Compute Euclidean Distance D

[n] from 1 to n

level using T[n] and

y[n]

5: end if

Once the Euclidean Distances have been calcu-

lated the detector must to obtain the soft information.

Depending on the kind of detector selected (OMAP

or MLA) Kernel 2 or Kernel 3 will be launched.

In this case the number of blocks is computed as

Eq.13 with n

= N

· m · n

. Each thread is in

charge to compute an LLR Λ

i,k

for a determinated

time index n using Eq.11 or Eq.12 respectively. A

matrix, called gray of size M ×m, is used in order

to ﬁnd the set of symbols Ω

with k = 1, . . . , m

. This matrix contains the representation of all

constellation symbols in binary format. This data

will not change during the execution and is read only,

then its very suitable to be copied in constant memory.

Kernel 2: Computation of the LLR for the OMAP by

the z-th thread for N

time slots.

Input: D, s, N

, σ

Output: Λ

1: Calculate using the thread global index z:

• Time slot identiﬁer n

• Layer position i

• Bit position k

2: Set d0 = 0 and d1 = 0

3: if n < N

then

4: for j = 1, . . . , P do

5: Get ith symbol s = s

i, j

[n]

6: if gray

s,k

= 1 then

7: d1 = d1 +e

−

[n]

8: else

9: d0 = d0 +e

−

[n]

10: end if

11: end for

12: Λ

i,k

[n] = log(

)

13: end if

Kernel 3: Computation of the LLR for the MLA by

the z-th thread for N

time slots.

Input: D, s, N

, σ

Output: Λ

1: Calculate using the thread global index z:

• Time slot identiﬁer n

• Layer position i

• Bit position k

2: Set d0 = 1e6 and d1 = 1e6

3: if n < N

then

4: for j = 1, . . . , P do

5: Get the ith symbol s = s

i, j

[n]

6: if gray

s,k

= 1 and D

[n] < d

then

7: d1 = D

[n]

8: end if

9: if gray

s,k

= 0 and D

[n] < d

then

10: d0 = D

[n]

11: end if

12: end for

13: Λ

i,k

[n] =

d0−d1

14: end if

4.2 Soft Fixed Sphere Decoder

The Soft Fixed Sphere Decoder (SFSD) performs a

predeterminated tree-search composed of tree differ-

ent stages. The ﬁrst two stages (FE and SE) are known

as the hard-output stage or HFSD detection (Barbero

and Thompson, 2008):

• A full expansion of the tree (FE) in the ﬁrst (high-

est) L levels. At the FE stage, for each survivor

path, all the possible values of the constellation

are assigned to the symbol at the current level.

EfficientSoft-outputDetectors-Multi-coreandGPUimplementationsinMIMOPackLibrary

339

• A single-path expansion (SE) in the remaining

tree-levels n

− L. The SE stage starts from

each retained path and proceeds down in the

tree calculating the solution of the remaining

succesive-interference-cancellation (SIC) prob-

lem (Berenguer and Wang, 2003) as:

ˆs

= Q

(

˜y

−

∑

j=i+1

i j

ˆs

)

, i = n

, . . . , 1.

(14)

The function Q (·) assigns the closest constella-

tion value. Note that, the efﬁcient PED calcula-

tion using matrix T can also be used to accelerate

the computation of the sumatory

∑

j=i+1

i j

ˆs

the SIC problem. The symbols are detected fol-

lowing a speciﬁc ordering also proposed by the

authors in (Barbero and Thompson, 2008). As it

was shown in (Jalden et al., 2009), the maximum

detection diversity can be achieved with the FSD

if the following value of L is chose:

L ≥

√

−1 (15)

• A Soft-Output extension (SOE) to provide soft in-

formation by obtaining an improved list of candi-

dates (Barbero et al., 2008). Figure 2(b) shows

the search-tree of the SFSD for the case with

= 4 and QPSK symbols. The method starts

from the list of candidates that the hard-output

FSD in (composed by the FE and SE stages) ob-

tains (in Fig. 2(a)) and adds new candidates to

provide more information about the counter bits.

Note that, since the ﬁrst level of the HFSD tree is

already totally expanded, all the necessary values

to compute the LLRs of the symbol bits in the ﬁrst

levels are available. Therefore, the list extension

must start from the second level of such path. To

begin the list extension, the best N

iter

paths are se-

lected from the initial hard-ouput FSD list (in this

example, N

iter

= 2). This is based on the heuris-

tics that the lowest-distance paths may be candi-

dates differing from the best paths in only some

bits. The symbols belonging to these N

iter

paths

are picked up from the root up to a certain level l,

and, at level l −1, additional log

M branches are

explored, each of them having one of the bits of

the initial path symbol negated. Afterwards, these

new partial paths are completed following the SIC

path, as done in the hard-ouput FSD scheme. The

same operation is repeated until the lowest level

of the tree is reached.

Figure 2: Decoding trees of the SFSD algorithm for a 4 ×4

MIMO system with QPSK symbols, N

iter

= 2 and L = 1: (a)

Hard-Output stage and (b) Soft-Output Extension.

4.2.1 SFSD CUDA Implementation

Algorithm 3 shows the steps needed to perform the

SFSD detection. First, data for input and output

variables are allocated and copied into the GPU-GM

memory. In this case, matrices gray, neg and constel-

lation symbols Ω are copied into constant memory.

The Ω variable is needed to perform the quantization

Q (·) in the SIC problem. Matrices D and s contains

the information of the P = M

+ N

iter

·m ·(n

−L)

paths computed: M

branches of the Hard-Output

stage and the N

iter

·m ·(n

−L) new branches of the

Soft-Output extension (SOE) stage.

In Kernel 4, each thread calculates one of the M

branches of the HFSD stage. After the hard-output

part is ﬁnished, the CPU is in charge to calculate the

iter

minimum distances and store it in the matrix min

in ascendent order. This matrix is copied in the GPU

global memory. Then, the N

iter

·m ·(n

−L) new can-

didates to be obtained per time index n are equally dis-

tributed among all the threads of the grid using Ker-

nel 5. As mentioned, in the SOE stage, adittional m

branches are explored in the remaining (n

−L) lev-

els. Each of them have one of the bits of the initial

path symbol negated. In order to accelerate this ex-

pansion, a matrix (neg) is builded before the detec-

tion. This matrix contains, for each constellation sym-

bol Ω

, a list of m constellations symbols resulting of

the kth bit negation. For example using QPSK con-

PECCS2015-5thInternationalConferenceonPervasiveandEmbeddedComputingandCommunicationSystems

340

Kernel 3: CUDA Parallelization for the SFSD detec-

tion of N

time instants.

1: Allocate Memory in GPU-GM for: T,

y, s and D,

2: Copy from CPU to GPU-GM: T and

3: Copy from CPU to GPU Constant Memory: gray, gray

and Ω,

4: Select kernel conﬁguration with n

= N

·M

5: HFSD Stage: Obtain D and s using Kernel 4,

6: Copy from GPU-GM to CPU: D,

7: Obtain min calculating the path indices of the N

iter

minimum distances,

8: Copy from CPU to GPU: min,

9: Select kernel conﬁguration with n

= N

·M ·m ·(n

−

L),

10: SOE Stage: Update D and s using Kernel 5,

11: Select the kernel conﬁguration with n

= m ·n

12: Obtain Λ using Kernel 6,

13: Copy from GPU-GM to CPU: Λ

stellation for the symbol Ω

its binary representation

is 11. Negating the 1st bit it becomes in 1

1 = 10, then

neg

3,1

= 2, negating the 2nd bit it becomes in

11 = 01,

then neg

3,2

= 1. As occurs with matrix gray, this ma-

trix is constant for the entire simulation, then will be

copied also in constant memory.

Kernel 4: Calculation of one of the branches of the

HFSD detector by the z-th thread for N

time slots.

Input: T,

y, n

, N

, L

Output: D, s

1: Calculate using the thread global index z:

• Time slot identiﬁer n

• Path identiﬁer j

2: if n < N

then

3: Get Path from level n

− L + 1 to n

−L+1:n

, j

[n],

4: Compute Partial Distance D

[n] from n

−L + 1 to

level using T[n] and

y[n]

5: for r = n

−L, . . . , 1 do

6: Compute s

r, j

[n] using SIC Eq.14,

7: Update path distance D

[n] using T[n] and

y[n]

8: end for

9: end if

After this, the ﬁnal step ﬁnds within this list the

minimum distances of paths having the counter bits

and computes the log

M ·n

LLRs. These operations

are executed by the Kernel 6.

4.3 Fully Parallel Fixed Sphere Decoder

In SFSD detector, a smart list extension based on

the lowest distance paths within the initial FSD list

is proposed, however, such extension is performed in

an almost totally sequential way, which alters the al-

gorithm parallelism degree. For this reason, a soft-

Kernel 5: Calculation of new candidates for the SFSD

detector by the z-th thread for N

time slots.

Input: T,

y, s, min, n

, N

Output: D, s

1: Calculate using the thread global index z:

• Time slot identiﬁer n

• Path identiﬁer j

• Selected path N

• Layer position l

• Bit position k

2: if n < N

then

3: Get index of the selected path j

= min

[n],

4: and its symbol on the ith layer as s

= s

l, j

[n],

5: Copy symbols from level l to n

from global mem-

ory s

l+1:n

, j

[n] = s

l+1:n

, j

[n],

6: Get negate kth negated symbol from constant mem-

ory s

l, j

[n] = neg

[n],k

7: Compute partial distance D

[n] from l to n

level

8: for r = l −1,. . . , 1 do

9: Compute s

r, j

[n] using SIC Eq.(14),

10: Update path distance D

[n] using T[n] and

y[n]

11: end for

12: end if

Kernel 6: Computation of the LLR for the SFSD by

the z-th thread for N

time slots.

Input: D, s, min, N

, σ

Output: Λ

1: Calculate using the thread global index z:

• Time slot identiﬁer n

• Layer position i

• Bit position k

2: Set d0 = 1e6 and d1 = 1e6

3: if n < N

then

4: Get index of the HFSD Solution as j

= min

[n],

5: and its symbol on the ith layer as s

= s

i, j

[n],

6: for j = 1, . . . , P do

7: s = s

i, j

[n]

8: if D

[n] < d

min

and gray

s,k

6= gray

then

9: d

min

= D

[n]

10: end if

11: end for

12: Λ

i,k

[n] =

[n]−d

min

)(1−2gray

)

13: end if

output demodulator was proposed in (Roger et al.,

2012) that performs a fully parallel list extension: the

fully parallel FSD (FPFSD). The proposed approach

is purely based on the hard-output FSD scheme.

The list of candidates and distances necessary to

obtain soft information is calculated through n

hard-

output FSD searches, each with a different channel

matrix ordering. The n

different channel orderings

ensure that a different layer (level) of the system is

placed at the top of the tree each time. This way, can-

didate paths containing all the bit labelling possibili-

EfficientSoft-outputDetectors-Multi-coreandGPUimplementationsinMIMOPackLibrary

341

Table 1: Symbol detection position and corresponding tree-level for the involved FPFSD orderings in an example with n

= 4.

Detection Norm-based Order Order Order Order

position Ordering 1 2 3 4

1st 2 1 2 3 4

2nd 4 2 4 2 2

3rd 3 4 3 4 3

4th 1 3 1 1 1

ties in every level are guaranteed and, thus, soft infor-

mation about all the bit positions is always available.

Recall that, for n

= 4, 4 hard-output FSD indepen-

dent searches such as the one in Fig. 2(a) should be

carried out, each with a different channel matrix or-

dering. These tree-searches can be carried out totally

in parallel.

Note that, as when using the FSD ordering, the re-

liability of the symbol placed in the FE stage is irrel-

evant. Then, the remaining levels are ordered follow-

ing the initial column-norm-based ordering but skip-

ping the level that was already set on the top. The

example in Table 1 shows how the ordering is set

up for a particular column-norm-based ordering of a

4 ×4 channel, which in this case is {2, 4, 3, 1}. As

the ﬁrst row of Table 1 shows, the ith proposed or-

dering starts the data detection at the ith tree-level,

being i ∈ {1, 2, 3, 4}. Then, the remaining levels are

explored following the column-norm-based ordering

in column 2.

4.3.1 CUDA Implementation

Algorithm 4 shows the steps needed to perform the

FPFSD detection. Once the relevant data have been

allocated and copied in the GPU. Kernel 7 calculates

the PEDs of the P = n

·M branches for each pth or-

der matrix. Once the Euclidean Distances have been

calculated the detector must to obtain the soft infor-

mation. The LLRs are obtained using Kernel 3 with

the appropiate size list P.

Algorithm 4: CUDA Parallelization for the FPFSD

detection of N

time instants.

1: Allocate Memory in GPU-GM for: T,

y, s and D,

2: Copy from CPU to GPU-GM: T and

3: Copy from CPU to GPU Constant Memory: gray and

Ω,

4: Select kernel conﬁguration with n

= N

·n

·M,

5: Obtain D and s using Kernel 7,

6: Select the kernel conﬁguration with n

= m ·n

7: Obtain Λ using Kernel 3 with P = n

·M,

8: Copy from GPU-GM to CPU: Λ

Kernel 7: Calculation of one of the branches of the

FPFSD detector by the z-th thread for N

time slots.

Input: T,

y, M, n

, N

, σ

Output: D, s

1: Calculate using the thread global index z:

• Time slot identiﬁer n

• Path identiﬁer j

• FPFSD ordering index p

2: if n < N

then

3: Get Path from level n

to n

as s

:, j

[n],

4: Compute Partial Distance D

[n] from n

to n

level

using T

(p)

[n] and

(p)

[n]

5: for r = n

−1, . . . , 1 do

6: Compute s

r, j

[n] using SIC Eq.14,

7: Update path distance D

[n] using T

(p)

[n] and

(p)

[n]

8: end for

9: end if

5 RESULTS

In order to assess the performance of our library, we

have evaluated the execution times of the Soft-Output

detectors described in the previous sections.

We employed for the implementations an het-

erogeneous system composed of two Nvidia Tesla

K20Xm GPU with 14 SM, each SM including 192

cores. The core frequency is 0.73 GHz. The GPU has

5GB of GDDR5 global memory and 48KB of shared

memory per block. The installed CUDA toolkit is 5.5.

The Nvidia card is mounted on a PC with two Intel

Xeon CPU E5-2697 at 2.70 GHz with 12 cores and

hyperthreading activated.

Table 2 shows the execution time and speedup ob-

tained by the OMAP and MLA demodulators for mul-

ticore and GPU implementations. Sequential version

refers the algorithm without the optimization obtained

with the use of matrix T seen in section 3.3. The

speedups are deﬁned as the ratio between the execu-

tion time of sequential version and the parallel im-

plementations. These results have been obtained sim-

ulating a MIMO system with 4 transmit and receive

antennas, 16-QAM symbol alphabet and N

= 10000.

The CUDA block conﬁguration is N

= N

= 16. As

can be seen parallel execution dramatically reduces

PECCS2015-5thInternationalConferenceonPervasiveandEmbeddedComputingandCommunicationSystems

342

response time for optimal soft demodulation running

up to 82 times faster than sequential version.

Table 2: Execution Time in seconds and Speedup of optimal

(OMAP) and max-log approximatio (MLA) MIMOPack

soft-output detectors with different library conﬁgurations.

Version

Execution Time(Speedup)

OMAP MLA

Sequential 304.00 88.27

48 OMP threads 15.19(≈20x) 4.70(≈19x)

GPU 3.72(≈82x) 3.05(≈29x)

Due to the lower complexity of the suboptimal

SFSD and FPFSD methods, we can to simulate trans-

missions with higher complexity. Figures 3 and 4

have been obtained simulating N

= 10000 signals

and varying the number of transmitter antennas (n

)

and the constellation sizes.

The speedup results obtained with the parallel

SFSD implementations can be seen in Figure 3, the

value of N

iter

is {2, 4, 6} for QPSK, 16-QAM and 64-

QAM, respectively and L = d

√

e−1. As we can

see, the multicore version have better performance

than CUDA version when the computational burden

is insufﬁcient to exploit the capabilities of the GPU.

When the number of transmitter antennas n

and

constellation size increases, the CUDA implementa-

tion obtain better performance than multicore version.

This is more noticeable from n

= 10, since the num-

ber of levels in the FE stage is ﬁxed to L = 3. This

behavior also occurs for the FPFSD detector (see 4).

4 6 8 10 12 14 16 18 20

Speedup

64QAM(gpu)

64QAM(omp)

16QAM(omp)

16QAM(gpu)

QPSK(omp)

QPSK(gpu)

Figure 3: Speedup for the SFSD detector with different

constellations and number of transmitter antennas.

4 6 8 10 12 14 16 18 20

Speedup

64QAM(gpu)

64QAM(omp)

16QAM(omp)

16QAM(gpu)

QPSK(omp)

QPSK(gpu)

Figure 4: Speedup for the FPFSD detector with different

constellations and number of transmitter antennas.

6 CONCLUSIONS

This paper presents a set of Soft-Output detectors in-

cluded in the future library for MIMO communica-

tions systems, called MIMOPack, which aims to pro-

vide a set of routines needed to perform the most com-

plex stages in the current wireless communications.

The efﬁciency of these detectors have been evalu-

ated by comparing the execution time with differ-

ent platform conﬁgurations. The variety of detectors

with mixed complexities and performances allows to

cover multiple use cases with different channel con-

ditions and scenarios such as massive MIMO. More-

over, parallel implementations allow the execution of

large simulations over different architectures thus ex-

ploiting the capacity of the modern machines. Re-

sults obtained with the efﬁcient soft-output detectors

presented in this paper demonstrate that MIMOPack

library may become in a very useful tool for compa-

nies involved in the development of new wireless and

broadband standards, which need to obtain results and

statistics of its proposals quickly and also for other re-

searchers making easier the implementation of scien-

tiﬁc codes.

ACKNOWLEDGEMENTS

This work has been supported by SP20120646

project of Universitat Polit

ecnica de Val

encia, by

ISIC/2012/006 and PROMETEO FASE II 2014/003

projects of Generalitat Valenciana; and has been sup-

ported by European Union ERDF and Spanish Gov-

ernment through TEC2012-38142-C04-01.

EfficientSoft-outputDetectors-Multi-coreandGPUimplementationsinMIMOPackLibrary

343

REFERENCES

Barbero, L. G., Ratnarajah, T., and Cowan, C. (2008).

A low-complexity soft-MIMO detector based on the

ﬁxed-complexity sphere decoder. In IEEE Interna-

tional Conference on Acoustics, Speech and Signal

Processing.

Barbero, L. G. and Thompson, J. S. (2008). Fixing the

complexity of the sphere decoder for MIMO detec-

tion. IEEE Transactions on Wireless Communica-

tions, 7(6):2131–2142.

Berenguer, I. and Wang, X. (2003). Space-time coding and

signal processing for mimo communications. Journal

of Computer Science and Technology, 18(6):689–702.

Caire, G., Taricco, G., and Biglieri, E. (1998). Bit-

interleaved coded modulation. IEEE Trans. Inform.

Theory, 44(3):927–946.

Falcao, G., Silva, V., and Sousa, L. (2009). How GPUs

can outperform ASICs for fast LDPC decoding. In In-

ternational Conference on Supercomputing, Yorktown

Heights, New York (USA).

Flynn, M. (1972). Some computer organizations and

their effectiveness. IEEE Transactions on Computers,

21:948–960.

Jalden, J., Barbero, L., Ottersten, B., and Thompson, J.

(2009). The error probability of the ﬁxed-complexity

sphere decoder. IEEE Transactions on Signal Process-

ing, 57:2711–2720.

Kim, J., Hyeon, S., and Choi, S. (2010). Implementation

of an sdr system using graphics processing unit. IEEE

Commun. Mag., 48(3):156–162.

Muller-Weinfurtner, S. (2002). Coding approaches for mul-

tiple antenna transmission in fast fading and ofdm.

IEEE Transactions on Signal Processing, 50:2442–

2450.

NVIDIA (2014). Nvidia programming guide, version 6.0.

http://docs.nvidia.com/.

OpenMP (2013). Application program interface v4.0.

http://www.openmp.org/.

Paulraj, A. J., Gore, D. A., Nabar, R., and Bolcskei, H.

(2004). An overview of MIMO communications - a

key to Gigabit wireless. In Proceedings of the IEEE,

volume 92, pages 198–218.

Roger, S., Ramiro, C., Gonzalez, A., Almenar, V., and Vi-

dal, A. M. (2012). Fully parallel gpu implementa-

tion of a ﬁxed-complexity soft-output mimo detec-

tor. IEEE Transactions on Vehicular Technology,

61(8):3796–3800.

Rusek, F., Persson, D., Lau, B., Larsson, E., Marzetta, T.,

Edfors, O., and Tufvesso, F. (2013). Scaling up mimo:

Opportunities and challenges with very large arrays.

IEEE Signal Processing Magazine, 30(1):40–60.

Wu, M., Sun, Y., Gupta, S., and Cavallaro, J. (2010). Im-

plementation of a high throughput soft MIMO detec-

tor on GPU. Journal of Signal Processing Systems,

64:123–136.

PECCS2015-5thInternationalConferenceonPervasiveandEmbeddedComputingandCommunicationSystems

344