3.3 Optimization of Kernel Resource
For the GPU with AMD Graphics Core Next(GCN)
architecture, the local memory of each computing
unit can be divided into 32 banks (AMD Inc, 2015a),
which are accessed by the half-wavefront (32 threads)
unit. If certain bank is accessed by multiple threads
on the same half wavefront boundary, the bank con-
flict will occur except that the 32 threads access the
same bank. In the bank conflict, the system has to wait
for running until all the threads obtain data. There-
fore, serious bank conflict will lead the idle of large
amount computing cycles with low running effi-
ciency.
To reduce the bank conflicts, the data copy for
each thread is generated in private memory as the so-
lution, with the coordinates and the normals of the
two points stored. Additionally, the vector format
data can also reduce the bank conflicts because of its
alignment to bank (Wilt, 2013). Considering the data
structure of railway track point clouds, the 3D coor-
dinates and normals are stored in float3 vectors. For
the limited number of registers, the use of private
memory may cause the leak of registers to DRAM,
with the side effect to performance (Gaster et al.,
2012), which should be avoid. The kernel file here is
built by CodeXL so as to count the used amount of
registers(private memory) and local memory re-
sources shown as Table 1.
Table 1: Used amount of kernel resource.
It is shown that the resources of kernel utility dis-
tribute in the recommended range, and can be built
successfully. Thus, the performance will not be dete-
riorated for the registers leak. For different numbers
of neighbor points, the memory unit load rates are an-
alyzed by CodeXL. The comparison before and after
memory optimization is shown as Figure 5.
As is shown, the load rates after optimization is
obviously higher than that of before, but it declines
with the increasing number of neighbor points, be-
cause with the increment of neighbor points, the data
amount in local memory is increasing linearly while
the computing complexity rises exponentially. If
there is enough neighbor points, the bottleneck will
be on computing unit rather than the load of memory
unit.
Figure 5: Comparison of memory unit load rates.
3.4 The Division of Work-group
In the abstract OpenCL model, the work-group is
composed by the work-items (Scarpino, 2011) ac-
cessing the same resource, in other words, all the
work-items in the work-group share the same local
memory. In the computing of PFH, the neighbor
points for single query point is independent to these
of other query points. Therefore, it is reasonable to
map the neighbor points for single query point to a
work-group of OpenCL.
The efficiency of GPU computing relates to the
size of work-group. For the GPU with AMD GCN ar-
chitecture, a wavefront has 64 work-items, which is
the lowest level that flow control can affect (AMD
Inc, 2015b).It is better when size of work-group is in-
tegral multiple of wavefront (Scarpino, 2011). If the
number of neighbor points is k, and the size of work-
group is N, the number of loops isk
2
. All the compu-
ting goes on if each work-item loop runs for M times
with M×N≥k
2
. The value of k is set by users. With
the small k, it will get a low number of loops k
2
, and
cause excessive idle of work-items if N is set too
large. Using the analysis function of CodeXL, it can
obtain the load percentage of vector arithmetic and
logic unit(VALU) in different work-group sizes.
It demonstrates that the VALU with 64 work-
groups has comparable initialization proportion with
256 ones, but the load proportion is much higher than
the later. Thus, the VALU with 64 work-groups is
more effective corresponding to theoretical analysis.
When the number of neighbor points is large, there is
very little influence of work-group size on VALU in-
itialization proportion for the sufficient utilization of
GPU computing resources. In summary, the size of
work-group is set as a wavefront (that is 64).
0
10
20
30
40
50
60
5 10 20 50
MemUnitBusy (%)
Number of Neighbors
Before Optimization After Optimization
ICINCO 2016 - 13th International Conference on Informatics in Control, Automation and Robotics
436