Efﬁcient Projective Transformation and Lanczos Interpolation on ARM

Platform using SIMD Instructions

Konstantinos Papadopoulos and Kyriakos Vlachos

Computer Engineering and Informatics Department, University of Patras, Patras, Greece

Keywords:

NEON, SIMD, Image Processing, Projective Transformation.

Abstract:

This paper proposes a novel way of exploiting NEON SIMD instructions for accelerating projective transfor-

mation in ARM platforms. Instead of applying data parallelism to linear algorithms, we study the effectiveness

of SIMD intrinsics on this non-linear algorithm. For image resampling, Lanczos interpolation is used since

it is adequately accurate, despite its rather large complexity. Multithreading is also employed for optimal use

of system resources. Moreover, qualitative and quantitative results of NEON’s performance are presented and

analyzed.

1 INTRODUCTION

Projective transformation is used in a wide range of

computer vision applications. It provides a linear

mapping between arbitrary quadrilaterals which is

very useful for deforming images controlled by mesh

partitioning. Some of the most well-known applicati-

ons are the removal of perspective distortion, image

stabilization, panoramic mosaic creation and object

tracking. Moreover, Lanczos resampling is one of

the most accurate algorithms for image upscaling, ac-

cording to (Burger and Burge, 2009). However, it is

computationally intensive, which can result in a poor

performance. The demand of faster multimedia appli-

cations is high, therefore improving projective trans-

formation’s processing time is crucial.

SIMD units’ contribution in multimedia applica-

tion development has been signiﬁcant over the past

years. It allows parallel execution of both data type

operations (arithmetic, logical, etc.) and load/store

operations. Theoretically, this unit is able to accele-

rate operations up to 16 times, but this applies only

to certain data types. Optimal use of SIMD is possi-

ble at the low assembly level. However, developers

have the option to use SIMD intrinsics in high-level

programming (C/C++), taking advantage of interope-

rability and improved control over data.

Work presented in (Welch et al., 2012) regards

the implementation 2D bilinear interpolation algo-

rithm using NEON SIMD instructions. This algo-

rithm is exclusively used for image scaling. The

speedup achieved compared to the baseline algorithm

was 1.97-2.06 times. Moreover, in (Mitra et al.,

2013), authors proposed SIMD vector operations to

accelerate code performance on both low-powered

ARM and Intel platforms. They implemented Float

to Short data type conversion, binary image threshol-

ding, Gaussian Blur ﬁlter, Sobel ﬁlter and edge de-

tection algorithms in various ARM devices and ma-

naged to achieve speed gains from 1.05 to 13.88 com-

pared to compiler auto-vectorization. In addition,

(Mazza et al., 2014) achieved a speed gain of 3.76-

3.86 in bilinear interpolation using multithreading (2

Cortex-A9 cores) and SIMD instructions. Additional

work in SIMD multimedia processing ﬁeld includes

linear image processing using OpenCL’s SIMD capa-

bilities in (Antao and Sousa, 2010) and acceleration of

alpha blending algorithm in a Flash application using

the Intel x86-64 platform’s SIMD (SSE) instructions

in (Perera et al., 2011).

This paper proposes a way of accelerating pro-

jective transformation using NEON SIMD instructi-

ons. The chosen resampling method is Lanczos in-

terpolation which is demanding and computationally

heavy, but produces notably results in terms of accu-

racy. Multithreading is utilized too, offering efﬁcient

use of CPUs’ resources. Overall performance evalu-

ation of the proposed implementation is based on the

speed gains. Qualitative evaluation is also provided

for the output frames.

Papadopoulos, K. and Vlachos, K.

Efﬁcient Projective Transformation and Lanczos Interpolation on ARM Platform using SIMD Instructions.

DOI: 10.5220/0006547000950100

In Proceedings of the 13th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2018) - Volume 4: VISAPP, pages

95-100

ISBN: 978-989-758-290-5

2 SIMD IMPLEMENTATION

Our implementation is divided into two separate

parts: projective mapping function and resampling.

Each of these parts affects differently the way of data

parallelization.

2.1 SIMD Projective Mapping

Inverse mapping approach is used in this context.

This means that, for each discrete pixel position

(u,v) in output image, the corresponding continuous

point (x, y) is computed, using the inverse geometrical

transformation T

−1

. In this method, each pixel of the

target image I

is calculated and ﬁlled exactly once,

so that there are no empty spots or multiple ﬁllings.

The calculation of coordinates x and y is shown in (1)

and (2), respectively:

x =

· (d

+ d

) =

+ d

+ 1

(1)

y =

· (d

+ d

) =

+ d

+ 1

(2)

where d

i j

is the corresponding A

ad j

element.

The parallelization of data processing is perfor-

med between iterations. Thus, inner loop is changed

so that four consecutive coordinate pairs are loaded

into a Q NEON register (128-bit).

Apparently, ﬂoating-point data type must be pre-

served for easier homogenization. On one hand, in

case of integers, homogenization could not be ef-

fective as it would require independent processing of

each point (there is no division instruction for inte-

gers). On the other hand, there is a ﬂoating-point

NEON SIMD intrinsic (vrecpsq

f32()) which ﬁnds

the reciprocal of a NEON register’s lanes using the

Newton-Raphson iteration. In our case, this operation

is performed twice for more accurate results. Algo-

rithm 1 demonstrates the projective mapping function

using SIMD instructions.

2.2 SIMD Lanczos Interpolation

The interpolation method used in this context is 2nd

order Lanczos because it maintains balance between

computational cost (in contrast to higher order Lan-

czos interpolation) and accuracy. Its 1D kernel is de-

ﬁned in (3):

Algorithm 1: Projective mapping function using SIMD in-

structions.

1: D ← ad joint(A);  A: 3x3 Transformation

Matrix

2: for x in image height do

3: y

temp

← vload 4(x · D

+ D

);

4: x

temp

← vload 4(x · D

+ D

);

5: h

temp

← vload 4(x · D

+ D

);

6: for y in image width step 4 do

7: vec

← vload 4([y : y + 3]);

8: temp

← vmac 4(y

temp

,vec

);

9: temp

← vmac 4(x

temp

,vec

);

10: temp

← vmac 4(h

temp

,vec

);

11: vec

recip

← vreciprocal 4(temp

);

12: temp

← vmultiply 4(temp

,vec

recip

);

13: temp

← vmultiply 4(temp

,vec

recip

);

14: [y

: y

+ 3] ← vstore 4(temp

);

15: [x

: x

+ 3] ← vstore 4(temp

);

16: end for

17: end for

(x) =











1 , |x| = 0

2 ·

sin





· sin(πx)

, 0 < |x| < 2

0 , |x| ≥ 2

(3)

Due to its high complexity, a LookUp Table (LUT) of

10000 ﬁxed-point kernel values is used. In addition,

this kernel is x/y − separable, therefore the 2D Lan-

czos interpolation can be expressed as in equation (4):

byc+2

∑

v=byc−1

(y−v)

bxc+2

∑

u=bxc−1



I(u, v)·w

(x−u)



(4)

where w =

byc+2

∑

v=byc−1

(y −v)

bxc+2

∑

u=bxc−1

(x −u)

In ARGB colorspace, four output subpixel values

will be processed in each iteration. Moreover, every

output pixel will have to be normalized in order to

avoid image artifacts. The proposed procedure for the

SIMD implementation of Lanczos 2 interpolation fol-

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

lows. Initially, out-of-frame image pixels are removed

using logical operations instead of if-clauses. This

operations’ results are used to mask both y-axis and

x-axis Lanczos kernel values (w

(i) = w

− v

(i) = w

− u

), i = 0 ...3) loaded from LUT.

Then, y-axis kernel values are loaded into two NEON

registers of 8 lanes each, where every w

(i) is put into

four consecutive lanes.

Following, four pixels



I(u

), I(u

j+1

I(u

j+2

), I(u

j+3

)



are loaded in one NEON re-

gister using 16 subpixels of 8 bits each. These values

are converted into 16-bit signed integers and are sto-

red in two NEON registers of 8×16 lanes. Thereaf-

ter, each subpixel is multiplied with its corresponding

Lanczos kernel value and these two vectors are added

together, as shown in Figure 1. After that, the upper

and the lower half of register are added pairwise and

multiplied with scalar w

(i), i = 0...3.

Figure 1: Vectorized convolution in y axis.

The above procedure is repeated four times for

each w

(i) during SIMD convolution. The implemen-

tation concludes with output pixel normalization. The

reciprocal of w is computed and then multiplied with

output subpixels, followed by the clamping of their

values in 0 − 255 range. Finally, each produced pixel

is stored as 32-bit unsigned integer in output image ar-

ray. The dataﬂow diagram of overall SIMD Lanczos

interpolation is presented in Figure 2.

3 RESULTS

In this section, qualitative and quantitative results are

presented. Performance of both baseline and SIMD

projective transformation is compared, for examining

the effectiveness of the NEON unit. For this pur-

pose we use two different CPUs: one Cortex-A9

Exynos 4412 Quad CPU, clocked at 1.6 GHz, which

Figure 2: SIMD Lanczos interpolation dataﬂow diagram.

NEON unit uses 64-bit long registers (128-bit Q re-

gisters) and one Qualcomm MSM8992 Snapdragon

808 which consists of two CPUs (quad-core 1.44 GHz

Cortex-A53 & dual-core 1.82 GHz Cortex-A57).

Input and output images are considered to have the

same size, since we keep only the input image map-

ping coordinates which correspond to the central part

of the output image. Subsequently, if output image is

larger than source image, then only the central part of

it (of same dimensions) will appear on display. Other-

wise, output image is supposed to appear in the center

of the screen.

Initially, our implementation’s qualitative re-

sults are examined. For this purpose, MATLAB

is used in order to produce algorithm’s original

output and our results are compared with them

using its built-in imshowpair() function. Figu-

res 3 and 4 display two of these comparisons.

In particular, Figures 3(a) and 4(a) display the

initial image, Figures 3(b) and 4(b) the results

for transformation matrix [−2,0.3,0; 0,−2,0; 0,0,1]

and [−2,0.3,0;0,−2,0;0,0,1] respectively and Fi-

gures 3(c) and 4(c) the results for transforma-

Efﬁcient Projective Transformation and Lanczos Interpolation on ARM Platform using SIMD Instructions

Figure 3: Input image in subﬁgure (a) and two of our pro-

posed system’s results compared to MATLAB’s correspon-

ding results in subﬁgures (b) and (c), where gray area is

common for compared systems’ results.

tion matrix [1.7, 0.1, 0.0006; 0.3, 1.7, 0.0001; 0, 0, 1]

and [1.2,0.2,0.0002; 0.3,1.3,0.0001; 0,0,1] respecti-

vely (gray area is the common part of our proposed

system’s and MATLAB’s results).

In order to produce the performance results, three

different image sizes are used for each platform. Furt-

hermore, three different cases are examined, based

on the parallelization technique. The ﬁrst case is the

baseline algorithm without any parallelization at all.

The second case is NEON algorithm and the third

Figure 4: Input image in subﬁgure (a) and our proposed

system’s results compared to MATLAB’s corresponding

results in subﬁgures (b) and (c) (source: http://car-from-

uk.com/carphotos/full/1359406159545035.jpg).

case is NEON+multithreading (NEON+mt) algorithm

in which all CPU cores are utilized. For each case,

three different examples of transformation matrices

are examined: a simple scale matrix, an afﬁne matrix

and a projective matrix. The performance of baseline

and NEON cases is presented in Tables 1 and 2.

Using the projective matrices, the processing

times for Small, Medium and Large images are

247.188 ms, 495.519 ms and 1878.818 ms respecti-

vely for baseline implementation and 111.603 ms,

223.969 ms and 801.489 ms respectively for NEON

implementation. There is a small deviation in proces-

sing times for afﬁne and scale matrices.

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

Table 1: Execution times for baseline, NEON and

NEON+multithreading cases for each image size and trans-

formation matrix (Cortex-A9).

Projective

image size baseline neon neon+mt

427x640 247.19 ms 111.6 ms 30.03 ms

640x853 495.52 ms 223.97 ms 61.44 ms

1067x1867 1878.82 ms 801.49 ms 217.73 ms

Afﬁne

image size baseline neon neon+mt

427x640 246.88 ms 108.79 ms 29.97 ms

640x853 491.52 ms 222.15 ms 59.54 ms

1067x1867 1801.91 ms 818.83 ms 211.48 ms

Scale

image size baseline neon neon+mt

427x640 243.37 ms 107.99 ms 28.32 ms

640x853 486.24 ms 215.49 ms 57.33 ms

1067x1867 1779.05 ms 781.94 ms 207.17 ms

Table 2: Execution times for baseline, NEON and

NEON+multithreading cases for each image size and trans-

formation matrix (Cortex-A53 & Cortex-A57).

Projective

image size baseline neon neon+mt

533x533 168.95 ms 43.83 ms 17.09 ms

1032x1653 993.12 ms 277.55 ms 102.59 ms

1920x3413 3959.02 ms 1035.4 ms 455.32 ms

Afﬁne

image size baseline neon neon+mt

533x533 161.17 ms 42.12 ms 16.71 ms

1032x1653 965.71 ms 242.37 ms 95.49 ms

1920x3413 3798.38 ms 888.65 ms 420.18 ms

Scale

image size baseline neon neon+mt

533x533 156.22 ms 36.86 ms 15.53 ms

1032x1653 952.38 ms 219.38 ms 88.57 ms

1920x3413 3639.23 ms 825.89 ms 361.18 ms

Figures 5 and 6 display the speed gain of NEON

implementation. The speedup factor achieved in Cor-

tex A9 CPUs ranges from 2.2 to 2.34 and, if all four

NEON units (included in CPU cores) are utilized, the

speedup factor increases to 8.06-8.63. In the mean-

time, the speed gain achieved in Cortex A53 + Cortex

A57 CPUs is signiﬁcantly higher. In particular, it ran-

ges from 3.58 to 4.41 for simple NEON implementa-

tion and increases to 8.7-10.75 for NEON+mt case.

The contribution of SIMD Lanczos interpolation

in the above results is signiﬁcant. On Cortex A9 CPU,

it offers an average speedup factor of 2.46, while

SIMD projective mapping is only 1.38 times faster

compared to baseline projective mapping. Moreo-

ver, on Cortex A53 + Cortex A57 CPU, respective

speedups are much larger. SIMD Lanczos interpo-

lation is 4.72 times faster than baseline Lanczos in-

Figure 5: Speed gain diagram of baseline, NEON and

NEON+ multithreading cases for each image size and trans-

formation matrix (Cortex-A9).

Figure 6: Speed gain diagram of baseline, NEON and

NEON+multithreading cases for each image size and trans-

formation matrix (LG G4).

terpolation, while SIMD projective mapping reaches

a speedup factor of 1.97.

Figure 7 shows the proﬁling of NEON+mt case on

Cortex-A9 CPU using ARM Streamline Community

Edition. According to the results, combining SIMD

and multithreading programming leads to optimal uti-

lization of system resources during the main proces-

sing.

4 CONCLUSIONS

In this paper, SIMD implementation of projective

transformation is presented. Despite its non-linearity,

a signiﬁcant speedup is achieved, as a result of the ef-

fective use of SIMD instructions. Combining SIMD

Efﬁcient Projective Transformation and Lanczos Interpolation on ARM Platform using SIMD Instructions

Figure 7: NEON+mt implementation proﬁling.

with multithreading programming offered the best

possible speed gain which, in many cases, exceeded

800%. Future work includes testing of modern ARM

architectures and further acceleration in order to re-

ach real-time performance for high deﬁnition video

applications.

ACKNOWLEDGEMENTS

This work was partially supported by a graduate grant

from IRIDA labs (http://www.iridalabs.gr/). The aut-

hors would like to thank Dr. Christos Theocharatos

and Dr. Nikos Fragoulis for their support and their

fruitful discussions.

REFERENCES

Antao, S. and Sousa, L. (2010). Exploiting simd extensions

for linear image processing with opencl. In Internati-

onal Conference on Computer Design (ICCD). IEEE.

Burger, W. and Burge, M. J. (2009). Principles of Digital

Image Processing, Core Algorithms. Springer.

Mazza, J., Patru, D., Saber, E., Roylance, G., and Larson, B.

(2014). A comparison of hardware/software techni-

ques in the speedup of color image processing algo-

rithms. In Image and Signal Processing Workshop

(WNYISPW). IEEE.

Mitra, G., Johnston, B., Rendell, A. P., McCreath, E.,

and Zhou, J. (2013). Use of simd vector operations

to accelerate application code performance on low-

powered arm and intel platforms. In 27th Internati-

onal Parallel and Distributed Processing Symposium

Workshops & PhD Forum (IPDPSW). IEEE.

Perera, C., Shapiro, D., Parri, J., Bolic, M., and Groza, V.

(2011). Accelerating image processing in ﬂash using

simd standard operations. In The Third International

Conferences on Advances in Multimedia (MMEDIA).

Welch, E., Patru, D., Saber, E., and Bengtson, K. (2012).

A study of the use of simd instructions for two image

processing algorithms. In Image Processing Workshop

(WNYIPW).

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

100