Analysis of GPGPU Programs for Data-race and Barrier Divergence

Santonu Sarkar

, Prateek Kandelwal

, Soumyadip Bandyopadhyay

and Holger Giese

ABB Corporate Research, India

MathWorks, India

Hasso Plattner Institute f

ur Digital Engineering gGmbH, Germany

Keywords:

Veriﬁcation, SMT Solver, CUDA, GPGPU, Data Races, Barrier Divergence.

Abstract:

Todays business and scientiﬁc applications have a high computing demand due to the increasing data size

and the demand for responsiveness. Many such applications have a high degree of parallelism and GPGPUs

emerge as a ﬁt candidate for the demand. GPGPUs can offer an extremely high degree of data parallelism

owing to its architecture that has many computing cores. However, unless the programs written to exploit the

architecture are correct, the potential gain in performance cannot be achieved. In this paper, we focus on the

two important properties of the programs written for GPGPUs, namely i) the data-race conditions and ii) the

barrier divergence. We present a technique to identify the existence of these properties in a CUDA program

using a static property veriﬁcation method. The proposed approach can be utilized in tandem with normal

application development process to help the programmer to remove the bugs that can have an impact on the

performance and improve the safety of a CUDA program.

1 INTRODUCTION

With the order of magnitude increase in computing

demand in the business and scientiﬁc applications, de-

velopers are more and more inclined towards massi-

vely data-parallel computing platform like a general-

purpose graphics processing unit (GPGPU) (Nickolls

and Dally, 2010). GPGPUs are programmable co-

processors where designers can develop applications

in CUDA if they are using NVIDIA GPGPU. CUDA

(Compute Uniﬁed Device Architecture) is an exten-

sion to standard ANSI C through additional key-

words. A CUDA program follows a Single-Program

Multiple- Data (SPMD) model, where a piece of code,

called kernel, is executed by hundreds of threads de-

pending on the device’s computing capability, in a

lock-step fashion (where all the participating thre-

ads execute the same instruction). The developer can

bundle these threads into a set of thread-blocks, where

a set of threads in a block can share their data and

synchronize their actions through a built-in barrier

function called __syncthreads().

While sharing of data among threads is essential

for any reasonably complex parallel application, it can

lead the program to a unsafe state where the behavior

of the program is unpredictable and incorrect. The

safety property of a program, which informally me-

ans that the program will never end-up in an errone-

ous state, or will never stop functioning in an arbitrary

manner, is a well-known and critical property that an

operational system should exhibit (Lamport, 1977).

A parallel program can reach an unsafe state when

multiple threads enter into a race condition, conﬂict

with each other for data access enters into a dead-

lock situation and so on. Such a situation never arises

in a sequential program. Thus, unlike a sequential

software, detecting a violation of the safety property

and identifying its cause in a parallel program is far

more challenging. Petri net based veriﬁcation of pa-

rallel programs has been reported in (Bandyopadhyay

et al., 2017; Bandyopadhyay et al., 2018), however, it

checks only the computational equiavlence.

One of the well-known techniques for detecting

such an unsafe operation in a program is the pro-

perty veriﬁcation method. This approach models po-

tential causes for a program safety violation such as

data-races, or deadlock as properties of the program.

The veriﬁcation program analyzes these properties,

extracted from the code, for their satisﬁability. While

this technique has been proposed for a long time and

used in the context of sequential programs, recently

it has gained renewed interest with the resurgence of

GPGPUs and its programming model. Analysis of

a parallel program differs from the analysis of a se-

460

Sarkar, S., Kandelwal, P., Bandyopadhyay, S. and Giese, H.

Analysis of GPGPU Programs for Data-race and Barrier Divergence.

DOI: 10.5220/0006834904600471

In Proceedings of the 13th International Conference on Software Technologies (ICSOFT 2018), pages 460-471

ISBN: 978-989-758-320-9

quential program because of the possible ordering of

instructions, can be many (unless restricted otherwise

with synchronization calls or atomic access) when

these instructions are executed in parallel by multiple

threads. The analysis of such a program requires the

methodology to consider all possible inter-leavings of

the instructions where there is no speciﬁc ordering

imposed by the program. This has a signiﬁcant im-

pact on the number of permutations that need to be

analyzed. This problem can be severe in the case of

GPU programs, as the number of threads can be in

the order of thousands, with absolutely no synchro-

nization available between two threads belonging to

separate blocks.

We propose a static veriﬁcation tool that can ana-

lyze a CUDA program and verify all possible data-

races spanning all the barrier intervals, which may

occur when the program is executed on a GPU. The

major contributions of this paper are as follows:

1. Existing literature such as (Li and Gopala-

krishnan, 2010; Lv et al., 2011) detects a sin-

gle data race or a set of races between two bar-

rier functions. This, in turn, requires the de-

velopers to repeatedly run and ﬁx the data race

conditions, which can be quite cumbersome and

time-consuming when the program is large, invol-

ving many barriers. Thus, our approach generates

more comprehensive results in one run which can

substantially improve the program repairing time.

2. Our method can detect whether the barriers are

divergent.

3. We have provided a formal analysis of our data

race and barrier divergence detection algorithms

and proved that our approach is sound, i.e., it does

not have any false positive results.

The paper has been organized as follows. In

Section 3, we provide an overview of the approach.

Next, we elaborate the program translation mecha-

nism in Section 4. In Section 5 we elaborate the veriﬁ-

cation approach. We report the experimental study of

our tool in Section 6. Finally, we conclude our paper.

2 RELATED WORK

Unlike a sequential program, a parallel program exe-

cution can lead to several new safety and correctness

issues (Kirk and Hwu, 2016; Lin et al., 2015). A more

recent article (Ernstsson et al., 2017) further substan-

tiates several safety issues in GPU programming.

An early approach by Boyer et al. (Boyer et al.,

2008) proposed a dynamic analysis of a CUDA pro-

gram. The CUDA kernel is instrumented to gene-

rate runtime traces of memory accesses by different

threads, which is analyzed to discover various bugs

like data races and bank conﬂicts.In comparison, our

approach analyzes the symbolic constraints associa-

ted with the potentially conﬂicting access, along with

checking for divergent barriers, which are not handled

in the former case.

PUG tool (Li, 2010) is a symbolic veriﬁer based

approach to detect data race conﬂicts, barrier diver-

gence, and bank conﬂicts. PUGpara (Li and Gopala-

krishnan, 2012) is an extension of the PUG tool which

can also check for functional correctness and equiva-

lence of CUDA kernels. PUGpara models a single

parameterized thread and uses symbolic analysis for

veriﬁcation of data race and equivalence checking. A

limitation of the tool is that the tool does not detect all

the races in a program, and exits on the ﬁrst encounter

of a race. Also, the barrier divergence detection me-

chanism in presence of branches, checks if the num-

ber of barriers executed along the path to be the same

in the absence of branch divergence, which is not cor-

rect as per the CUDA speciﬁcation, where for each

barrier synchronization call, it is expected that either

all the threads within the block should execute it, or

none at all.

Performance degradation analysis of GPUs con-

cerning bank conﬂicts and coalesced global memory

has been modeled as an unsatisﬁability problem in

(Lv et al., 2011). The approach produces a counter-

example where threads with consecutive thread-ids

may not access the consecutive memory locations.

Here factors like data-type and the nature of access

play an essential role (CUD, 2017). While the ap-

proach reported in the paper is novel, it is indeed

not robust in considering all possible scenarios of un-

coalesced access or complex bank conﬂict scenarios.

A combination of static and dynamic analysis has

been proposed by (Zheng et al., 2011; Zheng et al.,

2014) where, initially a static analysis is used to re-

duce the number of statements to be instrumented for

checking data races. Next, the access patterns of the

instrumented statements are collected and analyzed

using dynamic analysis.

An approach presented in (Said et al., 2011) ge-

nerates concrete thread schedules that can determi-

nistically trigger a particular data race. This appro-

ach also leverages the satisﬁability modulo theory

(SMT) to generate the schedule, modeled as con-

straints. The approach, however, has been develo-

ped for task-parallelism in Java involving fork/join se-

mantics.

Recently a synchronous delayed visibility se-

mantics based program veriﬁcation has been propo-

sed(Betts et al., 2015) to detect data races and barrier

Analysis of GPGPU Programs for Data-race and Barrier Divergence

461

divergence. Unlike this approach, we are using the

notion of the predicated form of a thread in our ap-

proach while translating the program to assist in the

generation of symbolic constraints.

Symbolic execution-based approach for detecting

data races in OpenCL code has been explored in (Col-

lingbourne et al., 2012) which records each memory

access and then use this information to detect the data-

race. Naturally, approach is computationally more ex-

pensive than an efﬁcient SMT based solver like ours.

Formal Semantics for heterogeneous CUDA C is

presented in (Hathhorn et al., 2012) with an interpre-

ter built upon the approach. With the help of the in-

terpreter based execution, the developers can identify

speciﬁc instances of race conditions and deadlocks

due to barrier divergence.

The test ampliﬁcation based approach is presented

in (Leung et al., 2012), where it is shown that in cer-

tain conditions, the results obtained by a single test

execution of a kernel can be ampliﬁed to prove the

correctness properties like data-race freedom for the

kernel in general during static analysis.

A concolic testing based technique is presented

in GKLEE (Li et al., 2012b; Li et al., 2012a) which

looks into both correctness and various performance

issues of GPU programs.

Researchers have used ﬁnite state model-based

analysis to detect occurrences of conﬂicts among ato-

mic synchronization commands in (Chiang et al.,

2013) which can help in generating alternate delay ba-

sed scheduling of GPU programs.

The concept of barrier invariants is presented

in (Chong et al., 2013) to enable veriﬁcation for data

dependent kernels, specially when the data are in the

shared memory of a GPU. However, the approach

does not directly address the data race situation.

In (Collingbourne et al., 2013), a lock-step seman-

tics for GPU kernels has been proposed to deal with

unstructured GPU kernels.

Permission-based separation logic presented

in (Huisman and Mihel

c, 2013; Blom and Huisman,

2014), has been used for veriﬁcation of OpenCL

programs. Here, the program is annotated by read

and write permissions, where at most one thread can

hold write permission to a location, and multiple

threads can hold read permission simultaneously.

Bardsley et al.(Bardsley and Donaldson, 2014)

presented a method to analyze atomics and warp s sy-

nchronization for GPU kernels without explicit bar-

riers using a two-pass and a re-sync approach. The

approach has been proposed to analyze the cases of

inter and intra-warp execution.

The problem of optimal barrier in SPMD and

multi-threaded programs has been studied in (St

ohr

s h a r e d i n t A[ 1 0 2 4 ] , i n t B [ 1 0 2 4 ] ,

i n t C [ 1 0 2 4 ] ;

d e v i c e v o i d d a t a r a c e ( ) {

i n t i , t i d ;

t i d = t h r e a d I d x . x ; / / IR 1 .

A[ t i d % 2]= 1 0 0 ; / / IR 2 .

B[ t i d ]= A[ t i d ] + 5 ; / / IR 3 .

i f ( ( t i d % 2 ==0) {

s y n c t h r e a d s ( ) ; / / IR 4 .

B[ t i d ]= A[ t i d ] + 1 0 ; / / IR 5 .

}

A[ t i d +1 ] = 1 0 ; / / IR 6 , IR 7

s y n c t h r e a d s ( ) ;

B[ t i d %3] = A[ t i d ] ∗B[ t i d ] ;

f o r ( i = 0 ; i <100; i ++) { / / IR 8

A[ i % t i d ] = i + 10 ; / / IR 9

}

Figure 1: A Schematic Diagram of Our tool.

Figure 2: GUI of the tool.

and O’Boyle, 1997; Oboyle et al., 1995; Darte and

Schreiber, 2005; Dhok et al., 2015), where it has been

shown that the problem of identifying the optimal pla-

cement of a barrier is tractable ensuring that the pro-

blem can be solved for practical purposes.

ICSOFT 2018 - 13th International Conference on Software Technologies

462

3 PROPOSED APPROACH

Here we translate a CUDA kernel program into a set

of well-formed logical formulae (WFF) in Static Sin-

gle Assignment (SSA) form and then perform con-

straint solving using an SMT solver to check for cor-

rectness properties as a part of the symbolic execu-

tion. Our approach is based on symbolic analysis,

where once a CUDA kernel program is translated into

a set of logical well-formed formulas, we use them

as constraints for a generic thread. We then use the

concept of two threaded execution and generate ap-

propriate additional constraints that should hold if a

given operation has a data race. These constraints are

then fed to the Z3 SMT solver

to see if it can ﬁnd

a satisfying assignment to the constraints, if yes, then

the tool returns the assignments as a witness for the

data race condition, else reports the inability in ve-

rifying the data race condition. Our approach aims

to detect all the pairs of statements that can lead to

data race and all the instances of the barriers where

the barrier divergence is possible. The upper part of

Figure 1 shows a sample CUDA code which we use

as a running example. The lower part of the ﬁgure

is a schematic diagram of our approach comprising

of two modules, the translation, and the veriﬁcation

module. The translation module uses CLANGs (Latt-

ner and Adve, 2004) Abstract Syntax Tree (AST) to

convert the CUDA kernel source ﬁle to a set of SMT

equations. The Veriﬁcation module takes the transla-

ted set and creates the SMT equations encoded with

the properties that need to be veriﬁed. The satisﬁa-

bility for these equations is checked using the SMT

Solver, where a witness is returned if the equations

are satisﬁable and imply the presence of data races

or barrier divergences. All such violations of correct-

ness properties are collected and reported back to the

user along with their location in the source code. Fi-

gure 2 shows the GUI of the tool that highlights the

statements having potential data races and barrier di-

vergence.

Example 1. Let us consider the sample CUDA ker-

nel shown in Figure 1, that uses three arrays shared

by multiple threads. Since each thread (with unique

thread id threadIdx.x) accesses one element of the ar-

ray A, threads with even thread-ids 0, 2, and 4 access

the same shared memory location, resulting in a con-

ﬂict in line with the comment “IR 2”. Similarly, there

is a chance of a read-write conﬂict between lines IR 2

and IR 3.

The code has a barrier call in line 13 which ensu-

res that all the threads are synchronized at the barrier

before proceeding for the next statement. We can see

https://github.com/Z3Prover/z3

that there can be a read-write conﬂict between two

threads in line IR 5 and IR 6.

There is a barrier call in line IR 4 inside the if-

condition that creates a barrier divergence since all

threads do not reach the barrier since the barrier

function is inside the if condition. In the paper, we

will illustrate only IR 5 and 6.

4 SOURCE PROGRAM TO WELL

FORMED FORMULAE

Our translation module can handle most of the

CUDA-C grammar necessary for writing kernels.

In the current implementation, we are not hand-

ling general function calls besides the call to

syncthreads(), nor pointers and assignments.

4.1 Translation Rules

The translation module shown in Figure 1, uses

CLang infrastructure to deﬁne user-deﬁned semantic

actions for each AST node. We have deﬁned speciﬁc

semantic actions that convert each source code state-

ment of a CUDA kernel into a predicated execution

form; formally represented as predicate → action.

Here, both the predicate and the action are well-

formed formulae generated using the translation ru-

les. Before converting a statement into a WFF, the

translation module converts each variable into its sta-

tic single assignment (SSA) form. Hereinafter, we re-

fer to the SSA form of a variable as the “augmented

variable”.

Assignments. For an assignment statement, we cre-

ate an action as (== augmented variable lvalue) and

return to the parent AST node. If there are multiple

declaration and assignments such as int x = 10, y

= 10;, the translator will create action expression for

each of them.

Important Unary and Binary Operators. The

tool supports arithmetic and logical operators such as

the not, “!”, the bitwise inversion “¬”, unary minus

“-”, incr and decr (both postﬁx and preﬁx), “++” and

“–” operators. The parser also allows the usual bi-

nary arithmetic and logical operators. In this context,

we highlight a special case of the assignment opera-

tor, “=”. Here the translation process ensures that the

SSA index of the lvalue of the assignment expression

is incremented.

Analysis of GPGPU Programs for Data-race and Barrier Divergence

463

Array Expression. Consider the expression A[tid

% 2] in line 4, in Example 1 of the CUDA pro-

gram. For this array variable, the translation mecha-

nism needs to consider the variable and the index ex-

pression that points to a speciﬁc element in the array.

When multiple threads access different elements of

an array simultaneously, there can be a possibility of

access conﬂict. The translator module creates a well-

formed formula combining the array variable (in SSA

form) and the index expression to detect such a poten-

tial conﬂict. Here we make use of McCarthy’s array

access (select-store) axiom (McCarthy, 1962; Manna,

1974) to construct the formula. We explain this later

towards the end of the section.

Conditional Statements. Conditional statements

of a CUDA program are tricky. Here, the translation

of an if and if-else statement to a WFF must consider

the fact that multiple threads of a GPGPU can take

different paths during the execution of the CUDA ker-

nel. Since we are using SSA representation, the SSA

indices might have different values in the then and

the else blocks. This is a classic compiler translation

problem, which is handled by creating a φ statement

which acts as a multiplexer to select the appropriate

augmented variable. Furthermore, a conditional sta-

tement will have a predicated execution, where the

predicate is created from the conditional expression.

The idea is illustrated below. Note that SSA indices

of the variable y change inside the if block.

CUDA program

fragment

i n t x = 1 0 ;

i n t y = 4 ;

i f ( x > 9 ) {

y = 6 ;

}

Predicated form

(true) =⇒ x1 = 10

(true) =⇒ y1 = 4

(x1 > 9) =⇒ y2 = 6

!(x1 > 9) =⇒ y2 = y1

For Loop. A CUDA kernel can have a for-loop,

which implies that each thread will execute the for-

loop. The translator model allows a simpliﬁed for

loop construct with an initialization expression, the

guard condition and the increment operation, of the

form “var = var op k”, where k is a constant value

or an expression that does not change over iterations,

and op is one of the +,-, <<, >> operations. We

construct predicates to model an arbitrary and valid

iteration of the loop out of the three expressions and

use it for the statements inside the loop body.

CUDA program

fragment

i n t y = 0 ;

f o r ( i n t i = 0;

i < 100; i ++) {

y = i + 9; }

Predicated form

(true) =⇒ y0 = 0

(true) =⇒ i1 = 0

(true) =⇒ i2 = (i1 + i2k × 1)

∧(i2k ≥ 0)

(i2 < 100) =⇒ y1 = i2 + 9

In this particular illustration, ‘i2’ represents the

value of the index ‘i’ at an arbitrary iteration of the

loop. Note that the value of ‘i2’ depends on a free

variable ‘i2k’, which models an arbitrary kZ

itera-

tion. Since the iteration is modeled using a free vari-

able, the WFF allows modeling of conﬂict scenarios

where different threads execute different iterations of

the loop.

4.2 Translated Well Formed Formulae

(WFF)

The program statements are converted into a set of

SMT compliant well formed formulae, which we can

express in Backus Naur Form as:

expr := num | var | (unaryop expr) | (binop expr1 expr2)

| (store var expr1 expr2) | (select var expr)

Here var is a variable, num ∈ Z, unaryop∈ {!,˜}

and binaryop ∈ {=,+,-,*,/,%,&&,||,ˆ,&,|}.

The select operation is used to read a value at an

index i from array variable. The expression (select

A i) corresponds to the value stored in the array A at

position i.

The store operation is used to express a writing a

value v in an array at a particular index i. For exam-

ple, A[i] = v is converted to an expression (store

A i v).

Example 2. Consider the code snippet shown in Fi-

gure 1. The translator module will create a set of

WFFs. For brevity, we show the translated WFF for

the lines annotated with IR 4, 5, and 6.

4) (’==’, (’%’,’tid4132775177_1’,’2’), ’0’)

=> BAR

5) (’==’, (’%’,’tid4132775177_1’,’2’), ’0’)

=> (’==’, ’B1306833491_2’, (’store’,

’B1306833491_1’, ’tid4132775177_1’,

(’+’, (’select’, ’A1271100794_1’,

’tid4132775177_1’), ’10’)))

6) (’==’, (’%’,’tid4132775177_1’,’1’), ’0’)

=>(’==’, ’A1271100794_2’,(’store’,

’A1271100794_1’,(’+’, ’tid4132775177_1’,

’1’), ’10’))

Now consider the body of the for-loop (IR 8 and 9).

The translator module will create a predicated form

for the loop and the loop body.

ICSOFT 2018 - 13th International Conference on Software Technologies

464

8) True => (’and’, (’>=’,’i2473756790_2_k’, 0),

(’==’,’i2473756790_2’,(’+’,

’i2473756790_1’,(’*’,’i2473756790_2_k’,

’1’))))

9) (’and’, ’True’,(’<’,’i2473756790_2’,

’100’))=> (’==’, ’A1271100794_3’,

(’store’,’A1271100794_2’,(’%’,

’i2473756790_2’,’tid4293592209_1’),

(+ i2473756790_2, ’10’)))

As discussed earlier, since i2 represents a valid value

of the loop index variable ‘i’ at an arbitrary loop ite-

ration, the generated WFF considers all the possible

cases across the array. Thus, the generated WFF#8

means that ‘i2’ can take any value as long as there

exists a valid ‘i2k’, and in conjunction with WFF#9,

the value of ’‘i2k’ is bounded such that we deal with

only valid iterations.

5 VERIFICATION METHOD

We have decoupled the veriﬁcation module from the

translator module to provide greater ﬂexibility and ex-

tensibility to the veriﬁcation module. For instance,

one can add multiple correctness checking procedures

working on the same formulas with the same or diffe-

rent degree of granularity if such a need arises. As a

concrete example, we intend to introduce intra-warp

bank conﬂicts detection module as a future extension

of our current tool.

The basic workﬂow of the veriﬁcation process in-

volves taking translated statements from the transla-

tion module as input, along with the GlobalVariable-

Map. The next step is to generate statements corre-

sponding to a single, generic threaded execution. A

two threaded execution model is then created using

this generic threaded statement. Here we would like

to highlight a crucial distinction between our appro-

ach from the general symbolic execution approach. In

our approach, we are translating the program to ge-

nerate a set of constraints, and these constraints are

solved for veriﬁcation of a correctness property. Like

a symbolic execution approach, these constraints are

generated by treating input as symbolic values; howe-

ver, unlike a symbolic execution, we do not deﬁne any

operational semantics as far as execution of the state-

ment is concerned. This makes our approach more

lightweight that a full-ﬂedged operational semantics

driven approach. In our approach, we iterate over

each pair of possibly conﬂicting statements and test

it for the correctness by considering all the necessary

constraints on which the statements depend.

We now deﬁne three properties that our veriﬁca-

tion method is based on.

Property 1. If a SMT solver generates a model M

for a Q of WFFs then M |= Q. This property implies

the following. Let P be a program, and P is translated

into the set M of well-formed formula (WFF). This set

of WFF represents the model (of the program) given

to the SMT solver.

Let Q be the set of properties that we want to ve-

rify. In our case, the set Q contains both the data race

and barrier divergent property which is in the form of

the set of WFF. This property means that if M satis-

ﬁes Q, then the program contains both data race and

barrier divergence.

Property 2. A kernel is free of data races iff any two

arbitrary threads of execution of the kernel are free of

data races.

Property 3. A kernel is free of barrier divergence if,

for any two threads in the block, an arbitrary pair of

thread do not diverge.

The entire veriﬁcation approach follows 2-

threaded execution model, where to verify a correct-

ness property, the veriﬁer module considers a pair of

arbitrary threads. This approach is valid because the

correctness properties like data races and barrier di-

vergence are pair-wise properties, that is, in a given

set of threads, unless all the pairs of threads can be

proved to behave correctly, the correctness property

of the entire program cannot be proved. Therefore,

if one wants to verify the program for the absence of

data race and barrier divergence, considering an ar-

bitrary pair of distinct threads and checking if they

violate the correctness, sufﬁces the purpose.

5.1 Data Race Detection

The veriﬁcation module handles the veriﬁcation pro-

cess of different correctness properties. As mentioned

earlier, this process takes a set of translated statements

and the GlobalVariableMap as input from the trans-

lator module. The veriﬁcation process ﬁrst converts

each translated statement into a threaded form where

each variable which is not global or shared, is repla-

ced with a thread local variable. This ensures that du-

ring the two threaded analysis, the local variables do

not affect other variables due to their respective con-

straints. This process is done by recursively visiting

each expression in the well-formed formula and con-

verting it to its corresponding threaded version. The

function that converts translate statements into a thre-

aded form is named as ThreadIFY() in this paper.

Once the threaded version is generated, the veri-

ﬁcation module creates thread speciﬁc copies for two

symbolically parameterized threads, t

and t

which

Analysis of GPGPU Programs for Data-race and Barrier Divergence

465

are used in veriﬁcation of correctness properties, dis-

cussed in the following subsections.

Condition for data race:A data race occurs when

two distinct threads access the same location such that

one of the accesses is a write operation and these two

accesses are within the same barrier interval. It may

be noted that a barrier interval is the sequence of pro-

gram statements between two barrier synchronization

calls.

Barrier Synchronization: CUDA provides the bar-

rier synchronization as the only synchronization pri-

mitive. The underlying principle of a barrier synchro-

nization is that a thread upon encountering a barrier

call waits until all the threads in the thread block re-

ach the barrier, after which they resume their execu-

tion. Barriers ensure that all the changes done to the

shared memory space are registered. The reason why

access to the same location results in a data race is

that the CUDA does not provide any guarantee about

when the changes to a shared location will be reﬂected

within a barrier interval, also, since multiple threads

are accessing the same location, even if one guaran-

tees that the changes will be reﬂected immediately,

the access can follow a different order, which can re-

sult in an inconsistency in the value read by different

threads. Consider the code snippet shown in Exam-

ple 1. The translator module will create the following

set of WFF. There are auxiliary data structures that are

created in this stage too, which, for the sake of brevity,

are not described here. Here we show the translated

WFF for the lines annotated with IR 1 to IR 7.

Encoding Data race: Let us consider the two thre-

aded execution model where two threads t

and t

encounters a data race, in a pair of statements, and

the operations that conﬂict, involve conﬂicting access

such as A[tid] = 100, and y =A[tid+10] + y. Let

us assume without loss of generality that the ﬁrst

access is a write access, taking place in the thread t

and the second access is a read access taking place

in another thread t

, also, let us assume that the

variable y in the second access is a local variable,

hence the accesses to the array, A are the only con-

ﬂicting operations in this example. A data race will

occur if the location A + tid

and A + tid

+ 10 are

same, provided that tid

6= tid

. Hence, there con-

dition for checking data race roughly translates to,

tid

6= tid

∧ tid

== (tid

+ 10). Feeding this to the

constraint solver, we are likely to get a satisfying mo-

del where tid

= tid

+10. Since an operation is a part

of a statement, and each statement is in a predicated

form, we need to also make sure that:

a) the predicate evaluates to true for both of the

statements, otherwise a data race cannot take place,

b) since the operations in the current statement

might involve certain symbols that are deﬁned in pre-

ceding statements, we need to include them as well.

Let us represent the predicates deﬁned in the prece-

ding statements are P

and P

for the thread t

and t

re-

spectively. The veriﬁcation condition then checks for

the satisﬁability of the following equation (Φ) where,

Φ : (tid

6= tid

) ∧ P

∧ P

∧ (tid

== tid

+ 10). More

generally, assuming that the conﬂicting operations in-

volve accesses on indices index

and index

, the data

race condition becomes Φ : (tid

6= tid

) ∧ P

∧ P

∧

(index

== index

). If Φ has a satisfying assignment,

the data race can be conﬁrmed. However, if the SMT

solver is not able to ﬁnd any satisfying assignment,

it may be due to the solver time-out. In such a case,

one can only state that the veriﬁcation module was not

able to prove the existence of data race, rather than

saying that there is no data race. This is commonly

known as the soundness property of the algorithm,

which we prove in Section 5.3.2. The pseudo code for

data race detection algorithm is given in Algorithm 1.

Algorithm 1: DataraceDetection (Translated-statements).

1: raceing-statement=

2: conﬂicting-statements = {hs

, s

i | s

, s

∈ Translated-

statement ∧s

conﬂicts with s

}

3: thread1 = ThreadIFY(Translated-statements,1)

4: thread2 = ThreadIFY(Translated-statements,2)

5: for hs

, s

i ∈ conﬂicting-statements do

6: conﬂicting-op= {hop

, op

i | op

∈

theread1.s

, op

∈ theread2.s

and op

conﬂicts

with op

}

7: for hop

, op

i ∈ conﬂicting-op do

8: smt-eqn = Generate-SMT-EQN

(op

, op

,datarace)

9: if CHECK-SAT(sat-eqn)==SAT then

10: raceing-statement = raceing-statement ∪

hop

, op

11: end if

12: conﬂicting-op=conﬂicting-op \hop

, op

13: end for

14: conﬂicting-statements= conﬂicting-

statements\hs

, s

15: end for

16: return racing-statement

5.2 Barrier Divergence Detection

Barrier divergence occurs when for a barrier some

threads in a block execute the call, while others do

not. This can happen in cases when a barrier call is

executed only in some scenarios. For instance, when

a barrier is present within a conditional statement, the

condition might not hold true for some threads. CU-

DAs programming guide speciﬁes that the behavior in

such scenarios is not deﬁned. It is very likely that such

ICSOFT 2018 - 13th International Conference on Software Technologies

466

a call might result in a deadlock, and hence can im-

pact the correctness of the program. We need to verify

if it is possible that for a given barrier synchronization

call, one thread will execute it while others will not, if

yes, then we have a possible barrier divergence. Since

we are tracking the predicates for each statement se-

parately, drawing the terminology from the data race

section, this gives us the equation Ψ to be satisﬁed

where, Ψ : (tid

6= tid

)∧(P

⊕P

) This means that for

two different threads Ψ is true if either P

or P

is true.

If Ψ has a satisfying assignment, the current barrier

call is divergent, and the model represents the witness

to it. However, if there is no satisfying assignment,

as in the case for data race, here too, we cannot say

that the current barrier indeed is free of divergence.

We prove the soundness of the barrier divergence ap-

proach in Section 5.3.3. The pseudo code for barrier

divergence detection is given in Algorithm 2.

Algorithm 2: BarrierDivergence (Translated-statements).

1: diverging-barriers=

2: barrier-list = {s

| ∀s

∈ Translated-statement ∧s

syncthreads() }

3: thread

= ThreadIFY(Translated-statements, i)

4: thread

= ThreadIFY(Translated-statements, j)

5: barrier-pairs={hb

, b

i | b is barrier-list and b

is b in

thread

, b

is b in thread

}

6: for hb

, b

i ∈ barrier-pairs do

7: smt-eqn = Generate-SMT-EQN (b

, b

,barrier)

8: if CHECK-SAT(sat-eqn)==SAT then

9: diverging-barriers = diverging-barriers ∪

hop

, op

10: end if

11: barrier-pairs=barrier-pairs\hb

, b

12: end for

13: return diverging-barriers

Example 3. Consider the CUDA kernel shown in Fi-

gure 1, Example 1 and the set of WFF in Example 2.

These WFF represent the model of the program which

is veriﬁed against the data race and barrier diver-

gence conditions. Let us consider the data race de-

tection. The veriﬁcation module detects a data race

condition for various conﬂicting operations by crea-

ting two-threaded SMT equations. For instance, let

us consider the conﬂict between IR 5. And IR 6. The

main clause to encode the data race condition for the

two threaded version will have an extra predicate for

IR 6 (for the enclosing if condition), which needs to

be true as well. The clause will look like:

True AND (t_2__tid4132775177_1 % 1))==(0)

AND (t_1__tid4132775177_1 ==

(t_2__tid4132775177_1+1)

Reporting violations:

The generation of witnesses occur for that portion of

the code where the tool is able to detect a data ra-

ces or a barrier divergence. The veriﬁcation module

reports them along with the line number and column

number for the instances of variables (line 18 and 20)

as shown in Figure 1.

5.3 Formal Analysis of Our Approach

In this section, we discuss two important aspects of

our approach, namely the termination of our algo-

rithm and the soundness. We analyze these two as-

pects for both data-race as well as barrier divergence

algorithms.

5.3.1 Soundness of Translation

Theorem 1. If no satisfying assignment for the WFFs

and Φ

of the two threads i and j and the condition

linking to statements φ

i, j

exists, then no execution of

the two threads i and j leading to a situation fulﬁlling

the condition linking to statements φ

i, j

exists.

Proof. Assuming that an execution of the two thre-

ads i and j leading to a situation fulﬁlling the con-

dition linking to statements φ

i, j

exists, we will show

that then also a satisfying assignment for the well for-

med formulas Φ

and Φ

of the two threads i and j

and the condition linking to statements φ

i, j

exists via

induction over the length of the trace execution.

We consider an interleaved trace of the two thre-

ads i and j that leads to a ﬁnal variable assignment

fulﬁlling the condition linking to statements φ

i, j

bac-

kwards and show that for each postﬁx of the trace

the existence implies the existence of the related sa-

tisfying assignment for the well formed formulas Φ

and Φ

of postﬁxes of the two threads i and j related

to the postﬁx of the trace considered so far. This in-

duction terminates as for any invalid safety property a

ﬁnite trace invalidating it must exist.

Basis. For an empty statement for both thread any

property that is not unsatisﬁable can be fulﬁlled.

Induction Hypothesis. Let n be the length of trace

execution. Then ∀n, 1 ≤ i ≤ n, φ

i, j

≡ T

Induction Steps. The induction hypothesis holds for

length of trace execution and it contains all essen-

tial program constructors like assignment statement,

important unary and binary operations, array expres-

sion, conditional statement, for loop etc. Here we

show that the induction step holds for the assignment

and the for-loop for brevity. Similar proofs hold for

the rest of the program constructs.

Assignment. Let η

and η

be the the assignments

statements of thread i and thread j respectively.

The constructed well formed formulae Φ

and Φ

Analysis of GPGPU Programs for Data-race and Barrier Divergence

467

has been generated by the range of the induction

hypothesis, i.e., ∀n, 1 ≤ i, j ≤ n. For (n + 1)

iteration, the well formed formulae must be con-

structed as the induction hypothesis holds.

For-Loop. Let η

and η

be the the assignments sta-

tements of therad i and thread j respectively. Let

the loop has been unrolled n times and during un-

rolling the induction hypothesis holds. For unrol-

ling once more times, the assertion becomes true.

For (n + 1)

length the rest of program contracts are

also available. Hence the induction hypothesis is true

for (n + 1)

length.

We now discuss ﬁrst on the basis of this general

results the termination and soundness of the data race

detection algorithm.

5.3.2 Data Race: Soundness

Theorem 2. If the function DataraceDetection

(Algorithm 1) reports a data race for operation pair

hop

, op

i where op

be the operation i in l

state-

ment and op

be the operation j in the m

statement,

then there is a data race between l

and j

state-

ments.

Proof. Let the step 10 of the function

DataraceDetection (Algorithm 1) be true.

However, the algorithm reports that there is no

data race. Therefore, according to Theorem 1 the

welformed formulae Φ

and Φ

are not formed for

the threads i and j respectively. Therefore, step 9

is not satisﬁed by the condition and if step 9 is not

satisﬁed, then the control does not go to step 10.

Hence contradiction.

We now discuss the termination and soundness of

the barrier divergence detection algorithm.

5.3.3 Barrier-divergence: Soundness

Theorem 3. If the function BarrierDivergence

reports a divergent barrier at statement s

, then there

must be a barrier call at s

which is indeed divergent.

Proof. Assume that the function

BarrierDivergence reports no barrier diver-

gence. Let us also assume that the statement s

a barrier statement. As s

is a barrier statement,

step 9 of the function BarrierDivergence must

be satisﬁed. Therefore, according to Theorem 1

the welformed formulae Φ

and Φ

are not formed

for the threads i and j respectively. Therefore, step

8 is not satisﬁed by the condition and if step 8 is

not satisﬁed, then the control does not go to step 9.

Hence contradiction.

Table 1: Results for Data race and Barrier divergence free

detection times in seconds.

Benchmarks LOC #Kernels #Conjuncts DR BD DR Time (sec) BD Time (sec)

CG 100 12 124 NO NO 0.2314 0.1235

BCM 53 3 78 NO NO 0.0031 0.0014

MIN MAX 80 4 123 NO NO 0.0120 0.0064

LUP 136 18 167 NO NO 0.5195 0.3432

K-Means 142 34 231 NO NO 0.8342 0.4123

HC 132 26 143 NO NO 0.4167 0.2437

DCT 30 2 53 NO NO 0.0075 0.0042

LRU 190 23 272 NO NO 0.9198 0.6342

FFT 1002 47 1845 NO NO 1.83455 0.9943

UA 1234 172 1324 NO NO 2.3412 2.0034

5.4 Complexity Analysis

We discuss the complexity of the veriﬁcation algo-

rithms for data race detection and barrier divergence,

Algorithm 1 and Algorithm 2 respectively in a ri-

gorous way.

Algorithm 1. Step 1 takes O(1) time. Step 2 com-

putes pair of conﬂict statements and it takes O(n

)

time. Step 3 and 4 take O(1) times. Step 6 compu-

tes pairs of conﬂicting operators and if the number of

operators are |F|, then it takes O(|F|

) time. SMT

solver is called in Step 8 for generating SMT equa-

tion which takes O(2

|F|

) where F is the length of the

constraint. Only addition operation takes place in the

if-block (Steps 9-11) and takes O(1) time as we main-

tained a value indexed data structure. In Step 12, only

deletion operation takes place and takes O(1) time.

The loop in step 7 iterates |F| times. Hence over-

all complexity is O(|F|.2

|F|

) times. Similarly, in step

5, the loop iterates n times; the total complexity is

O(n.|F|.2

|F|

) ' O(2

|F|

Algorithm 2. Step 1 takes O(1) time. Step 2 com-

putes all the barriers for a program and it takes O(n)

time. Step 3 and 4 take O(1) times. Step 5 com-

putes pairs of barrier and it takes O(n

) time. SMT

solver is called in Step 7 for generating SMT equa-

tion which takes O(2

|F|

) where F is the length of the

constraint. Only addition operation takes place in the

if-block (Steps 8-10) and takes O(1) time as we main-

tained a value indexed data structure. In Step 11, only

deletion operation takes place and takes O(1) time.

The loop in step 6 iterates n times. Hence the overall

complexity is O(n.2

|F|

) ' O(2

|F|

) times.

6 EXPERIMENTAL RESULTS

The veriﬁcation algorithm is implemented in Python

and tested on several parallel benchmarks on a 2.0

GHz Intel(R) Core(TM)2 Duo CPU machine (using

only a single core). The set of benchmarks is availa-

ble in (Bondhugula et al., 2008; Che et al., 2009). The

steps for carrying out the experimentation are outlined

below.

ICSOFT 2018 - 13th International Conference on Software Technologies

468

1. Preparation of the Benchmarks Suite. The

source code and the documentation related to the

list of source programs are given in (Bondhugula

et al., 2008; Che et al., 2009). It is to be noted that

some of these examples such as K-Means, HC,

DCT, LRU, UA, and FFT are widely used in va-

rious application domains such as bioinformatics,

image processing, computational ﬂuid dynamics

(CFD), etc.

2. Program to WFF. Each of the above programs is

then translated into the well-formed formulae by

the translation module of the tool as described in

Section 4. Next, the set of well-formed formulae

is used by the veriﬁcation modules to detect data

race and barrier divergence conditions.

3. Reporting of Results. For each test case, we ob-

served the numbers of kernels, the number of con-

juncts in SMT equations and the number of data

race and barrier divergence are encountered. For

small examples like DCT and MINANDMAX,

the data race and barrier divergence are manually

veriﬁed for correctness.

6.1 Analysis

Table 1 represents the description of the parallel ben-

chmarks concerning the number of lines of code

(LOC), number of kernels and number of conjuncts.

It is to be noted that the number of conjuncts is al-

ways greater than the number of lines of codes be-

cause there exists at least one conjunct in each line

of code and at least one additional conjunct should

be present between any two lines of code. Table 1

also depicts that there is no data race and barrier di-

vergence for each benchmark. Last two columns of

Table 1 depicts the data race and barrier divergence

detection time. In all the cases, it is also worth no-

ting that detection of barrier divergence time is smal-

ler than the data race detection. If barrier divergence

occurs, then there is a possibility for data race. Ho-

wever, the reverse is not true.

Finally, we take some applications from the ben-

chmark suites and manually inject some errors at the

code level. The objective of this line of experimenta-

tion is to check the efﬁcacy of the veriﬁcation module

in detecting data race and barrier divergence respecti-

vely. We have introduced the following types of error.

Type 1. Removal of barrier where read after write

operation occurs between two threads; this has

been injected in the HC and DCT benchmarks.

Type 2. Removal of the barrier where a write after

a read operation occurs between two threads; this

has been injected in the LUP benchmark.

Table 2: Results for Data race and Barrier divergence de-

tection times.

Error Benchmarks #DR #BD DR Time (sec) BD Time (sec)

Type 1 HC 6 5 0.8314 0.5235

DCT 2 2 0.0234 0.0123

Type 2 LUP 5 3 0.6298 0.4231

Type 3 HC 6 5 0.7217 0.6355

Type 4 MINANDMAX 3 1 0.1237 0.0934

UA 5 4 0.3256 0.2651

Type 3. Removal of the barrier where write after

write operation occurs between two threads; this

has been injected in the HC benchmark.

Type 4. data-locality error which introduce false

data-locality in the conditional branch in

MINANDMAX and UA benchmarks.

Table 3: Characteristics of counter examples.

Error Benchmarks Counter Model size

# formulae # conjuncts

Type 1 HC 12 45

DCT 7 24

Type 2 LUP 5 16

Type 3 HC 12 41

Type 4 MINANDMAX 3 9

UA 24 89

Table 3 only considered the part where the errors

have been injected rather than the entire code. For

this reason, the number of formulae or the number of

conjuncts vary from Table 2.

6.2 Analysis of Erroneous Benchmarks

Table 2 depicts the descriptions of the errors introdu-

ced in the benchmarks, the number of data races and

barrier divergences and the execution times taken by

the data race detection module and by the barrier di-

vergence module; (in each cases, the data race and

barrier divergence have been detected by the modu-

les successfully;) The last two columns of the Table

2 record the positive data race and barrier divergence

detection time. It is to be noted that in all cases, the

actual data race and barrier divergence detection time

is higher than the data race free and barrier divergence

free detection time. If the veriﬁcation module detects

that there is a data race or barrier divergence, the tool

constructs the corresponding counter models and gi-

ves the counter model as the output. For this rea-

son, real data race and barrier divergence detection

time are much higher. Table 3 depicts the size of

the counter model concerning the number of formu-

lae and number of conjuncts.

Analysis of GPGPU Programs for Data-race and Barrier Divergence

469

6.3 Analysis of Comparative Study

We have compared the performance of our tool with

the online version of PUG as shown in Table 4. The

noteworthy fact is that our tool can identify a set of

barrier divergence and data races that span across

multiple barriers in a single iteration. Since PUG

stops after detecting barrier divergences and data ra-

ces present within the ﬁrst barrier it encounters, it re-

quires multiple runs (after correcting the already en-

countered problems) to discover additional barriers

spanning across multiple barriers. From the develo-

per productivity point of view, our approach certainly

helps to repair the program faster.

Table 4: Comparative results.

Error Benchmarks # Barrier # Data #Iterations

Divergence Races PUG Our Tool

Type 1 HC 6 5 2 1

DCT 2 2 1 1

Type 2 LUP 5 3 2 1

Type 3 HC 6 5 3 1

Type 4 MIN MAX 3 1 2 1

UA 5 4 3 1

7 CONCLUSION

In this paper, we have proposed an approach for sta-

tic veriﬁcation of GPU kernels programmed using

CUDA. Our approach veriﬁes whether the program

has the data race or has barrier divergence issues. Our

approach can identify all the violations of these across

the barriers for which we can generate witnesses in

the code. We have elaborated the implementation

of our approach and provided a formal proof of the

soundness of the algorithm. Thus, if our tool reports

a data race or a barrier divergence, these conditions

do exist in the code. Like other reported approaches,

our tool does not have a completeness property, i.e.,

it cannot report all possible data-races or barrier di-

vergences. The experimental results indicate that the

proposed tool has a minimal performance impact.

Producing a list of conﬂicts in one go has bene-

ﬁts for veriﬁcation tool developers, where the tool

can take advantage of the list of conﬂicts and barrier

divergences and suggest placement of barriers in the

code to remove the conﬂicts. This assistance mecha-

nism can help programmers to implement conﬂict free

CUDA programs more efﬁciently.

We are working to include veriﬁcation of additio-

nal properties such as shared memory bank conﬂicts

and global coalesced memory accesses as they can

signiﬁcantly affect the performance of the kernel and

thus impact the observed speedup. We also plan to

build this veriﬁer as an integral part of the design pro-

cess so that the designers can check for these safety

properties as they develop the code for GPGPU.

ACKNOWLEDGEMENTS

This work is ﬁnancially supported by Science and En-

gineering Research Board, Govt. of India funding

(SB/S3/EECE/0170/2014).

REFERENCES

(2017). Cuda c programming guide. Technical Report PG-

02829-001 v8.0, NVIDIA Corporation.

Bandyopadhyay, S., Sarkar, D., and Mandal, C. (2018).

Equivalence checking of petri net models of programs

using static and dynamic cut-points. Acta Informatica.

Bandyopadhyay, S., Sarkar, D., Mandal, C. A., Banerjee,

K., and Duddu, K. R. (2016). A path construction

algorithm for translation validation using PRES+ mo-

dels. Parallel Processing Letters, 26(2):1–25.

Bandyopadhyay, S., Sarkar, S., Sarkar, D., and Mandal,

C. A. (2017). Samatulyata: An efﬁcient path based

equivalence checking tool. In Automated Technology

for Veriﬁcation and Analysis - 15th International Sym-

posium, ATVA 2017, Pune, India, October 3-6, 2017,

Proceedings, pages 109–116.

Bardsley, E. and Donaldson, A. F. (2014). Warps and Ato-

mics: Beyond Barrier Synchronization in the Veriﬁca-

tion of GPU Kernels, pages 230–245. Springer Inter-

national Publishing, Cham.

Betts, A., Chong, N., Donaldson, A. F., Ketema, J., Qa-

deer, S., Thomson, P., and Wickerson, J. (2015). The

design and implementation of a veriﬁcation technique

for gpu kernels. ACM Trans. Program. Lang. Syst.,

37(3):10:1–10:49.

Blom, S. and Huisman, M. (2014). The VerCors Tool for

Veriﬁcation of Concurrent Programs, pages 127–131.

Springer International Publishing.

Bondhugula, U., Hartono, A., Ramanujam, J., and Sadayap-

pan, P. (2008). Pluto: A practical and fully automatic

polyhedral program optimization system. In PLDI 08.

Boyer, M., Skadron, K., and Weimer, W. (2008). Automa-

ted Dynamic Analysis of CUDA Programs. In Third

Workshop on Software Tools for MultiCore Systems.

Che, S., Boyer, M., et al. (2009). Rodinia: A Benchmark

Suite for Heterogeneous Computing. In IEEE Intl.

Symp. on Workload Characterization (IISWC).

Chiang, W.-F., Gopalakrishnan, G., Li, G., and Rakamari

Z. (2013). Formal Analysis of GPU Programs with

Atomics via Conﬂict-Directed Delay-Bounding, pages

213–228. Springer Berlin Heidelberg, Berlin, Heidel-

berg.

Chong, N., Donaldson, A. F., Kelly, P. H., Ketema, J., and

Qadeer, S. (2013). Barrier invariants: A shared state

ICSOFT 2018 - 13th International Conference on Software Technologies

470

abstraction for the analysis of data-dependent gpu ker-

nels. OOPSLA ’13, pages 605–622, New York, NY,

USA. ACM.

Collingbourne, P., Cadar, C., and Kelly, P. H. J. (2012).

Symbolic testing of opencl code. In Proceedings of

the 7th International Haifa Veriﬁcation Conference

on Hardware and Software: Veriﬁcation and Tes-

ting, HVC’11, pages 203–218, Berlin, Heidelberg.

Springer-Verlag.

Collingbourne, P., Donaldson, A. F., Ketema, J., and Qa-

deer, S. (2013). Interleaving and Lock-Step Semantics

for Analysis and Veriﬁcation of GPU Kernels, pages

270–289. Springer Berlin Heidelberg, Berlin, Heidel-

berg.

Darte, A. and Schreiber, R. (2005). A linear-time algorithm

for optimal barrier placement. In Proceedings of the

Tenth ACM SIGPLAN Symposium on Principles and

Practice of Parallel Programming, PPoPP ’05, pages

26–35, New York, NY, USA. ACM.

Dhok, M., Mudduluru, R., and Ramanathan, M. K. (2015).

Pegasus: Automatic barrier inference for stable mul-

tithreaded systems. In Proceedings of the 2015 In-

ternational Symposium on Software Testing and Ana-

lysis, ISSTA 2015, pages 153–164, New York, NY,

USA. ACM.

Ernstsson, A., Li, L., and Kessler, C. (2017). SkePU 2:

Flexible and type-safe skeleton programming for he-

terogeneous parallel systems. International Journal of

Parallel Programming.

Hathhorn, C., Becchi, M., Harrison, W. L., and Procter,

A. M. (2012). Formal semantics of heterogeneous

CUDA-C: A modular approach with applications. In

Proceedings Seventh Conference on Systems Software

Veriﬁcation, SSV 2012, Sydney, Australia, 28-30 No-

vember 2012., pages 115–124.

Huisman, M. and Mihel

c, M. (2013). Speciﬁcation and

veriﬁcation of GPGPU programs using permission-

based separation logic. In 8th Workshop on Bytecode

Semantics, Veriﬁcation, Analysis and Transformation

(BYTECODE 2013).

Kirk, D. and Hwu, W.-M. W. (2016). Programming Mas-

sively Parallel Processors: A Hands-on Approach.

Morgan Kaufmann, 3 edition.

Lamport, L. (1977). Proving the correctness of multprocess

programs. IEEE Trans. Software Engineering, 3(2).

Lattner, C. and Adve, V. (2004). LLVM: A Compilation

Framework for Lifelong Program Analysis & Trans-

formation. In Proc. Intl. Symp. on Code generation

and optimization.

Leung, A., Gupta, M., Agarwal, Y., Gupta, R., Jhala, R.,

and Lerner, S. (2012). Verifying gpu kernels by test

ampliﬁcation. In Proceedings of the 33rd ACM SIG-

PLAN Conference on Programming Language Design

and Implementation, pages 383–394.

Li, G. (2010). Formal veriﬁcation of programs and their

transformations. PhD thesis.

Li, G. and Gopalakrishnan, G. (2010). Scalable smt-based

veriﬁcation of gpu kernel functions. In Proceedings

of the Eighteenth ACM SIGSOFT International Sym-

posium on Foundations of Software Engineering (FSE

’10), pages 187–196.

Li, G. and Gopalakrishnan, G. (2012). Parameterized veri-

ﬁcation of gpu kernel programs. In Proceedings of the

2012 IEEE 26th International Parallel and Distribu-

ted Processing Symposium Workshops & PhD Forum,

IPDPSW ’12, pages 2450–2459.

Li, G., Li, P., Sawaya, G., Gopalakrishnan, G., Ghosh, I.,

and Rajan, S. P. (2012a). GKLEE: Concolic Veriﬁca-

tion and Test Generation for GPUs. In Proceedings

of the 17th ACM SIGPLAN Symposium on Principles

and Practice of Parallel Programming, PPoPP ’12,

pages 215–224.

Li, P., Li, G., and Gopalakrishnan, G. (2012b). Parametric

ﬂows: Automated behavior equivalencing for symbo-

lic analysis of races in cuda programs. In Procee-

dings of the International Conference on High Perfor-

mance Computing, Networking, Storage and Analysis,

SC ’12, pages 29:1–29:10.

Lin, Z., Gao, X., Wan, H., and Jiang, B. (2015). GLES: A

Practical GPGPU Optimizing Compiler Using Data

Sharing and Thread Coarsening, pages 36–50. Sprin-

ger, Cham.

Lv, J., Li, G., Humphrey, A., and Gopalakrishnan, G.

(2011). Performance Degradation Analysis of GPU

Kernels. In CAV EC

Manna, Z. (1974). Mathematical Theory of Computation.

McGraw-Hill Kogakusha, Tokyo.

McCarthy, J. (1962). Towards a Mathematical Science of

Computation. In IFIP Congress, pages 21–28.

Nickolls, J. and Dally, W. J. (2010). The GPU computing

era. IEEE Micro, 30(2).

Oboyle, M., Kervella, L., and Bodin, F. (1995). Synchroni-

zation minimization in a spmd execution model. Jour-

nal of Parallel and Distributed Computing, 29(2):196

– 210.

Said, M., Wang, C., Yang, Z., and Sakallah, K. (2011). Ge-

nerating data race witnesses by an smt-based analysis.

In Proceedings of the Third International Conference

on NASA Formal Methods, NFM’11, pages 313–327.

Springer-Verlag.

ohr, E. A. and O’Boyle, M. F. P. (1997). A graph based

approach to barrier synchronisation minimisation. In

Proceedings of the 11th International Conference on

Supercomputing, ICS ’97, pages 156–163.

Zheng, M., Ravi, V. T., Qin, F., and Agrawal, G. (2011).

GRace: A Low-overhead Mechanism for Detecting

Data Races in GPU Programs. In Proceedings of the

16th ACM Symposium on Principles and Practice of

Parallel Programming, PPoPP ’11, pages 135–146.

Zheng, M., Ravi, V. T., Qin, F., and Agrawal, G. (2014).

Gmrace: Detecting data races in gpu programs via a

low-overhead scheme. IEEE Transactions on Parallel

and Distributed Systems, 25(1):104–115.

Analysis of GPGPU Programs for Data-race and Barrier Divergence

471