Schfuzz: Detecting Concurrency Bugs with Feedback-Guided Fuzzing
Hiromasa Ito
2
, Yutaka Matsubara
2
and Hiroaki Takada
1,2 a
1
Institute of Innovation for Future Society, Nagoya University, Aichi, Japan
2
Graduate School of Informatics, Nagoya University, Aichi, Japan
Keywords:
Fuzzing, Concurrency Testing, Concurrency Bug Detection, Feedback-Guided Fuzzing,
Memory-Access-Guided Fuzzing.
Abstract:
It is challenging to detect concurrency bugs with fuzzing. There are two main reasons for this. First, manifest-
ing them by exploring input space is inefficient because they only occur under specific interleavings. Second,
re-giving an input detected a bug in a fuzzing campaign does not necessarily reproduce the bug because typ-
ical runtimes do not schedule threads deterministically. This research proposes Schfuzz, a novel approach
for detecting concurrency bugs with feedback-guided fuzzing. This approach executes programs under test
deterministically based on test cases generated by fuzzers. In addition, it feeds back dynamic memory-access
orders to aid fuzzers in detecting concurrency bugs more efficiently and effectively. We evaluate Schfuzz with
a hand-made motivating example and four benchmark programs from SCTBench (Thomson et al., 2016). The
result shows that it can detect concurrency bugs more efficiently and effectively than traditional feedback-
guided fuzzing.
1 INTRODUCTION
System developers have to design concurrent systems
because it is challenging to increase the density of
semiconductor integration further in the post-Moore
era. Although concurrent systems enable us to use
computational resources effectively, developers often
suffer from concurrency bugs. They are errors in con-
current systems manifesting when the systems sched-
ule threads in particular interleavings. We need an
effective method to detect concurrency bugs because
it is more challenging to detect them than sequential
ones.
Although fuzzing is one of the most successful
automated software testing methods, detecting con-
currency bugs with fuzzing is not so straightforward.
There are two main reasons for this. First, exploring
input space is inefficient for hunting concurrency bugs
because most inputs may run under the same inter-
leaving. Second, bugs detected in fuzzing campaigns
are hard to reproduce because typical runtimes sched-
ule threads non-deterministically.
This research proposes Schfuzz, a novel method
for detecting concurrency bugs with feedback-guided
fuzzing. This approach has two key points. First, it
executes programs under test deterministically based
on test cases generated by fuzzers. Second, it feeds
a
https://orcid.org/0000-0003-3544-2397
back dynamic memory-access orders to aid fuzzers in
detecting concurrency bugs more efficiently and ef-
fectively.
The contributions of the research are as follows:
We propose Schfuzz, a novel approach to detect
concurrency bugs with feedback-guided fuzzing.
We implement the prototypes of Schfuzz and sta-
tistically compare it with traditional fuzzing meth-
ods based on quantitative metrics retrieved from
the experimental results.
2 BACKGROUND AND
MOTIVATION
2.1 Concurrency Bug
A concurrency bug (e.g., data race, deadlock, and
atomicity violation) is one of the errors that can oc-
cur in concurrent systems. If a system has latent con-
currency bugs, the bugs can occur when the system
schedules threads in particular interleavings. The oc-
curred bug can propagate and finally fail the system.
Even if a system has latent concurrency bugs, it
is not easy to manifest them artificially. In order to
manifest them, the system must run under specific in-
terleavings. The number of such interleavings is tiny
Ito, H., Matsubara, Y. and Takada, H.
Schfuzz: Detecting Concurrency Bugs with Feedback-Guided Fuzzing.
DOI: 10.5220/0011722100003464
In Proceedings of the 18th International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE 2023), pages 273-282
ISBN: 978-989-758-647-7; ISSN: 2184-4895
Copyright
c
2023 by SCITEPRESS Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)
273
relative to the size of the interleaving space. In ad-
dition, the input which manifested the bug does not
necessarily reproduce the bug because most runtimes
do not schedule threads deterministically. Therefore,
there is a need for a novel method that can 1) effi-
ciently explore interleaving space, 2) effectively de-
tect concurrency bugs, and 3) reproduce detected bugs
without extra effort.
2.2 Fuzzing
Fuzzing is one of the software testing methods. Gen-
erally, it repeats the following three steps:
1. A test case generator (fuzzer) generates a test case
and feeds it to a program under test (PUT).
2. The PUT runs with the test case.
3. The fuzzer checks the PUT’s liveness (e.g., status
code).
Although the concept of fuzzing proposed by Miller
et al. (1990) is so simple, it detected many bugs in
the UNIX utility programs. In recent years, many
researchers have studied how to detect various bugs
more efficiently and effectively with fuzzing (Bohme
et al., 2017; Rawat et al., 2017; Stephens et al., 2016;
Yun et al., 2018).
Feedback-guided fuzzing feeds back runtime in-
formation in the fuzzing loop. A feedback-guided
fuzzer evaluates a test case using the runtime infor-
mation gathered when the PUT ran with the test case.
It stores highly-rated test cases as seeds and uses them
to generate beneficial descendant test cases. For ex-
ample, a code-coverage-guided fuzzer uses test cases
with improved code coverage in a fuzzing campaign
as seeds and generates new ones by mutating the
seeds. Amerian Fuzzy Lop
1
(AFL) and libFuzzer
2
are typical code-coverage-guided fuzzers that have
detected numerous bugs and vulnerabilities in real-
world programs.
Although existing fuzzing methods and various
fuzzers have detected many sequential bugs, detecting
concurrency bugs is not straightforward. First, explor-
ing input space is inefficient for detecting concurrency
bugs. Even though various inputs indirectly explore
interleaving space because PUTs can run under vari-
ous interleavings with them, it is less efficient than di-
rect exploration. Second, even if an input in a fuzzing
campaign manifests a bug, the input does not neces-
sarily reproduce the bug due to the non-determinism
of thread scheduling. Therefore, there is a need for
a novel approach to detecting concurrency bugs with
fuzzing.
1
https://lcamtuf.coredump.cx/afl/
2
https://llvm.org/docs/LibFuzzer.html
1 void
2 main_task(intptr_t exinf)
3 {
4 while (1) {
5 char c; // character input
6 if (!read_data(&c, sizeof(char))) {
7 continue;
8 }
9 if ((c & 0b00001111) == 0b1111) {
10 consume_time(1000);
11
12 // acquire lock
13 loc_cpu();
14 lock = 1;
15 unl_cpu();
16
17 // do nothing
18
19 // release lock
20 loc_cpu();
21 lock = 0;
22 unl_cpu();
23
24 consume_time(1000);
25 } else {
26 consume_time(1000);
27 }
28 }
29 }
Figure 1: Main task of the motivating example.
2.3 Motivating Example
Figures 1 and 2 show the source codes of an appli-
cation program of TOPPERS/ASP3
3
, an open-source
real-time operating system that runs on GR-PEACH
4
,
an RZ/A1H SoC development board. The applica-
tion consists of a task and an interrupt service routine
(ISR).
Figure 1 shows the code for the main task of the
example. It consists of a single infinite loop. In the
loop, it repeatedly reads input, and if its last four bits
are all one, it acquires and releases the lock repre-
sented by the variable lock.
Figure 2 shows the code for the ISR of the exam-
ple. It checks the state of the lock, and if its value is
one, it increments the value of the global variable shm.
Because the access to shm is not synchronized, other
threads (i.e., ISR instances) can overwrite its value be-
tween its increment and its reload for assertion check-
ing. In other words, data race can occur because the
ISR is not reentrant.
In order to manifest the bug, threads must inter-
leave to satisfy the following two conditions:
3
https://toppers.jp/asp3-kernel.html
4
https://www.renesas.com/us/en/products/
gadget-renesas/boards/gr-peach
ENASE 2023 - 18th International Conference on Evaluation of Novel Approaches to Software Engineering
274
1 void
2 intno1_isr(intptr_t exinf)
3 {
4 intno1_clear();
5
6 unsigned int lock_state = 0;
7
8 // check the lock state
9 iloc_cpu();
10 lock_state = lock;
11 iunl_cpu();
12
13 // iff. the task have the lock, increment
shm.
14 if (lock_state) {
15 unsigned int buf = 0;
16 buf = shm;
17 shm++;
18 assert(shm == (buf + 1));
19 }
20 }
Figure 2: Interrupt service routine (ISR) of the motivating
example.
1. Firing the first interrupt when the value of lock is
one.
2. Firing the second interrupt between incrementing
shm and reloading it for assertion checking.
The number of such interleavings is minimal relative
to the size of the interleaving space. In addition, since
input rarely contributes to interleaving space explo-
ration in this example, it is hard to satisfy condition
2 with traditional fuzzing methods. In order to effi-
ciently explore the latent bug in this example, fuzzers
should also explore the interleaving space.
3 SCHFUZZ
This research proposes Schfuzz, a novel method
to detect concurrency bugs with feedback-guided
fuzzing. There are two critical points of the method.
It executes PUTs deterministically based on test
cases generated by fuzzers.
It feeds back dynamic memory-access orders to
fuzzers.
Figure 3 shows an overview of Schfuzz. A sched-
uler receives a test case generated by a fuzzer and
interprets it as a series of schedules. A runtime en-
vironment (e.g., a system emulator) executes a PUT
under the schedule, records orders of shared-memory
access, and feeds them back to the fuzzer.
Algorithm 1 shows how schedulers and runtime
environments execute PUTs deterministically based
on test cases generated by fuzzers. If a bug occurs
with a test case, running the PUT with the test case
can reproduce the bug. The algorithm repeats until
it exhausts data in a test case (lines 2 - 9). A sched-
ule consists of addr, maxHitCount, and maxInstCount
and a scheduler retrieves it from the test case (line
4). A runtime environment executes a PUT under the
schedule and pauses the execution when the value of
the program counter matches addrs value the number
of times as maxHitCounts value (line 6). The value
of maxInstCount is the upper bound of the number
of instructions executed before pausing. After that,
the runtime environment saves the current thread con-
text (line 7) and gets the next available thread (line
8). Note that the function transformToValidAddr val-
idates the value of addr because it is likely to be un-
reachable if addr is allowed to take any value. If addr
takes an invalid value, transformToValidAddr trans-
forms its value into a valid one (line 5). Testers must
determine the valid value range because it depends on
PUTs.
Schfuzz feeds back shared-memory access or-
ders to fuzzers for detecting concurrency bugs ef-
fectively and efficiently. Typical feedback-guided
fuzzing feeds back basic-block level paths and prefers
test cases that improve code coverage. However,
this strategy is not suited for detecting concurrency
bugs. There are two main reasons for this. First, a
context switch can occur in concurrent programs in
the middle of a basic block. Because of this fea-
ture, static-basic-block level paths imprecisely rep-
resent PUT’s executions. Likewise, the number of
dynamic-basic-block level paths increases exponen-
tially to the number of instructions and threads. Sec-
ond, a concurrency bug occurs when threads access
shared memories in particular orders. Generally, im-
proving code coverage helps to detect latent concur-
rency bugs. Meanwhile, even if a test case improves
the code coverage of a fuzzing campaign, it does
not contribute to concurrency bug detection if the
newly covered codes do not access shared memory.
Therefore, Schfuzz feeds back access orders to shared
memory and prefers test cases that access the memory
in an order not yet observed in a fuzzing campaign.
For the evaluation described in section 4, we have
implemented two prototypes of Schfuzz. They use
American Fuzzy Lop (ver. 2.52b) as a feedback-
guided fuzzer and Unicorn (ver. 1.0.3) and Qiling
(ver. 1.4.3) as fine-controllable runtime environments
based on schedules. It depends on runtime environ-
ments how to control executions and gather runtime
information. In order to run fuzzing for our moti-
vating example, we have implemented the first proto-
Schfuzz: Detecting Concurrency Bugs with Feedback-Guided Fuzzing
275
Feedback-guided fuzzer
Seeds
EvaluatorMutator
Schedules
addr = 0xdeadbeef
maxHitCount = 0x01
maxInstCount = 0xffff
addr = 0xdeadbeef
maxHitCount = 0x01
maxInstCount = 0xffff
addr = 0xdeadbeef
maxHitCount = 0x01
maxInstCount = 0xffff
Test case
00001111
10100010
11100111
PUT
Runtime environment
Scheduler
0xdeadbeef
0xdefec8ed
0xfacefeed
...
Access logs to shared memory
Figure 3: Overview of Schfuzz. For prototyping, we use American Fuzzy Lop (AFL) as a feedback-guided fuzzer and Unicorn
and Qiling as runtime environments for PUTs. For the first prototype, we have implemented the scheduler in the code of the
test driver. On the other hand, for the second prototype, we have integrated the scheduler into Qiling’s code.
Input: testCase, mainThread
1: currrentThread mainThread
2: while testCase is not EOF do
3: ctx loadContext(currentThread)
4: addr, maxHitCount, maxInstCount getSchedule(testCase)
5: validAddr transformToValidAddr(addr)
6: new ctx run(ctx, validAddr, maxHitCount, maxInstCount)
7: saveContext(new ctx)
8: currentThread getNextThread()
9: end while
Algorithm 1: Deterministic execution based on a test case.
type with Unicorn
5
, a multi-architecture CPU emula-
tion framework. We use its APIs to implement Algo-
rithm 1 because they enable us to control and execute
PUTs at the instruction level. In addition, they en-
able us to hook various runtime events (e.g., memory
access). The prototype records runtime information
with the APIs. Although the first prototype enables us
to run fuzzing for the motivating example, we should
apply Schfuzz to more various PUTs for evaluation.
Many real-world concurrent programs run on the Lin-
ux/pthread environment, but it is not straightforward
to run Linux/pthread software on vanilla Unicorn be-
cause it is just a CPU emulator (i.e., an instruction-set
simulator). Hence we have implemented the second
prototype with Qiling
6
, a binary emulation framework
that uses Unicorn as its backend. Although we can-
not control PUT’s execution via Qiling as finely as
Unicorn, we modified it to enable test-case-based de-
terministic execution. The second prototype records
runtime information with the APIs of Qiling as those
of Unicorn.
5
https://github.com/unicorn-engine/unicorn
6
https://github.com/qilingframework/qiling
4 EVALUATION
4.1 Experimental Setup
We experimentally evaluate Schfuzz. It is clear that
Schfuzz has better reproducibility of bugs than exist-
ing fuzzing methods because it deterministically exe-
cutes PUTs based on test cases generated by fuzzers.
In the experiment, we evaluate the effectiveness of us-
ing access orders to shared memory as runtime feed-
back. We run fuzzing campaigns many times for five
PUTs using three different runtime feedback and dis-
cuss the effectiveness of Schfuzz based on quantita-
tive metrics of the results. All of the experiments run
on Ubuntu 18.04 (Linux 5.4.0) on Intel(R) Core(TM)
i9-9960X CPU @ 3.10GHz and 32 GiB RAM.
Research Question. There are two research ques-
tions in the experiment.
ENASE 2023 - 18th International Conference on Evaluation of Novel Approaches to Software Engineering
276
RQ1. Does Schfuzz improve the efficiency of
detecting concurrency bugs compared to ex-
isting fuzzing methods?
RQ2. Does Schfuzz improve the effective-
ness of detecting concurrency bugs compared
to existing fuzzing methods?
The time and the number of generated test cases to de-
tect bugs index the efficiency, and bug-detection rates
index the effectiveness.
PUT. In the experiment, we use five concurrent pro-
grams as PUT. One is the motivating example shown
in section 2.3, and we run 100 10-minute fuzzing
campaigns for it with each method described in the
paragraph below. The others (reorder 10 bad, re-
order 20 bad, safestack, and twostage 100 bad) are
benchmark programs in SCTBench (Thomson et al.,
2016), a set of buggy concurrent software run on the
Linux/pthread environment. Thomson et al. (2016)
have evaluated various methods to detect concurrency
bugs with SCTBench. In their evaluation, the number
of methods that detected the bugs in the four programs
is fewer than those that detected bugs in the others. In
other words, they are the four most challenging pro-
grams in SCTBench to detect bugs. They all have
assertion checks, and their checks can fail when they
run under specific interleavings. We run 30 6-hour
fuzzing campaigns for them with each method.
Method. We compare three methods, each of which
uses the following runtime information:
Shared-memory access orders (Schfuzz).
Basic-block level paths (code-coverage-guided
fuzzing).
None (dumb fuzzing).
As shown in Algorithm 1, a test case is interpreted
as a series of schedules. However, when the PUT is
the motivating example, it is interpreted as a series of
inputs and schedules because we must give specific
inputs to manifest the bug. Table 1 shows the size of
each element that makes up a schedule and an input.
Limitations of the Prototypes. There are two lim-
itations of the prototypes.
1. A tester must provide shared memory addresses
to be monitored.
2. A tester must provide a range for the function
transformToValidAddr in Algorithm 1.
0
200
400
600
motivating_example
PUT
Time to detect the bugs [s]
Runtime info.
bb
shm
Figure 4: Time distributions to detect the bug in the moti-
vating example.
Table 2 shows the shared variables to be monitored
in the experiments for each PUT. The range of trans-
formToValidAddr in the experiments is the address
range of their application codes. Note that we modi-
fied the source code of safestack because of limitation
1. Since safestack dynamically allocates the memory
to be monitored and the prototypes cannot monitor
memory whose address is determined at runtime, we
modified the code to allocate it statically.
4.2 Result
Table 3 shows each method’s time to detect the la-
tent bugs in each PUT. In the fuzzing campaigns for
the motivating example, Schfuzz detected the bug on
average 2.36 times faster than code-coverage-guided
fuzzing. Figure 4 shows the box plot of the time dis-
tributions to detect the bug in the motivating example.
The p-value calculated by the Brunner-Munzel test
(Brunner and Munzel, 2000) for these distributions is
less than 2.2e
16
, which means significant differences
between them.
Table 4 shows the number of test cases each
method generates to detect the latent bugs in each
PUT. Note that the values are estimated by integrating
the number of test cases executed per second. In the
fuzzing campaigns for the motivating example, Sch-
fuzz detected the bug in 2.94 times fewer test cases on
average than code-coverage-guided fuzzing. Figure
5 shows the box plot of the distributions of the esti-
mated number of test cases generated to detect the bug
in the motivating example. The p-value calculated by
the Brunner-Munzel test (Brunner and Munzel, 2000)
for these distributions is 6.6e
16
, which means signif-
icant differences between them.
Table 5 shows the bug detection rate of each
method in each PUT. Schfuzz has a higher bug de-
tection rate than the other methods for the motivating
Schfuzz: Detecting Concurrency Bugs with Feedback-Guided Fuzzing
277
Table 1: Size of each element that makes up a schedule and an input.
PUT
Size [byte]
numberOfInput input addr maxHitCount maxInstCount
motivating example 1 numberOfInput 4 1 2
reorder 10 bad 0 0 4 1 2
reorder 20 bad 0 0 4 1 2
safestack 0 0 4 1 2
twostage 100 bad 0 0 4 1 2
Table 2: Shared variables to be monitored in the experi-
ments for each PUT.
PUT Vars to be monitored
motivating example lock, shm
reorder 10 bad a, b
reorder 20 bad a, b
safestack stack_items, stack
twostage 100 bad data1Value, data2Value
0e+00
5e+04
1e+05
motivating_example
PUT
Est. #testcases
Runtime info.
bb
shm
Figure 5: Distributions of the estimated number of test cases
generated to detect the bug in the motivating example.
example, reorder 10 bad and reorder 20 bad. On the
other hand, it could not detect the bugs in safestack
and twostage 100 bad as the other methods. Code-
coverage-guided fuzzing detected only the bug in the
motivating example in 33% of the fuzzing campaigns.
Regarding dumb fuzzing, it could not detect any bugs
in the PUTs.
4.3 Discussion
We discuss the following three topics based on the
results:
1. The bugs which Schfuzz could not detect.
2. The answers to the research questions.
3. The threats to validity.
Undetected Bugs. We should discuss the bugs
which Schfuzz could not detect. It could not detect
the bugs in safestack and twostage 100 bad with 6-
hour fuzzing campaigns. Although the bug in safes-
tack is not false-positive, it is known to be challenging
to detect. No method could detect it in the evalua-
tion by Thomson et al. (2016). The most recent work
on concurrency bug detection (Wen et al., 2022) also
failed to detect the bug. twostage
100 bad concur-
rently runs 100 threads, which access the shared vari-
ables. A large number of threads has two adverse ef-
fects on Schfuzz. First, it slows down fuzzing speed.
In other words, PUTs run with fewer test cases. Sec-
ond, it makes detecting bugs more difficult because it
increases the size of interleaving space.
Answers to Research Questions. We should an-
swer the research questions based on the results.
A1. Yes. Schfuzz can detect concurrency
bugs more efficiently than the existing fuzzing
methods.
In the experiment, Schfuzz detected the bugs in the
three PUTs faster and with fewer test cases than the
other methods. In addition, there are significant dif-
ferences among the distributions of time and the num-
ber of test cases to detect the bugs.
A2. Yes. Schfuzz can detect concur-
rency bugs more effectively than the existing
fuzzing methods.
Schfuzz showed a higher bug-detection rate in the ex-
periment than the existing method for the three PUTs.
Threats to Validity. We should discuss the threats
to the validity:
Fuzzing has randomness. Therefore, it is impos-
sible to get the same result as this. In order to
reduce the effect of the randomness, we have an-
alyzed the statistics based on many-times fuzzing
campaigns.
The efficiency and effectiveness of Schfuzz de-
pend on the property and amount of knowledge
given by testers. For example, testers give knowl-
edge about shared-memory addresses to be mon-
itored and the range of the function transformTo-
ENASE 2023 - 18th International Conference on Evaluation of Novel Approaches to Software Engineering
278
Table 3: Statistics on time to detect the bugs. “Runtime info. indicates the methods. “shm” denotes Schfuzz, “bb” denotes
code-coverage-guided fuzzing, and “none” denotes dumb fuzzing. “#samples” means the number of the fuzzing campaigns
that detected the bugs. In the columns of the statistical values, “NA” means no values.
PUT Runtime info. #samples Mean [s] S.D. [s] Median [s] Min [s] Max [s]
motivating example
shm 93 176.96 165.14 108 10 593
bb 33 418.28 127.31 441 171 588
none 0 NA NA NA NA NA
reorder 10 bad
shm 20 8546.10 5147.28 7554 1043 19593
bb 0 NA NA NA NA NA
none 0 NA NA NA NA NA
reorder 20 bad
shm 21 10240.33 5953.48 8574 1843 19050
bb 0 NA NA NA NA NA
none 0 NA NA NA NA NA
safestack
shm 0 NA NA NA NA NA
bb 0 NA NA NA NA NA
none 0 NA NA NA NA NA
twostage 100 bad
shm 0 NA NA NA NA NA
bb 0 NA NA NA NA NA
none 0 NA NA NA NA NA
Table 4: Statistics on the estimated number of test cases generated to detect the bugs. “Runtime info. indicates the methods.
“shm” denotes Schfuzz, “bb” denotes code-coverage-guided fuzzing, and “none” denotes dumb fuzzing. “#samples” means
the number of the fuzzing campaigns that detected the bugs. In the columns of the statistical values, “NA” means no values.
PUT Runtime info. #samples Mean S.D. Median Min Max
motivating example
shm 93 30788.32 27817.70 17895 3250 133884
bb 33 61217.48 18446.95 63653 23165 88163
none 0 NA NA NA NA NA
reorder 10 bad
shm 20 10134.85 5949.11 9276 1196 22791
bb 0 NA NA NA NA NA
none 0 NA NA NA NA NA
reorder 20 bad
shm 21 6769.76 3721.73 6693 1189 12216
bb 0 NA NA NA NA NA
none 0 NA NA NA NA NA
safestack
shm 0 NA NA NA NA NA
bb 0 NA NA NA NA NA
none 0 NA NA NA NA NA
twostage 100 bad
shm 0 NA NA NA NA NA
bb 0 NA NA NA NA NA
none 0 NA NA NA NA NA
ValidAddr. The property and amount of knowl-
edge would affect the result.
Concurrency bugs must be observable to detect
them with Schfuzz. Although we can observe the
bugs in the experimental PUTs as assertion fail-
ures, some concurrency bugs, such as deadlocks
and starvations, are hard to generalize how to ob-
serve because their definition depends on system
requirements.
Although we evaluated Schfuzz with the five
PUTs, more experiments with more realistic
PUTs would allow us to evaluate it more precisely.
5 RELATED WORK
5.1 Concurrency Bug Detection
Many researchers have published works to detect con-
currency bugs. According to Fu et al. (2018), we can
classify them into the following three categories:
Random delay disturbance (Park et al., 2009;
Chew and Lie, 2010).
Thread scheduling/switching (Musuvathi et al.,
2008; Musuvathi and Qadeer, 2007; Yu et al.,
2012; Burckhardt et al., 2010).
Fuzzing (Sen, 2008; Joshi et al., 2009; Cai et al.,
2014).
Schfuzz: Detecting Concurrency Bugs with Feedback-Guided Fuzzing
279
Table 5: Detection rates of the bugs. “Runtime info. in-
dicates the methods. “shm” denotes Schfuzz, “bb” denotes
code-coverage-guided fuzzing, and “none” denotes dumb
fuzzing.
PUT Runtime info. Rate [%]
motivating example
shm 93.0
bb 33.0
none 0.0
reorder 10 bad
shm 66.7
bb 0.0
none 0.0
reorder 20 bad
shm 70.0
bb 0.0
none 0.0
safestack
shm 0.0
bb 0.0
none 0.0
twostage 100 bad
shm 0.0
bb 0.0
none 0.0
Random delay disturbance (Park et al., 2009;
Chew and Lie, 2010) is a category of methods that
randomly insert delays in PUT’s execution. Delays
induce various interleavings and can manifest latent
concurrency bugs. Stress testing is a common prac-
tice in this category. Nevertheless, methods in this
category are considered inadequate because they tend
to execute PUTs under similar interleavings.
Thread scheduling/switching (Musuvathi et al.,
2008; Musuvathi and Qadeer, 2007; Yu et al., 2012;
Burckhardt et al., 2010) is a category of methods that
control thread scheduling to execute PUTs under var-
ious interleavings. Schfuzz is classified into this cate-
gory because it controls thread scheduling based on
test cases generated by fuzzers. This category has
two sub-categories, systematic and non-systematic.
Systematic methods (Musuvathi et al., 2008; Musu-
vathi and Qadeer, 2007) explore interleaving space
systematically (i.e., exhaustively). However, it is in-
feasible to explore interleaving spaces of large-scale
real-world programs because the size of interleav-
ing space increases exponentially to the number of
threads and instructions in PUTs. Hence they often
have some budget limits (e.g., a time limit for explo-
ration) in practice. Although Thomson et al. (2016)
applied some systematic methods to the four PUTs we
used in our experiments to detect the bugs, no meth-
ods succeeded in detecting the bugs within a sched-
ule bound of 100,000. Non-systematic methods (Bur-
ckhardt et al., 2010) do not exhaustively explore in-
terleaving space. Schfuzz is non-systematic because
fuzzing has randomness and does not guarantee to ex-
plore the space exhaustively. Although it is similar to
random schedule generators, we can expect it can de-
tect concurrency bugs more efficiently and effectively
than random schedulers when we target complicated
programs because it is feedback-guided. In the exper-
iment by Thomson et al. (2016), a random scheduler
failed to detect the bugs in the four PUTs we used in
our experiments.
Fuzzing (Sen, 2008; Joshi et al., 2009; Cai et al.,
2014) is a category of methods to manifest potential
bugs reported by other methods, such as data-race de-
tectors. Although Fu et al. (2018) named the category
“fuzzing,” note that its meaning differs from that used
in the context of bug detection in sequential programs.
The efficiency and effectiveness of the methods in this
category to manifest bugs depend on methods to re-
port potential bugs.
Our research aims to apply feedback-guided
fuzzing to concurrency bug detection. Although it is
also an exciting research topic to compare Schfuzz to
them with quantitative metrics, that is currently out of
the scope.
5.2 Fuzzing Concurrent Software
As mentioned in section 2.2, though most fuzzing
studies aim to detect sequential bugs, some works tar-
get concurrency bugs like Schfuzz:
ConAFL (Liu et al., 2018).
MUZZ (Chen et al., 2020).
AutoInter-fuzzing (Ko et al., 2022).
CONZZER (Jiang et al., 2022).
ConFuzz (Vinesh and Sethumadhavan, 2020).
Table 6 shows compare them to Schfuzz in summary.
Unfortunately, it is almost impossible to verify them
and to fairly compare them to Schfuzz with quantita-
tive metrics because their tools, except ConAFL
7
, are
publicly unavailable now
8
. In addition, ConAFL tar-
gets programs that process some input and run on Lin-
ux/pthread, so our motivating example and the pro-
grams in SCTBench (Thomson et al., 2016) are out of
the scope.
ConAFL (Liu et al., 2018) runs fuzzing campaigns
for PUTs, which are instrumented to run under vari-
ous interleavings and to induce to manifest potential
concurrency vulnerabilities. There are two main dif-
ferences between ConAFL and Schfuzz:
ConAFL uses test cases generated by fuzzers only
for input-space exploration and explores inter-
leaving space by randomly varying threads’ prior-
ities. On the other hand, Schfuzz uses them to ex-
plore interleaving space. It deterministically de-
cides when to switch contexts based on test cases.
7
https://github.com/Lawliar/ConAFL
8
Feb. 06, 2023.
ENASE 2023 - 18th International Conference on Evaluation of Novel Approaches to Software Engineering
280
Table 6: Comparing Schfuzz with the existing fuzzing studies.
means partially yes because AutoInter-fuzzing controls
thread scheduling when it runs fuzzing in its interleaving mode.
Property ConAFL MUZZ AutoInter-fuzzing CONZZER ConFuzz Schfuzz
Directly exploring interleaving space
Handling non-determinism in fuzzing
Reproducibility of detected bugs
ConAFL is a typical code-coverage-guided
fuzzer. It prefers test cases that improve the code
coverage in a fuzzing campaign.
MUZZ (Chen et al., 2020) runs fuzzing campaigns
for PUTs, which are instrumented to run under vari-
ous interleavings and to record runtime information
about thread interleavings. There are three main dif-
ferences between MUZZ and Schfuzz:
MUZZ does not exclude the non-determinism of
thread scheduling, and Chen et al. (2020) do not
mention the reproducibility of detected bugs.
MUZZ induces various interleavings by randomly
varying threads’ priorities. This strategy corre-
sponds to random delay disturbance in the clas-
sification of Fu et al. (2018) and is commonly in-
efficient.
MUZZ uses test cases to explore input space as
ConAFL.
AutoInter-fuzzing (Ko et al., 2022) prefers test
cases with which PUTs access shared memory from
many dispersed threads. There are two main differ-
ences between AutoInter-fuzzing and Schfuzz:
AutoInter-fuzzing does not control thread
scheduling when it runs fuzzing in its normal
mode, so its non-determinism affects fuzzing
campaigns. Therefore, it can pick up a useless test
case as a seed by chance, resulting in inefficient
and ineffective exploration.
As ConAFL and MUZZ, AutoInter-fuzzing uses
test cases to explore input space.
CONZZER (Jiang et al., 2022) explores data races
by mutating pairs of function calls concurrently ex-
ecutable (concurrent call pair). There are two main
differences between CONZZER and Schfuzz:
Although CONZZER tries to manifest data races
by executing various concurrent call pairs, it
might waste time exploring codes that have noth-
ing to do with data races because it does not con-
sider whether an executed concurrent call pair can
be involved in the data race (i.e., whether it ac-
cesses shared memory). On the other hand, Sch-
fuzz prefers to explore codes involved in concur-
rency bugs because it employs test cases that ac-
cess shared memory in an unobserved order as
seeds.
As MUZZ, Jiang et al. (2022) do not mention the
reproducibility of detected bugs.
ConFuzz (Vinesh and Sethumadhavan, 2020) stat-
ically assigns weights to each basic block based on
distances from thread function call sites (e.g., mutex
lock/unlock). The weight of a test case is the sum of
the weights of executed basic blocks, and ConFuzz
prefers high-weight test cases. There are two main
differences between ConFuzz and Schfuzz:
ConFuzz does not handle the non-determinism of
thread scheduling. As MUZZ and CONZZER,
Vinesh and Sethumadhavan (2020) do not men-
tion the reproducibility of detected bugs.
As ConAFL, MUZZ, and AutoInter-fuzzing,
ConFuzz uses test cases to explore input space.
6 CONCLUSION
This research proposed Schfuzz, a novel method
to detect concurrency bugs with feedback-guided
fuzzing. It has two key points. First, it controls PUTs’
execution with test cases generated by fuzzer and
guarantees the reproducibility of detected bugs. Sec-
ond, it feeds back dynamic orders of shared-memory
access to explore interleaving space efficiently and ef-
fectively. For evaluation, we ran fuzzing campaigns
for the five concurrent programs with Schfuzz, code-
coverage-guided fuzzing, and dumb fuzzing. Schfuzz
detected the bugs in three of the five PUTs most ef-
fectively and efficiently. Although Schfuzz failed to
detect the bugs in the two PUTs, the other methods
also failed to detect them. This result shows that Sch-
fuzz can detect concurrency bugs more effectively and
efficiently than the other methods.
There are some future works derived from this re-
search. First, how to improve the efficiency and ef-
fectiveness of Schfuzz? It failed to detect the bugs in
the two PUTs in the experiment. This result means we
can further improve Schfuzz. Second, how to improve
the applicability of Schfuzz? Source code or binary
analysis may enable Schfuzz to automatically retrieve
information about PUTs currently given by testers.
Finally, can we extend the results of this research to
general? We evaluated Schfuzz with the hand-made
motivating example and the four most challenging
Schfuzz: Detecting Concurrency Bugs with Feedback-Guided Fuzzing
281
programs in SCTBench (Thomson et al., 2016) to de-
tect bugs. Experiments with various real-world soft-
ware will let us evaluate its generality more precisely.
REFERENCES
Bohme, M., Pham, V.-T., and Roychoudhury, A.
(2017). Coverage-Based Greybox Fuzzing as Markov
Chain. IEEE Transactions on Software Engineering,
45(5):489–506.
Brunner, E. and Munzel, U. (2000). The Nonparametric
Behrens-Fisher Problem: Asymptotic Theory and a
Small-Sample Approximation. Biometrical Journal,
42(1):17–25.
Burckhardt, S., Kothari, P., Musuvathi, M., and Na-
garakatte, S. (2010). A randomized scheduler with
probabilistic guarantees of finding bugs. In Proceed-
ings of the Fifteenth International Conference on Ar-
chitectural Support for Programming Languages and
Operating Systems, ASPLOS XV, pages 167–178,
New York, NY, USA. Association for Computing Ma-
chinery.
Cai, Y., Wu, S., and Chan, W. K. (2014). ConLock: A
constraint-based approach to dynamic checking on
deadlocks in multithreaded programs. In Proceed-
ings of the 36th International Conference on Software
Engineering, ICSE 2014, pages 491–502, New York,
NY, USA. Association for Computing Machinery.
Chen, H., Guo, S., Xue, Y., Sui, Y., Zhang, C., Li, Y.,
Wang, H., and Liu, Y. (2020). MUZZ: Thread-aware
Grey-box Fuzzing for Effective Bug Hunting in Mul-
tithreaded Programs. In Proceedings of the 29th
USENIX Security Symposium, SEC’20, pages 2325–
2342, USA. USENIX Association.
Chew, L. and Lie, D. (2010). Kivati: Fast detection and pre-
vention of atomicity violations. In Proceedings of the
5th European Conference on Computer Systems, Eu-
roSys’10, pages 307–320, New York, NY, USA. As-
sociation for Computing Machinery.
Fu, H., Wang, Z., Chen, X., and Fan, X. (2018). A system-
atic survey on automated concurrency bug detection,
exposing, avoidance, and fixing techniques. Software
Quality Journal, 26(3):855–889.
Jiang, Z.-M., Bai, J.-J., Lu, K., and Hu, S.-M.
(2022). Context-Sensitive and Directional Concur-
rency Fuzzing for Data-Race Detection. In Proceed-
ings of the 29th Annual Network and Distributed Sys-
tem Security Symposium, NDSS 2022, Reston, VA. In-
ternet Society.
Joshi, P., Park, C.-S., Sen, K., and Naik, M. (2009). A
randomized dynamic program analysis technique for
detecting real deadlocks. ACM SIGPLAN Notices,
44(6):110–120.
Ko, Y., Zhu, B., and Kim, J. (2022). Fuzzing with automat-
ically controlled interleavings to detect concurrency
bugs. Journal of Systems and Software, 191:111379.
Liu, C., Zou, D., Luo, P., Zhu, B. B., and Jin, H. (2018).
A Heuristic Framework to Detect Concurrency Vul-
nerabilities. In Proceedings of the 34th Annual Com-
puter Security Applications Conference, pages 529–
541, New York, NY, USA. Association for Computing
Machinery.
Miller, B. P., Fredriksen, L., and So, B. (1990). An empiri-
cal study of the reliability of UNIX utilities. Commu-
nications of the ACM, 33(12):32–44.
Musuvathi, M. and Qadeer, S. (2007). Iterative context
bounding for systematic testing of multithreaded pro-
grams. ACM SIGPLAN Notices, 42(6):446–455.
Musuvathi, M., Qadeer, S., Nainar, P. A., Ball, T., Basler,
G., and Neamtiu, I. (2008). Finding and reproducing
heisenbugs in concurrent programs. In Proceedings
of the 8th USENIX Conference on Operating Systems
Design and Implementation, OSDI’08, pages 267–
280, USA. USENIX Association.
Park, S., Lu, S., and Zhou, Y. (2009). CTrigger: Ex-
posing atomicity violation bugs from their hiding
places. ACM SIGARCH Computer Architecture News,
37(1):25–36.
Rawat, S., Jain, V., Kumar, A., Cojocar, L., Giuffrida, C.,
and Bos, H. (2017). VUzzer: Application-aware Evo-
lutionary Fuzzing. In Proceedings of the 24th Annual
Network and Distributed System Security Symposium,
NDSS 2017, Reston, VA. Internet Society.
Sen, K. (2008). Race directed random testing of concurrent
programs. ACM SIGPLAN Notices, 43(6):11–21.
Stephens, N., Grosen, J., Salls, C., Dutcher, A., Wang, R.,
Corbetta, J., Shoshitaishvili, Y., Kruegel, C., and Vi-
gna, G. (2016). Driller: Augmenting Fuzzing Through
Selective Symbolic Execution. In Proceedings of the
23rd Annual Network and Distributed System Security
Symposium, NDSS 2016, Reston, VA. Internet Soci-
ety.
Thomson, P., Donaldson, A. F., and Betts, A. (2016). Con-
currency testing using controlled schedulers: An em-
pirical study. ACM Transactions on Parallel Comput-
ing, 2(4):1–37.
Vinesh, N. and Sethumadhavan, M. (2020). ConFuzz—A
Concurrency Fuzzer. In Proceedings of the 1st In-
ternational Conference on Sustainable Technologies
for Computational Intelligence, volume 1045 of Ad-
vances in Intelligent Systems and Computing, pages
667–691, Singapore. Springer Singapore.
Wen, C., He, M., Wu, B., Xu, Z., and Qin, S. (2022). Con-
trolled concurrency testing via periodical scheduling.
In Proceedings of the 44th International Conference
on Software Engineering, ICSE 2022, pages 474–486,
Pittsburgh Pennsylvania. Association for Computing
Machinery.
Yu, J., Narayanasamy, S., Pereira, C., and Pokam, G.
(2012). Maple: A coverage-driven testing tool for
multithreaded programs. ACM SIGPLAN Notices,
47(10):485–502.
Yun, I., Lee, S., Xu, M., Jang, Y., and Kim, T. (2018).
QSYM: A practical concolic execution engine tai-
lored for hybrid fuzzing. In Proceedings of the 27th
USENIX Security Symposium, SEC’18, pages 745–
761, USA. USENIX Association.
ENASE 2023 - 18th International Conference on Evaluation of Novel Approaches to Software Engineering
282