formance in the experiment, because its average ex-
ecution time is only 0.8s, which is the lowest among
the three tools.
However, it is worth mentioning that Slither
defines the vulnerabilities more broadly, and has
pre-defined a large number of vulnerabilities with
extremely low severity that appear frequently in
the smart contracts. It categorises the severity of
the vulnerabilities into five levels, including High,
Medium, Low, Informational, and Optimisation,
whereas Mythril has only categorised three levels, in-
cluding High, Medium, and Low. The Informational
or Optimisation levels vulnerabilities usually do not
pose a threat to security, but cannot be refuted, for
example, the variable names that violate the Solidity
naming convention would be reported as an Informa-
tional level vulnerability. Over 50% of the vulnerabil-
ities reported by Slither have Informational and Op-
timisation levels of severity. This phenomenon may
directly lead to a falsely low false positive rate, so
it might make the experiment fairer, if these types
of vulnerabilities reported could be eliminated before
evaluation.
Securify. According to the experimental data in the
table, it could be observed that Securify has the least
ideal robustness, as it has failed to analyse 25 smart
contracts for a variety of reasons, such as the unsup-
ported Solidity version used in the smart contracts, or
some unrecognised tokens by the tool, etc. However,
the performance of Securify is relatively high, as its
average execution time is around 2.1 seconds. Secu-
rify has also shown a relatively high accuracy due to
its least false negative rate, which could be explained
by the fact that Securify supports the detections of
the most types of vulnerabilities when compared with
Mythril and Slither.
Similar to Slither, Securify has categorised the
severity of the vulnerabilities it could detect into five
levels, including Critical, High, Medium, Low, and
Info. The majority of the vulnerabilities reported by
Securify are in the Low level or Info level, which may
also cause the falsely low false positive rate.
4.3 Future Directions
In this section, we will propose three future directions
based on some key observations of this research or
intentions to address some limitations of the study.
1. For the future studies that attempt to optimise the
accuracy of the analysis tools, reducing the false
negative rate might be a more valuable direction,
as it has a higher marginal benefit, which could
be observed from the experimental data that the
false negative rate of the current analysis tools
is generally much higher than the false positive
rate. However, in contrast, reducing FPR is more
friendly to beginners and saves more time, be-
cause developers only need to optimise existing
detectors, while reducing false negative rate might
require much more workload in developing new
detectors.
2. As mentioned in the previous section, a limita-
tion of this study was identified from the Slither
and Securify evaluation process that the severity
factor was neglected when conducting the experi-
ments and might have resulted in the falsely high
false positive rate estimation. Therefore, in future
experiments, it would be better if the reported vul-
nerabilities reported can be classified according to
their severity, and compute the proportion of vul-
nerabilities of each severity in the total number. It
could help avoid some flooding data affecting the
fairness of the experiment by setting a threshold
of vulnerability severity based on the computed
proportion.
3. Due to the time constraint of this study, the total
number of sample smart contracts used for experi-
ments is kept below 200, and this limitation might
reduce the generalizability of the study among
various analysis tools and increase the margin of
error, because the characteristics of some analy-
sis tools cannot be reflected when analysing the
sample smart contracts.
Therefore, the most straightforward direction to
address this limitation in the future study is to in-
troduce more sample smart contracts in the exper-
iments. Moreover, in order to reduce the bias in
the evaluation, these smart contracts could be di-
verse in application categories, such as DeFi, De-
centralized Exchange (DEX), Gaming, etc.
5 CONCLUSIONS
In general, the existing smart contract analysis tools
can indeed effectively detect some certain categories
of vulnerabilities in the smart contract, but may lack
reliability when attempting to detect some more com-
plicated vulnerabilities or defects hidden in-depth,
such as the integer overflow defect in the BEC Token
contract. More importantly, despite the advantages of
time efficiency and low cost, smart contracts analy-
sis tools could not fully replace the manual auditing
performed by professional audit teams, because the
analysis tools can only detect a certain number of vul-
nerabilities and defects via some predefined logics or
processes, which often covers only a small part of the
ENASE 2023 - 18th International Conference on Evaluation of Novel Approaches to Software Engineering
328