cation clones on android markets. In Proceedings of
the 36th International Conference on Software Engi-
neering, pages 175–186. ACM.
Dai, A. M., Olah, C., and Le, Q. V. (2015). Document em-
bedding with paragraph vectors. In NIPS Deep Learn-
ing Workshop.
Feichtner, J. and Rabensteiner, C. (2019). Obfuscation-
resilient code recognition in android apps. In Proceed-
ings of the 14th International Conference on Avail-
ability, Reliability and Security, page 8. ACM.
Grace, M. C., Zhou, W., Jiang, X., and Sadeghi, A.-R.
(2012). Unsafe exposure analysis of mobile in-app ad-
vertisements. In Proceedings of the fifth ACM confer-
ence on Security and Privacy in Wireless and Mobile
Networks, pages 101–112. ACM.
Le, Q. and Mikolov, T. (2014). Distributed representations
of sentences and documents. In International confer-
ence on machine learning, pages 1188–1196.
Li, M., Wang, W., Wang, P., Wang, S., Wu, D., Liu, J.,
Xue, R., and Huo, W. (2017). Libd: scalable and pre-
cise third-party library detection in android markets.
In 2017 IEEE/ACM 39th International Conference on
Software Engineering (ICSE), pages 335–346. IEEE.
Liu, B., Liu, B., Jin, H., and Govindan, R. (2015). Efficient
privilege de-escalation for ad libraries in mobile apps.
In Proceedings of the 13th annual international con-
ference on mobile systems, applications, and services,
pages 89–103. ACM.
Ma, Z., Wang, H., Guo, Y., and Chen, X. (2016). Libradar:
fast and accurate detection of third-party libraries in
android apps. In Proceedings of the 38th international
conference on software engineering companion, pages
653–656. ACM.
Mitchell, N. and Sevitsky, G. (2007). The causes of bloat,
the limits of health. In Proceedings of the 22nd an-
nual ACM SIGPLAN conference on Object-oriented
programming systems and applications, pages 245–
260.
Strobel, M. (2019). Procyon: A suite of java metaprogram-
ming tools focused on code generation, analysis, and
decompilation.
Wang, H., Guo, Y., Ma, Z., and Chen, X. (2015). Wukong:
A scalable and accurate two-phase approach to an-
droid app clone detection. In Proceedings of the
2015 International Symposium on Software Testing
and Analysis, pages 71–82. ACM.
Xu, G., Mitchell, N., Arnold, M., Rountev, A., and Sevitsky,
G. (2010). Software bloat analysis: Finding, remov-
ing, and preventing performance problems in modern
large-scale object-oriented applications. In Proceed-
ings of the FSE/SDP workshop on Future of software
engineering research, pages 421–426.
Zhang, Y., Dai, J., Zhang, X., Huang, S., Yang, Z., Yang,
M., and Chen, H. (2018). Detecting third-party li-
braries in android applications with high precision and
recall. In 2018 IEEE 25th International Conference
on Software Analysis, Evolution and Reengineering
(SANER), pages 141–152. IEEE.
APPENDIX
Objective: The objective here is to seek an answer to
our questions:
1. For reliably detecting a library, what is the ac-
ceptable value of Jar2Vec similarity scores for
source code and bytecode inputs?
2. Does the threshold similarity score (
ˆ
α) vary with
the input parameters (β, γ, and ψ) of PVA?
3. What are the optimal values for the PVA tuning-
parameters β,γ, and ψ?
Please refer to Table-1 for notation definitions.
Table 4: Scenarios for training Jar2Vec models using PVA.
Parameters varied
Epochs β Vector size γ Training samples ψ Models
Fixed
at
10
Fixed
at
10
Vary 5000 to-
CorpusSize in-
steps of 5000
CorpusSize
÷
5000
Vary 5-
to 50 in-
steps of 5
Fixed
at
10
Fixed at-
CorpusSize
10
Fixed
at
10
Vary 5-
to 50 in-
steps of 5
Fixed at-
CorpusSize
10
Test-bed Setup: Using the test partition of the test-
bed developed in §4.1, we generate a test dataset (Y )
containing same, di f f erent file pairs in 50:50 ratio.
Further, to check if our tool is resilient to source code
transformations, we test it for the following three sce-
narios:
1. Package-name Transformations: Package names
of the classes present in TPLs are modified.
2. Function (or Method) Transformations: Func-
tion names are changed in the constituent classes’
source code, and function bodies are relocated
within a class.
3. Source Code Transformations: Names of various
variables are changed, source code statements are
inserted, deleted, or modified, such that it does
not alter the semantics of the source code. For
instance, adding print statements at various places
in the source file.
We test Jar2Vec models’ efficacy in detecting similar
source code pairs (or bytecode pairs) using Y.
Procedure: The salient steps are:
1. F
bc
, F
sc
:= Obtain the textual forms of bytecode
and source code present in source files of training
JARs of the test-bed (developed in §4.1).
2. For each parameter combination π ∈ Z (listed in
Table-4):
(a) S
π
sc
:= S
π
bc
:= NULL
ENASE 2021 - 16th International Conference on Evaluation of Novel Approaches to Software Engineering
136