Table 4: Evaluation results of three different searching approaches.
Method Average Score
Average Hits MAP@10
@Top10 @1-3 @4-6 @7-9 @10 Initial Gold
Words 5.205 3.89 1.95 0.98 0.64 0.32 0.2929 0.389
Semantic 4.312 3.66 0.34 0.84 1.11 1.37 0.291 0.366
Category 5.258 4.89 0.71 1.18 1.25 1.75 0.404 0.489
Table 5: MAP values of different searching result lists.
MAP
@Top
Before Ranking After Ranking
WCS CWS SCW CSW SWC WSC WCS CWS SCW CSW SWC WSC Auto
10 0.2929 0.404 0.291 0.404 0.291 0.2929 0.2926↓ 0.406↑ 0.303↑ 0.406 0.303 0.292 0.407
20 0.273 0.303 0.235 0.273 0.227 0.233 0.272 0.304 0.241 0.275 0.233 0.234 0.301
30 0.226 0.247 0.213 0.239 0.212 0.216 0.227 0.248 0.217 0.240 0.216 0.217 0.247
experimental results show that the ranker derives a
slight improvement in MAP value at top-10 projects.
Design effective text features in training rankers
, analyze concrete effectiveness of different features,
find the way to omit the impact of imbalanced projects
distribution and train better category classifiersare,
and taking user requirements into consideration for
measuring project similarity are the future works of
this paper.
ACKNOWLEDGEMENTS
This work was supported by the National Natural
Sciences Foundation, China (No. 61802167), Open
Foundation of State key Laboratory of Network-
ing and Switching Technology (Beijing University
of Posts and Telecommunications) (SKLNST-2019-
2-15), and the Fundamental Research Funds for the
Central Universities. Jidong Ge is the corresponding
author.
REFERENCES
Clariana, R. B. and Wallace, P. (2009). A comparison
of pair-wise, list-wise, and clustering approaches for
eliciting structural knowledge. International Journal
of Instructional Media, 36(3):287–302.
Cortes, C. and Vapnik, V. (1995). Support-vector networks.
Machine learning, 20(3):273–297.
Epelbaum, T. (2017). Deep learning: Technical introduc-
tion. arXiv preprint arXiv:1709.01412.
Gomaa, W. H. and Fahmy, A. A. (2013). A survey of text
similarity approaches. International Journal of Com-
puter Applications, 68(13):13–18.
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term
memory. Neural computation, 9(8):1735–1780.
Kawamitsu, N., Ishio, T., Kanda, T., Kula, R. G.,
De Roover, C., and Inoue, K. (2014). Identifying
source code reuse across repositories using lcs-based
source code similarity. In 2014 IEEE 14th Interna-
tional Working Conference on Source Code Analysis
and Manipulation, pages 305–314. IEEE.
Kim, Y. (2014). Convolutional neural networks for sentence
classification. arXiv preprint arXiv:1408.5882.
Le, Q. and Mikolov, T. (2014). Distributed representations
of sentences and documents. In International confer-
ence on machine learning, pages 1188–1196.
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., et al. (1998).
Gradient-based learning applied to document recogni-
tion. Proceedings of the IEEE, 86(11):2278–2324.
Li, Y., McLean, D., Bandar, Z. A., O’shea, J. D., and Crock-
ett, K. (2006). Sentence similarity based on seman-
tic nets and corpus statistics. IEEE transactions on
knowledge and data engineering, 18(8):1138–1150.
Liu, C., Jin, T., Hoi, S. C., Zhao, P., and Sun, J. (2017). Col-
laborative topic regression for online recommender
systems: an online and bayesian approach. Machine
Learning, 106(5):651–670.
Luo, L., Ming, J., Wu, D., Liu, P., and Zhu, S.
(2017). Semantics-based obfuscation-resilient binary
code similarity comparison with applications to soft-
ware and algorithm plagiarism detection. IEEE Trans-
actions on Software Engineering, 43(12):1157–1177.
MacQueen, J. et al. (1967). Some methods for classification
and analysis of multivariate observations. In Proceed-
ings of the fifth Berkeley symposium on mathematical
statistics and probability, volume 1, pages 281–297.
Oakland, CA, USA.
Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013).
Efficient estimation of word representations in vector
space. arXiv preprint arXiv:1301.3781.
Nagwani, N. K. (2015). A comment on “a similarity mea-
sure for text classification and clustering”. IEEE
Transactions on Knowledge and Data Engineering,
27(9):2589–2590.
Wang, S. and Wu, D. (2017). In-memory fuzzing for bi-
nary code similarity analysis. In Proceedings of the
32nd IEEE/ACM International Conference on Auto-
mated Software Engineering, pages 319–330. IEEE
Press.
Yao, L., Pan, Z., and Ning, H. (2018). Unlabeled short text
similarity with lstm encoder. IEEE Access, 7:3430–
3437.
Retrieving Similar Software from Large-scale Open-source Repository by Constructing Representation of Project Description
303