To determine whether a file might be malicious,
we searched the VirusTotal malware information ser-
vice that aggregates the detection results of 76 anti-
virus (AV) products (VirusTotal, 2020b). Any reg-
istered user can submit a sample to VirusTotal for
analysis. The detection results and other file infor-
mation are available to anyone for subsequent query,
by submitting a cryptographic hash of the file. Virus-
Total’s Application Programming Interface (API) in-
cludes rescan requests for results from the most up-
to-date AV products and much threat intelligence data
related to malware (VirusTotal, 2020a).
The contribution of this paper is a methodology
for investigating the presence of malware over all the
commits in the lifetime of a GitHub repository. While
it is straightforward to clone a repository to a specific
point in time - e.g., the current head state or some
arbitrary branch in the past - our approach investi-
gates all of the commits throughout the history of the
repository to identify files for analysis. We use the
well-established method of VirusTotal anti-virus en-
gine results to assess maliciousness of a particular file
type (Windows portable executable binaries), and we
apply our methodology to a subset of GitHub reposi-
tories (Windows C and C++ repositories) in this pre-
liminary investigation. However, this methodology
could be applied to additional populations of GitHub
repositories, identifying other file types of interest
through the repository lifetimes, and using other mal-
ware analysis methods.
In this paper, we present our preliminary inves-
tigation into the presence of malware files in Win-
dows C/C++ GitHub repositories. Section 2 provides
background on GitHub and related work in VirusTotal
malware research. We describe our approach to mine
Windows binary files from GitHub and then query
VirusTotal for malware detection results in Section 3.
Section 4 presents our initial VirusTotal analysis re-
sults for the Windows files that we mined from our
GitHub repositories of interest. Section 5 provides a
discussion and more detailed analysis of our results.
We present our conclusions and directions for future
research in Section 6.
2 BACKGROUND AND RELATED
WORK
GitHub is known to host malware, both legitimately
(i.e., in compliance with GitHub’s terms of use) and
illegitimately. GitHub prohibits content that “con-
tains or installs any active malware or exploits, or
uses our platform for exploit delivery” (GitHub.com,
2020b). An example of GitHub hosting malware in
violation of this policy occurred in March 2018, when
cybercriminals uploaded cryptocurrency mining mal-
ware to forked GitHub projects and used phishing ads
to download and execute the malware (Avast Threat
Intelligence Team, 2018). More recently, 26 open
source projects were discovered to have backdoors in-
serted by the Octopus malware, which used the build
process to spread to other NetBeans projects (Munoz,
2020). GitHub appears to allow executable malware
in curated malware collections. A search for “mal-
ware samples” returns over 250 repositories. Al-
though many repository descriptions suggest analy-
sis tools or malware-related resources, some explic-
itly indicate that they include malware samples.
In terms of detecting malware or malicious repos-
itories in GitHub, only recently have two efforts sys-
tematically studied this problem. Recent work by
Rokon et al. developed a methodology for find-
ing malware source code within GitHub projects and
identified 7,504 malware source repositories (Rokon
et al., 2020). While the findings from this work can be
used to search for malware binaries in GitHub as well,
our work seeks to find malicious binaries in GitHub
repositories that are not necessarily purporting to con-
tain malware. Zhang et al. developed a deep neural
network approach to detect malicious GitHub reposi-
tories using content-based features from source code
files, investigating a population of blockchain and
crytocurrency repositories (Zhang et al., 2020). They
used VirusTotal as part of their evaluation process
for comparison purposes, ultimately labeling 1,492
repositories as malicious out of their population of
3,729 repositories, but again this work was more fo-
cused on malicious source code in GitHub.
Many previous research efforts have used Virus-
Total to support malware detection and analysis
in the domains of malware binaries run in dy-
namic analysis sandboxes (Graziano et al., 2015),
signed malware binaries (Kim et al., 2018), and mo-
bile applications (Hurier et al., 2017), (Pendlebury
et al., 2019), (Salem et al., 2019), (Suciu et al.,
2018), (Wang et al., 2019). VirusTotal can also be
used for analysis of malicious web addresses, i.e.,
Uniform Resource Locators (URLs), such as those
used in phishing campaigns (Peng et al., 2019). These
research efforts and others each utilize VirusTotal in
different ways, either using various thresholds for the
number of VirusTotal engines needed to consider a
sample as malicious (e.g., 1, 5, or 10), thresholds
based on percentage of engines (e.g., 50%), or results
from a subset of engines based on high reputation or
market share. In short, there is little consensus on
how to definitively interpret VirusTotal results to de-
termine whether a sample is malicious.
ICISSP 2021 - 7th International Conference on Information Systems Security and Privacy
476