Analyzing Distributions of Emails and Commits from OSS Contributors

through Mining Software Repositories

An Exploratory Study

ario Farias

1,2

, Renato Novais

1,4

, Paulo Ortins

, Methanias Colac¸o

and Manoel Mendonc¸a

1,4

Federal Institute of Bahia, Salvador, Brazil

Federal Institute of Sergipe, S

ao Crist

ao, Brazil

Federal University of Sergipe, S

ao Crist

ao, Brazil

Fraunhofer Project Center for Software and Systems Engineering, Bahia, Brazil

Keywords:

Software Repository Mining, Open Source Contributions, Experimental Software Engineering, Software

Visualization, Preferred Representational Systems.

Abstract:

Context: Distributed software development is a modern practice in software industry. This is especially true

in Open Source Software (OSS) community. In this context, developers are normally distributed around the

world. In addition, most of them work for free and without or with low coordinating. Understanding how de-

velopers’ practices are on those projects may guide communities to successfully manage their projects. Goal:

We mined two repositories of the Apache Httpd project in order to gather information about its developers’ be-

havior. Method: We developed an approach to cross data gathered from mail list and source code repository

through mining techniques. The approach uses software visualization to analyze the mined data. We con-

ducted an experimental evaluation of the approach to assess the behavioral patterns from OSS development

community. Results: Our results show Apache developers’ behavior patterns. In addition, we deepen the

analysis of the Preferred Representational System of four top developers presented by Colac¸o et. al in (Colac¸o

et al., 2010). Conclusion: The use of data mining and software visualization to analyze data from different

sources can spot important properties of development processes.

1 INTRODUCTION

A challenge for software engineers and programmers

is dealing with complex issues and large software

systems while evolving their projects (Sjoberg et al.,

2013). Large systems are complex and difﬁcult to

understand because of size and complexity that soft-

ware has achieved. Software engineers frequently

face maintenance tasks, which require the understand-

ing of non-familiar software artifacts. Such tasks of-

ten imply problems regarding communication, com-

patibility, and complexity issues (Lanza et al., 2005).

Therefore, ensuring the maintainability of software

systems is a costly task, and improving this process is

a continuous work of research in the software main-

tainability area. One possible way to deal with this

complex scenario is to understand the development

community behavior, through software repositories’

data analysis, improving software development pro-

cesses and practices (Heller et al., 2011).

Software repositories have been used to discover

useful knowledge about the development, mainte-

nance and evolution of software. However, some of

these data sources (e.g. mailing lists) are not built in

a structured and organized way. So, we need a con-

siderable effort to gather evidence from those reposi-

tories. To this end, researchers have been developing

different approaches (Licorish and MacDonell, 2014;

Heller et al., 2011; Novais et al., 2013a; Canfora et al.,

2011; Eyolfson et al., 2011). They use data mining,

software visualization, text mining, and mining soft-

ware repository. Some of them analyze each repos-

itory separately, even that to combine different tech-

niques is a promising approach (Novais et al., 2013b).

Combine approaches from different areas may

lead to great results. For example, enrich those

techniques with information visualization may reveal

valuable hidden software properties. Based on this

premise, this paper presents an exploratory study that

uses data mining and software visualization tech-

niques to analyze open source software (OSS) de-

velopers’ behavior. The study is particularly inter-

303

Farias M., Novais R., Ortins P., Colaço . and Mendonça M..

Analyzing Distributions of Emails and Commits from OSS Contributors through Mining Software Repositories - An Exploratory Study.

DOI: 10.5220/0005368603030310

In Proceedings of the 17th International Conference on Enterprise Information Systems (ICEIS-2015), pages 303-310

ISBN: 978-989-758-097-0

 2015 SCITEPRESS (Science and Technology Publications, Lda.)

ested in analyze discussion mailing lists and source

code repositories through the use of software visu-

alization. It integrated and analyzed data originated

from Apache Httpd mailing list and source code data.

Our approach extends the empirical studies pre-

sented in previous works (Colac¸o et al., 2012; Farias

et al., 2014). The ﬁrst work proposed strategies to

mine Apache server mailing lists, aiming to clas-

sify the top Apache Httpd Contributor according to

their Preferred Representational Systems (PRS). Next

work analyzes Apache developers’ behavior, mining

mailing lists and source conﬁguration management

data (SCM Data).

This representational system gives us a preferred

way to use one or more basic systems to communi-

cate and learn. The basic systems usually discussed

in the literature are : (1) Visual, (2) Auditory and (3)

Kinesthetic. For more detail, the reader should read

(Colac¸o et al., 2012).

In summary, this study deeps one of the research

questions presented by (Colac¸o et al., 2010), intro-

duces two new research questions and, when possible,

compares the outcomes of both studies.

This paper is organized as follow. Section 2

presents related works. Section 3 describes our ex-

perimental evaluation. Section 4 reports and discusses

our ﬁndings. Section 5 discusses some threats to va-

lidity. Finally, Section 6 concludes the paper high-

lighting future works.

2 RELATED WORK

This section discusses some related work concerned

with identifying patterns in OSS development com-

munity through mining software repository or soft-

ware visualization.

Heller et al. (Heller et al., 2011) proposed a strat-

egy that mined a GitHub repository metadata and used

visualization techniques to identify patterns in OSS

development community. The study focused on spe-

ciﬁc patterns, such as the effect of geographic dis-

tance on developer relationships, social connectivity

and inﬂuence among cities, and variation in project

speciﬁc contribution styles. From the standpoint of

behaviour patterns, in (Murgia et al., 2014) the au-

thors have analyzed whether development artifacts

like issue reports carry any emotional information

about software development. The work has analyzed

the Apache Software Foundation issue tracking sys-

tem. The analysis shows that developers do express

emotions (in particular gratitude, joy and sadness).

Based on their ﬁndings, issue comments have poten-

tial as data source for emotion mining.

Some works have already considered email spe-

ciﬁc analysis to study OSS development process and

behavior of people (Rigby and Hassan, 2007; Gill and

Oberlander, 2003)Rigby and Hassan (Rigby and Has-

san, 2007) have analyzed OSS mailing list content to

ﬁnd developers personalities and general emotional

content. In (Gill and Oberlander, 2003), the authors

investigated the impact of computer-mediated interac-

tion on person perception. In particular, they studied

how important traits for socialization and collabora-

tion may be detected from the text of an email. To

this end, they analyzed emails from 30 students at the

University of Edinburgh.

Other works are focused on the use of software

visualization to understand the OSS developers be-

haviour. They usually propose new visual metaphors

to analyze OSS developers contribution. In (Licorish

and MacDonell, 2014), the authors used psycholin-

guistics, text mining and visualization to examine

repository data. Besides that, they demonstrated the

utility of combining these approaches to illuminate

details of teams’ behavioral processes evident in their

artifacts. M

uller et al. (M

uller et al., 2010) presented

a visualisation and statistics system called Subversion

Statistics Sifter. It explores the structure and evolution

of data contained in Subversion repositories. They

use statistical graphics and graph plots to analyze both

developer activity and source code changes.

Two works are closest to the research presented

here. First, Canfora et al. (Canfora et al., 2011) mined

explicitly documented cross-system bug ﬁxings from

versioning repository and data from two project mail-

ing lists. They tried to identify Cross-System-Bug-

Fixings activities between FreeBSD and OpenBSD.

They also investigated the social role of developers

performing such activities by means of social net-

work analysis. We based our cross-system mailing

list in this work. Second, in (Colac¸o et al., 2010), the

authors introduced a psychometrically-based neuro-

linguistic analysis tool to classify developers through

email mining. They conducted an experiment to as-

sess the Preferred Representational Systems of top de-

velopers at Apache server mailing lists. In our study,

we extended their e-mails and Preferred Representa-

tional System analysis.

3 EXPERIMENTAL EVALUATION

This section describes the planning and the operation

of the experimental evaluation we conducted to vali-

date our approach. The experimental process follows

the Wohlin’s guidelines (Wohlin et al., 2012). The

next section presents the gathered evidence and the

ICEIS2015-17thInternationalConferenceonEnterpriseInformationSystems

304

results.

3.1 Goal Deﬁnition

The main goal of our study is to reveal interesting

behavioral patterns in open source software contribu-

tions, such as the effect of geographic distribution be-

tween e-mails and commits and correlation over time

between emails and commits of OSS project develop-

ers. For reaching this goal, we had to gather infor-

mation through preprocessing and text mining, min-

ing software repository and software visualization to

analyze mailing lists, commits from projects and ge-

ographic location (geo-location) of contributions in

OSS projects.

3.2 Planning

3.2.1 Context Selection

The experiment context was open source project

repositories. These repositories have a large amount

of e-mails and commits. Commonly, the data is not

ready to use. It is necessary to clean the data to

avoid misleading understanding. For that, we de-

veloped powerful computational procedures follow-

ing (Colac¸o et al., 2010; Colac¸o et al., 2012). On

top of that, we did a detailed manual analysis of the

committers’ proﬁles in order to gather geographic in-

formation. The approach of this study followed three

steps: First, we extracted data from: a) Apache’s com-

mits repository; b) Apache developer’s mailing list;

and c) geographic information from geo-location ser-

vices; Second, we crossed the data collected in the

previous step in order to associate the data to the de-

veloper that produced it; and ﬁnally, we built interac-

tive visualizations that helped users to discover rele-

vant information. Next, we present each of these steps

in detail.

We developed four modules for this study in or-

der to provide an able environment to integrate and

analyze two repositories from a system. The ﬁrst is

a module for Extraction, Transformation and Load

(ETL) of emails. The second module is for min-

ing source code repositories. The next, the integrator

module. Finally, the fourth is a visualization mod-

ule. These visualizations are publicly available at

(http://goo.gl/RSs4VR).

3.2.2 Research Questions

This work aims to investigate OSS developers’ behav-

ior. To do this, we mined two software repositories

in order to analyze the research questions addressing

the distribution of email and commits over time in the

project. Our efforts were focused on relation between

the PRS presented by Colac¸o et al. in (Colac¸o et al.,

2010) and the contribution made by each developer in

the Apache project in depth (Question ii). Moreover,

we present one question addressing the distribution of

email and commits over time in the project (Questions

i). Our research questions are described as follow:

i. How are commits and emails distributed over

time among the Apache Project community?

ii. Is there a link between OSS developers’

context-speciﬁc PRS (Colac¸o et al., 2010) and con-

tribution made by each developer?

3.2.3 Participant and Artifact Selection

In this study, we used as object of analysis the Apache

Httpd Project (http://httpd.apache.org/). Apache

Httpd is a HTTP server that aims to offer a robust,

efﬁcient and bug free implementation of HTTP ser-

vices. Accordingly to (NETCRAFT, 2013), it is used

for more than 300 million websites that represents al-

most 40% of the world websites. Apache Httpd is

maintained by dozens of developers through Apache

Software Foundation (ASF). This foundation is com-

prised of a community of developers and users that

provides support for more than 100 well-known open

source projects.

Over its 17 years of development, the project re-

ceived more than 60,000 commits, totaling more than

two millions of lines of code written by more than 100

developers around the world. These developers use a

mailing list to communicate with each other. They

send emails to discuss several activities, such as de-

velopment of new features, bug ﬁxes, user problems,

and so on. This data can provides useful information

about the project evolution and developers’ behavior.

Because of that, it may be used for several studies in

mining software repository.

To answer our research questions, we extracted

and analyzed the body of 100,479 email messages

and 33,586 commits from the Apache repositories be-

tween 1995 and 2005. We selected the four devel-

opers who had the greatest number of commits. We

refer to these developers as ”Dev A”, ”Dev B”, ”Dev

C” and ”Dev D”. We also grouped all the other devel-

opers, and refer to them as ”Cluster”, these develop-

ers represent the rest of population. We analyzed the

same developers and same period used in the related

study (Colac¸o et al., 2010).

3.2.4 Preparation

We prepared a pilot for testing our approach. The pi-

lot study was carried using a small sample of emails

AnalyzingDistributionsofEmailsandCommitsfromOSSContributorsthroughMiningSoftwareRepositories-An

ExploratoryStudy

305

and commits, which was chosen at random. Thus,

the pilot helped us to calibrate some speciﬁc char-

acteristics of our modules and to ﬁnd improvement

point, such as performance of the crossing data and

geographic information.

3.3 Experiment Execution

To conduct this research, we developed four modules.

The modules were necessary in order to provide an

able environment to integrate and analyze two dif-

ferent source repositories. The ﬁrst is a module for

Extraction, Transformation and Load (ETL) of the

emails. The second module is for Mining Source

Code Repositories. The third on is the integrator mod-

ule. Finally, the fourth is the visualization module.

They are described as following.

ETL OF THE EMAILS: Our approach uses Text

Mining (TM). Similar to conventional data mining,

text mining consists of phases that are inherent to

knowledge discovery process (Fayyad et al., 1996).

In this sense, we need to pay special attention to pre-

processing, because the used data is unstructured for

computer analysis. This means that before setting the

text data to be mined, it is necessary to convert each

document to a suitable format. A set of emails orga-

nized by month was treated as a text document by our

approach.

We based on (Colac¸o et al., 2010; Colac¸o et al.,

2012) processes to mining the email list. Those pro-

cesses consist of various steps. All of the steps have

as ﬁnal purpose producing data collections with high

semantic content. Next, we brieﬂy present the steps.

For more details on how to do preprocessing and to

clean messages, we suggest to read the used refer-

ences (Rigby and Hassan, 2007; Witte et al., 2008).

Step 1: The original documents are not always

represented in a sole textual format. Due to this rea-

son, it is necessary to convert them to a unique format.

For that, we needed to eliminate any attributes of pre-

sentation formatting, such as footer, signature, source

code and attachments.

Step 2: To change the letter case to upper case

or lower case. This facilitates the matching process.

We changed the letter case to lower case;

Step 3: To separate each email message. The

goal of this step is to recover important properties,

such as ”from”, ”to”, ”subject”, ”date”, ”time of the

day” and ”weekday”. Identifying the start and the end

of each message is a challenge task. In general, there

is no a pattern header

throughout the mailing list’s

A pattern that indicates the start of the emails,

e.g., From owner-new-httpd@hyperreal.com Wed Nov 1

12:13:11 1995, From dev-return-62023-apmail-httpd-dev-

lifecycle. We had to perform a qualitative analysis in

a sample of emails and created a heuristic to detect

the headers. After that, we analyzed the outcomes in

order to validity our heuristic.

Step 4: To group the mined data, summarize and

store it into a database.

We applied all these steps to 100,479 emails sent

between 1995 and 2005.

MINING SOURCE CODE REPOSITORIES:

Apache Httpd uses a SVN repository. We used

computational procedure to extract the data from the

repository. In the studied period, there are 33,586

commits made by 110 different developers. We built

a parser to extract data from commits. From each

commit, we extract the ”author name”, ”date”, ”time

of the day” and ”weekday” that the commit was

made, and the ﬁles changed by the commit.

INTEGRATOR MODULE: In order to establish a

link between extracted data from email and commits

we developed a integrator module, for this at least

one property must be shared by both data sources, as

shown in Figure 1. For example, either an ID or an

email in both data sources should be equal. However,

this was not possible, since, in Apache Http project,

the data sources have different users’ proﬁles. This

was another challenge in our approach. It was nec-

essary to match different kinds of data, e.g. email

address with nickname and name with nickname. In

order to perform this task, we adapted the approach

proposed by (Canfora et al., 2011). Our integrator

module is composed of the following steps:

Figure 1: Module Integrator.

(i) Split Email Sender in Developer Name and

Email Address. Data from email sender is com-

posed of name and email. These properties had to

be split. For example, an entry like ”john lennon

<john@email.com>” would result in ”john lennon”

and ”john@email.com”;

archive=httpd.apache.org@httpd.apache.org Mon Sep 01

06:08:51 2008

ICEIS2015-17thInternationalConferenceonEnterpriseInformationSystems

306

(ii) Adjust Special Characters and Upper Case

Letters. In this step, names are converted to lower

case while special characters were removed. For ex-

ample, an entry like ”M

ario L. Castrol” would result

in ”mario l castrol”.

(iii) Match Emails and Commits Identiﬁers. To

perform this task, we used the Levenshtein Distance

to verify the similarity between them. We performed

several tests and after it we assumed the heuristics:

(1) there is a low possibility that an email sender and

a repository committer being the same person if the

Levenstein distance is equal or less than 0.7; (2) there

is a large possibility that an email sender and a repos-

itory committer being the same person if the Leven-

stein distance is equal or above than 0.9. For the ones

in the interval ]0.7..0.9[, we considered the following

steps.

a) Abbreviation of middle name is ignored. For

the developers and committers with two or more

words in the name, we consider only the ﬁrst and the

last ones. If, after that, there is a match between them,

then, we considered the same person, unless if there

is another developer with the same ﬁrst and last name.

For example, the entries ”mario l castrol” and ”mario

t. castrol” are considered as the same person, if there

is not another ”mario castrol” in one of the reposito-

ries.

b) First name abbreviated. When there is a name

composed of an abbreviated ﬁrst name plus a last

name (e.g. ”j. lennon”), we considered it equals to

another name composed of the same last name and a

ﬁrst name that starts with the abbreviated letter. As in

the previous step, it only happens when there is not

another developer with the same characteristics;

c) Only last name. If there is only the last name

and in the collection data there is one, and only one,

last name equals to it, we considered them as the

same person. For example, the entries ”gonzalez” and

”johnson gonzalez” are considered the same person, if

there is not another person with the last name ”gonza-

lez”.

d) Only initials. If only initial letters composes

the name and match another name, we considered

them as the same person. For example, the entries

”mwvd” and ”michael-willy van dick” are considered

the same person. The exception here happens when

exist another developer with the same initial letter,

e.g., ”michel-willy victor dagg”.

(iv) Match Email Address. Finally, we compared

the nickname in the email address (which comes be-

fore the ’@’ symbol) with the committer’s name. If

they match we considered as the same person. For ex-

ample, the entries ”jonnylennon@software.net” can

http://lucene.apache.org

be matched to ”Jonny Lennon”.

Even considering all crossing data process, it was

not possible to ﬁnd a match between some emails and

commits. In those cases, the data was ignored and

corresponded to 29,698 emails. Thus, we considered

in our analysis only 70,781 emails.

INTERACTIVE VISUALIZATIONS: Our approach

uses six visualizations to help to analyze the extracted

data. The ﬁrst two are heat maps showing the com-

mits and emails distribution around the world (Fig-

ure 3). In these heat maps, commits and emails con-

centration are represented by a gradient composed by

11 colors varying from green (small amount) to yel-

low (medium amount) and from yellow to red (large

amount), each color is used accordingly to a dynamic

scale varying from zero until the greatest concen-

tration of points. In Figure 3, one can easily spot

the places with more commits. There are also three

charts, showing the commits and emails distribution

over the years. Figure 2 displays all the commits and

email for the whole period. It shows them consider-

ing over time, the hour of the day and the days of the

week. The Figures 4 and 5 are a bubble chart visual-

izations. We used it to represent ﬁle types that were

more modiﬁed over time. The larger the bubble size,

larger is the number of modiﬁcations made in a spe-

ciﬁc ﬁle type.

Besides the graphics, the tool also provides some

interaction mechanisms that allow us to ﬁlter data.

The ﬁrst interaction is a range slider

that is used to

ﬁlter data by a speciﬁc period. The second is a list

where we can choose to ﬁlter data by a speciﬁc de-

veloper. There also the maps built-in interactions like

zooming and dragging.

After that, we retrieved the geographic informa-

tion (latitude/longitude and the time offset) for each

committer aiming to know the origin of commits and

emails. Apache Httpd project does not have this data

for all developers. They provide this information only

for the core committers (the ones who contributes

more to the project).

In these cases, Apache Httpd project provides a

page with complementary information about them.

Unfortunately, core committers represent only 63

from the total developers (110). For the others, we

needed to perform a manual task to retrieve their ge-

ographic information. They also have different time

offsets. This brings out another issue, since it is nec-

essary to consider the time zone when collecting the

weekday and time for each commit. In this case, we

needed to get each developer’s time zone and adjust

the times for the Apache server time.

After retrieving geographic information, we found

http://jqueryui.com/slider

AnalyzingDistributionsofEmailsandCommitsfromOSSContributorsthroughMiningSoftwareRepositories-An

ExploratoryStudy

307

Figure 2: Interactions over time between emails and com-

mits of developer population.

27 more proﬁles, totaling 90 from the 110 available

developers. We decided to remove the commits from

those developers, which we could not ﬁnd geographic

information. So, we reduced the amount of analyzed

commits from 33,586 to 31,611.

4 RESULTS AND DATA ANALYSIS

The collected data built a rich set of information.

Nonetheless, in general, data extracted from software

repositories are too difﬁcult to be analyzed in the

same state that they were stored (Mazza, 2009). Thus,

we decide to use visualizations to reorganize them

in such manner that users can easily understand the

whole database. We discuss now the results of this

study. To answer the research questions, we analyzed

the data taking into consideration (i) the relation be-

tween emails and commits of the Apache Project de-

velopers and (ii) The beginning of Apache Project.

(iii) OSS developers’ contribution and Preferred Rep-

resentational Systems.

i. Relation between Emails and Commits of the

Apache Project Community: Through the interaction

with the period ﬁlter to generate heat maps over the

time, we perceive that the heat zones (regions where

contributions were made) used to appear ﬁrst in the

emails’ map and after in the commits’ map, it’s ev-

idence that in this project developers ﬁrst interact in

the email list and after commit code to the repository.

We could conﬁrm these behaviors in the Apache

Httpd Web Site. According to the site, changes to

the code are proposed and voted on the mailing list

and only after they are approved, they are committed

in the repository. On top of that, we could also iden-

tify that the regions that have more participation in the

emails list are also the regions with more participation

in the code repository (see Figure 3). An exception is

the Japanese developers’ behavior. In this case, there

is a considerable amount of commits (bottom of Fig-

Figure 3: Heat maps showing amount of (a) emails and (b)

commits.

ure 3) but a low participation in the discussion mailing

list (top of Figure 3 ). It may suggest an introverted

behavior due to cultural factors.

We also perceived that there is a correlation be-

tween emails and commits timestamps. Developers

normally commit code and discuss in the list in the

same time as well in the same weekday. Figure 2

shows the interactions over time between emails and

commits of the Apache Project developers.

ii. The Apache Project Beginning: Analyzing

carefully the data evolution in Figure 2, we can see

that the discussion in the email list started a year be-

fore the ﬁrst commits. This is an interesting behav-

ior, since they had time to discuss before to start the

implementation. However, it is common to observe

OSS projects starting on the other way around. First,

a project is created with few developers. They start to

commit and create the ﬁrst release. After that, users

start to use it. Using the software, bug ﬁx and request

for new features will rise. So, users use a bug track-

ing system to report the issues. At the end, the devel-

opers start the discussions in the email list. We de-

cided to investigate why the Apache httpd evolved in

this way. We looked the website and discovered that

the Apache Httpd was a continuation of NCSA Httpd,

which stopped to be developed when Rob McCool left

NCSA in 1994. A group of webmasters then started to

develop their own extensions and bug ﬁxes, in 1995,

they solved to join all this features and bug ﬁxes in

a unique distribution and then the Apache Group was

created.

iii. OSS Developers’ Contribution and Preferred

Representational Systems: The PRS is the one that

the person tends to use more than the others to create

his/her internal representation (Colac¸o et al., 2010).

In this respect, we accept as true that developers with

a kinesthetic proﬁle have more contributions related

ICEIS2015-17thInternationalConferenceonEnterpriseInformationSystems

308

to code ﬁles (emotions and physical experience, hold-

ing and doing practical hands-on experiences), while

developers with an auditory and visual proﬁle have

more contributions related to documentation and ar-

chitecture (creation of internal images/sound and the

use of seen or observed things, including diagrams,

demonstrations, ﬂip-charts, sounds reminders, etc.).

To conﬁrm or refuse the ﬁndings presented by Colac¸o

et al. in (Colac¸o et al., 2010), we proceed with analy-

sis for each developer as follows.

Dev A and C had a dominance of the kinesthetic

proﬁle (Colac¸o et al., 2010). It’s evidence that this

developer should work directly with code. We may

check it in Figure 4 (Dev A and Dev C), the Dev A has

a lot of commits changing ﬁles with the extensions

header (.h) and C ﬁles (.c). The Dev B was pointed

out as visual proﬁle (Colac¸o et al., 2010), contributing

more with documentation and architecture. However,

we found out that this developer has had really valu-

able contribution in commits. It is evidence that he

also contributes with source code in project, this fact

is conﬁrmed by second score (kinesthetic) presented

by (Colac¸o et al., 2010). The developer is the one

from all developers analyzed with most contribution

in code ﬁles (see Figure 4 in Dev B).

The Dev D has an auditory proﬁle and also a

strong visual proﬁle. Furthermore, is also, the one

who has the more balanced score between the dif-

ferent proﬁles. Figure 4 (Dev D) shows the distri-

bution of ﬁle performed by this developer. He/she

was the developer who more contributed with differ-

ent ﬁles types. In addition, we discovered that this

developer contributes with the project documentation,

translations and so on. It was conﬁrmed through pre-

dominant contributions involving XML ﬁles (.xml),

HTML ﬁles (.html) and image ﬁles (.gif and .png).

In Colac¸o et al. in (Colac¸o et al., 2010), they also

created a Cluster that is composed of all developers

except the top committers (Dev A, B, C and D). In his

study, the cluster has a kinesthetic proﬁle, which was

also conﬁrmed by our study (Figure 5), there are a lot

of commits made by these developers related to code

ﬁles (.h and .c).

5 THREATS TO VALIDITY

Apache Http is one of the most mature and large OSS

project. As showed in Section 3, we experienced sev-

eral challenge on the process used to analyse its data.

We tried to overcome the issues following approaches

found in the literature. However, there are still threats

to validity.

We did not consider the number of developers ana-

Figure 4: Bubble chart: Apache project’s contributions.

lyzed enough to generalize these results for other OSS

projects. So, our approach needs further investigation

to assure the external validity.

We obtained the geographic locations through

public proﬁles. These data may not represent the ac-

tual residence of each developer in the moment they

contribute to the project. So, we assumed that com-

mits and emails were sent from the developers’ place

to reduce the threats to internal validity. To this end,

we tried to gather data from a questionnaire sent via

email to some developers in order to provide us addi-

tional information but, unfortunately, we did not re-

ceive many replies.

Figure 5: Bubble Chart: Clusters’ Contributions.

After retrieval geographic information and data

processing, some commits and e-mails were disre-

garded, totaling 18,8% of developers, 5.88% of com-

mits and 29.55% of e-mails analyzed. Aiming to re-

duce this threat, we performed a deep qualitative anal-

ysis to uncover geographic information. At the end,

the data discarded represent small percentage of our

sample, which does not compromise our analysis.

AnalyzingDistributionsofEmailsandCommitsfromOSSContributorsthroughMiningSoftwareRepositories-An

ExploratoryStudy

309

6 CONCLUSION AND FURTHER

WORK

In this paper, we presented an useful and innovative

approach that extracts information from two impor-

tant software project data sources. We mined and

tried to match emails list and source code repository

data. This approach can be used to discover hidden

behavioral patterns in unstructured data from software

repositories. We also believe that OSS leaders can use

our approach to increase developers’ contributions or

to keep contributors in their projects. OSS managers

can also use our approach to split tasks according to

each developers’ proﬁle or to tracking team’s contri-

butions over time considering weekdays and day pe-

riods.

We have evidences that discussion lists and repos-

itories can be used to measure project activity or to

predict each other. We now draw answers to our re-

search questions stated in the section 3. Regarding

RQ1, we may conﬁrm that commits and emails fol-

low the same pattern distribution in the Apache evo-

lution. In respect to RQ2, our analysis conﬁrmed the

ﬁndings discussed by (Colac¸o et al., 2010) for devel-

opers A, C, D, Cluster and refused the developer B.

However, we found out that this developer has had re-

ally valuable contribution in commits, this setting was

also dealt by (Colac¸o et al., 2010).

Our future work will address three key issues:

(1) improve our approach by extracting other rele-

vant data from other OSS. This work is in process;

(2) extend this study to mine data from PostgreSQL,

emails and commits, aiming to compare to ﬁndings

performed by (Colac¸o et al., 2012); and (3) develop

new interactive visualizations.

REFERENCES

Canfora, G., Cerulo, L., Cimitile, M., and Di Penta, M.

(2011). Social interactions around cross-system bug

ﬁxings: The case of freebsd and openbsd. In MSR,

pages 143–152.

Colac¸o, M., Mendonc¸a, M., @and Paulo Henrique, M. F.,

and Corumba, D. (2012). A neurolinguistic method

for identifying oss developers’ context-speciﬁc pre-

ferred representational systems. page 112 to 121.

Colac¸o, M., Mendonca, M., Farias, M., and Henrique, P.

(2010). Oss developers context-speciﬁc preferred rep-

resentational systems: A initial neurolinguistic text

analysis of the apache mailing list. MSR, pages 126–

129.

D’Ambros, M., Lanza, M., and Robbes, R. (2010). Commit

2.0. In WW2SE, pages 14–19. ACM.

Eyolfson, J., Tan, L., and Lam, P. (2011). Do time of day

and developer experience affect commit bugginess?

In Proceedings of the 8th Working Conference on Min-

ing Software Repositories, MSR, pages 153–162.

Farias, M. A. F., Ortins, P., Novais, R., Colac¸o, M. J., and

Mendonca, M. (2014). Recovering valuable informa-

tion behaviour from oss contributors: An exploratory

study. In SEKE, pages 474–478.

Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. (1996).

The kdd process for extracting useful knowledge from

volumes of data. Commun. ACM, 39(11):27–34.

Gill, A. J. and Oberlander, J. (2003). Perception of e-mail

personality at zero-acquaintance: Extraversion takes

care of itself; neuroticism is a worry.

Heller, B., Marschner, E., Rosenfeld, E., and Heer, J.

(2011). Visualizing collaboration and inﬂuence in

the open-source software community. In MSR, pages

223–226.

Lanza, M. and Ducasse, S. (2003). Polymetric views-a

lightweight visual approach to reverse engineering.

IEEE TSE, 29(9):782–795.

Lanza, M., Marinescu, R., and Ducasse, S. (2005). Object-

Oriented Metrics in Practice.

Licorish, S. A. and MacDonell, S. G. (2014). Combin-

ing text mining and visualization techniques to study

teams’ behavioral processes. In MUD, pages 16–20.

Mazza, R. (2009). Introduction to Information Visualiza-

tion.

uller, C., Reina, G., Burch, M., and Weiskopf, D. (2010).

Subversion statistics sifter. In ICAVC, pages 447–457.

Springer-Verlag.

Murgia, A., Tourani, P., Adams, B., and Ortu, M. (2014).

Do developers feel emotions? an exploratory analysis

of emotions in software artifacts. In MSR, pages 262–

271. ACM.

NETCRAFT (2013). Web Server Survey. NetCraft Web-

site. http://news.netcraft.com/archives/2013/06/06/

june-2013-web-server-survey-3.html/.

Novais, R., Nunes, C., Garcia, A., and Mendonca, M.

(2013a). Sourceminer evolution: A tool for support-

ing feature evolution comprehension. In ICSM, pages

508–511.

Novais, R. L., Torres, A., Mendes, T. S., Mendonc¸a, M., and

Zazworka, N. (2013b). Software evolution visualiza-

tion: A systematic mapping study. IST, 55(11):1860 –

1883.

Pattison, D. S., Bird, C. A., and Devanbu, P. T. (2008). Talk

and work: A preliminary report. In MSR, pages 113–

116. ACM.

Rigby, P. C. and Hassan, A. E. (2007). What can oss mailing

lists tell us? a preliminary psychometric text analysis

of the apache developer mailing list. In MSR. IEEE

Computer Society.

Sjoberg, D., Yamashita, A., Anda, B., Mockus, A., and

Dyba, T. (2013). Quantifying the effect of code smells

on maintenance effort. TSE, 39(8):1144–1156.

Witte, R., Li, Q., Zhang, Y., and Rilling, J. (2008). Text

mining and software engineering: an integrated source

code and document analysis approach. Soft. IET,

2(1):3–16.

Wohlin, C., Runeson, P., H

ost, M., Ohlsson, M. C., Reg-

nell, B., and Wessl

en, A. (2012). Experimentation in

Software Engineering: An Introduction. Springer.

ICEIS2015-17thInternationalConferenceonEnterpriseInformationSystems

310