the Dynegy energy company fell through and Enron
filed for Chapter 11 bankruptcy on December 2, 2001
(McLean, Elkind, 2003). The challenge was how to
classify this information in a meaningful way.
One key focus of this analysis has been the ex-
traction of patterns and meanings from these data
sets. Pattern discovery is particularly difficult in these
datasets as the patterns are usually small in scale and
hard to pick out against the background of normal ev-
ery day behavior. This creates difficult new problems
for analysis techniques: pragmatic problems caused
by the sheer size and complexity of the data, and dis-
crimination problems, determining when some small
variation in the structure of the data is potentially in-
teresting. The situation is further complicated by the
fact such data is inherently messy reflecting the vast
array of original data-sources (e.g., news plus web
plus email), biases in data collection, and intentional
ambiguities (such as false identities).
All of the techniques described in this paper are
general. They could equally well have been ap-
plied to counter-narcotics, counter-terrorism, money-
laundering or other activities. Thus, although the
techniques are used on Enron, they are equally ap-
plicable to analyzing virtual data generated by simu-
lations and real data extracted from various sources.
It was said, that the Enron Corpus has its own unique
difficulties and features. Data is time stamped. But
actors have multiple aliases (email accounts). Many
messages are duplicated, and so on. The sheer volume
of data cleaning is immense.
Within Enron, questions asked include, but are not
limited to the following. What do groups look like?
What is the inter-organizational profile of a company
as it moves toward crisis? How does message traf-
fic change over a corporate lifetime? Can communi-
ties of interest be identified? Which people are im-
portant in these communities? These and many other
complex real-world problems can be addressed by an-
alyzing large, complex, and messy datasets such as
the Enron mail corpus. There are two broad kinds of
analysis that can help in addressing such problems.
The first looks at the properties of individual objects,
perhaps people or messages or journeys, and tries to
detect those that are anomalous in some useful way.
The second looks at the relationships between objects,
and tries to find patterns in their connections that are
anomalous. Similarly, there are two broad kinds of
approaches. The first focuses on streams of data and
tries to locate anomalies as new data arrives. The sec-
ond focuses on the data as though it was a single block
in time, a snapshot of the world, and tries to locate
anomalies within this snapshot.
The paper is constructed as follows: section 2 de-
scribes state of the art relevant to Enron Corpus. A de-
tection of communities within social network is pre-
sented in section 3. Usage of h-index and person im-
portance measurement is given in section 4. Results
of experimental calculations are contained in section
5. Conclusions and prospects of the further researches
are described in section 6.
2 STATE OF THE ART
One of the key aspects of the Enron corpus is that
corpus is a result of emailing. This fact has several
consequents.
2.1 Language Processing
It is clear the email is not quite like either spoken or
formal written communication. Email tends to oc-
cupy a middle ground: less formal than other forms
of writing, but, more formal than speech. The Enron
emails provide a chance to investigate, empirically,
what the language of email is like.
Keila and Skillicorn (Keila, Skillicorn, 2005)
study structure of bodies of individual emails using
singular value decomposition and semidiscrete de-
composition. Vocabulary used in emails has specific
features, especially frequencies of words, different
from standard English.
2.2 Structural Patterns
Emails have a sender and one of more receivers, and
so represent a form of connection between people. It
is natural to build various forms of graphs to capture
these connections, and then to see what they can tell
us about how communication works, and how it con-
nects to relationships and to power.
McCallum et al. (McCallum, Corrada-Emmanuel,
Wang, 2005) combined social network information
extracted from sender recipient relations with infor-
mation on the topic of emails that they identified by
statistical analysis of word distributions into the ART
model. They extended the ART model by determin-
ing people’s roles (RART model) and showed exper-
imentally that this combination of evidence provides
a better prediction of similarities among people with
the same roles than traditional block modeling. Cha-
panond, Krishnamoorthy and Yener (Chapanond, Kr-
ishnamoorthy, Yener, 2005) detect social communi-
ties using sender-recipient relationship.
H-INDEX CALCULATION IN ENRON CORPUS
207