‘panic’ phrases. This is as opposed to training sets
comprised of hundreds or sometimes thousands of
training phrases, as is more usually observed in the
use of CRFs. That the use of relatively broad
categorisations of phrases was able to approximately
reflect the timelines of the investigation into Enron
means the method could be extended in many ways.
There are many limitations with the approach as
detailed within this study. Selection of phrases
corresponding to the three categories studied was
entirely subjective and therefore there was a risk of
bias in model training. Additionally, the nature of
the corpus meant that although there were extensive
attempts to clean the dataset, many artifacts of email
in its raw form remain (e.g. spam, multiple quoting
biasing counts). The precise nature of the association
between phrase use and actual events in Enron’s
history can only be guessed at, more information
regarding the detailed course of events would be
required to validate the accuracy and sensitivity of
the association detailed here. The a priori nature of
CRF model training in this instance virtually
guarantees bias.
There are also general limitations in probabilistic
topics models which may affect inferred results;
topic models are prone to overfitting, as in, the mode
by which an individual document’s topic mixture is
established is not robust enough to handle the
addition of new documents to the trained corpus.
Related, the number of free model parameters
increases linearly with the number of training
documents, making re-training a computationally
expensive exercise.
Possible extensions to the software are many and
varied. Results from what was a relatively lightly
trained CRF model seemed reasonable but it was
trained only on binary data with a first-order model.
The use of higher-order model will likely increase
the precision of tagging of phrases as more
information about context is modelled. This may
also allow a finer grained model training of more
specific phrases as ‘slacker’, ‘aggressive’, etc. are
relatively broad terms for the language being
modelled.
Instead, sub-types of slacker/aggressive/panic
phrases could be tagged. The results of a topic
model could also be used to inform the tagging of
phrases rather than the a priori method as detailed in
this study. Ideally, a formal evaluation of tagging
predictive accuracy could be conducted on non-
Enron emails or on Enron emails with a k-folds
cross-validation methodology with attendant
measures of fit (e.g. positive predictive value).
5 CONCLUSIONS
The method as detailed provides a broad method for
the descriptive analysis of email data by tagging of
phrases that are semantically interesting. That this
exercise even broadly reflects the timeline of
investigation validates the use of both a sub-set of
the full Enron corpus as well as the method used to
tag information of interest. This is suggestive that
the performance of even a lightly-trained model may
be acceptable on a far smaller test set than would be
the case were it exhaustively trained on the full
Enron Email Dataset.
REFERENCES
Blei, D. (2012). Probabilistic Topic Models.
Communications of the ACM , 55 (4), 77-84.
Buys, N. M. (2010). Employees’ Perceptions of the
Management of Workplace Stress. International
Journal of Disability Management, 5 (2), 25-31.
Chapanond, A. K. (2005). Graph Theoretic and Spectral
Analysis of Enron Email Data. Computational &
Mathematical Organization Theory, 11, 265-281.
Chekina, L. G. (2013). Exploiting label dependencies for
improved sample complexity. Machine Learning, 91,
1-42.
Dahl, C. (2004). Pipe Dreams: Greed, Ego, and the Death
of Enron/Anatomy of Greed: The Energy Journal, 25
(4), 115-134.
Diesner, J. C. (2008). Conditional random fields for entity
extraction and ontological text coding. Computer and
Mathematical Organisation Theory, 14, 248-262.
Diesner, J. F. (2005). Communication Networks from the
Enron Email Corpus “It’s Always About the People.
Enron is no Different". Computational &
Mathematical Organization Theory, 11, 201-228.
Dreijer, J. H. (2013). Left ventricular segmentation from
MRI datasets with edge modelling conditional random
fields . BMC Medical Imaging, 13, 1-24.
El Shikieri, A. M. (2012). Factors Associated with
Occupational Stress and Their Effects on
Organizational Performance in a Sudanese University .
Creative Education , 3 (1), 134-144.
Hayashida, M. K. (2013). Prediction of protein-RNA
residue-base contacts using two-dimensional
conditional random field with the lasso . BMC Systems
Biology, 7 (Suppl 2), 1-11.
Hurley-Hanson, A. G. (2011). The Effect of the Attacks of
9/11 on Organizational Policies, Employee Attitudes
and Workers’ Psychological States. American Journal
of Economics and Business Administration, 3 (2), 377-
389.
Hutton, A. L. (2006). Crowdsourcing Evaluations of
Classifier Interpretability. AAAI Technical Report SS-
12-06 Wisdom of the Crowd, 21-26.
Jahanian, R. T. (2012). Stress Management in the
ICAART2015-InternationalConferenceonAgentsandArtificialIntelligence
252