ted with building a training set for blogs and so-
cial media, while striving for comparable results with
these state-of-the-art systems.
Transfer Learning. Transfer Learning allows the
domains, tasks and distributions for a classifier’s
training and test data to be different (Pan and Yang,
2009). The sub-area of transfer learning most similar
to our work is transductive transfer learning, where
neither the source nor target data is labeled. In this
case, methods are sought to first automatically label
the source data.
Automatic Labeling. Work has been done in sev-
eral areas (Tomasic et al., 2007; Fuxman et al., 2009)
to reduce the human labeling effort; where automatic
Labeling has been achieved with weak labeling. In
one such work, (Tomasic et al., 2007) wild labels
(obtained from observing users) provide the basis for
generating weak labels. Similar to our work, weak
labels are distinguished from gold labels, which are
generated by a human expert. The weakly-labeled
corpus is used to train machine-learning algorithms
that are capable of predicting the sequence and pa-
rameter values for the actions a user will take on a
new request. In other work automatic labeling is ac-
complished by first defining a set of criteria a potential
corpora must have in order to support the automatic
labeling process (Fuxman et al., 2009). To date, none
of the work based on automatic labelling or a trans-
fer learning approach, consider the task of Epidemic
Intelligence.
5 CONCLUSIONS AND FUTURE
WORK
In this paper we have demonstrated that with our
Cross-Classification framework, it is possible to use
comparable text, such as outbreak reports as automat-
ically labeled data for training a classifier that is ca-
pable of detecting the victim-reporting sentences in a
blog.
As with any automated labeling process, the ex-
amples are subject to noise and error. We investigated
how this noise can be reduced and evaluated the qual-
ity of such weak labeled sentences using three prop-
erties: Sentence Position, Sentence Length and Sen-
tence Semantics. With no effort in human labeling
and minimalistic feature engineering, we were able to
build a Cross-Classifier, which achieved a precision
as high as 88%. The impact of this work is that the
noisy sentences in blogs, and possibly other types of
social media, can be appropriately filtered to support
epidemic investigation.
Cross-Classification has shown to be promising for
data in which the topic is rather focused. As a future
work, we will apply the approach to more diverse and
topic-drifting blog posts. Further, we seek to gener-
alize the results presented here, describing the con-
ditions under which corpora can be considered com-
parable. This would help in automatically selecting
the appropriate auxiliary and target corpora for Cross-
Classification. In this work, we have assumed the
presence of a high quality data set that lends itself to
weak labeling. As further work, we plan to consider
cases in which a Cross-Classifier can be built from
less volume of data, for example, by using a boot-
strapping approach.
REFERENCES
Conway, M., Collier, N., and Doan, S. (2009). Using hedges
to enhance a disease outbreak report text mining sys-
tem. In BioNLP ’09: Proceedings of the Workshop on
BioNLP, pages 142–143, Morristown, NJ, USA. As-
sociation for Computational Linguistics.
Fuxman, A., Kannan, A., Goldberg, A. B., Agrawal, R.,
Tsaparas, P., and Shafer, J. (2009). Improving classi-
fication accuracy using automatically extracted train-
ing data. In KDD ’09: Proceedings of the 15th ACM
SIGKDD international conference on Knowledge dis-
covery and data mining, pages 1145–1154, New York,
NY, USA. ACM.
Hartley, D., Nelson, N., Walters, R., Arthur, R., Yangarber,
R., Madoff, L., Linge, J., Mawudeku, A., Collier, N.,
Brownstein, J., Thinus, G., and Lightfoot, N. (2009).
The landscape of international event-based biosurveil-
lance. Emerging Health Threats.
Lam-Adesina, A. M. and Jones, G. J. F. (2001). Ap-
plying summarization techniques for term selection
in relevance feedback. In Proceedings of the 24th
annual international ACM SIGIR conference on Re-
search and development in information retrieval, SI-
GIR ’01, pages 1–9, New York, NY, USA. ACM.
Moens, M.-F. (2009). Information extraction from blogs. In
Jansen, B. J., Spink, A., and Taksa, I., editors, Hand-
book of Research on Web Log Analysis, pages 469–
487. IGI Global.
Moschitti, A. (2006). Making tree kernels practical for nat-
ural language learning. In Proceedings of the 11th
Conference of the European Chapter of the Associa-
tion for Computational Linguistics.
Pan, S. J. and Yang, Q. (2009). A survey on transfer learn-
ing. IEEE Transactions on Knowledge and Data En-
gineering, 99.
Tomasic, A., Simmons, I., and Zimmerman, J. (2007).
Learning information intent via observation. In WWW
’07: Proceedings of the 16th international conference
on World Wide Web, pages 51–60, New York, NY,
USA. ACM.
Zhang, Y. (2008). Automatic Extraction of Outbreak Infor-
mation from News. PhD thesis, University of Illinois
at Chicago.
WEBIST 2011 - 7th International Conference on Web Information Systems and Technologies
576