3 TESTS ON CLASSIFICATION
On FedBizOpps (FBO), calls for tenders have been
manually classified according to two classification
schemas, FCS (Federal Supply Code) and NAICS
(North American Industry Classification System,
http://
www.census.gov/naics). So we can use
them to test the accuracy of classification. In our
test, we only consider the first three digits of
NAICS, i.e. the corresponding sector. There are 92
such categories.
We collected 21,945 CFTs from FBO, covering
the period of September 2000 to October 2003. This
collection is split into
and
two parts: 60% for training,
Table 2: Classification on FBO.
method macro-F1 micro-F1
40% for testing. We used Rainbow package
(McCallum, 1996) to perform NB classification.
Table 2 shows a comparison of the classification
results with and without sentence filtering. What is
the most interesting to observe is Micro-F1.
The sentence filtering reduces the size of the
whole collection from around 600,000 sentences to
96,811. The results, identified in the table as sent.filt,
sho
w a strong increase in the micro-F1 measure
(+7.6%). This shows that sentence filtering can be
highly useful for the classification of CFTs. This
allows removing many procedural sentences that are
not directly related to the subject of the CFT.
baseline .3297 .5498
sent.filt. .3223 .5918 (+7.6%)
4 CONCLUSION
The system we described in this paper has been in
use by our commercial partners, and deployed in
several applications: as an aid for business
opportunities watch, as a CFT search facility for the
Canada's metal industry portal, and as an thematic
watch for the travel industry. The system has been
found very useful in all these applications, whic
shows that such a system would be of great help to
facilitate the information.
believe t ding of relevant business
t step to a business success.
This is part of e-Business.
n
and filtering, and user/domain profiles are highly
useful for retrieving CFTs.
i
Bett
W
ow
R
P
2
Cai,
a eedings of
SIGIR
Chau, M., Zeng, D., Chen, H., Huang, M. and
Hendriawan, D. Design and evaluation of a multi
agent collaborative web mining system. Decision
Support Systems, 35(1):167.183, 2003.
Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.
GATE: A Framework and Graphical Development
Environment for Robust NLP Tools and Applications.
ACL, 2002.
Jason, D. M., Rennie, Lawrence, Shih, J. T., & Karger, D.
R. Tackling the poor assumptions of Naive Bayes text
classifiers. ICML, 2003.
Maynard, D., Tablan, V., Ursu, C., Cunningham, H., and
Wilks, Y. Named entity recognition from diverse text
types. R l Language
Processing, pages 257.274, 2001.
McCallum, A. K. Bow: A too e
modeling, text retrieval, classification and clustering,
1996. http://www.cs.cmu.edu/.mccallum/bow,
Peters, C.. Braschler. M., Gonzalo, J. and Kluck. M.,
editors. Advances in Cross-Language Information
Retrieval Systems. Springer, 2003.
Ponte, J. and Croft, W.B., A language modeling approach
to information retrieval. Proceedings of SIGIR, pp.
275-281, 1998.
Soderland. S. Learning information extraction rules for
semi-structured and free text. Machine Learning,
44(1), 1999.
Yang, Y. An evaluation of statistical approaches to text
categorization. Journal of Information Retrieval,
1(1/2):67.88, 1999.
Zhai, C, and L thing Methods
for Languag Information
Retrieval. Proc. SIGIR, pp. 334-342, 2001.
h
distribution of business
We hat the fin
oppor unities is the first
From a technical point of view, our study shows
that sentence filtering brings a strong increase to
classification accuracy (Micro-F1). User/domain
profiles seem to be useful. Their usefulness has been
formally tested in another study (Bai et al. 2007). All
our results indicate that both information extractio
The system can be improved on several aspects:
the translation module can be more precise; we can
use a more effective classification approach such as
SVM. However, the general approach presented here
seems promising for business intelligence.
REFERENCES
Aggarwal, C.C., Al-Garawi, F. and Yu. P.S. Intelligent
crawling on the world wide web with arbitrary
predicates. WWW Conference, 2001.
Bai, J., Nie, J., Cao, G., Using query contexts in
nformation retrieval, Proc. SIGIR, 2007, to appear.
s, M. The future of business intelligence. Computer
orld, 14 April 2003.
n, P.F., PieBr tra, S.A.D., Pietra, V.D.J. and Mercer,
.L. The mathematics of machine translation:
arameter estimation. Computational Linguistics, 19:
63-312, 1992.
L. and Hofmann, T. Text categorization by boosting
utomatically extracted concepts. Proc
, pp.182.189, 2003.
ecent Advances in Natura
lkit for statistical languag
afferty, J., A Study of Smoo
e Models Applied to
FACILITATING E-BUSINESS BY RETRIEVING RELEVANT BUSINESS OPPORTUNITIES ON THE INTERNET
179