
 
3  TESTS ON CLASSIFICATION 
On FedBizOpps (FBO), calls for tenders have been 
manually classified according to two classification 
schemas, FCS (Federal Supply Code) and NAICS 
(North American Industry Classification System, 
http://
  www.census.gov/naics). So we can use 
them to test the accuracy of classification. In our 
test, we only consider the first three digits of 
NAICS, i.e. the corresponding sector. There are 92 
such categories. 
We collected 21,945 CFTs from FBO, covering 
the period of September 2000 to October 2003. This 
collection is split into
and
 two parts: 60% for training, 
Table 2: Classification on FBO. 
method macro-F1 micro-F1 
 40% for testing. We used Rainbow package 
(McCallum, 1996) to perform NB classification. 
Table 2 shows a comparison of the classification 
results with and without sentence filtering. What is 
the most interesting to observe is Micro-F1. 
The sentence filtering reduces the size of the 
whole collection from around 600,000 sentences to 
96,811. The results, identified in the table as sent.filt, 
sho
w a strong increase in the micro-F1 measure 
(+7.6%). This shows that sentence filtering can be 
highly useful for the classification of CFTs. This 
allows removing many procedural sentences that are 
not directly related to the subject of the CFT. 
baseline .3297  .5498 
sent.filt. .3223  .5918 (+7.6%) 
4  CONCLUSION 
The system we described in this paper has been in 
use by our commercial partners, and deployed in 
several applications: as an aid for business 
opportunities watch, as a CFT search facility for the 
Canada's metal industry portal, and as an thematic 
watch for the travel industry. The system has been 
found very useful in all these applications, whic
shows that such a system would be of great help to 
facilitate the information. 
believe t ding of relevant business 
t step to a business success. 
This is part of e-Business.  
n 
and filtering, and user/domain profiles are highly 
useful for retrieving CFTs.  
i
Bett
W
ow
R
P
2
Cai, 
a eedings of 
SIGIR
Chau, M., Zeng, D., Chen, H., Huang, M. and 
Hendriawan, D. Design and evaluation of a multi 
agent collaborative web mining system. Decision 
Support Systems, 35(1):167.183, 2003. 
Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V. 
GATE: A Framework and Graphical Development 
Environment for Robust NLP Tools and Applications. 
ACL, 2002. 
Jason, D. M., Rennie, Lawrence, Shih, J. T., & Karger, D. 
R. Tackling the poor assumptions of Naive Bayes text 
classifiers. ICML, 2003. 
Maynard, D., Tablan, V., Ursu, C., Cunningham, H., and 
Wilks, Y. Named entity recognition from diverse text 
types.  R l Language 
Processing, pages 257.274, 2001. 
McCallum, A. K. Bow: A too e 
modeling, text retrieval, classification and clustering, 
1996. http://www.cs.cmu.edu/.mccallum/bow,  
Peters, C.. Braschler. M., Gonzalo, J. and Kluck. M., 
editors.  Advances in Cross-Language Information 
Retrieval Systems. Springer, 2003. 
Ponte, J. and Croft, W.B., A language modeling approach 
to information retrieval. Proceedings of SIGIR, pp. 
275-281, 1998. 
Soderland. S. Learning information extraction rules for 
semi-structured and free text. Machine Learning, 
44(1), 1999. 
Yang, Y. An evaluation of statistical approaches to text 
categorization.  Journal of Information Retrieval, 
1(1/2):67.88, 1999.  
Zhai, C, and L thing Methods 
for Languag Information 
Retrieval. Proc. SIGIR, pp. 334-342, 2001. 
h 
 distribution of business 
We  hat the fin
oppor unities is the first 
From a technical point of view, our study shows 
that sentence filtering brings a strong increase to 
classification accuracy (Micro-F1). User/domain 
profiles seem to be useful. Their usefulness has been 
formally tested in another study (Bai et al. 2007). All 
our results indicate that both information extractio
The system can be improved on several aspects: 
the translation module can be more precise; we can 
use a more effective classification approach such as 
SVM. However, the general approach presented here 
seems promising for business intelligence. 
REFERENCES 
Aggarwal, C.C., Al-Garawi, F. and Yu. P.S. Intelligent 
crawling on the world wide web with arbitrary 
predicates. WWW Conference, 2001. 
Bai, J., Nie, J., Cao, G., Using query contexts in 
nformation retrieval, Proc. SIGIR, 2007, to appear. 
s, M. The future of business intelligence. Computer 
orld, 14 April 2003. 
n, P.F., PieBr tra, S.A.D., Pietra, V.D.J. and Mercer, 
.L. The mathematics of machine translation: 
arameter estimation. Computational Linguistics, 19: 
63-312, 1992. 
L. and Hofmann, T. Text categorization by boosting 
utomatically extracted concepts. Proc
, pp.182.189, 2003. 
ecent Advances in Natura
lkit for statistical languag
afferty, J., A Study of Smoo
e Models Applied to 
FACILITATING E-BUSINESS BY RETRIEVING RELEVANT BUSINESS OPPORTUNITIES ON THE INTERNET
179