good QAPs is to dissociate the original question from
these clusters.
7 CONCLUSION AND FUTURE
WORK
We presented a novel method to extract the rich
question-answering information from community-
based QA web sites and generate a large number of
QAPs. Experiments confirm that our method has a
high accuracy of extracting information and a high
quality of constructing QAPs suitable for chatbots.
We plan to investigate the following issues in fu-
ture studies: (1) Improve the ratio of good QAPs by
dissociating certain clusters with the original ques-
tion. (2) Generate an aggregate answer from different
answers in a cluster to a sub-question, instead of just
choosing the best one. (3) Devise a spam detection
mechanism to filter out spams from the data collected
from CQAW sites.(4) Compare the number of clusters
generated by HDP-LDA with the number of clusters
generated by our algorithm.
ACKNOWLEDGEMENTS
This work was supported in part by Eola Solutions
Inc. The authors are grateful to members of the Text
Automation Lab at UMass Lowell for discussions.
REFERENCES
Arasu, A. and Garcia-Molina, H. (2003). Extracting struc-
tured data from web pages. In Proceedings of the 2003
ACM SIGMOD international conference on Manage-
ment of data, pages 337–348. ACM.
Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent
Dirichlet allocation. Journal of machine Learning re-
search, 3(Jan):993–1022.
Chang, C.-H. and Lui, S.-C. (2001). Iepad: Information
extraction based on pattern discovery. In Proceedings
of the 10th international conference on World Wide
Web, pages 681–688. ACM.
Huang, J., Zhou, M., and Yang, D. (2007). Extracting chat-
bot knowledge from online discussion forums. In IJ-
CAI, volume 7, pages 423–428.
Jin, O., Liu, N. N., Zhao, K., Yu, Y., and Yang, Q. (2011).
Transferring topical knowledge from auxiliary long
texts for short text clustering. In Proceedings of the
20th ACM international conference on Information
and knowledge management, pages 775–784. ACM.
Liu, B., Grossman, R., and Zhai, Y. (2003). Mining data
records in web pages. In Proceedings of the 9th ACM
SIGKDD international conference on Knowledge dis-
covery and data mining, pages 601–606. ACM.
Mihalcea, R. and Tarau, P. (2004). Textrank: Bringing order
into text. In EMNLP, volume 4, pages 404–411.
Reis, D. d. C., Golgher, P. B., Silva, A. S., and Laender, A.
(2004). Automatic web news extraction using tree edit
distance. In Proceedings of the 13th international con-
ference on World Wide Web, pages 502–511. ACM.
Surdeanu, M., Ciaramita, M., and Zaragoza, H. (2008).
Learning to rank answers on large online qa collec-
tions. In 2008 Annual Meeting of the Association for
Computational Linguistics, volume 8, pages 719–727.
Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M.
(2005). Sharing clusters among related groups: Hi-
erarchical dirichlet processes. In Advances in neural
information processing systems, pages 1385–1392.
Wang, B., Liu, B., Sun, C., Wang, X., and Sun, L. (2009).
Extracting chinese question-answer pairs from online
forums. In Systems, Man and Cybernetics, 2009. SMC
2009. IEEE International Conference on, pages 1159–
1164. IEEE.
Wang, J. and Wang, J. (2015). qRead: A fast and accurate
article extraction method from web pages using parti-
tion features optimizations. In Knowledge Discovery,
Knowledge Engineering and Knowledge Management
(IC3K), 2015 7th International Joint Conference on,
volume 1, pages 364–371. IEEE.
Yang, J.-M., Cai, R., Wang, Y., Zhu, J., Zhang, L., and Ma,
W.-Y. (2009). Incorporating site-level knowledge to
extract structured data from web forums. In Proceed-
ings of the 18th international conference on World
wide web, pages 181–190. ACM.
Yu, H., Han, J., and Chang, K. C.-C. (2002). PEBL: Positive
example based learning for web page classification us-
ing SVM. In Proceedings of the 8th ACM SIGKDD
International conference on Knowledge discovery and
data mining, pages 239–248. ACM.
Zhang, C. and Wang, J. (2017). RQAS: a rapid QA scheme
with exponential elevations of keyword rankings. In
Knowledge Discovery, Knowledge Engineering and
Knowledge Management (IC3K), 2017 9th Interna-
tional Joint Conference on.