
GPT-3.5 and GPT-4. This suggests that distilling
knowledge from larger models holds promise for
practical language model applications and empha-
sizes the importance of further analyzing how the syn-
thetic data enhanced our model. However, given the
broader implications, we chose not to open-source the
synthetic dataset. Our primary concern is the poten-
tial for these data to inadvertently contaminate future
web scraping efforts, which could compromise the
quality of the data for training upcoming language
models.
REFERENCES
Amati, G. (2009). BM25, pages 257–260. Springer US,
Boston, MA.
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.,
Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,
Askell, A., Agarwal, S., Herbert-Voss, A., Krueger,
G., Henighan, T., Child, R., Ramesh, A., Ziegler,
D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler,
E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner,
C., McCandlish, S., Radford, A., Sutskever, I., and
Amodei, D. (2020). Language models are few-shot
learners.
Chen, S., Neves, L., and Solorio, T. (2022). Style transfer
as data augmentation: A case study on named entity
recognition. In Proceedings of the 2022 Conference
on Empirical Methods in Natural Language Process-
ing, pages 1827–1841, Abu Dhabi, United Arab Emi-
rates. Association for Computational Linguistics.
Chen, X., Zhao, Z., Chen, L., Ji, J., Zhang, D., Luo, A.,
Xiong, Y., and Yu, K. (2021). WebSRC: A dataset for
web-based structural reading comprehension. In Pro-
ceedings of the 2021 Conference on Empirical Meth-
ods in Natural Language Processing, pages 4173–
4185, Online and Punta Cana, Dominican Republic.
Association for Computational Linguistics.
Chung, H. W., Le Hou, S. L., Zoph, B., Tay, Y., Fedus, W.,
Li, Y., Wang, X., Dehghani, M., Brahma, S., Web-
son, A., Gu, S. S., Dai, Z., Suzgun, M., Chen, X.,
Chowdhery, A., Castro-Ros, A., Pellat, M., Robinson,
K., Valter, D., Narang, S., Mishra, G., Yu, A., Zhao,
V., Huang, Y., Dai, A., Yu, H., Petrov, S., Chi, E. H.,
Dean, J., Devlin, J., Roberts, A., Zhou, D., Le, Q. V.,
and Wei, J. (2022). Scaling instruction-finetuned lan-
guage models.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.
(2019). Bert: Pre-training of deep bidirectional trans-
formers for language understanding.
Hao, Q. (2011). Structured web data extraction dataset
(swde).
Li, J., Xu, Y., Cui, L., and Wei, F. (2022). MarkupLM:
Pre-training of text and markup language for visually
rich document understanding. In Proceedings of the
60th Annual Meeting of the Association for Compu-
tational Linguistics (Volume 1: Long Papers), pages
6078–6087, Dublin, Ireland. Association for Compu-
tational Linguistics.
Lin, C.-Y. (2004). ROUGE: A package for automatic evalu-
ation of summaries. In Text Summarization Branches
Out, pages 74–81, Barcelona, Spain. Association for
Computational Linguistics.
Møller, A. G., Dalsgaard, J. A., Pera, A., and Aiello, L. M.
(2023). Is a prompt and a few samples all you need?
using gpt-4 for data augmentation in low-resource
classification tasks.
Myers, D. and McGuffee, J. W. (2015). Choosing scrapy. J.
Comput. Sci. Coll., 31(1):83–89.
Nakayama, H., Kubo, T., Kamura, J., Taniguchi, Y.,
and Liang, X. (2018). doccano: Text annota-
tion tool for human. Software available from
https://github.com/doccano/doccano.
OpenAI (2023). Gpt-4 technical report.
Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002).
Bleu: A method for automatic evaluation of ma-
chine translation. In Proceedings of the 40th Annual
Meeting on Association for Computational Linguis-
tics, ACL ’02, page 311–318, USA. Association for
Computational Linguistics.
Sharma, M. (2014). Selenium tool: A web based automa-
tion testing framework.
Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li,
X., Guestrin, C., Liang, P., and Hashimoto, T. B.
(2023). Stanford alpaca: An instruction-following
llama model. https://github.com/tatsu-lab/stanford
alpaca.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, L., and Polosukhin, I.
(2017). Attention is all you need. In Proceedings of
the 31st International Conference on Neural Informa-
tion Processing Systems, NIPS’17, page 6000–6010,
Red Hook, NY, USA. Curran Associates Inc.
Vural, A. G., Cambazoglu, B. B., and Karagoz, P. (2014).
Sentiment-focused web crawling. ACM Trans. Web,
8(4).
Wang, Q., Fang, Y., Ravula, A., Feng, F., Quan, X., and
Liu, D. (2022). Webformer: The web-page trans-
former for structure information extraction. In Pro-
ceedings of the ACM Web Conference 2022, WWW
’22, page 3124–3133, New York, NY, USA. Associa-
tion for Computing Machinery.
Wang, Y., Kordi, Y., Mishra, S., Liu, A., Smith, N. A.,
Khashabi, D., and Hajishirzi, H. (2023). Self-instruct:
Aligning language models with self-generated in-
structions. In Proceedings of the 61st Annual Meeting
of the Association for Computational Linguistics (Vol-
ume 1: Long Papers), pages 13484–13508, Toronto,
Canada. Association for Computational Linguistics.
Xie, C., Huang, W., Liang, J., Huang, C., and Xiao, Y.
(2021). Webke: Knowledge extraction from semi-
structured web with pre-trained markup language
model. In Proceedings of the 30th ACM International
Conference on Information & Knowledge Manage-
ment, CIKM ’21, page 2211–2220, New York, NY,
USA. Association for Computing Machinery.
ICAART 2024 - 16th International Conference on Agents and Artificial Intelligence
688