Table 6: Accuracy scores on two datasets showing the effectiveness of Randout-KD. The baseline model is data2vec (Baevski
et al., 2022).
Dataset Baseline Randout KD Randout-KD
CODA-19 (Huang et al., 2020) 73.27 73.62 73.62 73.82
RHMD (Naseem et al., 2022a) 78.17 78.90 76.31 79.37
learning rate, no. of epochs, etc., for a fair compari-
son.
6 CONCLUSIONS
In this paper, we presented a method called “Randout-
KD” to finetune foundation models. We proposed
a new noise injection method and combined it with
knowledge distillation. During finetuning of the stu-
dent model, we stochastically replaced the hidden
representations units of various tokens with random
noise. We evaluated the suggested method on two
multi-class text classification datasets. Our presented
approach improved the model’s performance on both
datasets compared to the baseline models. We shall
explore this method with variants of knowledge dis-
tillation in future work.
REFERENCES
Afzal, M., Alam, F., Malik, K. M., Malik, G. M., et al.
(2020). Clinical context–aware biomedical text sum-
marization using deep neural network: model devel-
opment and validation. Journal of medical Internet
research, 22(10):e19810.
Baevski, A., Hsu, W.-N., Xu, Q., Babu, A., Gu, J., and Auli,
M. (2022). Data2vec: A general framework for self-
supervised learning in speech, vision and language.
arXiv preprint arXiv:2202.03555.
Beltagy, I., Lo, K., and Cohan, A. (2019). Scibert: A
pretrained language model for scientific text. arXiv
preprint arXiv:1903.10676.
Bommasani, R., Hudson, D. A., Adeli, E., Altman, R.,
Arora, S., von Arx, S., Bernstein, M. S., Bohg, J.,
Bosselut, A., Brunskill, E., et al. (2021). On the
opportunities and risks of foundation models. arXiv
preprint arXiv:2108.07258.
Bucilua, C., Caruana, R., and Niculescu-Mizil, A. (2006).
Model compression. In Proceedings of the 12th ACM
SIGKDD international conference on Knowledge dis-
covery and data mining, pages 535–541.
Bul
`
o, S. R., Porzi, L., and Kontschieder, P. (2016). Dropout
distillation. In International Conference on Machine
Learning, pages 99–107. PMLR.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.
(2018). Bert: Pre-training of deep bidirectional trans-
formers for language understanding. arXiv preprint
arXiv:1810.04805.
Gu, Y., Tinn, R., Cheng, H., Lucas, M., Usuyama,
N., Liu, X., Naumann, T., Gao, J., and Poon,
H. (2021). Domain-specific language model pre-
training for biomedical natural language process-
ing. ACM Transactions on Computing for Healthcare
(HEALTH), 3(1):1–23.
He, R., Cai, S., Ming, Z., and Zhang, J. (2022). Weighted
self distillation for chinese word segmentation. In
Findings of the Association for Computational Lin-
guistics: ACL 2022, pages 1757–1770.
Hinton, G., Vinyals, O., Dean, J., et al. (2015). Distilling
the knowledge in a neural network. arXiv preprint
arXiv:1503.02531, 2(7).
Huang, T.-H., Huang, C.-Y., Ding, C.-K. C., Hsu, Y.-C.,
and Giles, C. L. (2020). Coda-19: Using a non-expert
crowd to annotate research aspects on 10,000+ ab-
stracts in the covid-19 open research dataset. arXiv
preprint arXiv:2005.02367.
Ibrahim, M. A., Khan, M. U. G., Mehmood, F., Asim,
M. N., and Mahmood, W. (2021). Ghs-net a generic
hybridized shallow neural network for multi-label
biomedical text classification. Journal of biomedical
informatics, 116:103699.
Khan, P. I., Razzak, I., Dengel, A., and Ahmed, S. (2022a).
A novel approach to train diverse types of language
models for health mention classification of tweets.
arXiv preprint arXiv:2204.06337.
Khan, P. I., Siddiqui, S. A., Razzak, I., Dengel, A., and
Ahmed, S. (2022b). Improving health mention classi-
fication of social media content using contrastive ad-
versarial training. IEEE Access, 10:87900–87910.
Kim, Y. and Rush, A. M. (2016). Sequence-level knowledge
distillation. arXiv preprint arXiv:1606.07947.
Kitada, S. and Iyatomi, H. (2021). Attention meets pertur-
bations: Robust and interpretable attention with ad-
versarial training. IEEE Access, 9:92974–92985.
Lee, C., Cho, K., and Kang, W. (2019). Mixout: Effective
regularization to finetune large-scale pretrained lan-
guage models. arXiv preprint arXiv:1909.11299.
Liu, X., He, P., Chen, W., and Gao, J. (2019a). Improv-
ing multi-task deep neural networks via knowledge
distillation for natural language understanding. arXiv
preprint arXiv:1904.09482.
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D.,
Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov,
V. (2019b). Roberta: A robustly optimized bert pre-
training approach. arXiv preprint arXiv:1907.11692.
Liu, Y., Shen, S., and Lapata, M. (2020). Noisy self-
knowledge distillation for text summarization. arXiv
preprint arXiv:2009.07032.
Naseem, U., Khushi, M., Kim, J., and Dunn, A. G. (2022a).
Rhmd: A real-world dataset for health mention clas-
ICAART 2023 - 15th International Conference on Agents and Artificial Intelligence
464