
task following Shao et al. (Shao et al., 2023), us-
ing synthesized dataset for our purpose. Based on
our model, unknown acoustic patterns are identified
into six different anomaly classes. The use of student-
teacher transformer allows the learning of long-term
temporal dependencies. When trained and fine-tuned
with on the synthetic dataset generated using real
traffic audio, the model gave an overall accuracy of
99.33% when tested on unseen audio. The model
performance degrades gracefully with distance of the
source of anomaly which is an added advantage.
REFERENCES
Dahua security network cameras. https://zenodo.org/
records/3519845. Accessed: October 22, 2024.
Dahua security network cameras. https://www.
dahuasecurity.com/in/products/All-Products/
Network-Cameras/WizMind-Series/5-Series/2MP/
DH-IPC-HF5231EP-E. Accessed: October 22, 2024.
DCASE community. https://dcase.community/. Accessed:
October 22, 2024.
DESED: Domestic environment sound event detection.
https://project.inria.fr/desed/. Accessed: October 22,
2024.
MIVIA audio events dataset. https://mivia.unisa.it/datasets/
audio-analysis/mivia-audio-events. Accessed: Octo-
ber 22, 2024.
MIVIA road audio events dataset. https:
//mivia.unisa.it/datasets/audio-analysis/
mivia-road-audio-events-data-set. Accessed:
October 22, 2024.
YAMNet. https://www.tensorflow.org/hub/tutorials/
yamnet. Accessed: October 22, 2024.
(2017). Dcase 2017 challenge: Rare sound event
detection. https://dcase.community/challenge2017/
task-rare-sound-event-detection. Accessed: October
22, 2024.
(2023). DCASE 2023 challenge: Sound
event detection with weak and soft la-
bels. https://dcase.community/challenge2023/
task-sound-event-detection-with-weak-and-soft-labels.
Accessed: October 22, 2024.
Bilen, C., Ferroni, G., Tuveri, F., Azcarreta, J., and
Krstulovic, S. (2020). A framework for the robust
evaluation of sound event detection. In ICASSP 2020-
2020 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pages 61–
65.
Chen, S., Wu, Y., Wang, C., Liu, S., Tompkins, D., Chen,
Z., Che, W., Yu, X., and Wei, F. (2023). Beats: Au-
dio pre-training with acoustic tokenizers. In Inter-
national Conference on Machine Learning (ICML),
pages 5178–5193.
Foggia, P., Petkov, N., Saggese, A., Strisciuglio, N., and
Vento, M. (2016). Audio surveillance of roads:
A system for detecting anomalous sounds. IEEE
Transactions on Intelligent Transportation Systems,
17(1):279–288.
Gemmeke, J. F., Ellis, D. P. W., Freedman, D., Jansen,
A., Lawrence, W., Moore, R. C., Plakal, M., and Rit-
ter, M. (2017). Audio set: An ontology and human-
labeled dataset for audio events. In IEEE International
Conference on Acoustics, Speech and Signal Process-
ing (ICASSP), pages 776–780.
Giri, R., Tenneti, S. V., Cheng, F., Helwani, K., Isik, U.,
and Krishnaswamy, A. (2020). Self-supervised clas-
sification for detecting anomalous sounds. In Proc.
DCASE, pages 46–50.
Howard, A., Zhu, M., Chen, B., Kalenichenko, D., Wang,
W., Weyand, T., Andreetto, M., and Adam, H.
(2017). Mobilenets: Efficient convolutional neu-
ral networks for mobile vision applications. arXiv
preprint arXiv:1704.04861.
Ito, A., Aiba, A., Ito, M., and Makino, S. (2009). Detection
of abnormal sound using multi-stage gmm for surveil-
lance microphone. In Proc. IAS, volume 1, pages 733–
736.
Khosla, P., Teterwak, P., Wang, C., Sarna, A., Tian, Y.,
Isola, P., Maschinot, A., Liu, C., and Krishnan, D.
(2020). Supervised contrastive learning. In Proceed-
ings of the Advances in Neural Information Process-
ing Systems.
Koizumi, Y., Saito, S., Uematsu, H., Harada, N., and
Imoto, K. (2019a). Toyadmos: A dataset of miniature-
machine operating sounds for anomalous sound detec-
tion. In Proc. of the Workshop on Applications of Sig-
nal Processing to Audio and Acoustics (WASPAA).
Koizumi, Y., Saito, S., Uematsu, H., Kawachi, Y., and
Harada, N. (2019b). Unsupervised detection of
anomalous sound based on deep learning and the ney-
man–pearson lemma. IEEE/ACM Transactions on Au-
dio, Speech, and Language Processing, 27(1):212–
224.
Koizumi, Y., Yasuda, M., Murata, S., Saito, S., Uematsu,
H., and Harada, N. (2020). Spidernet: Attention
network for one-shot anomaly detection in sounds.
In ICASSP 2020 - 2020 IEEE International Con-
ference on Acoustics, Speech and Signal Processing
(ICASSP), pages 281–285, Barcelona, Spain.
Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., and
Plumbley, M. D. (2020). Panns: Large-scale pre-
trained audio neural networks for audio pattern recog-
nition. arXiv:1912.10211.
Li, X., Shao, N., and Li, X. (2023). Self-supervised au-
dio teacher-student transformer for both clip-level and
frame-level tasks. arXiv preprint arXiv:2306.04186.
Neyman, J. and Pearson, E. S. (1933). On the problem of the
most efficient tests of statistical hypotheses. Philos.
Trans. Roy. Soc. London, 231:289–337.
Purohit, H., Tanabe, R., Ichige, K., Endo, T., Nikaido,
Y., Suefusa, K., and Kawaguchi, Y. (2019). Mimii
dataset: Sound dataset for malfunctioning industrial
machine investigation and inspection. In Proc. Detec-
tion Classification Acoustic Scenes Events Workshop
(DCASE), page 209.
ICPRAM 2025 - 14th International Conference on Pattern Recognition Applications and Methods
618