5 CONCLUSION
Video classification is becoming increasingly relevant
in the modern world due to the large amount of video
content and its various categories. In this paper, we
described a method to analyze a video sequence using
machine learning algorithms and a new approach to
data preprocessing. This approach has been success-
fully implemented and tested on the searching and
classifying chromakey in video problem. The results
outperformed existing algorithms. Our approach can
be easily applied to other video classification prob-
lems and can also be scaled to estimate larger video
spans depending on the specific task. In the future,
it is planned to conduct more qualitative comparisons
of different length chromakey sequences. There are
many ways to use chromakey technology and each
of them has its own specifistic and artifacts. In the
future, it is planned to study in detail all chromakey
types and select the necessary preprocessing for each
type individually, which will help to manage edge
cases better.
ACKNOWLEDGEMENTS
The research was supported by the ITMO University,
project 623097 ”Development of libraries containing
perspective machine learning methods”.
REFERENCES
Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lu
ˇ
ci
´
c, M.,
and Schmid, C. (2021). Vivit: A video vision trans-
former. In Proceedings of the IEEE/CVF international
conference on computer vision, pages 6836–6846.
Bertasius, G., Wang, H., and Torresani, L. (2021). Is space-
time attention all you need for video understanding?
In ICML, volume 2, page 4.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-
Fei, L. (2009). Imagenet: A large-scale hierarchical
image database. In 2009 IEEE conference on com-
puter vision and pattern recognition, pages 248–255.
Ieee.
Ding, L. and Goshtasby, A. (2001). On the canny edge de-
tector. Pattern recognition, 34(3):721–725.
Huang, H., Lin, L., Tong, R., Hu, H., Zhang, Q., Iwamoto,
Y., Han, X., Chen, Y.-W., and Wu, J. (2020). Unet 3+:
A full-scale connected unet for medical image seg-
mentation. In ICASSP 2020-2020 IEEE international
conference on acoustics, speech and signal processing
(ICASSP), pages 1055–1059. IEEE.
Kaur, A. and Kranthi, B. (2012). Comparison between
ycbcr color space and cielab color space for skin color
segmentation. International Journal of Applied Infor-
mation Systems, 3(4):30–33.
Koonce, B. and Koonce, B. (2021). Efficientnet. Convo-
lutional Neural Networks with Swift for Tensorflow:
Image Recognition and Dataset Categorization, pages
109–123.
Li, J. P., Haq, A. U., Din, S. U., Khan, J., Khan, A., and Sa-
boor, A. (2020). Heart disease identification method
using machine learning classification in e-healthcare.
IEEE access, 8:107562–107582.
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin,
S., and Guo, B. (2021). Swin transformer: Hierar-
chical vision transformer using shifted windows. In
Proceedings of the IEEE/CVF international confer-
ence on computer vision, pages 10012–10022.
Otsu, N. (1979). A threshold selection method from gray-
level histograms. IEEE transactions on systems, man,
and cybernetics, 9(1):62–66.
Singh, G. and Singh, K. (2022). Chroma key foreground
forgery detection under various attacks in digital video
based on frame edge identification. Multimedia Tools
and Applications, pages 1–28.
Song, J.-D., Kim, Y.-G., and Park, T.-H. (2019). Smt defect
classification by feature extraction region optimiza-
tion and machine learning. The International Journal
of Advanced Manufacturing Technology, 101:1303–
1313.
Targ, S., Almeida, D., and Lyman, K. (2016). Resnet
in resnet: Generalizing residual architectures. arXiv
preprint arXiv:1603.08029.
Tong, Z., Song, Y., Wang, J., and Wang, L. (2022). Video-
mae: Masked autoencoders are data-efficient learn-
ers for self-supervised video pre-training. Advances
in neural information processing systems, 35:10078–
10093.
Xie, Z., Lin, Y., Yao, Z., Zhang, Z., Dai, Q., Cao, Y., and
Hu, H. (2021). Self-supervised learning with swin
transformers. arXiv preprint arXiv:2105.04553.
Yuan, L., Chen, Y., Wang, T., Yu, W., Shi, Y., Jiang, Z.-H.,
Tay, F. E., Feng, J., and Yan, S. (2021). Tokens-to-
token vit: Training vision transformers from scratch
on imagenet. In Proceedings of the IEEE/CVF inter-
national conference on computer vision, pages 558–
567.
Zhang, D. and Zhang, D. (2019). Wavelet transform. Fun-
damentals of image data mining: Analysis, Features,
Classification and Retrieval, pages 35–44.
VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications
232