Frames Preprocessing Methods for Chromakey Classiﬁcation in Video

Evgeny Bessonnitsyn, Artyom Chebykin, Grigorii Stafeev and Valeria Eﬁmova

ITMO University, Kronverksky Pr. 49, bldg. A, St. Petersburg, 197101, Russia

Keywords:

Chromakey, Transformer, Video Analysis, Classiﬁcation, Deep Learning, Image Processing.

Abstract:

Currently, video games, movies, commercials, and television shows are ubiquitous in modern society. How-

ever, beneath the surface of their visual variety lies sophisticated technology, which can produce impressive

effects. One such technology is chromakey — a method that allows to change the background to any other

image or video. Recognizing chromakey technology in video plays a key role in ﬁnding fake materials. In this

paper, we consider approaches based on deep learning models that allows to recognize chromakey in video

based on unnatural artifacts that arise during the transition between frames. The video consists of a sequence

of frames, and the the video accuracy can be determined in different ways. If we consider the accuracy frame

by frame, our method reaches an F

score equal to 0.67. If we consider the entire video to be fake in case there

is one or more fake segments, then the F

score equal to 0.76. The proposed methods showed better results on

the dataset we collected in comparison with existing methods for chromakey detection.

1 INTRODUCTION

Recognizing chromakey technology in video is one

of the key innovations in realistic and immersive cre-

ation ﬁeld. The basic technology principle is to use

a special color background, usually green or blue,

which is easily distinguishable from other objects in

the video; less often, white, yellow and other colors

are used for a replaceable background. Then, using

special software, the background is recognized and

separated from objects or people. First, the color of

the background to be removed is selected, and then

the desired image is substituted in its place.

Modern laptops and smartphones use simpler

technologies to create the effect. For them, it is not

necessary to use a solid special color background, and

the objects separation from the background occurs us-

ing neural networks. However, this approach has sig-

niﬁcantly lower accuracy than classic chromakey. For

example, when an object moves, a “halo” may appear

with the previous background parts.

In general, if chromakey is used unprofessional,

artifacts may arise in the form of overly clear or, con-

versely, unnatural blurry contours. Moreover, chro-

makey artifacts include a sharp difference in the fore-

ground object lighting from the background light-

ing. By the such artifacts quantity, we can determine

whether the video is fake or real. However, after qual-

itative post-processing stage, it is quite difﬁcult for the

human eye to recognize negative effects, and one has

https://orcid.org/0000-7981-878-1424

to draw conclusions based on the video plot realism

degree.

Moreover, video frames classiﬁcation without pre-

processing leads to the fact that naive algorithms, like

the human eye, do not notice the difference between

fake content and real content. Therefore, in com-

puter vision methods it will not be possible to use pre-

trained models, since they do not take into account the

preprocessing clues.

In the past, due to these problems, analytical

methods and attempts to preprocess data for further

analysis were used to solve such problems. In this pa-

per we develop preprocessing approach and combine

it with deep learning methods.

Our research is aimed primarily at ﬁnding optimal

approaches to data preprocessing and their further im-

plementation into classical methods. Due to the rea-

sons described above, at the moment there is no large

and high-quality dataset for chromakey recognition.

Our tasks also included its formation based on short

videos.

Our contribution is threefold:

• We propose frame preprocesing, which can be

used with image classiﬁcation networks.

• We proposed frame preprocesing, which can be

used with vision transformers.

• We release the labeled dataset of 200 video links

with chromakey.

We provide access to the code and data set

https://github.com/Turukmokto/Chromakey

Bessonnitsyn, E., Chebykin, A., Stafeev, G. and Eﬁmova, V.

Frames Preprocessing Methods for Chromakey Classiﬁcation in Video.

DOI: 10.5220/0012467900003660

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2024) - Volume 3: VISAPP, pages

227-232

ISBN: 978-989-758-679-8; ISSN: 2184-4321

227

In Section 2 we will talk about existing ap-

proaches to solving the problem. In Section 3 we

will describe the method we developed using machine

learning models. Finally, in Section 4 we will demon-

strate our approach results.

2 RELATED WORK

2.1 Analytical Method

So far, the analytical approach for high-quality video

preprocessing played a key role in chromakey recog-

nition (Singh and Singh, 2022). This paper high-

lights the fact that chromakey is much easier to rec-

ognize in motion, which means that it is necessary

to evaluate not individual frames, but the pixel-by-

pixel difference between them. At the ﬁnal processing

stage, the difference frame is converted to grayscale.

Since chromakey leaves artifacts or unnatural differ-

ences in lighting mainly at the objects edges, thus,

on the difference frame should be left only pixels re-

lated to objects contours obtained by Canny edge al-

gorithm (Ding and Goshtasby, 2001).

di f f

= |Fr

− Fr

i+1

|, (1)

Sample

= GrayScale(CannyEdge(Fr

di f f

)), (2)

where Fr

is i’th frame in video sequence.

After preprocessing, the paper suggests an analyt-

ical method for assessing such a difference frame us-

ing a threshold value obtained from image binariza-

tion using the Otsu algorithm (Otsu, 1979). Unfortu-

nately, this approach is ineffective on video with high-

quality post-processing and a small object movement

amount, since generally it is not possible to analyti-

cally evaluate most of the edge cases. It follows that

the current approach is imperfect and requires sig-

niﬁcant improvement and generalization to existing

methods of chromakey creation.

2.2 Computer Vision Algorithms

Today, there are many machine learning models that

allows solving the classiﬁcation problem extracting

visual features. Among the most famous and ef-

fective are EfﬁcientNet (Koonce and Koonce, 2021),

ResNet (Targ et al., 2016), Unet (Huang et al., 2020),

Vit (Yuan et al., 2021), Swin (Liu et al., 2021) and

others. These models can be used for processing RGB

images in quite complex and speciﬁc tasks, such as

identifying diseases (Li et al., 2020), defect classiﬁ-

cation (Song et al., 2019) and others. With the proper

preprocessing of the input frames, it is possible to col-

lect a dataset for training the model that will deter-

mine the chromakey technology usage — for exam-

ple, using the same difference frames principle as a

basis.

2.3 Transformers

The Vision Transformer (ViT) model (Yuan et al.,

2021) made a huge impression on the computer vi-

sion community; it is not only capable to process im-

age as a sequence, but had a higher modeling ability

thanks to a new attention mechanism (multi-headed

self-attention), designed speciﬁcally for transformers

for natural language processing tasks, wide receptive

ﬁelds, as well as smaller biases. All this allowed ViT

to achieve the best results (SOTA) in the image clas-

siﬁcation task on the ImageNet dataset (Deng et al.,

2009). Gradually, the transformer architecture began

to displace advanced methods in other computer vi-

sion areas, including video classiﬁcation.

The paper about TimeSformer (Bertasius et al.,

2021), proposes a method for processing video se-

quences based on ViT. Embedding for a sequence is

created by combining the batch size, the sequence

length, dividing each frame into patches, pulling

them into vectors, which are then multiplied by the

training matrix, and then positional embedding is

added to the result. Paper also propose a more ef-

ﬁcient spatio-temporal attention architecture, called

“Divided Space-Time Attention”, in which tempo-

ral and spatial attention are divided into two separate

blocks and applied one after another.

The paper proposes the use of ViViT (Arnab et al.,

2021) model and several alternative approaches to

solve the video classiﬁcation problem. The ﬁrst

method is to create two independent encoders, the ﬁrst

for the spatial embeddings and the second for the spa-

tial encoder outputs to evaluate the entire sequence,

this is called a temporal encoder. This approach is

very similar to a convolutional neural network: at ﬁrst

the network sees only individual small pieces, but as

it moves deeper into the network, the receptive ﬁeld

increases until it becomes equal to the whole image,

and then it is fed to the classiﬁer. The second method

is to leave one encoder, but divide the attention mech-

anism into two parts: spatial and temporal. This ap-

proach was also described in the paper about TimeS-

former. The last method is to create two independent

heads groups in Multi-Head Dot Product Attention:

spatial and temporal. The last two approaches allows

preserving the one method functionality and reduc-

ing the calculations complexity. Moreover, the paper

provides techniques for initializing model weights for

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

228

video classiﬁcation using analogues for image classi-

ﬁcation, since to train a model from scratch you need

a huge video dataset, which is difﬁcult to collect.

The VideoMAE paper (Tong et al., 2022), pro-

poses a solution to the data lack problems, high cor-

relation between frames. Frames sequence is taken,

than, divide these frames into patches, randomly ap-

ply a tube masking on them with an extremely high

coefﬁcient, then create an embedding from this. The

embedding is then encoded and decoded trying to

recreate the missing patches. This should help the

transformer “pay more attention” to dynamic objects,

and also not retrain and remember low-level features,

but focus on high-level ones.

3 METHOD

In general, there is a problem with the analytical

approach to implementing the chromakey recogni-

tion algorithm of using this technology in different

cases, leading to a large false positive results number.

We discuss existing method (Singh and Singh, 2022)

problems in Subsection 3.1, after that we propose two

chromakey classiﬁcation methods, the ﬁrst based on

a convolutional neural network (Subsection 3.2), and

the second based on the transformer model (Subsec-

tion 3.3).

3.1 Existing Problems

The initial idea was to simply replace the analytical

approach with a classiﬁcation algorithm and select the

best model. The fact is that despite the good results

during training, the results on the test set was much

worse. This happens because the model, evaluating

one difference frame during the training process, be-

gins to rely solely on the video plot and on the pix-

els conﬁguration, and not on their unnaturalness rela-

tive to others. The model just remembers “scenarios”,

considering some to be natural and others not. Due to

the fact that it is impossible to include every possi-

ble video content variety in ﬁnite data set, the dataset

turns out to be very limited. Consequently, the model

performs much worse on the test sample than during

the training process.

3.2 CNN-Based Method

Based on the experiment results described above, we

hypothesized that a single difference frame may not

be enough to detect a fake. Note that our difference

frame after preprocessing is a single-channel image,

while most machine learning models are made to an-

alyze three-channel RGB images. Thus, it is possi-

ble to analyze not two consecutive frames, but four at

once, since from them you can make three consecu-

tive single-channel difference frames, and then com-

bine them into one RGB image.

Triplet

= [Sample

, Sample

i+1

, Sample

i+2

], (3)

where Sample

is calculated following equation 2.

As a result, it will describe the change in the im-

age contours over 4 frames, with each frame being

used exactly once for each color (RGB). It should be

noted that the machine learning model has no infor-

mation about the image coloring; this is a convention

only for understanding the classiﬁed images prepro-

cessing. This approach gave better results on the test

set, since it assessed a larger video period and could

work with previously unknown scenarios. There are

three examples of classiﬁed samples in Fig 1.

Figure 1: Triplets example.

3.3 Transformer-Based Method

As is the case with computer vision models, trans-

formers do not work well with raw data and ”notice”

chromakey where there is none in reality. Here we

propose a second method for preprocessing the input

video. Each frame in a video containing chromakey

can be conditionally divided into two parts according

to the noise pattern. This is due to the fact that we

insert one image into another, that is, the areas have

different intrinsic noise (with different characteristics)

due to the imperfection of the physical photons con-

verting processes into electrical energy and illumina-

tion. All this affects the level and noise nature in the

image.

In this method, we will use the idea described

above to preprocess the input videos by creating stan-

dard deviation maps for each frame. We ﬁrst convert

the image from the RGB format to the YCbCr for-

mat, since the RGB color space is inefﬁcient for stor-

ing and transmitting color signals (Kaur and Kranthi,

2012). In YCbCr there is no division into primary col-

ors; all of them are converted into visually signiﬁcant

Frames Preprocessing Methods for Chromakey Classiﬁcation in Video

229

information. We take only the Y channel, since it is

responsible for the pixel brightness value, and calcu-

late it using following equation:

Y = wr · R + (1 −wb − wr) ·G + wb · B, (4)

where wr = 0.299 and wb = 0.114.

Next, we apply a 2D discrete wavelet brightness

transform (db4) (Zhang and Zhang, 2019) to the im-

age in order to calculate the noise heterogeneity, be-

cause the wavelet coefﬁcients at the ﬁrst level are, in

fact, pure noise.

y[n] = (x ·g)[n] =

inf

∑

k=− inf

x[k]g[n − k], (5)

where n — a signal and g — low-pass ﬁlter with

impulse response.

We are only interested in the diagonal component,

since it is an approximation coefﬁcient. Next, we cal-

culate the absolute median deviation (MAD) to ﬁnd

the noise variation, that is, we divide the image into

NxN blocks and take the median over the blocks.

Basically, near objects with many contours, the

noise level will always be slightly higher than where

there are no objects. To eliminate this factor, we cal-

culate the minimum for each block and subtract the

MAD value for the block elementwise. Finally, we

reassemble the image from the blocks. Fig 2 shows

the frame preparation process.

Figure 2: Preprocessing example. At the ﬁrst step, the

original image is converted into the YCbCr format and the

preprocessing described above is applied. In the second

step, the 2D discrete wavelet brightness transformation is

applied. In the third step we calculate the MAD value for

each block. In the ﬁnal step we subtract the MAD value

from the second step result.

After processing the frames, we create sequences

of the required length from them. Then, due to the

lack of a sufﬁciently large dataset necessary for train-

ing the transformer, we decided to feed the trans-

former with features obtained using a convolutional

neural network from a preprocessed frames sequence

obtained in the previous step. Intermediate extractor

usage also allows us to ﬁnd very different chromakey

appearance on various videos. A schematic diagram

of the algorithm is presented in Fig. 3.

Transformer is a DeiT (Xie et al., 2021) encoder

that is great for learning from small data amounts. It

consists of two layers, containing 4 attention heads.

The small size is due to the careful input videos pre-

processing presence.

4 EXPERIMENTS AND RESULTS

4.1 Dataset

There is currently no publicly available dataset

for classifying chromakey, so we collected it from

YouTube. The set contains 200 videos each lasting

up to 10 minutes and marked second by second for

the chromakey presence or its absence. All videos are

720p quality or higher. In Table 1 are frame statistics

for the entire dataset and for the train/test parts.

Table 1: Dataset statistics, # denotes number of correspond-

ing items.

#total

frames

#fake

frames

#real

frames

videos

count

Train 26368 13242 13126 156

Test 115112 61895 53217 44

All 141480 75137 66343 200

In the future, when enlarging or customizing the

dataset, it is worth paying attention to several points.

For correct testing, videos in the test and training sets

should not overlap. In addition, the number of fake

and real frames in both samples should be approxi-

mately the same.

4.2 Selected Metrics

The algorithms quality can be assessed in three differ-

ent ways:

• Frame-by-frame evaluation — For each video

frame, we try to determine whether the model’s

results are true.

• Evaluation per second — Each second consists of

the frames we measured in the previous metric. To

draw a conclusion about the particular second fak-

eness, it is necessary to select a threshold for the

fake frames number in it. Our threshold was se-

lected empirically on the dataset test part. This is a

conﬁgurable setting, which is manually tuned for

speciﬁc videos sets. For example, if a dataset con-

sists mainly by static videos, the threshold should

be increase to 0.9. If a dataset contains a large

number of videos with fast dynamics, it is worth

reducing the threshold to 0.5. In our test we used

an average value of 0.7.

• Video rating — If a certain number of consecu-

tive fake seconds is found in a video, speciﬁed by

a custom hyperparameter, then this video can be

considered fake. Otherwise, the video is consid-

ered real. Based on this metric, you can evaluate

the performance on the entire test videos set.

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

230

Figure 3: Algorithm scheme.

Table 2: F

scores for method comparison, best value is marked with bold.

Frame-by-frame evaluation Evaluation per second Video rating

Analytical approach 0.23 0.27 0.12

EfﬁcientNet B0 w/o preprocessing 0.3 0.31 0.33

EfﬁcientNet B0 with preprocessing 0.67 0.72 0.76

4 frames interval w/o preprocessing 0.4 0.41 0.39

4 frames interval with preprocessing 0.6 0.71 0.75

8 frames interval with preprocessing 0.6 0.56 0.63

16 frames interval with preprocessing 0.61 0.64 0.61

32 frames interval with preprocessing 0.59 0.57 0.58

Since all of these metrics can be important when

solving speciﬁc problems and allows hyperparame-

ters tuning, we tested our algorithm on all of them. In

our testing, we used a second conﬁdence threshold of

0.7, and when evaluating the latter metric, we did not

take into account fake intervals less than 3 seconds.

The hyperparameters were selected empirically, but

can be easily replaced depending on the dataset fea-

tures.

4.3 Models Results

After the ﬁrst preprocessing method described in sub-

section 3.2, we can use any classical computer vi-

sion classiﬁer model that work with RGB images

such as EfﬁcientNet, ResNet, VGG and others. It is

worth mentioning that using pre-trained models gives

worse results than those trained from scratch since

pretrained models do not take into account data pre-

processing. As a result, EfﬁcientNet (B0, B1, B2,

B3, B4), ResNet-(50, 101), VGG-16 were trained and

tested. After training different versions of these mod-

els on our dataset, EfﬁcientNet B0 showed the best

results.

For the second preprocessing method described in

subsection 3.3, the length of the analyzed video pe-

riod can be selected. We trained the transformer at

intervals of length: 4, 8, 16, 32 frames. The results

evaluated by the metrics described above for proposed

methods and analytical approach are presented in the

Tab. 2.

Despite the similar results of both methods, we

consider the second approach more promising, since

the training set size was small, while transformers

need more data for training than convolutional neu-

ral networks to show superior performance. However

even at the current stage, the proposed preprocessing

to a large extent compensates for this shortcoming.

4.4 Limitations

Despite the decent results, the methods we propose

work differently on various types of video. CNN-

based method 3.2 copes much worse with videos

in which there is a small number of movements,

and as a result there are few contours for analysis.

The Transformer-based method 3.3 copes better with

static videos, but works much worse on bright videos,

since due to excessive brightness it is not possible to

identify the different nature of noise in the object and

the background. Both methods do not work well with

ﬁlms, since the chromakey in them is done with much

postprocessing and it could not be identiﬁed in the ab-

sence of artifacts.

Frames Preprocessing Methods for Chromakey Classiﬁcation in Video

231

5 CONCLUSION

Video classiﬁcation is becoming increasingly relevant

in the modern world due to the large amount of video

content and its various categories. In this paper, we

described a method to analyze a video sequence using

machine learning algorithms and a new approach to

data preprocessing. This approach has been success-

fully implemented and tested on the searching and

classifying chromakey in video problem. The results

outperformed existing algorithms. Our approach can

be easily applied to other video classiﬁcation prob-

lems and can also be scaled to estimate larger video

spans depending on the speciﬁc task. In the future,

it is planned to conduct more qualitative comparisons

of different length chromakey sequences. There are

many ways to use chromakey technology and each

of them has its own speciﬁstic and artifacts. In the

future, it is planned to study in detail all chromakey

types and select the necessary preprocessing for each

type individually, which will help to manage edge

cases better.

ACKNOWLEDGEMENTS

The research was supported by the ITMO University,

project 623097 ”Development of libraries containing

perspective machine learning methods”.

REFERENCES

Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lu

c, M.,

and Schmid, C. (2021). Vivit: A video vision trans-

former. In Proceedings of the IEEE/CVF international

conference on computer vision, pages 6836–6846.

Bertasius, G., Wang, H., and Torresani, L. (2021). Is space-

time attention all you need for video understanding?

In ICML, volume 2, page 4.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-

Fei, L. (2009). Imagenet: A large-scale hierarchical

image database. In 2009 IEEE conference on com-

puter vision and pattern recognition, pages 248–255.

Ieee.

Ding, L. and Goshtasby, A. (2001). On the canny edge de-

tector. Pattern recognition, 34(3):721–725.

Huang, H., Lin, L., Tong, R., Hu, H., Zhang, Q., Iwamoto,

Y., Han, X., Chen, Y.-W., and Wu, J. (2020). Unet 3+:

A full-scale connected unet for medical image seg-

mentation. In ICASSP 2020-2020 IEEE international

conference on acoustics, speech and signal processing

(ICASSP), pages 1055–1059. IEEE.

Kaur, A. and Kranthi, B. (2012). Comparison between

ycbcr color space and cielab color space for skin color

segmentation. International Journal of Applied Infor-

mation Systems, 3(4):30–33.

Koonce, B. and Koonce, B. (2021). Efﬁcientnet. Convo-

lutional Neural Networks with Swift for Tensorﬂow:

Image Recognition and Dataset Categorization, pages

109–123.

Li, J. P., Haq, A. U., Din, S. U., Khan, J., Khan, A., and Sa-

boor, A. (2020). Heart disease identiﬁcation method

using machine learning classiﬁcation in e-healthcare.

IEEE access, 8:107562–107582.

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin,

S., and Guo, B. (2021). Swin transformer: Hierar-

chical vision transformer using shifted windows. In

Proceedings of the IEEE/CVF international confer-

ence on computer vision, pages 10012–10022.

Otsu, N. (1979). A threshold selection method from gray-

level histograms. IEEE transactions on systems, man,

and cybernetics, 9(1):62–66.

Singh, G. and Singh, K. (2022). Chroma key foreground

forgery detection under various attacks in digital video

based on frame edge identiﬁcation. Multimedia Tools

and Applications, pages 1–28.

Song, J.-D., Kim, Y.-G., and Park, T.-H. (2019). Smt defect

classiﬁcation by feature extraction region optimiza-

tion and machine learning. The International Journal

of Advanced Manufacturing Technology, 101:1303–

1313.

Targ, S., Almeida, D., and Lyman, K. (2016). Resnet

in resnet: Generalizing residual architectures. arXiv

preprint arXiv:1603.08029.

Tong, Z., Song, Y., Wang, J., and Wang, L. (2022). Video-

mae: Masked autoencoders are data-efﬁcient learn-

ers for self-supervised video pre-training. Advances

in neural information processing systems, 35:10078–

10093.

Xie, Z., Lin, Y., Yao, Z., Zhang, Z., Dai, Q., Cao, Y., and

Hu, H. (2021). Self-supervised learning with swin

transformers. arXiv preprint arXiv:2105.04553.

Yuan, L., Chen, Y., Wang, T., Yu, W., Shi, Y., Jiang, Z.-H.,

Tay, F. E., Feng, J., and Yan, S. (2021). Tokens-to-

token vit: Training vision transformers from scratch

on imagenet. In Proceedings of the IEEE/CVF inter-

national conference on computer vision, pages 558–

567.

Zhang, D. and Zhang, D. (2019). Wavelet transform. Fun-

damentals of image data mining: Analysis, Features,

Classiﬁcation and Retrieval, pages 35–44.

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

232