
ditions are studied by repeatedly capturing images of
actual products. Since this process requires a lot of
time and cost, an efficient method to perform it is
needed. In this study, we propose a method for au-
tomatically optimizing the imaging conditions.
For this purpose, we propose a method to perform
highly accurate anomaly inspection by using a dis-
play used for video presentation as a light source to
illuminate the object and optimizing the illumination
patterns displayed on the display. There are an al-
most infinite number of patterns that can be displayed
on a display used as a light source, and it is pos-
sible to create a very complex illumination environ-
ment. On the other hand, the number of variations
makes it impractical to optimize them through trial-
and-error. Therefore, the proposed method, which can
automatically optimize the lighting environment, has
a great advantage when constructing an abnormality
inspection system. In addition, the proposed method
trains a neural network used for anomaly detection
in conjunction with such lighting pattern optimiza-
tion. Combined with the derived patterns, the pro-
posed method can detect anomalies with extremely
high accuracy.
2 ANOMALY DISCRIMINATION
USING VISION
TRANSFORMER (ViT)
We first describe methods for detecting anomalies us-
ing image information. Two methods are possible for
image abnormality detection: one is to use only nor-
mal images and discriminate deviations from them
as abnormalities(Perera and Patel, 2019; Bergmann
et al., 2019), and the other is to collect abnormal and
normal images in advance and discriminate them by
2-class identification. While the former method is
easy to collect training data, it has the problem that
it is difficult to improve the accuracy of abnormality
discrimination compared to the method that directly
uses abnormality data. In this paper, assuming that
the anomaly images can be collected to some extent,
we proceed with the discussion with the main objec-
tive of anomaly detection by 2-class discrimination.
In traditional anomaly detection before deep
learning, classification of abnormal and normal im-
ages has been attempted based on the statistical prop-
erties of image features, such as principal component
analysis and linear discriminant analysis. In these
discriminant analyses, various types of information
have been obtained not only by directly using im-
age intensity information, but also by using infor-
Figure 2: Representative architecture of Vision Transfomer.
mation such as edges obtained by filtering. On the
other hand, in recent years, methods using neural net-
works represented by CNNs (Convolutional Neural
Networks) have been widely used in anomaly detec-
tion(Perera and Patel, 2019). Recently, neural net-
work structures that do not use convolution, called
Vision Transformer (ViT) in particular, have been
widely used,(Dosovitskiy et al., 2021). Transformer
architecture in neural networks is a method proposed
in the field of natural language processing, and the
structure that applies this to image information is the
Vision Transformer, which is widely used for image
identification.
The Vision Transformer divides the input image
into a set of small patches and processes these as to-
kens using the Transformer structure. In this case,
each token is input to a network structure called ViT
Encoder, as shown in Fig.2. The ViT Encoder uses
a structure called “multi-head attention” to represent
the relationship between tokens, enabling it to capture
global image features even from a finely segmented
patch set. The features obtained by the Encoder are
input to MLP (Multi-Layer Perceptron), which calcu-
lates the final identification result.
It is known that a large amount of training data is
required to achieve sufficient performance when con-
structing a discriminator using Vision Transformer,
compared to using a CNN. However, it is known that
this problem can be avoided by training (pre-training)
a neural network in advance using a large amount
of image data, and then learning the neural network
in transfer learning according to the task. This has
shown that the Vision Transformer can be applied
to a variety of tasks even when training data is lim-
ited(Dosovitskiy et al., 2021). In the discriminator
used in this study, we aim to achieve highly accurate
anomaly discrimination by transfer learning a neural
network that has been trained with a large amount of
data in advance, using the target data.
VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications
756