Deep Learning-Powered Assembly Step Classiﬁcation

for Intricate Machines

Luca Rodiga

1,2 a

, Eva Eggeling

2 b

, Ulrich Krispel

2 c

and Torsten Ullrich

1,2 d

Institute of Computer Graphics and Knowledge Visualization, Graz University of Technology, Graz, Austria

Fraunhofer Austria Research GmbH, Graz, Austria

Keywords:

Computer Vision, Deep Learning, Machine Learning.

Abstract:

Augmented Reality-based assistance systems can help qualiﬁed technicians by providing them with technical

details. However, the applicability is limited by the low availability of real data. In this paper, we focus on

synthetic renderings of CAD data. Our objective is to investigate different model architectures within the

machine-learning component and compare their performance. The training data consists of CAD renderings

from different viewpoints distributed over a sphere around the model. Utilizing the advantages of transfer

learning and pre-trained backbones we trained different versions of EfﬁcientNet and EfﬁcientNetV2 on these

images for every assembly step in two resolutions. The classiﬁcation performance was evaluated on a smaller

test set of synthetic renderings and a dataset of real-world images of the model. The best Top1-accuracy on the

real-world dataset is achieved by the medium-sized EfﬁcientNetV2 with 57.74%, while the best Top5-accuracy

is provided by EfﬁcientNetV2 Small. Consequently, our approach has a good classiﬁcation performance

indicating the real-world applicability of such a deep learning classiﬁer in the near future.

1 INTRODUCTION

Technology is designed to serve human needs and

assist people by creating tools, systems, and solu-

tions that enhance human capabilities, improve efﬁ-

ciency and simplify various tasks. One approach to

such an assistive system is based on augmented real-

ity (AR) (Azuma, 1993), (Caudell and Mizell, 1992).

Due to the increasing complexity of today’s ma-

chines, manufacturing and assembly tasks performed

by unassisted human workers are signiﬁcantly more

error prone as a study by TANG et al. showed that the

use of an AR system reduced the error rate for an as-

sembly task by 82% compared to a paper manual and

other assistive systems (Tang et al., 2003). Another

study found similar results (Loch et al., 2016): These

systems can help technicians by providing them with

technical details, blueprints, manuals, and other types

of information precisely when they need it. This

means that technicians can have the next repair step

displayed right in their ﬁeld of view, allowing them

https://orcid.org/0009-0000-7108-5855

https://orcid.org/0000-0001-6703-2865

https://orcid.org/0000-0001-8984-635X

https://orcid.org/0000-0002-7866-9762

to seamlessly continue with their tasks. Previous re-

search (Wiedenmaier et al., 2003) has shown that in

an assembly task of a window regulator, the AR-

supported assembly workers were about 33% faster

than workers, who were using paper manuals. How-

ever, in order to display the next step the system must

be able to classify the current state of the assembly.

This work focuses on a scenario where the assis-

tance system has to recognize the current construc-

tion step of a partially assembled model using deep

learning-based computer vision and thus be able to as-

sist the technician by providing visual instructions for

the next construction step. Such a system consists of

two main components. First, we need an AR-enabled

head-mounted display that is able to monitor the as-

sembly process through an integrated camera system,

(Kress and Cummings, 2017) (Evans et al., 2017).

The second part is a deep learning model capable of

classifying the visual information from the camera,

distinguishing assembly steps, and communicating

the result back. Since manually collecting real-world

training data by taking thousands of pictures from dif-

ferent angles for every construction step is a tedious

and time-consuming task, the model should be able

to learn the classiﬁcation task on the Computer Aided

Design (CAD) model only. The advantage of this ap-

500

Rodiga, L., Eggeling, E., Krispel, U. and Ullrich, T.

Deep Learning-Powered Assembly Step Classiﬁcation for Intricate Machines.

DOI: 10.5220/0012376300003660

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2024) - Volume 4: VISAPP, pages

500-507

ISBN: 978-989-758-679-8; ISSN: 2184-4321

proach is that an almost arbitrary amount of training

data can be generated fairly easily for anything that

has a CAD model including the assembly steps. This

paper speciﬁcally addresses the implementation and

training of the deep-learning-based computer vision

model.

2 RELATED WORK

Image classiﬁcation is one of the fundamental prob-

lems in computer vision. It is deﬁned by the pro-

cess of classifying an image and assigning a spe-

ciﬁc label to it (Russakovsky et al., 2015). Most

state-of-the-art approaches are based on the work of

KRIZHEVSKY et al. (Krizhevsky et al., 2012), where

the authors proposed the ﬁrst general image classi-

ﬁer based on a convolutional neural network (CNN)

called AlexNet. Since then, the concept has been

improved and the model architectures have become

more sophisticated, such as GoogLeNet (Szegedy

et al., 2015), ResNet (He et al., 2016), and Efﬁcient-

Net (Tan and Le, 2019). Traditionally, training a

deep learning model for image classiﬁcation requires

a large amount of annotated data and computational

resources. To overcome this limitation, a technique

called transfer learning has been widely used in re-

cent years. This technique exploits the feature extrac-

tion properties of deep CNNs by allowing the transfer

of existing domain knowledge to other domains (Niu

et al., 2020). To achieve this, state-of-the-art models

such as EfﬁcientNet are pre-trained on large datasets

such as ImageNet (Deng et al., 2009), which teaches

them low-level classiﬁcation capabilities such as edge

detection. Then a second training process ﬁne-tunes

the model on the smaller custom dataset according to

the use case (Niu et al., 2020).

At present, LEGO is very popular among adults

and children and it’s widely available. In particu-

lar, the ﬁne-grained and standardized parts of LEGO

Technic are similar to real-world machine parts and

can be used to build a wide range of simple to very

complex constructions (without the need of any non-

disclosure agreements).

The research by BOI

NSKI et al. aims to create

an automated LEGO brick sorter using deep neural

networks, where the authors evaluate 28 different im-

age classiﬁcation models (Boi

nski et al., 2022). Their

research showed that deep learning models are well

suited for such a classiﬁcation task as their best mod-

els were able to distinguish individual bricks almost

perfectly. However, the authors only used a small

subset of possible target classes from the thousands

of existing LEGO bricks.

The paper by VIDAL et al. tackles a problem,

that children around the world face on a daily ba-

sis, namely the tedious task of ﬁnding speciﬁc LEGO

bricks when looking at a pile of them (Vidal et al.,

2023). They used a supervised deep learning ap-

proach to this multi-label classiﬁcation problem. In

addition, they used synthetically generated images to

improve their model performance in this data-limited

scenario. This shows that image classiﬁcation models

are able to transfer the knowledge learned from syn-

thetic images to a real-world application, as their re-

ported average accuracy was 98.7% in controlled sce-

narios.

Previous research combines the idea of an AR as-

sembly assistance system with deep learning based

image classiﬁcation on LEGO (Schoosleitner and Ull-

rich, 2021). They aim to correctly classify the cur-

rent construction step of a LEGO model based on 2D

images, by evaluating a machine learning approach,

where they trained the deep neural network Rotation-

Net. The visual data in this paper comes from a

LEGO Technic model called “Airport Rescue Vehi-

cle” (no. 42068). This model consists of 1094 parts

and has dimensions of over 42cm in height, 45cm in

length, and 15cm in width. This particular model has

a complexity that is sufﬁciently similar to the pro-

posed use case. A CAD representation was created by

PHILIPPE HURBAIN and published under the CCAL

VERSION 2.0 license at ldraw.org. Their research

has provided two-dimensional images from render-

ings of the CAD representation of the LEGO model

illustrating each construction step. In this paper, we

present a novel approach to the classiﬁcation prob-

lem described above, building upon the renderings by

SCHOOSLEITNER and ULLRICH. In this research, we

extend their work by applying the latest state-of-the-

art deep-learning techniques to overcome some of the

shortcomings identiﬁed in their paper. First of all we

have increased the model’s input resolution (to pre-

serve more detail in the images) and the amount of

training data (by using more viewpoints).

3 PROPOSED METHOD

In this project, different sizes of EfﬁcientNet (Tan

and Le, 2019) and EfﬁcientNetV2 (Tan and Le,

2021) were used and evaluated as the backbone of

the image classiﬁer. These architectures were cho-

sen because they provide state-of-the-art performance

while being time and memory efﬁcient during train-

ing compared to other architectures. In particu-

lar, the EfﬁcientNetV2 models are signiﬁcantly faster

than other backbones (Srinivas et al., 2021), (Doso-

Deep Learning-Powered Assembly Step Classiﬁcation for Intricate Machines

501

vitskiy et al., 2021). Furthermore, they achieve

similar or better classiﬁcation accuracies on several

benchmark datasets such as ImageNet and CIFAR-

10 (Krizhevsky, 2009) with only a fraction of the

trainable parameters according to the originally re-

ported.

Figure 1: Distribution of camera angles on the sphere: each

of the 486 grey dots represents a camera position of a ren-

dered CAD image (Schinko et al., 2011).

As mentioned in the previous section, an existing

dataset of rendered images was used to train the clas-

siﬁer. These images are generated in a manner simi-

lar to a camera taking pictures from different angles

around a sphere with the rendered model in the center.

For each construction step, a total of 486 points on the

sphere are selected as camera positions for image ac-

quisition distributed by error minimization (Schinko

et al., 2011). This distribution ensures a maximum

angle difference of ten degrees between two camera

positions (see Figure 1).

While the CAD model consists of 137 construc-

tion steps, the last ﬁve steps (#132 to #137), actually

construct a toolbox that is then placed inside the vehi-

cle. This means that these steps are indistinguishable

from the outside of the main model and are excluded

from the set of target classes. Therefore, the train-

ing data consists of 486 images for each of the 132

construction steps with a resolution of 2048 × 2048.

The images use the RGB color space with a neutral,

mostly ambient lighting conﬁguration (see Figure 2).

Prior to model training, the widely accepted proce-

dure of splitting the dataset into a training, validation,

and test set was implemented (Hastie et al., 2009).

We used an 80-10-10 train-val-test split, which means

that the model is trained on 80% of the images, while

after each epoch the model is cross-validated with the

validation dataset of 10% of the images. After train-

ing, the model is cross-validated a second time on im-

ages it has never seen before.

For the proposed use case, it is reasonable to as-

sume that the real images of the assembly process

will have the model at least approximately centered.

Therefore, the ﬁrst step in the preprocessing pipeline

for the training data is to ensure that the model re-

ally learns to distinguish the steps and does not take

into account where the vehicle is in the image. The

model needs to learn a general understanding of the

vehicle’s position. To achieve this, the pipeline ex-

tracts from the original training image the smallest

possible rectangle of pixels that still contains the en-

tire image of the displayed step, called the bounding

box. This cutout is then pasted onto a new white

background with a random position around the cen-

ter, while also varying the size of the new image to

further increase the variability in the data. This pro-

cess creates an entirely new image and is performed

every time an image gets fed into the model during

training. The images are then resized to the ﬁxed in-

put size of 1024 × 1024 or 512 × 512. These input

resolutions were chosen to minimize information loss

and preserve ﬁne details between construction steps.

To combat overﬁtting, the data augmentation strategy

“AutoAugment” (Cubuk et al., 2019) implemented in

the “Pytorch” package gave the best results. The Ta-

bles 1 and 2 give an overview of all backbones and

hyperparameters used in the model training. Since

especially the models trained on the input resolution

of 1024 × 1024 allowed only small batch sizes due to

video memory constraints, and to reduce the hyperpa-

rameter search by one variable, the batch size was al-

ways set to the maximum that could ﬁt into the avail-

able video memory. The Table 1 shows the maximum

batch sizes for each run. Since training the model

Figure 2: Comparison of real-world image and synthetic

training image of step #129; the upper image is a real-world

image taken with a smartphone on a white background. The

lower image is rendered using the Computer-Aided Design

model only.

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

502

Table 1: Overview of the backbones used including the

model sizes regarding their number of parameters.

Model Num. of Max. Batchsize

Parameters 1024 × 1024 512 × 512

B0 4,176,640 24 96

B3 10,899,116 12 50

B5 28,611,252 6 -

V2 Small 20,380,372 12 56

V2 Medium 53,027,448 - 32

Table 2: Hyperparameters of the optimizer and the learning

rate scheduler.

Hyperparameter Value

Optimizer AdamW

Learning rate 0.001

LR scheduler ReduceLROnPlateau

Factor 0.5

Patience 5

Cooldown 3

Epochs 100 resp.

60 for EfﬁcientNet B5 with 1024 × 1024

from scratch would not have been feasible with the

small dataset available, a transfer learning approach

was implemented using only pre-trained versions of

the above models. Thus, the training of these models

is initialized with the best performing weights learned

from the ImageNet-1K dataset. It also replaces the

last layer of the neural network and adjusts the num-

ber of target classes, which in our case is 132. Dur-

ing training, all layers were ﬁne-tuned and none were

frozen. An initial learning rate of 0.001 was paired

with the common and simple learning rate decay strat-

egy “ReduceLROnPlateau” to avoid getting stuck in

local minima (Goodfellow et al., 2016).

This learning rate scheduler reduces the learning

rate by 50% if the validation loss function does not

improve over a given number of epochs, which was

set to ﬁve in this case. As an optimizer, the newer

variant of Adam called AdamW has been chosen, be-

cause of its superior generalization performance com-

pared to the vanilla version (Loshchilov and Hutter,

2019). The training was done on a dedicated train-

ing server (Intel Xeon 64-core CPU, 3.10GHz, 1TB

RAM, Nvidia A40 GPU with 45GB) and all models

were trained for 100 epochs, except EfﬁcientNet B5

(1024 × 1024), which was only trained for 60 epochs

because this particular model started to overﬁt after

20 epochs.

4 EVALUATION

The quality of an AI model is deﬁned by the practi-

cality of its application. Therefore, each of the models

mentioned in the previous section is evaluated and the

results of the classiﬁcation performance after model

training are presented. The models are evaluated on

three different datasets:

The validation dataset: It consists of 10% of the

synthetic images from the CAD renderings. The mod-

els are cross-validated with this data set after each

training epoch.

The test dataset: It also consists of 10% of the

synthetic images from the CAD renderings. How-

ever, the models do not see these images during train-

ing. Instead, they are only used to validate the perfor-

mance of the models after training is complete. This

is a second measure of performance, since the models

cannot learn from these images.

The real-world dataset: Once the models have

been satisfactorily validated on the synthetic images,

the ﬁnal test is performed on the images from the

real world. This dataset consists of 11 target classes

and a total of 522 images taken with a widely avail-

able smartphone on a white background. The target

step classes are step #12, #24, #25, #39, #51, #62,

#79 , #98 , #117, #129, and #131. They were cho-

sen to cover the beginning, middle, and end stages of

assembly while varying the step spacing. Each im-

age is taken from a slightly different angle to capture

the model from all sides. The collection of images is

roughly evenly distributed across this set of assembly

steps. As shown in Figure 2, the real world images

are much darker than the training images. Therefore,

the pre-processing pipeline increases the brightness

by 100% and the contrast by 25% of the real-world

images to close the visual gap to the synthetic images.

Since the dataset is adequately balanced along all

target classes, the following two metrics were chosen

for the model evaluation. The ﬁrst one is the Top1-

accuracy, which is the success rate of correctly iden-

tifying each construction step. This metric is quite

strict for the application, because some differences

between construction steps are not visible from cer-

tain camera points, making a correct classiﬁcation ac-

cording to Top1-accuracy a guess. Figure 3 show-

cases this problem. The second metric chosen is the

Top5-accuracy, which is more relaxed. A prediction

where the correct construction step is in the classiﬁ-

cation set of the ﬁve highest step probabilities is still

considered a successful classiﬁcation. In this partic-

ular use case, the Top5-accuracy is more meaningful

as small differences in steps cannot always be distin-

guished from different views.

We evaluate the performance of different versions

and sizes of the EfﬁcientNet model family: A total of

eight training experiments were conducted with four

experiments for each resolution. The results of all ex-

periments are shown in Tables 3 and 4. First, on

our validation set, the best performing model achieves

Deep Learning-Powered Assembly Step Classiﬁcation for Intricate Machines

503

(a) Step #78 from Viewpoint #56

(b) Step #79 from Viewpoint #56

(d) Step #79 from Viewpoint #207

Figure 3: Comparison of step differences and visibilities from different viewpoints; while only a small part is added between

the steps, the difference can be easily seen from viewpoint #56. On the other hand, from viewpoint #207, the changes from

other parts are obscured, making it very difﬁcult for humans and computers to classify.

Table 3: Top1-Accuracies of all experiments including the

results from the RotationNet by (Schoosleitner and Ullrich,

2021).

Top1 - Accuracy

Model Resolution Val. Set Test Set Real Images

B0 1024× 1024 81.53% 82.29% 36.9%

B3 83.49% 82.98% 51.82%

B5 82.07% 81.73% 38.43%

V2 Small 84.12% 83.83% 53.35%

B0 512 × 512 77.01% 77.72% 50.29%

B3 78.44% 78.43% 56.6%

V2 Small 79.59% 79.31% 56.41%

V2 Medium 80.59% 79.8% 57.74%

RotationNet 256 × 256 - 4.38% 0.73%

a Top1-accuracy of 84.12%.with a Top5-accuracy of

97.27%. Second, each of our models performs very

well on the synthetic image test dataset, achieving

similar Top1- and Top5-accuracies of around 80% and

over 95%, respectively. In comparison, the older Ro-

tationNet loses half of its Top1 and three quarters

of its Top5 accuracy with 4.38% and 7.3%, respec-

tively, when classifying unknown positions. Further-

more, all models signiﬁcantly outperform the human

test subjects from the older study on the synthetic im-

ages (Schoosleitner and Ullrich, 2021).

Table 4: Top5-Accuracies of all experiments including the

results from the RotationNet by (Schoosleitner and Ullrich,

2021).

Top5 - Accuracy

Model Resolution Val. Set Test Set Real Images

B0 1024 × 1024 95.87% 96.41% 65.77%

B3 96.96% 97.16% 82.79%

B5 96.17% 96.63% 68.45%

V2 Small 97.27% 97.37% 79.54%

B0 512 × 512 95.01% 95% 79.73%

B3 95.06% 95.04% 83.17%

V2 Small 95.04% 95.12% 83.94%

V2 Medium 95.51% 95.51% 81.64%

RotationNet 256 × 256 - 7.30% 0.73%

On the real-world test dataset, the highest Top1-

accuracy was achieved by the medium version of

EfﬁcientNetV2 (512 × 512) with 57.74%. Figure 4

visualizes this performance with a nice diagonal of

dark cells with correct predictions in the confusion

matrix. The best Top5 accuracy was achieved by

the small version of EfﬁcientNetV2 (512 × 512) with

83.94%. While the performance of all our models on

the synthetic images is relatively similar, the perfor-

mance differences on the real images become appar-

ent. We can see that all models trained on the higher

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

504

resolution perform worse. In particular, the model

with the fewest number of parameters, EfﬁcientNet

B0, with about 4.2M, and the model with one of the

highest number of parameters, EfﬁcientNet B5, un-

derperformed on the real images. While the models

lost some accuracy when transferred to the real-world

dataset, they still greatly outperformed the older al-

gorithm (Schoosleitner and Ullrich, 2021), which had

success rates below 1%.

Figure 4: Confusion matrix of EfﬁcientNetV2 M (512 ×

512) on the real-world dataset; the confusion matrix shows

a nice diagonal of dark cells with correct predictions. Most

misclassiﬁcations occur around the early steps #24 and #25

and the last steps #129 and #131.

Further analysis of the classiﬁcation performance

of the models revealed that all models had the most

misclassiﬁcations on the ﬁrst and the last few assem-

bly steps when classifying the synthetic test set. In

particular, steps #2, #3, #4 have a high error rate

compared to the rest of the steps. Figure 5 visual-

izes this effect on the example of EfﬁcientNetV2 S

(512 × 512). The graph shows the deviation of the

Top1 prediction mean from the target class ± the

standard deviation for all images in the test dataset.

In addition, it turned out that in some special cases,

the models predicted outliers in their Top1 predic-

tion with a large margin of error, and since the mean

and standard deviation are quite sensitive to outliers,

this distorts the graph. Therefore, we created a sec-

ond graph without these outliers by removing 5% of

the lowest and highest predictions. This results in a

better representation of the true performance: Fig-

ure 6 visualizes the error spread of EfﬁcientNetV2 S

(512 × 512) on the real-world dataset with automati-

cally removed outliers. Compared to the performance

on the synthetic test dataset, the means are still quite

close to the ideal, but the error spreads are much

wider.

5 CONCLUSIONS

The contribution of this paper is a novel approach to

an existing classiﬁcation problem (Schoosleitner and

Ullrich, 2021), which extends their idea of a deep

learning classiﬁer as part of an AR assistance sys-

tem. Synthetic images of existing CAD renderings

of the LEGO Technic model were used to train sev-

eral state-of-the-art neural network architectures and

signiﬁcantly improved the classiﬁcation performance.

Compared to the older RotationNet, the results are

promising and indicate the real-world applicability of

such a deep learning classiﬁer in the near future.

5.1 Discussion

Throughout the course of this study, several limita-

tions became apparent. One problem we faced was

the high input resolution required to preserve enough

detail in both the training and test images, as this leads

to an incredible amount of VRAM and time required

for model training.

Another challenge is the training data. On the one

hand, the use of CAD data ensures a sufﬁcient amount

of training data; on the other hand, the CAD data is

“unrealistically clean”. The images of the rendered

CAD model are similar to the printed LEGO man-

ual, i.e. the rendering software has been set up to

have a rather ambient lighting conﬁguration. As a re-

sult, we have unrealistic training data, and this is the

main reason why the models performed better on the

synthetic data than on the real images. Another sig-

niﬁcant problem is the visibility of certain assembly

steps. Consecutive steps often look remarkably simi-

lar, making it difﬁcult to distinguish their differences.

Depending on the point of view, newly added com-

ponents are sometimes completely hidden by others,

making correct classiﬁcation difﬁcult or even impos-

sible, as shown in Figure 3.

5.2 Future Work

Although the results of our algorithm are promising,

there is still potential for further improvement. As

mentioned above, one of the main limitations is the

unrealistic training data. Therefore, we are currently

working on generating new, more realistic training

images to better reﬂect the real use case. The ren-

derings of the CAD data will be generated under

plausible lighting conditions, which should result in

more realistic images with shadows and reﬂections.

This should help to close the performance gap be-

tween synthetic and real images. However, there is

a wide range of different lighting conditions in the

Deep Learning-Powered Assembly Step Classiﬁcation for Intricate Machines

505

Figure 5: Classiﬁcation performance of EfﬁcientNetV2 S (512 ×512) on the test dataset; the graphs show the deviation of the

prediction mean from the target class (measured in the number of construction steps). The vertical lines indicate the standard

deviation of the Top1 predictions. The bottom graph has 5% outliers removed, which caused distortion, especially in some

steps in the middle. Generally, the model performs better in the middle of the assembly as the early and the late steps have a

wider error spread.

Figure 6: Classiﬁcation performance of EfﬁcientNetV2 S (512 × 512) on the real-world dataset; the plots show the deviation

of the prediction mean from the target class ± the standard deviation of the Top1 predictions for each construction step

(analogous to Figure 5). Again, the bottom graph has outliers removed. Compared to the synthetic test dataset, the means are

still quite close to the ideal, but the error spreads are much wider.

real world, including variations with the time of day

in outdoor use cases. Therefore, the choice of plau-

sible lighting for the training images is not as sim-

ple as it may seem, and the decision must be care-

fully evaluated. A common challenge for all classiﬁ-

cation algorithms is the effect of the background on

the model’s recognition. Since it is not feasible to

train the model on all possible backgrounds, reducing

the background information can improve classiﬁca-

tion performance in the real world. One possible ap-

proach to this problem would be to implement a seg-

mentation pipeline for the real-world images, where

the model is extracted from the background prior to

classiﬁcation. The survey by MINAEE et al. (Minaee

et al., 2021) gives an overview of possible candidate

algorithms for such a pipeline. On the other hand,

performing image segmentation before image classi-

ﬁcation introduces a statistical error that propagates

into the classiﬁcation process, as inaccurate bound-

aries and mislabeled regions can occur if segmenta-

tion is not performed accurately. Thus, such a pipeline

must be carefully evaluated in advance to be advanta-

geous.

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

506

REFERENCES

Azuma, R. (1993). Tracking requirements for augmented

reality. Communications of the ACM, 36(7):50–51.

Boi

nski, T., Zawora, K., and Szyma

nski, J. (2022). How

to Sort Them? A Network for LEGO Bricks Classi-

ﬁcation. International Conference on Computational

Science, 22:627–640.

Caudell, T. P. and Mizell, D. W. (1992). Augmented Real-

ity: An Application of Heads-U Display Technology

to Manual Manufacturing Processes. volume 2, pages

659–669.

Cubuk, E. D., Zoph, B., Mane, D., Vasudevan, V., and Le,

Q. V. (2019). Autoaugment: Learning Augmentation

Strategies from Data. IEEE Conference on Computer

Vision and Pattern Recognition, 32:113–123.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei,

L. (2009). ImageNet: A Large-Scale Hierarchical Im-

age Database. IEEE Conference on Computer Vision

and Pattern Recognition, 22:248–255.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,

D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer,

M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby,

N. (2021). An Image is Worth 16×16 Words: Trans-

formers for Image Recognition at Scale. International

Conference on Learning Representations, 2021, 9:1–

21.

Evans, G., Miller, J., Pena, M. I., MacAllister, A., and

Winer, E. (2017). Evaluating the Microsoft HoloLens

through an augmented reality assembly application.

SPIE Defense + Security, 10197(101970V):282–297.

Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep

Learning. MIT Press.

Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Ele-

ments of Statistical Learning: Data Mining, Inference,

and Prediction, volume 2. Springer.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep

Residual Learning for Image Recognition. IEEE Con-

ference on Computer Vision and Pattern Recognition,

29:770–778.

Kress, B. C. and Cummings, W. J. (2017). Towards the Ul-

timate Mixed Reality Experience: HoloLens Display

Architecture Choices. SID symposium digest of tech-

nical papers, 48(1):127–131.

Krizhevsky, A. (2009). Learning Multiple Layers of Fea-

tures from Tiny Images.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Im-

agenet classiﬁcation with deep convolutional neural

networks. Communications of the ACM, 60:84–90.

Loch, F., Quint, F., and Brishtel, I. (2016). Comparing

Video and Augmented Reality Assistance in Manual

Assembly. International Conference on Intelligent

Environments, 12:147–150.

Loshchilov, I. and Hutter, F. (2019). Decoupled Weight

Decay Regularization. International Conference on

Learning Representations, 7:1–8.

Minaee, S., Boykov, Y., Porikli, F., Plaza, A., Kehtarnavaz,

N., and Terzopoulos, D. (2021). Image Segmenta-

tion using Deep Learning: a Survey. IEEE Transac-

tions on Pattern Analysis and Machine Intelligence,

44(7):3523–3542.

Niu, S., Liu, Y., Wang, J., and Song, H. (2020). A Decade

Survey of Transfer Learning (2010–2020). IEEE

Transactions on Artiﬁcial Intelligence, 1(2):151–166.

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,

Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bern-

stein, M., et al. (2015). Imagenet Large Scale Vi-

sual Recognition Challenge. International Journal of

Computer Vision, 115:211–252.

Schinko, C., Ullrich, T., and Fellner, D. W. (2011). Simple

and efﬁcient normal encoding with error bounds. The-

ory and Practice of Computer Graphics Conference,

29:63–65.

Schoosleitner, M. and Ullrich, T. (2021). Scene Under-

standing and 3D Imagination: A Comparison between

Machine Learning and Human Cognition. 16:231–

238.

Srinivas, A., Lin, T.-Y., Parmar, N., Shlens, J., Abbeel, P.,

and Vaswani, A. (2021). Bottleneck Transformers for

Visual Recognition. IEEE Conference on Computer

Vision and Pattern Recognition, 34:16519–16529.

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.,

Anguelov, D., Erhan, D., Vanhoucke, V., and Rabi-

novich, A. (2015). Going Deeper with Convolutions.

IEEE Conference on Computer Vision and Pattern

Recognition, 28:1–9.

Tan, M. and Le, Q. (2019). EfﬁcientNet: Rethinking Model

Scaling for Convolutional Neural Networks. Inter-

national Conference on Machine Learning, 36:6105–

6114.

Tan, M. and Le, Q. (2021). EfﬁcientNetV2: Smaller Mod-

els and Faster Training. International Conference on

Machine Learning, 38:10096–10106.

Tang, A., Owen, C., Biocca, F., and Mou, W. (2003). Com-

parative Effectiveness of Augmented Reality in Object

Assembly. Proceedings of the SIGCHI Conference on

Human Factors in Computing Systems, 5:73–80.

Vidal, J., Vallicrosa, G., Marti, R., and Barnada, M. (2023).

Brickognize: Applying photo-realistic image synthe-

sis for lego bricks recognition with limited data. Sen-

sors, 23(4):1898ff.

Wiedenmaier, S., Oehme, O., Schmidt, L., and Luczak,

H. (2003). Augmented Reality (AR) for Assembly

Processes Design and Experimental Evaluation. In-

ternational Journal of Human-Computer Interaction,

16(3):497–514.

Deep Learning-Powered Assembly Step Classiﬁcation for Intricate Machines

507