sign choice. The major contribution of applying PCA
whitening in this work is that it speeds up the training
process by more than 30% per epoch on a GPU. In
the meantime, a balanced design of intermediate code
vector with similar variance for each parameter can
benefit the performance of the neural network. The
proposed HERA system then takes ∼ 70 seconds to
train one epoch on an NVIDIA RTX 2080 and takes
∼ 350 epochs to train the whole network. After train-
ing, the network predicts a single image in 6 ms.
For training without pixel loss, as shown in Figure
7, the overall appearance of the rendered ear image
differs from the input ear image especially for the he-
lix part. Training without pixel loss makes the model
focus on lowering the landmark alignment error re-
gardless of the overall appearance of the ear. There-
fore it is necessary to utilise the pixel loss. This set of
figures also illustrates the pose ambiguity of this sys-
tem caused by orthogonal projection. For a distinct
set of ear 3DMM parameters, there exists two differ-
ent rotations that result in the same projected 2D land-
marks. In one case, such as Figure 7 (1), the external
auditory canal part of the ear is visible and in the other
case such as the other rendered images in this paper,
the external auditory canal is covered by itself. This
ambiguity may affect further applications that relate
the reconstructed 3D ear and other 3D objects, such
as the 3D head, but a simple 3D registration task can
be carried out to solve the rotational ambiguity, if re-
quired. Restrictions on the rotations during the train-
ing phase can be applied to allow the results to fall
into desired range.
When training without data augmentation, the 2D
landmark localisation performance drops by a small
amount mainly due to its lack of variety in ear rota-
tion, shown in Figure 8. When training without land-
mark loss, the predicted landmarks is not accurate
enough, shown in Figure 8. As a result, the recon-
structed 3D ears are not accurately aligned with the
2D ears especially for the ear contours.
5 CONCLUSION
As a large proportion of human-related 3D recon-
struction approaches focus on the human face, 3D ear
reconstruction, as an important human-related task,
has much less related work. In this paper, we propose
a self-supervised deep 3D ear reconstruction autoen-
coder from single image. Our model reconstructs the
3D ear mesh with a plausible appearance and accurate
dense alignment, as witnessed by the accurate align-
ment compared to ground truth landmarks. The com-
prehensive evaluation shows that our method achieves
state-of-the-art performance in 3D ear reconstruction
and 3D ear alignment.
REFERENCES
Bizjak, M., Peer, P., and Emer
ˇ
si
ˇ
c,
ˇ
Z. (2019). Mask r-cnn for
ear detection. In 2019 42nd International Convention
on Information and Communication Technology, Elec-
tronics and Microelectronics (MIPRO), pages 1624–
1628. IEEE.
Blanz, V. and Vetter, T. (1999). A morphable model for
the synthesis of 3d faces. In Proceedings of the 26th
annual conference on Computer graphics and inter-
active techniques, pages 187–194.
Cootes, T. F., Edwards, G. J., and Taylor, C. J. (1998). Ac-
tive appearance models. In European conference on
computer vision, pages 484–498. Springer.
Dai, H., Pears, N., Huber, P., and Smith, W. A. (2020a).
3d morphable models: The face, ear and head. In 3D
Imaging, Analysis and Applications, pages 463–512.
Springer.
Dai, H., Pears, N., and Smith, W. (2018). A data-augmented
3d morphable model of the ear. In 2018 13th IEEE In-
ternational Conference on Automatic Face & Gesture
Recognition (FG 2018), pages 404–408. IEEE.
Dai, H., Pears, N., and Smith, W. (2019). Augmenting a 3d
morphable model of the human head with high resolu-
tion ears. Pattern Recognition Letters, 128:378–384.
Dai, H., Pears, N., Smith, W., and Duncan, C. (2020b). Sta-
tistical modeling of craniofacial shape and texture. In-
ternational Journal of Computer Vision, 128(2):547–
571.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-
Fei, L. (2009). Imagenet: A large-scale hierarchical
image database. In 2009 IEEE conference on com-
puter vision and pattern recognition, pages 248–255.
Ieee.
Emer
ˇ
si
ˇ
c,
ˇ
Z., Gabriel, L. L.,
ˇ
Struc, V., and Peer, P. (2017a).
Pixel-wise ear detection with convolutional encoder-
decoder networks. arXiv preprint arXiv:1702.00307.
Emer
ˇ
si
ˇ
c,
ˇ
Z.,
ˇ
Struc, V., and Peer, P. (2017b). Ear recognition:
More than a survey. Neurocomputing, 255:26–39.
Emer
ˇ
si
ˇ
c,
ˇ
Z., SV, A. K., Harish, B., Gutfeter, W., Khiarak,
J., Pacut, A., Hansley, E., Segundo, M. P., Sarkar, S.,
Park, H., et al. (2019). The unconstrained ear recog-
nition challenge 2019. In 2019 International Confer-
ence on Biometrics (ICB), pages 1–15. IEEE.
Gecer, B., Ploumpis, S., Kotsia, I., and Zafeiriou, S. (2019).
Ganfit: Generative adversarial network fitting for high
fidelity 3d face reconstruction. In Proceedings of the
IEEE Conference on Computer Vision and Pattern
Recognition, pages 1155–1164.
Hansley, E. E., Segundo, M. P., and Sarkar, S. (2018). Em-
ploying fusion of learned and handcrafted features
for unconstrained ear recognition. IET Biometrics,
7(3):215–223.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-
ual learning for image recognition. In Proceedings of
VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications
144