Homography VAE: Automatic Bird’s Eye View Image Reconstruction

from Multi-Perspective Views

Keisuke Toida

1 a

, Naoki Kato

2 b

, Osamu Segawa

2 c

, Takeshi Nakamura

2 d

and Kazuhiro Hotta

1 e

Meijo University, 1-501 Shiogamaguchi, Tempaku-ku, Nagoya 468-8502, Japan

Chubu Electric Power Co., Inc., 1-1 Higashishin-cho, Higashi-ku, Nagoya 461-8680, Japan

Keywords:

Variational Autoencoder, Homography Transformation, Unsupervised Learning.

Abstract:

We propose Homography VAE, a novel architecture that combines Variational AutoEncoders with Homog-

raphy transformation for unsupervised standardized view image reconstruction. By incorporating coordinate

transformation into the VAE framework, our model decomposes the latent space into feature and transfor-

mation components, enabling the generation of consistent standardized view from multi-viewpoint images

without explicit supervision. Effectiveness of our approach is demonstrated through experiments on MNIST

and GRID datasets, where standardized reconstructions show signiﬁcantly improved consistency across all

evaluation metrics. For the MNIST dataset, the cosine similarity among standardized view achieved 0.66,

while original and transformed views show 0.29 and 0.37, respectively. The number of PCA components re-

quired to explain 95% of the variance decreases from 193.5 to 33.2, indicating more consistent representations.

Even more pronounced improvements are observed on GRID dataset, in which standardized view achieved a

cosine similarity of 0.92 and required only 7 PCA components compared to 167 for original images. Further-

more, the ﬁrst principal component of standardized view explains 71% of the total variance, suggesting highly

consistent geometric patterns. These results validate that Homography VAE successfully learns to generate

consistent standardized view representations from various viewpoints without requiring ground truth Homog-

raphy matrices.

1 INTRODUCTION

In recent years, the importance of techniques for

transforming images captured from various view-

points into a standardized perspective has grown

signiﬁcantly in ﬁelds such as autonomous driving

and surveillance camera systems. Bird’s Eye View

(BEV), which provides a top-down perspective of

a scene, is particularly important for these applica-

tions as it enables better understanding of spatial re-

lationships and object positions. Although Homog-

raphy transformation has been widely used for such

viewpoint transformations, manual estimation of Ho-

mography parameters requires signiﬁcant human ef-

fort and expertise, making it impractical for large-

scale applications. In the ﬁeld of deep learning, Vari-

https://orcid.org/0009-0006-4873-3651

https://orcid.org/0009-0004-3815-0829

https://orcid.org/0009-0000-2469-6098

https://orcid.org/0009-0001-4991-3383

https://orcid.org/0000-0002-5675-8713

ational Autoencoders (VAE) (Kingma and Welling,

2014) have demonstrated excellent performance in

image generation and reconstruction. Although VAEs

can encode input data into a low-dimensional latent

space and reconstruct the original data from it, con-

ventional VAEs struggle to directly reconstruct stan-

dardized view images from perspective-transformed

images.

To address this limitation and enable unsuper-

vised learning of viewpoint transformations, we pro-

pose Homography VAE, a novel architecture that in-

corporates Homography transformation into the VAE

framework through coordinate transformation. Our

model learns to decompose the latent space into fea-

ture and transformation components, enabling the re-

construction of both input and standardized view us-

ing a single framework.

We demonstrated the effectiveness of our pro-

posed method through experiments on the MNIST

and synthetic GRID datasets. Our results show

that the proposed method successfully generates con-

sistent standardized view reconstructions, achieving

626

Toida, K., Kato, N., Segawa, O., Nakamura, T. and Hotta, K.

Homography VAE: Automatic Bird’s Eye View Image Reconstruction from Multi-Perspective Views.

DOI: 10.5220/0013244800003912

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2025) - Volume 3: VISAPP, pages

626-632

ISBN: 978-989-758-728-3; ISSN: 2184-4321

higher pairwise cosine similarity and lower L2 dis-

tance compared to input and transformed views. Fur-

thermore, the signiﬁcant reduction in PCA compo-

nents indicates the model’s ability to learn compact

and consistent representations.

This paper is organized as follows. Section 2 re-

views related works in variational autoencoders and

Homography estimation. Section 3 describes the de-

tails of our proposed method. Section 4 presents ex-

perimental results and analysis. Finally, Section 5

concludes our paper.

2 RELATED WORKS

The research related to our work involves vari-

ational autoencoders, transformation-aware autoen-

coders, and deep learning-based Homography esti-

mation. VAE(Kingma and Welling, 2014) combines

variational inference with deep neural networks to

learn latent representations of data. This framework

has been widely adopted for various image genera-

tion and reconstruction tasks. β-VAE(Higgins et al.,

2017) extends this framework by introducing a hy-

perparameter to control the capacity of the latent bot-

tleneck. In the context of transformation-aware ar-

chitectures, Afﬁne VAE(Bidart and Wong, 2019) in-

corporates afﬁne transformation awareness into the

VAE framework, demonstrating improved generaliza-

tion and robustness to distribution shifts. Similarly,

Spatial Transformer Networks(Jaderberg et al., 2015)

introduced a differentiable module for spatial trans-

formations within neural networks though not specif-

ically in a VAE context.

In the ﬁeld of Homography estimation, deep learn-

ing approaches have shown promising results. Deep

Image Homography Estimation(DeTone et al., 2016)

demonstrated the ﬁrst successful application of deep

learning to direct Homography parameter estimation

from image pairs. This approach was extended to dy-

namic scenes(Le et al., 2020), incorporating temporal

consistency. Self-supervised approaches(Wang et al.,

2019) have further eliminated the need for manual an-

notations in Homography estimation.

However, these existing approaches have sev-

eral limitations. Deep learning-based methods typi-

cally require ground truth Homography matrices for

training, which are often costly to obtain. Further-

more, while various methods have been proposed

for Homography estimation or image transformation,

none have speciﬁcally addressed the challenge of re-

constructing standardized view images from multi-

viewpoint datasets without explicit supervision. Our

proposed Homography VAE addresses these lim-

itations by incorporating Homography transforma-

tion into the VAE framework, enabling unsupervised

learning of viewpoint transformations.

3 PROPOSED METHOD

We propose Homography VAE, a novel architecture

that combines VAE with Homography transforma-

tion to enable standardized view image reconstruc-

tion from multi-viewpoint images. As shown in Fig-

ure 1, our method consists of three main components:

an encoder for latent representation, a Homography

transformation module, and a decoder for image re-

construction.

3.1 Model Architecture

3.1.1 Image Encoder

Let x ∈ R

H×W ×C

be an input image captured from an

arbitrary viewpoint, where H, W , and C denote the

height, width, and number of channels respectively.

The encoder E(·) maps x to a latent representation z.

z = E(x) (1)

The latent space z is designed to contain both im-

age feature information and Homography parameters.

Speciﬁcally, we partition z into two parts.

z = [z

f eat

, z

homo

] (2)

where z

f eat

∈ R

represents d-dimensional image fea-

tures and z

homo

∈ R

contains the information for

computing the Homography transformation matrix

H ∈ R

3×3

3.1.2 Homography Transformation Module

From z

homo

, we compute the Homography transfor-

mation matrix H that represents the viewpoint trans-

formation from the standard coordinate system to the

input image’s perspective. A Homography transfor-

mation can be represented by a 3 × 3 matrix.

H =









(3)

where h

is typically set to 1 as the matrix is deﬁned

up to a scale factor. The standard coordinates C

std

are

deﬁned as a regular grid in normalized coordinates.

std

:= {x ∈ R

3×h×w

| − 1 ≤ x ≤ 1} (4)

For a point in homogeneous coordinates p =

(x, y, 1)

⊤

, the transformation is computed by Equa-

tion 5.

′

= H p = (x

′

, y

′

, w

′

)

⊤

(5)

Homography VAE: Automatic Bird’s Eye View Image Reconstruction from Multi-Perspective Views

627

Figure 1: Overview of Homography VAE architecture. The encoder maps input images to a latent representation that is

split into feature and transformation components. The decoder employs a dual-branch strategy: the ﬁrst branch (blue arrow)

uses transformed coordinates C

trans

to reconstruct the input viewpoint, while the second branch (red arrow) uses standard

coordinates C

std

to generate the standardized view. Both branches share the same feature representation z

f eat

but use different

coordinate information.

The homogeneous coordinates are converted back

to Euclidean coordinates through perspective division

in Equation 6.

′′

, y

′′

) = (

′

) (6)

Applying these transformations to all points in

std

, we obtain the transformed coordinates C

trans

= H ·C

std

(7)

3.1.3 Decoder and Image Reconstruction

The decoder D(·) takes both the image features z

f eat

and coordinate information to reconstruct the image.

For reconstructing the input viewpoint image, we use

rec

= D(z

f eat

, C

trans

). (8)

To reconstruct the standardized view image, we

use the standard coordinates C

std

instead of the trans-

formed coordinates.

std

= D(z

f eat

, C

std

) (9)

This key feature allows our model to reconstruct stan-

dardized view images without explicit supervision of

the transformation parameters. The decoder learns

to associate the standard coordinate system with the

standardized view perspective through the training

process.

3.2 Training Objective

The model is trained using the standard VAE objec-

tive function with a reconstruction loss and KL diver-

gence term.

L(θ, φ; x, z) = E

(z|x)

[log p

(x|z)] − D

(z|x)||p(z))

(10)

where q

(z|x) and p

(x|z) denote the encoder and de-

coder distributions respectively, with p(z) = N (0, 1).

Here, φ and θ are learnable parameters of the neural

networks. The encoder q

(z|x) outputs parameters of

a Gaussian distribution N (µ

(x), σ

(x)), where µ

(x)

and σ

(x) are learned through the neural network.

The KL divergence term D

measures the difference

between the encoder’s distribution and the prior dis-

tribution p(z).

The key advantage of our proposed method is that

it learns to estimate Homography transformations in

an unsupervised manner while simultaneously encod-

ing image features in the latent space. By decom-

posing the latent representation into feature and trans-

formation components, the model can reconstruct im-

ages from both the input and standardized view using

a single framework. This uniﬁed approach enables

viewpoint transformation without requiring ground

truth Homography matrices during training.

4 EXPERIMENTS

4.1 Datasets

We evaluate our proposed method on two different

datasets. The ﬁrst dataset is MNIST, which consists of

handwritten digits with a resolution of 28×28 pixels.

The second dataset comprises synthetically generated

GRID images with a resolution of 64×64 pixels con-

taining grid patterns. For both datasets, we apply

random Homography transformations to the original

images during training and testing to simulate multi-

viewpoint inputs.

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

628

Table 1: Detailed evaluation results for each digit class in MNIST dataset. Mean ± std are shown for cosine similarity and L2

distance.

Class

Original Transformed Standardized

Cos.Sim. L2 Dist. PCA Cos.Sim. L2 Dist. PCA Cos.Sim. L2 Dist. PCA

0 0.31±0.14 13.49±1.75 185 0.38±0.16 11.75±1.94 59 0.69±0.15 8.86±2.20 26

1 0.22±0.20 9.44±1.62 154 0.25±0.22 8.34±1.68 47 0.75±0.13 4.96±1.47 23

2 0.30±0.13 12.58±1.59 207 0.39±0.16 10.25±1.77 65 0.62±0.13 8.68±1.86 39

3 0.29±0.14 12.25±1.57 205 0.38±0.16 9.76±1.68 68 0.62±0.14 8.15±1.74 38

4 0.30±0.13 11.36±1.46 202 0.39±0.16 8.97±1.54 61 0.65±0.13 6.92±1.49 35

5 0.27±0.12 11.91±1.52 203 0.37±0.14 9.38±1.59 65 0.58±0.15 7.79±1.68 35

6 0.32±0.14 12.11±1.60 188 0.41±0.16 10.19±1.78 62 0.65±0.15 8.17±1.91 32

7 0.26±0.15 11.22±1.56 187 0.32±0.17 9.44±1.67 56 0.64±0.16 7.14±1.72 31

0.35±0.13 12.21±1.56 210 0.46±0.16 9.58±1.77 69 0.69±0.10 7.33±1.64 40

9 0.31±0.14 11.35±1.45 194 0.39±0.16 9.39±1.52 68 0.66±0.13 7.42±1.56 33

4.2 Implementation Details

The encoder and decoder networks are implemented

using convolutional neural networks. We train our

model using the Adam optimizer with a learning rate

of 0.001. To stabilize the training, we employ cyclic

KL annealing (Fu et al., 2019) for mitigating KL col-

lapse and gradient clipping (Pascanu et al., 2013) with

a maximum norm of 1.0. During training and testing,

we randomly sample Homography transformation pa-

rameters within a predetermined range to generate di-

verse viewpoint variations. Speciﬁcally, we perturb

the four corner points of the input image with ran-

dom displacements to create the transformation ma-

trix. The input image is then warped using homogra-

phy transformation with the obtained transformation

matrix.

4.3 Evaluation Metrics

To quantitatively evaluate the effectiveness of our

model, we employ four metrics to assess the consis-

tency of the reconstructed images within each class.

First, we compute the mean pairwise cosine similar-

ity, measuring the average directional similarity be-

tween image pairs. Second, we calculate the mean

pairwise L2 distance to quantify pixel-level differ-

ences between images. Third, we analyze the num-

ber of principal components required to explain 95%

of the total variance in the PCA space, where fewer

components indicate more compact representations.

Finally, we evaluate the ﬁrst principal component ra-

tio, which quantiﬁes how much of the total variance

is captured by the most signiﬁcant direction of varia-

tion. All metrics are computed separately for original,

transformed, and standardized images to enable com-

prehensive comparison.

4.4 Results and Analysis

Our experimental results on MNIST dataset demon-

strate that the standardized view reconstructions

achieve signiﬁcantly higher consistency compared to

both original and transformed images, as shown in Ta-

ble 1 and 2. The comparison can be analyzed from

three perspectives. First, the cosine similarity met-

ric indicates that standardized view maintain higher

directional consistency across samples compared to

both original and transformed images. Second, the

lower L2 distance in standardized view suggests that

our model successfully reduces pixel-wise variations

while preserving essential image features. Third, the

analysis of PCA components reveals that standard-

ized view can be represented in a signiﬁcantly lower-

dimensional space compared with original and trans-

formed images, indicating that our model success-

fully learns to generate consistent standardized view

reconstructions.

Notably, this improvement in consistency is ob-

served across all digit classes in the MNIST dataset,

as detailed in Table 2. The standardized view con-

sistently show better performance in all metrics, with

particularly strong results for simpler digits such as

”1”. Even for more complex digits with higher inher-

ent variability, our model maintains improved consis-

tency while preserving the distinctive features of each

class.

Furthermore, we evaluated our model on the syn-

thetic GRID dataset, which contains more structured

patterns than MNIST. As shown in Table 3, the re-

sults on GRID images demonstrate even more pro-

nounced improvements in the standardized view re-

constructions. While the performance on MNIST is

relatively lower compared to GRID dataset, this is

primarily because the MNIST model needs to han-

dle multiple digit classes simultaneously. This re-

quires the model to learn class-speciﬁc features along

with viewpoint transformations. In contrast, GRID

Homography VAE: Automatic Bird’s Eye View Image Reconstruction from Multi-Perspective Views

629

Table 2: Comparison of average metrics across different reconstruction types on MNIST dataset.

Metric Original Transformed Standardized

Cosine Similarity ↑ 0.29 0.37 0.66

L2 Distance ↓ 11.79 9.70 7.54

PCA Components ↓ 193.50 62.00 33.20

First Component Ratio ↑ 0.11 0.15 0.22

Table 3: Comparison of average metrics across different reconstruction types on GRID dataset.

Metric Original Transformed Standardized

Cosine Similarity ↑ 0.15±0.05 0.24±0.06 0.92±0.09

L2 Distance ↓ 19.86±1.05 16.53±1.10 5.78±2.99

PCA Components ↓ 167 156 7

First Component Ratio ↑ 0.03 0.04 0.71

Figure 2: Qualitative results of image reconstruction. For each dataset, we show (a) original input images with various

viewpoint transformations, (b) transformed view reconstructions that preserve the input perspective, and (c) standardized

view reconstructions that consistently align to a frontal viewpoint regardless of input variations. Left column shows the

results on MNIST dataset. Right column shows the results on GRID dataset. Our model successfully generates consistent

standardized view while maintaining the structural integrity of the patterns.

dataset contains only single-class patterns, allowing

the model to focus solely on learning viewpoint trans-

formations. Particularly notable is the dramatic re-

duction in required PCA components, indicating that

our model achieves remarkably consistent standard-

ized view reconstructions for structured grid patterns.

The high cosine similarity and low L2 distance of

standardized view further support this ﬁnding.

The qualitative results shown in Figure 2 demon-

strate our model’s ability to generate visually consis-

tent reconstructions. Although the transformed views

accurately preserve the perspective of input images,

the standardized view exhibit consistent frontal rep-

resentations regardless of the input viewpoint. Fig-

ure 3 shows the reconstruction results speciﬁcally

for MNIST digit ”4”. Despite its relatively complex

structure, our model successfully generates consis-

tent standardized view while preserving the features

of this digit class.

The consistency of these reconstructions is quan-

titatively validated through pairwise cosine similarity

analysis. Figure 4 visualizes the similarity matrices

computed for digit ”4”, where brighter colors indi-

cate higher similarity values. These matrices show

notably higher and more uniform similarity values

in standardized view compared to both original and

transformed views, as indicated by the consistently

brighter colors.

These results validate that our Homography VAE

successfully learns to generate consistent standard-

ized view representations without explicit supervision

of transformation parameters.

5 CONCLUSIONS

In this paper, we presented Homography VAE, a novel

unsupervised framework for standardized view image

reconstruction from multi-viewpoint input. Our main

contributions include a novel architecture that incor-

porates Homography transformation into the VAE

framework through coordinate transformation, en-

abling unsupervised learning of viewpoint transfor-

mations. We demonstrated that decomposing the la-

tent space into feature and transformation compo-

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

630

Figure 3: Reconstruction results for MNIST digit ”4”. Given original input images with various viewpoint transformations

(a), our model generates two types of reconstructions. (b) are transformed view reconstructions that preserve the original

perspective of each input. (c) are standardized view reconstructions that align all outputs to a consistent frontal viewpoint.

The results (c) demonstrate that our model successfully handles complex digit structures while maintaining consistent stan-

dardization of viewpoint.

Figure 4: Visualization of pairwise cosine similarity matrices computed from 50 samples within the same class. Results

shown are from MNIST digit ”4”. For each type of images ((a) original, (b) transformed reconstructions, and (c) standardized

reconstructions), we compute the cosine similarity between all pairs of images. The color intensity represents the similarity

value, where brighter colors indicate higher similarity. The more uniform and brighter patterns in (c) demonstrate that stan-

dardized reconstructions achieve consistently higher similarity across all pairs, validating the effectiveness of our approach in

generating consistent representations.

nents allows for effective generation of both input

and standardized view using a single framework. Fur-

thermore, experimental results show that our method

achieves signiﬁcantly higher consistency in standard-

ized view reconstruction compared to input and trans-

formed views, without requiring ground truth Ho-

mography matrices.

For future work, extending our method to handle

real-world scenes with multiple objects, varying light-

ing conditions, and higher resolution images would

enhance its practical applications. Additionally, in-

vestigating more complex geometric transformations

beyond Homography would further expand the capa-

bility of our framework.

REFERENCES

Bidart, R. and Wong, A. (2019). Afﬁne variational autoen-

coders. In Image Analysis and Recognition: 16th In-

ternational Conference, ICIAR 2019, Waterloo, ON,

Canada, August 27–29, 2019, Proceedings, Part I 16,

pages 461–472. Springer.

DeTone, D., Malisiewicz, T., and Rabinovich, A.

(2016). Deep image homography estimation. ArXiv,

abs/1606.03798.

Fu, H., Li, C., Liu, X., Gao, J., Celikyilmaz, A., and Carin,

L. (2019). Cyclical annealing schedule: A simple ap-

proach to mitigating KL vanishing. In Burstein, J.,

Doran, C., and Solorio, T., editors, Proceedings of

the 2019 Conference of the North American Chap-

ter of the Association for Computational Linguistics:

Human Language Technologies, Volume 1 (Long and

Homography VAE: Automatic Bird’s Eye View Image Reconstruction from Multi-Perspective Views

631

Short Papers), pages 240–250, Minneapolis, Min-

nesota. Association for Computational Linguistics.

Higgins, I., Matthey, L., Pal, A., Burgess, C. P., Glorot,

X., Botvinick, M. M., Mohamed, S., and Lerchner, A.

(2017). beta-vae: Learning basic visual concepts with

a constrained variational framework. ICLR (Poster),

Jaderberg, M., Simonyan, K., Zisserman, A., et al. (2015).

Spatial transformer networks. Advances in neural in-

formation processing systems, 28.

Kingma, D. P. and Welling, M. (2014). Auto-Encoding

Variational Bayes. In 2nd International Conference

on Learning Representations, ICLR 2014, Banff, AB,

Canada, April 14-16, 2014, Conference Track Pro-

ceedings.

Le, H., Liu, F., Zhang, S., and Agarwala, A. (2020). Deep

homography estimation for dynamic scenes. In Pro-

ceedings of the IEEE/CVF conference on computer vi-

sion and pattern recognition, pages 7652–7661.

Pascanu, R., Mikolov, T., and Bengio, Y. (2013). On the dif-

ﬁculty of training recurrent neural networks. In Das-

gupta, S. and McAllester, D., editors, Proceedings of

the 30th International Conference on Machine Learn-

ing, volume 28 of Proceedings of Machine Learning

Research, pages 1310–1318, Atlanta, Georgia, USA.

PMLR.

Wang, C., Wang, X., Bai, X., Liu, Y., and Zhou, J. (2019).

Self-supervised deep homography estimation with in-

vertibility constraints. Pattern Recognition Letters,

128:355–360.

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

632