
Table 1: Detailed evaluation results for each digit class in MNIST dataset. Mean ± std are shown for cosine similarity and L2
distance.
Class
Original Transformed Standardized
Cos.Sim. L2 Dist. PCA Cos.Sim. L2 Dist. PCA Cos.Sim. L2 Dist. PCA
0 0.31±0.14 13.49±1.75 185 0.38±0.16 11.75±1.94 59 0.69±0.15 8.86±2.20 26
1 0.22±0.20 9.44±1.62 154 0.25±0.22 8.34±1.68 47 0.75±0.13 4.96±1.47 23
2 0.30±0.13 12.58±1.59 207 0.39±0.16 10.25±1.77 65 0.62±0.13 8.68±1.86 39
3 0.29±0.14 12.25±1.57 205 0.38±0.16 9.76±1.68 68 0.62±0.14 8.15±1.74 38
4 0.30±0.13 11.36±1.46 202 0.39±0.16 8.97±1.54 61 0.65±0.13 6.92±1.49 35
5 0.27±0.12 11.91±1.52 203 0.37±0.14 9.38±1.59 65 0.58±0.15 7.79±1.68 35
6 0.32±0.14 12.11±1.60 188 0.41±0.16 10.19±1.78 62 0.65±0.15 8.17±1.91 32
7 0.26±0.15 11.22±1.56 187 0.32±0.17 9.44±1.67 56 0.64±0.16 7.14±1.72 31
8
0.35±0.13 12.21±1.56 210 0.46±0.16 9.58±1.77 69 0.69±0.10 7.33±1.64 40
9 0.31±0.14 11.35±1.45 194 0.39±0.16 9.39±1.52 68 0.66±0.13 7.42±1.56 33
4.2 Implementation Details
The encoder and decoder networks are implemented
using convolutional neural networks. We train our
model using the Adam optimizer with a learning rate
of 0.001. To stabilize the training, we employ cyclic
KL annealing (Fu et al., 2019) for mitigating KL col-
lapse and gradient clipping (Pascanu et al., 2013) with
a maximum norm of 1.0. During training and testing,
we randomly sample Homography transformation pa-
rameters within a predetermined range to generate di-
verse viewpoint variations. Specifically, we perturb
the four corner points of the input image with ran-
dom displacements to create the transformation ma-
trix. The input image is then warped using homogra-
phy transformation with the obtained transformation
matrix.
4.3 Evaluation Metrics
To quantitatively evaluate the effectiveness of our
model, we employ four metrics to assess the consis-
tency of the reconstructed images within each class.
First, we compute the mean pairwise cosine similar-
ity, measuring the average directional similarity be-
tween image pairs. Second, we calculate the mean
pairwise L2 distance to quantify pixel-level differ-
ences between images. Third, we analyze the num-
ber of principal components required to explain 95%
of the total variance in the PCA space, where fewer
components indicate more compact representations.
Finally, we evaluate the first principal component ra-
tio, which quantifies how much of the total variance
is captured by the most significant direction of varia-
tion. All metrics are computed separately for original,
transformed, and standardized images to enable com-
prehensive comparison.
4.4 Results and Analysis
Our experimental results on MNIST dataset demon-
strate that the standardized view reconstructions
achieve significantly higher consistency compared to
both original and transformed images, as shown in Ta-
ble 1 and 2. The comparison can be analyzed from
three perspectives. First, the cosine similarity met-
ric indicates that standardized view maintain higher
directional consistency across samples compared to
both original and transformed images. Second, the
lower L2 distance in standardized view suggests that
our model successfully reduces pixel-wise variations
while preserving essential image features. Third, the
analysis of PCA components reveals that standard-
ized view can be represented in a significantly lower-
dimensional space compared with original and trans-
formed images, indicating that our model success-
fully learns to generate consistent standardized view
reconstructions.
Notably, this improvement in consistency is ob-
served across all digit classes in the MNIST dataset,
as detailed in Table 2. The standardized view con-
sistently show better performance in all metrics, with
particularly strong results for simpler digits such as
”1”. Even for more complex digits with higher inher-
ent variability, our model maintains improved consis-
tency while preserving the distinctive features of each
class.
Furthermore, we evaluated our model on the syn-
thetic GRID dataset, which contains more structured
patterns than MNIST. As shown in Table 3, the re-
sults on GRID images demonstrate even more pro-
nounced improvements in the standardized view re-
constructions. While the performance on MNIST is
relatively lower compared to GRID dataset, this is
primarily because the MNIST model needs to han-
dle multiple digit classes simultaneously. This re-
quires the model to learn class-specific features along
with viewpoint transformations. In contrast, GRID
Homography VAE: Automatic Bird’s Eye View Image Reconstruction from Multi-Perspective Views
629