93.7% of the instances, making it also highly imbal-
anced. The synthetic dataset generated by Tab-VAE
preserves this ratio whilst the one without Gumbel-
softmax has 97.5% of this class, again suppressing
the information of the minority class. All this fac-
tored into this very low score on the census dataset in
the ablation analysis.
For the remaining three datasets, the performance
difference is not as significant, as these datasets
rely less on categorical variables for encoding in-
formation. Interestingly, the adult dataset also has
one multi-class categorical column with 41 classes,
whereas the model without Gumbel-softmax gener-
ates only 9 of these columns. Still, the impact of this
column is not as prevalent as in the case of credit
and census datasets. Similarly to these two datasets,
adult is the only other binary classification dataset,
but it has a good ratio of 75%-25% of its two classes.
Therefore, the magnitude of the obtained results high-
lights the importance of Gumbel-softmax in modeling
categorical columns in tabular datasets.
7 CONCLUSIONS AND FUTURE
WORK
In this paper, we introduced Tab-VAE, which ad-
dresses the challenge of modeling multi-class cate-
gorical variables in tabular data using a VAE gener-
ative model. Our approach is motivated by the belief
that VAEs can generate high-quality synthetic data,
with added benefits of being simpler, more stable,
and more computationally efficient. We corroborated
this claim by comparing our model against a host of
state-of-the-art models using two evaluation frame-
works and an ablation analysis. Tab-VAE continu-
ously showed high-level performance across multiple
datasets by outperforming state-of-the-art models of
both GANs and VAEs. In the future, the model can be
made more robust by incorporating additional encod-
ing methods for different types of data, such as mixed
data. Overall, Tab-VAE represents a significant ad-
vancement in the field of tabular data generation and
has the potential for broad applications in various do-
mains.
REFERENCES
(2022). Synthetic Data Metrics. DataCebo, Inc. Version
0.8.0.
Adams, R. (2013). The gumbel-max trick for discrete distri-
butions — laboratory for intelligent probabilistic sys-
tems.
Arjovsky, M., Chintala, S., and Bottou, L. (2017). Wasser-
stein generative adversarial networks. In Interna-
tional conference on machine learning, pages 214–
223. PMLR.
Asperti, A. (2018). Sparsity in variational autoencoders.
arXiv preprint arXiv:1812.07238.
Bishop, C. M. and Nasrabadi, N. M. (2006). Pattern recog-
nition and machine learning, volume 4. Springer.
Bojanowski, P., Joulin, A., Lopez-Paz, D., and Szlam, A.
(2017). Optimizing the latent space of generative net-
works. arXiv preprint arXiv:1707.05776.
Burda, Y., Grosse, R., and Salakhutdinov, R. (2015).
Importance weighted autoencoders. arXiv preprint
arXiv:1509.00519.
Camino, R., Hammerschmidt, C., and State, R. (2018).
Generating multi-categorical samples with gen-
erative adversarial networks. arXiv preprint
arXiv:1807.01202.
Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer,
W. P. (2002). Smote: synthetic minority over-
sampling technique. Journal of artificial intelligence
research, 16:321–357.
Choi, E., Biswal, S., Malin, B., Duke, J., Stewart, W. F.,
and Sun, J. (2017). Generating multi-label discrete
patient records using generative adversarial networks.
In Machine learning for healthcare conference, pages
286–305. PMLR.
Chow, C. and Liu, C. (1968). Approximating discrete
probability distributions with dependence trees. IEEE
transactions on Information Theory, 14(3):462–467.
Dua, D. and Graff, C. (2017). UCI machine learning repos-
itory.
Elasri, M., Elharrouss, O., Al-Maadeed, S., and Tairi, H.
(2022). Image generation: A review. Neural Process-
ing Letters, pages 1–38.
Figueira, A. and Vaz, B. (2022). Survey on synthetic data
generation, evaluation methods and gans. Mathemat-
ics, 10(15):2733.
Goodfellow, I. (2016). Nips 2016 tutorial: Generative ad-
versarial networks. arXiv preprint arXiv:1701.00160.
Goodfellow Ian, J., Jean, P.-A., Mehdi, M., Bing, X., David,
W.-F., Sherjil, O., and Courville Aaron, C. (2014).
Generative adversarial nets. In Proceedings of the
27th international conference on neural information
processing systems, volume 2, pages 2672–2680.
Guo, C., Zhou, J., Chen, H., Ying, N., Zhang, J., and
Zhou, D. (2020). Variational autoencoder with opti-
mizing gaussian mixture model priors. IEEE Access,
8:43992–44005.
Guo, J., Lu, S., Cai, H., Zhang, W., Yu, Y., and Wang, J.
(2018). Long text generation via adversarial training
with leaked information. In Proceedings of the AAAI
conference on artificial intelligence, volume 32.
Harsha, L. and Stanley, V. (2020). Synthetic tabular data
generation with oblivious variational autoencoders:
Alleviating the paucity of personal tabular data for
open research. In ICML HSYS Workshop, volume 1,
pages 1–6.
Tab-VAE: A Novel VAE for Generating Synthetic Tabular Data
25