FSL-LFMG: Few-Shot Learning with Augmented Latent Features and

Multitasking Generation for Enhancing Multiclass Classiﬁcation on

Tabular Data

Aviv A. Nur

, Chun-Kit Ngan

and Rolf Bardeli

Data Science Program, Worcester Polytechnic Institute, 100 Institute Rd., Worcester, MA, U.S.A.

thyssenkrupp Materials Services GmbH, Essen, Germany

{aanur, cngan}@wpi.edu, rolf.bardeli@thyssenkrupp-materials.com

Keywords:

Few-shot Learning, Machine Learning, Deep Learning, Multiclass Classiﬁcation, Autoencoders, Random

Forest, CatBoost, One vs Rest Classiﬁer, STUNT, Prototypical Network, Tabular Data.

Abstract:

In this work, we propose advancing ProtoNet that employs augmented latent features (LF) by an autoencoder

and multitasking generation (MG) by STUNT in the few-shot learning (FSL) mechanism. Speciﬁcally, the

achieved contributions to this work are threefold. First, we propose an FSL-LFMG framework to develop

an end-to-end few-shot multiclass classiﬁcation workﬂow on tabular data. This framework is composed of

three main stages that include (i) data augmentation at the sample level utilizing autoencoders to generate

augmented LF, (ii) data augmentation at the task level involving self-generating multitasks using the STUNT

approach, and (iii) the learning process taking place on ProtoNet, followed by various model evaluations in

our FSL mechanism. Second, due to the outlier and noise sensitivity of K-means clustering and the curse of

dimensionality of Euclidean distance, we enhance and customize the STUNT approach by using K-medoids

clustering that is less sensitive to noisy outliers and Manhattan distance that is the most preferable for high-

dimensional data. Finally, we conduct an extensive experimental study on four diverse domain datasets—Net

Promoter Score segmentation, Dry Bean type, Wine type, and Forest Cover type—to prove that our FSL-

LFMG approach on the multiclass classiﬁcation outperforms the Tree Ensemble models and the One-vs-the-

rest classiﬁers by 7.8% in 1-shot and 2.5% in 5-shot learning.

1 INTRODUCTION

The increasing volume of data across various sectors,

such as telecommunications, agriculture, and ﬁnance,

has led to a pressing demand for effective multiclass

classiﬁcation (Hollmann et al., 2022). For instance,

the telecom industry has utilized many machine-

learning (ML) models, such as random forest, deci-

sion trees, and discriminant feature analysis, to fore-

cast customer attrition and enhance investment opti-

mization. These techniques strive to predict customer

behavior and enhance investment choices (Sikri et al.,

2024; S¸ahin, 2023; Abdulsalam et al., 2022; Louk-

ili et al., 2022). In the ﬁeld of agriculture, ML and

deep learning (DL) improve crop monitoring, yield

estimation, and productivity, showcasing their essen-

tial impact on improving farm management and pro-

ductivity (Attri et al., 2024; Khan et al., 2023; Adebiyi

et al., 2020). In the ﬁnance industry, ML and DL ad-

dress tasks such as risk assessment, pricing, and the

development of optimal insurance packages through

various methodologies such as artiﬁcial neural net-

works and clustering algorithms. These approaches

underscore the crucial role of data and the necessity

to adapt to changing ﬁnancial patterns for improved

decision-making and efﬁciency (Matloob et al., 2021;

Blier-Wong et al., 2020).

The widespread use of data in multiple industries

typically involves the utilization of tabular data that

has been demonstrated by a 2023 Kaggle survey of

14,000 data scientists. The poll indicated that a sub-

stantial portion of professionals within those indus-

tries, ranging from 50% to 90%, relied on tabular

data in their work environments (Tunguz et al., 2023;

Sun et al., 2019). The inclination towards tabular

data presents distinct challenges such as high dimen-

sionality, heterogeneity, and critical interdependen-

cies among features, which are not found in images or

other data modalities (Borisov et al., 2022). Despite

these challenges, the adoption of innovative multi-

class classiﬁcation methods is still growing demon-

strating the importance of those methods in enhancing

Nur, A., Ngan, C. and Bardeli, R.

FSL-LFMG: Few-Shot Learning with Augmented Latent Features and Multitasking Generation for Enhancing Multiclass Classiﬁcation on Tabular Data.

DOI: 10.5220/0012934200003837

In Proceedings of the 16th International Joint Conference on Computational Intelligence (IJCCI 2024), pages 531-542

ISBN: 978-989-758-721-4; ISSN: 2184-3236

531

decision-making and operational efﬁciency across in-

dustries.

Presently, the existing approaches for multiclass

classiﬁcation on tabular data can be broadly divided

into two categories: DL Models and Tree Ensemble

(TE) Models (Shwartz-Ziv and Armon, 2022). Re-

cent advances in DL models, such as TabNet, Neu-

ral Oblivious Decision Ensembles (NODE), and Dis-

junctive Normal Formulas (DNF-Net), have demon-

strated exceptional outcomes across diverse domain

datasets (Arik and Pﬁster, 2021; Katzir et al., 2020;

Popov et al., 2019). These models possess the abil-

ity to delve into intricate connections among features,

resulting in heightened efﬁciency and performance

for tasks involving high-dimensional, structured data.

Each model utilizes distinct mechanisms for process-

ing feature selection, which further improves their

overall effectiveness. However, these models present

challenges in terms of complexity and computation,

as well as interpretability. On the other hand, the

TE models, including Random Forest and Gradient

Boosting, offer enhanced interpretability and reduced

computational complexity. In particular, Gradient

Boosting, such as XGBoost, exhibits signiﬁcantly

better performance in tabular data compared to DL

models (Borisov et al., 2022; Shwartz-Ziv and Ar-

mon, 2022). However, the remarkable performance of

these models is highly reliant on the utilization of co-

pious amounts of training data, which are inadequate

in some domains and require substantial storage space

(Wang et al., 2021; Tian et al., 2020). Additionally, if

the amount of training data is insufﬁcient, it results in

an overﬁtted model that lacks generalizability.

Few-shot learning (FSL) is an ML technique that

trains on a small number of labeled samples, typically

one to ﬁve samples per class, providing a potential so-

lution to the aforementioned issues (Li et al., 2023;

Wang et al., 2020). This technique enables efﬁcient

learning of multiclass classiﬁcation tasks with only

a limited amount of data (Parnami and Lee, 2022).

Although FSL has achieved noteworthy success in

the domain of image classiﬁcation, research on these

techniques on tabular data has been widely under-

explored (Nam et al., 2023). Furthermore, the appli-

cation of FSL in conjunction with TE models on tab-

ular data is very challenging because of the models’

limitations in generalizing on a few data samples per

class.

An effort to address the limitations of TE mod-

els led to the implementation of the One-vs-the-rest

(OvR) multiclass technique (sklearn, 2024). This

method is speciﬁcally designed for multiclass clas-

siﬁcation and involves dividing the tasks into a se-

ries of binary tasks. The OvR classiﬁer strategy can

be integrated into various existing ML conventional

models, including TE models, as the base estimators.

This technique is expected to enhance the classiﬁ-

cation capabilities of tree-based models by splitting

tasks into binary tasks. However, there is an oppor-

tunity that this technique provides suboptimal results

due to the potential loss of signiﬁcant data character-

istics, such as complex inter-class correlation and in-

teraction, which could result in unsatisfactory perfor-

mance in classiﬁcation tasks.

To improve and generalize the ability of models

is to augment the data, thereby increasing data vari-

ability. Data augmentation can be conducted either

at the sample or task level (Zhang and Liu, 2023).

At the sample level, typical methods for image data

involve modifying pixel properties through actions,

such as rotation, scaling, cropping, and other similar

approaches. These actions are performed to increase

the variety of data. On the contrary, when it comes

to tabular data, there is currently no recognized ap-

proach that can complete this data augmentation task.

In order to tackle this issue, we investigate the use

of autoencoders to extract signiﬁcant latent features.

Two methods have been experimented in this context:

one is to directly apply the extracted latent features to

a classiﬁer, and the other is to concatenate these en-

coded latent features with the original data in order to

enhance the number of features. The main contribu-

tions in this work is to utilize the knowledge found

in large datasets to improve the accuracy of multi-

class classiﬁcation that leads to higher levels of ac-

curacy. Despite its potential merits, this methodology

is not reliable for comprehending new tasks, as it pri-

marily concentrates on a single task, i.e., the process

of learning to predict a single outcome, including bi-

nary, multiclass, or continuous values, respectively,

from a labeled dataset. Hence, in addition to latent

features, it is essential to employ task-level data aug-

mentation methods that can improve the precision of

classiﬁcation, while also facilitating the model to ef-

ﬁciently learn new tasks. The task-level augmenta-

tion entails generating new tasks to offer the model a

broader range of learning experience.

The incorporation of Self-generated Tasks from

unlabeled Tables (STUNT) is recognized as a promi-

nent task-level data augmentation strategy (Nam

et al., 2023). Through the treatment of data as un-

labeled, this technique has the potential to generate

various tasks for a single dataset. This outcome is

attained by applying the K-means clustering method

to create new labels. It is anticipated that the gen-

eration of self-tasks leads to effective generalization,

as the model acquires knowledge from multiple tasks,

i.e., the process of jointly learning to predict multi-

NCTA 2024 - 16th International Conference on Neural Computation Theory and Applications

532

ple outcomes on inputs of the same dataset, simul-

taneously. In order to capture the generalized knowl-

edge, a meta-learning scheme called Prototypical Net-

works (ProtoNet) is utilized as a classiﬁer (Snell et al.,

2017). In contrast to ProtoNet, tree-based models

lack capabilities to perform generalization on small

datasets because their complicated structures tend to

overﬁt speciﬁc training samples instead of capturing

broader patterns. A lack of data also makes methods,

such as bagging, less useful, resulting in trees that are

not varied and less-than-ideal decisions at splits (Biau

and Scornet, 2016). ProtoNet has proven to be highly

accurate and effective across various types of data

(Nam et al., 2023; Yu et al., 2022; Snell et al., 2017).

This approach successfully generates representative

prototypes or mean embeddings for each class by uti-

lizing Euclidean distance to determine the proximity

of a target task to its prototype.

To incorporate the advantages of the above meth-

ods into our approach, we propose advancing Pro-

toNet that employs augmented latent features (LF) by

an autoencoder and multitasking generation (MG) by

STUNT in the few-shot learning mechanism. Specif-

ically, the achieved contributions to this work are

threefold. First, we propose an FSL-LFMG frame-

work to develop an end-to-end few-shot multiclass

classiﬁcation workﬂow on tabular data. This frame-

work is composed of three main stages that include

(i) data augmentation at the sample level utilizing au-

toencoders to generate augmented LF, (ii) data aug-

mentation at the task level involving self-generating

multitasks using the STUNT approach, and (iii) the

learning process taking place on ProtoNet, followed

by various model evaluations in our FSL mechanism.

Second, due to the outlier and noise sensitivity of K-

means clustering (Arora et al., 2016) and the curse

of dimensionality of Euclidean distance (Yu et al.,

2022), we enhance and customize the STUNT ap-

proach by using K-medoids clustering that is less sen-

sitive to noisy outliers and Manhattan distance that

is the most preferable for high-dimensional data. Fi-

nally, we conduct an extensive experimental study on

four diverse domain datasets—Net Promoter Score

(NPS) segmentation, Dry Bean type, Wine type, and

Forest Cover type—to prove that our FSL-LFMG ap-

proach on the multiclass classiﬁcation outperforms

the TE models and the OvR classiﬁers by 7.8% in 1-

shot and 2.5% in 5-shot learning.

The remainder of this paper is organized as fol-

lows: Section 2 introduces our proposed FSL-LFMG

framework. Section 3 describes the process that

learns the LF by using autoencoders. Section 4 ex-

plains the MG approach by using the STUNT. Sec-

tion 5 explains the meta learning process using Pro-

toNet. Section 6 details the experimental results, anal-

yses, and discussion. Finally, in Section 7, we pro-

vide a conclusion and outline our future work for this

project.

2 FSL-LFMG FRAMEWORK

In this section, we describe and explain our proposed

FSL-LFMG framework that is an end-to-end pipeline

consisting of four main modules shown in Figure 1.

The modules include Data Preprocessing (DP), Latent

Features Augmentation (LFA), Multitasking Genera-

tion (MG), and Prototypical Network (PN). First, raw

tabular data is passed into the DP module that pro-

cesses and cleans the data in three separate steps in

sequence. In STEP 1, the data is divided into three

parts, i.e., training set, validation set, and test set. The

ratio among them is 64:16:20. For instance, in the

NPS telecom dataset, which comprises 100,000 sam-

ples, the dataset is divided into 64,000 samples for

the training set, 16,000 samples for the validation set,

and 20,000 samples for the test set. In STEP 2, to

deal with numerical features in the dataset, we apply

the Min-Max scaler to ensure that all those features

are on the same scale, for instance, between 0 and

1. The Min-Max scaling is deﬁned by the following

equation:

scaled

x − x

min

max

− x

min

, (1)

where x

scaled

is the new scaled value, x

min

is the mini-

mum value of the feature, x

max

is the maximum value

of the feature, and x is the original value. This ap-

proach helps minimize the inﬂuence of varying scales

and measurements among different numerical fea-

tures. For example, the ’data usage’ feature of our

NPS dataset has a range of 100 to 90,000, while the

’upload’ feature has a range of 1 to 1,250. By us-

ing the Min-Max scaler, we transform these features

into the same range of 0 to 1. In STEP 3, to manage

categorical features, we employ the one-hot encoding

technique to encode categorical data into numerical

ones suitable for ML models to understand. For in-

stance, the ’tariff’ feature has three unique categorical

values, i.e., Level 1, Level 2, and Level 3. By per-

forming the one-hot encoding technique on this fea-

ture, we convert the values in the categorical variable

into a numeric form, i.e., the binary variable (1/0),

which can be understood by the model while main-

taining its categorical nature.

After the data is cleaned, they are passed into the

LFA module that augments the existing features with

the latent features learned from the autoencoder. This

FSL-LFMG: Few-Shot Learning with Augmented Latent Features and Multitasking Generation for Enhancing Multiclass Classiﬁcation on

Tabular Data

533

Figure 1: FSL-LFMG Framework.

module aims to increase the data variation at the sam-

ple level by increasing the number of input features.

Speciﬁcally, the core of the LFA module involves

training the autoencoder and then utilizing the trained

encoder to extract the most important and compact

latent features, which are then concatenated with the

existing features to obtain a larger number of features.

This process is further detailed in Section 3.

After the features are augmented, they are fed

into the MG module, where the STUNT methodol-

ogy (Nam et al., 2023) is implemented to facilitate

the multitasking generation process. This approach is

designed to address the challenges of FSL and aims to

enhance the diversity of data at the task level by gen-

erating a range of diverse tasks. In each task, a ran-

dom selection process is conducted with a speciﬁed

number of samples according to the designated sup-

port and query sets. The support set consists of exam-

ples that are used for training, while the query set con-

tains examples that are used for testing. For instance,

in Task i-th, where i = 1, ..., m, in the 1-shot setting

with three classes, one sample is randomly selected

from each class, so the number of support set is three.

Then the number of queries is set at ﬁfteen samples

per class to evaluate the performance of the model

in Task i-th. This process is replicated in the prede-

termined number of tasks. In our work, we employ

K-medoids rather than K-means for the task genera-

tion process, as K-medoids is a more robust method

for overcoming the inﬂuence of noisy outliers in the

dataset. In contrast to the original method, which uti-

lizes K-means for segmentation and produces pseudo

labels that resemble existing labels, our work employs

K-medoids to improve the accuracy and reliability of

the results. This process is thoroughly explained in

Section 4.

Finally, the tasks corresponding to the selected

data and features are passed into our meta-learning

paradigm of the PN module, which effectively gen-

eralize from minimal examples by shaping a met-

ric space conducive to distance-based classiﬁcation

which is explained in detail in Section 5. This holistic

framework not only addresses the complexities inher-

ent in FSL but also sets a new benchmark for process-

ing tabular datasets more efﬁciently and accurately.

3 LATENT FEATURES

LEARNING AND

AUGMENTATION

The process of latent features learning and augmenta-

tion consists of three steps, as follows:

NCTA 2024 - 16th International Conference on Neural Computation Theory and Applications

534

Figure 2: A high-level architecture of autoencoders adapted from Ye and Wang (2023).

Step 1: Encoding. Autoencoders are employed in

high-dimensional data for feature extraction to gener-

ate compact representations that accurately reﬂect the

original data (Ye and Wang, 2023). This technique is

particularly advantageous for image and video data,

as it minimizes storage requirements. For tabular

data, autoencoders extract critical features that aim to

replicate the original dataset’s characteristics fully.

The training of autoencoders shown in Figure 2

utilizes a substantial portion of data to capture preva-

lent attributes, yielding the encoder, depicted by green

layers that consist of three hidden layers (i.e., h

, h

and h

) to map the input features x to a latent repre-

sentation z that extracts the signiﬁcant features. The

decoder, shown in the blue layers that consist of three

symmetrical hidden layers with the encoder’s hidden

layers (i.e., h

′

, h

′

, and h

′

), reconstructs the input fea-

tures ˆx from z, aiming to minimize the difference be-

tween x and ˆx. The trained encoder can then trans-

form the new input data into the latent representations

z that is useful for the downstream tasks, including

data augmentation and classiﬁcation, respectively.

Mathematically, the encoder E and decoder D can

be formulated in the following transformations:

Encoder :











= σ(W

(E)

x + b

(E)

)

= σ(W

(E)

+ b

(E)

)

= σ(W

(E)

+ b

(E)

)

z = σ(W

(E)

+ b

(E)

)

, (2)

Decoder :











′

= σ(W

(D)

z + b

(D)

)

′

= σ(W

(D)

′

+ b

(D)

)

′

= σ(W

(D)

′

+ b

(D)

)

ˆx = σ(W

(D)

′

+ b

(D)

)

, (3)

where x is a set of input features, z is a set of latent

features, ˆx is a set of reconstructed input features, W

is a weight matrix, b

is a bias, h

is a hidden layer, and

σ is the ReLU activation function shown in Equation

(4), for i = 1, 2, 3, 4 and j = 1, 2, 3,

σ = ReLU(x) = max(0, x). (4)

In this work, we develop a three-layer symmet-

ric autoencoder architecture with the ReLU activation

functions to regularize the process. During the learn-

ing process, we utilize the mean squared error (MSE)

loss criterion to optimize the architecture by using the

Adam optimizer with a learning rate set at 10

−3

. To

prevent overﬁtting, the early stopping is implemented

with a patience parameter of 5 that not only ensures

the optimal model performance but also reduces the

likelihood of training divergence. The optimization of

autoencoder quantiﬁed using MSE between the orig-

inal inputs x and the reconstructed outputs ˆx can be

deﬁned as follows:

MSE = L(x, ˆx) =

∑

i=1

∥x

− ˆx

∥

. (5)

This loss function MSE shown in Equation (5) guides

the training process that encourages the model to ﬁnd

the most representative latent features, where N is the

total number of data instances in the training set. The

Adam optimizer, which adaptively adjusts the learn-

ing rate for each parameter based on the estimations

of ﬁrst and second moments of the gradients and the

learning rate η = 10

−3

, can be deﬁned as follows:

(E)

, b

(E)

, W

(D)

, b

(D)

← Adam(∇L, η). (6)

Once the encoder is trained, it transforms the

original data x to the latent features z shown in Figure

Step 2: Min-Max Scaling of Latent Features.

After obtaining the latent representations z, min-max

scaling is applied, using Equation (1), to ensure that

the latent features have the same scale as the original

features. This step is crucial to maintain consistency

FSL-LFMG: Few-Shot Learning with Augmented Latent Features and Multitasking Generation for Enhancing Multiclass Classiﬁcation on

Tabular Data

535

Figure 3: Process of augmentation latent features.

and enhance the effectiveness of the subsequent data

augmentation process.

Step 3: Concatenation with Original Input Features.

Once scaled, these latent features are concatenated

with the original dataset to generate the augmented

data, denoted as x

augmented

, shown in Equation (7).

This augmented dataset enhances the overall feature

set and increases data variability, which is expected

to improve the model’s generalization capabilities.

augmented

= [x : z] (7)

This augmented dataset is then used in the MG

process.

4 MULTITASKING GENERATION

Data augmentation at the task level is to build the

common knowledge by performing multiple tasks

from a dataset. STUNT is a speciﬁc framework that

can generate those multiple diverse tasks using the K-

means clustering as a pseudo-label generator (Nam

et al., 2023). The idea behind this approach is that

each feature can serve as a label for the other features.

For instance, Figure 4 is an original dataset with three

input features (x) (i.e., complaints, data usage, and

age) and a target binary variable (y) (i.e., cancellation

(yes/no)). In this example, we assume that there is a

positive correlation between complaints and cancel-

lation, from which we can use complaints as a new

target variable and use the other variables as the input

features. By generalizing this concept, we can ﬁrst

Figure 4: Example data from telecom dataset, adapted from

Nam et al. (2023).

consider the data that is unlabeled; and this STUNT

method randomly selects some features and then uti-

lizes the K-means algorithm to perform the cluster-

ing. In each cluster, the pseudo-label can be obtained

by computing the center of the cluster, also known as

the centroids. By iteratively applying the task genera-

tion process using different combinations of features,

we can generate the corresponding new datasets with

their own labels from the original dataset.

More speciﬁcally, following Nam et al. (2023),

given a dataset X of unlabeled tabular data, we

summarize their approach and formalize the process

as follows:

Step 1: Masking Ratio Sampling. Sample a masking

ratio p from a uniform distribution over a range of

hyperparameter [r

, r

], where 0 < r

< r

< 1.

Step 2: Binary Mask Creation. Generate a ran-

dom binary mask m ∈ {0, 1}

, where d is the number

of features and the sum of elements in m is ⌊d p⌋

where ⌊·⌋ is ﬂoor function applied to d p.

Step 3: Column Selection. Use the mask m to

select columns from the unlabeled data X. The

selected data is denoted by sq(x ◦ m), where ◦

indicates element-wise multiplication, and sq(·) rep-

resents a squeezing operation that removes elements

corresponding to zeros in m.

Step 4: K-means Clustering. Apply K-means

clustering on the selected columns to generate

pseudo-labels

y. The objective function for the

K-means is given by:

min

C∈R

⌊d p⌋×k

∑

i=1

min

∈{0,1}

∥

sq(x

◦ m) −C

∥

, (8)

such that

= 1,

where C is the centroid matrix, k is the number of

centroids, 1

is a vector of ones and x

represents the

i-th sample in the dataset. ∥ · ∥

indicates squared

Euclidean distance, used here to measure the distance

between the transformed data points and the cluster

centroids.

Step 5: Data Perturbation. To prevent trivial

learning by the classiﬁer, perturb the selected column

features by:

x := m ◦

x + (1 − m) ◦ x, (9)

where

x is sampled from the empirical marginal dis-

tribution of each column feature.

NCTA 2024 - 16th International Conference on Neural Computation Theory and Applications

536

Step 6: Task Deﬁnition. The generated task T

STUNT

from the process is deﬁned as:

STUNT

:= {(

)}

i=1

. (10)

Figure 5: Multitasking generation using K-medoids,

adapted from Nam et al. (2023).

In our work, we enhance and customize the

STUNT approach by using the K-medoids as an al-

ternative clustering method to the K-means shown in

Figure 5 that is a high-level overview of the modiﬁed

STUNT approach. We propose this approach because

the K-medoids clustering is more robust to outliers,

as it uses the actual data points as the centers, also

known as medoids, thereby avoiding the inﬂuence of

extreme values, unlike K-means. Thus, in Equation

(8) above, we change to use the K-medoids clustering

instead of the K-means clustering.

More precisely, we apply the K-medoids cluster-

ing on the selected features to generate the pseudo-

label

y. The objective function for the K-medoids

clustering using the Manhattan distance is given by:

min

C∈R

⌊d p⌋×k

∑

i=1

min

∈{0,1}

∥

sq(x

◦ m) −C

∥

, (11)

such that

= 1,

where C is the centroid matrix which is the medoids

matrix, k is the number of medoids, and x

represents

the i-th sample in the dataset. ∥ · ∥

represents the

Manhattan distance to measure the distance between

the transformed data points and the cluster medoids.

The medoids are selected from the dataset X, and each

data point x

is assigned to the nearest medoid based

on the Manhattan distance.

After the completion of diverse tasks generation,

the learning process is undertaken by ProtoNet, which

aims to develop a model capable of generalizing

based on diverse inputs from various tasks.

5 PROTOTYPICAL NETWORKS

ProtoNet is a neural network that employs meta-

learning to learn variety of tasks. Speciﬁcally, after

the MG module generates the diverse tasks by Equa-

tion (10), data samples are taken from a collection of

those tasks. For each task, the support (S ) set and the

query (Q ) set are then selected. As shown in Figure

6, i.e., a high-level framework of few-shot learning

concept using ProtoNet, the model is trained on the

support set and evaluated on the query set, with the

meta-learner being updated based on the query set’s

performance. Following this, the meta-learner is uti-

lized for adaptation and prediction on a new test set

using a fresh batch of labeled data, with a small por-

tion serving as the support test set.

Several advantages have been identiﬁed by Nam

et al. (2023) regarding the use of ProtoNet as an em-

bedding function or learner in few-shot settings, in-

cluding ﬂexible centroids, agnostic application, and

optimal performance. Flexible centroids refer to the

adaptability of ProtoNet to various cases by adjust-

ing the number of k or centroids. Agnostic appli-

cation allows for the direct application of this archi-

tecture to tabular data without signiﬁcant difﬁculty.

Additionally, ProtoNet has demonstrated strong per-

formance in various modalities, as reported in some

studies (Nam et al., 2023; Yu et al., 2022; Snell et al.,

2017). This study also highlights the ﬂexibility of

ProtoNet as a beneﬁt. The original ProtoNet employs

the Euclidean distance for metric learning, but the re-

search work conducted by Yu et al. (2022) on image

classiﬁcation suggests that the Manhattan distance is

a strong substitute for this metric, potentially improv-

ing performance. It would therefore be intriguing to

apply this substitution to tabular data, as the Manhat-

tan distance has advantages over the Euclidean one,

especially in high-dimensional data. The differences

between the Euclidean and the Manhattan distance are

visually shown in Figure 7. The Euclidean distance

(i.e., the red line) measures the shortest straight-line

distance between the two points that is calculated by

using the Pythagorean theorem. The Manhattan dis-

tance (i.e., the blue path) measures the distance be-

tween the two points by summing the absolute differ-

ences of their coordinates.

In this work, the architecture of ProtoNet follows

Figure 6: Few-shot learning concept using ProtoNet

adapted from Snell et al. (2017).

FSL-LFMG: Few-Shot Learning with Augmented Latent Features and Multitasking Generation for Enhancing Multiclass Classiﬁcation on

Tabular Data

537

Figure 7: Comparison between Euclidean distance and

Manhattan distance.

a multilayer perceptron (MLP) design that consists of

a 2-layer fully connected neural network with a hid-

den dimension of 1,024, as recommended by Nam

et al. (2023). Given a task selected by Equation (10),

we construct the classiﬁer using the episodic training

way. Training episodes are created by selecting ran-

dom subsets of classes and examples, with some ex-

amples acting as (S ) and (Q ) from each task. The

Prototypical networks create a prototype or an aver-

age representation for each class using an embedding

function F

with a learnable parameter θ. Each pro-

totype is the mean of the embedded points in its class

calculated as follows:

∑

(

, ˜y

)∈S

(

), (12)

where S

is the support set associated with the proto-

type k. Using a distance function d, which is the Man-

hattan distance, the network calculates the probability

of a class for a query point ˜x

by taking a softmax over

distances to the prototypes:

(y = k |

;S) =

exp(−d(F

(

), c

))

∑

′

exp(−d(F

(

), c

′

))

, (13)

where the Manhattan distance is

d(F

(

), c

) =

∑

i=1

∥

(

) − c

)

∥

. (14)

Next, we compute the cross-entropy loss on the clas-

siﬁer p

as follows:

, ˜y

) = −

∑

j=1

( ˜y

)

log p

. (15)

The ultimate objective is to minimize the meta-

learning loss over diverse tasks generated by Equation

(10) as follows:

meta

(θ, Q ) :=

∑

)∈Q

, ˜y

). (16)

After we ﬁnish the model training, we use the

model obtained to adapt with the few-shot sample

, y

), where y

is the existing label from 100 dif-

ferent seed. Finally, using an independent test set,

we compute the mean test accuracy of this few-shot

learning process.

6 EXPERIMENTAL RESULTS,

ANALYSES, AND DISCUSSION

In our experimental studies, we utilize four different

domain datasets. Three of them are publicly available

from the UCI Machine Learning Repository, includ-

ing Wine (Aeberhard and Forina, 1991), Dry Bean

(UCI, 2020), and Forest Cover Type (Blackard, 1998).

One of them is a proprietary dataset provided by a

telecommunications corporation, speciﬁcally related

to the NPS segmentation. The Wine dataset contains

178 instances and 13 attributes that are used for the

classiﬁcation of wine variants. The Dry Bean dataset

includes 13,611 instances and 16 attributes that are

aimed at classifying different types of beans. The

Forest Cover Type dataset is composed of 581,012

instances and 54 attributes that are used for predict-

ing forest cover types based on cartographic variables.

The proprietary NPS segmentation dataset consists of

customer demographic proﬁle and feedback data, seg-

mented into promoters, passives, and detractors based

on their likelihood to recommend the company’s ser-

vices. Table 1 provides the detailed descriptions of

these four datasets, including the number of instances

and attributes, as well as the primary classiﬁcation ob-

jective for each dataset.

During our experimental evaluations, we examine

various TE models as the benchmark, including Ran-

dom Forest, CatBoost, and One-vs-Rest (OvR) Clas-

siﬁer. We also combine these three baseline models

with augmentation techniques utilizing autoencoders

to enhance the feature representation and improve

classiﬁcation performance. Random Forest, known

for its robustness and ease of implementation, pro-

vides a strong baseline through its ensemble of deci-

sion trees. CatBoost, a gradient boosting algorithm,

is particularly effective in handling categorical fea-

tures and improving accuracy. The OvR Classiﬁer,

a strategy for multiclass classiﬁcation, breaks down

the problem into multiple binary classiﬁcation tasks.

In addition to these models, we employ the stan-

dard STUNT framework as a comparison benchmark

for our proposed method. The STUNT framework,

NCTA 2024 - 16th International Conference on Neural Computation Theory and Applications

538

Table 1: Summary of Datasets.

No Name # N # features Description Source

1 Net Promoter

Score (NPS)

segmentation

100,000 11 Predict a customer segmentation into three

groups: promoters, passives, detractors,

based on demographics and customer expe-

riences.

Private

2 Drybean types 13,611 16 Predict seven various sorts of dry beans

according to market conditions, including

form, shape, type, and structure.

Public

3 Wine types 178 13 Predict three different types of wines using

the ﬁndings of a chemical analysis of wines

grown in the same region of Italy.

Public

4 Forest cover types 581,012 54 Predict seven forest cover classes based on

variables such as elevation, aspect, slope,

hill shade, soil type, and others.

Public

Table 2: Baselines Details.

No Methods Description

1 Random Forests (RF) An ensemble of tree predictors, where each tree’s predictions are

based on the values of a random vector that is separately sampled

and has the same distribution for all trees in the forest.

2 CatBoost (CB) A gradient boosting method that utilizes binary decision trees as

its base predictors.

3 Autoencoders (AE) + Classi-

ﬁer

Using only encoded features to be trained into classiﬁer (RF or

CB)

4 Concatenation Autoencoders

(ConcatAE) + Classiﬁer

Using original and encoded features (concatenation) to be trained

into classiﬁer (RF or CB)

5 One-vs-the-rest (OvR) multi-

class strategy

The one-vs-the-rest (OvR) multiclass strategy, often referred to as

one-vs-all, involves training a separate classiﬁer for each class.

6 Self-generated Tasks from

unlabeled Tables (STUNT)

A few-shot tabular learning system that utilizes meta-learning to

train on self-generated problems derived from unlabeled tables.

known for its comprehensive approach to generate

multiple tasks on tabular data setting, served as a rig-

orous benchmark to evaluate the efﬁcacy of our pro-

posed enhancements. Table 2 shows a more detailed

and extensive explanation of these baseline methods.

In the 1-shot learning, we observe varying levels

of performance among the baseline models in terms

of mean test accuracy. The results in Table 3 illus-

trate several noteworthy trends in the performance of

different classiﬁcation methods across the datasets ex-

amined. First, RF classiﬁer generally outperforms CB

classiﬁer in all cases by 0.74% in average. Second,

OvR strategy speciﬁcally on CB, consistently out-

performs the baseline models (RF and CB) on most

datasets by 0.99% in average. Additionally, Con-

catAE approach surpasses both the OvR strategy by

2.7% in average and the base models by 2.8% in

average. These ﬁndings suggest that employing ad-

vanced techniques such as OvR and ConcatAE can

signiﬁcantly enhance classiﬁcation accuracy in 1-shot

learning scenarios. Compared to standard STUNT,

our method, which employs ConcatAE in conjunc-

tion with K-medoids clustering and Manhattan Pro-

toNet, achieved the highest mean test accuracy across

all datasets and tasks by 4.03% in average, showcas-

ing the superiority of this approach in 1-shot learning

classiﬁcation.

In the 5-shot learning, the performance patterns

observed in Table 4 are similar to those seen in 1-

shot settings for various datasets in base models and

when combined with augmentation techniques. For

instances, RF classiﬁer generally still outperforms CB

classiﬁer in all cases by 2.04% in average. Then,

OvR strategy speciﬁcally on CB, consistently out-

performs the baseline models (RF and CB) on most

datasets by 2.13% in average. In addition, ConcatAE

approach surpasses both the OvR strategy by 0.99%

and the base models by 1.09%. These ﬁndings sug-

gest that employing advanced techniques such as OvR

and ConcatAE can signiﬁcantly enhance classiﬁca-

tion accuracy in 5-shot learning scenarios. Compared

to standard STUNT, our method, which employs Con-

FSL-LFMG: Few-Shot Learning with Augmented Latent Features and Multitasking Generation for Enhancing Multiclass Classiﬁcation on

Tabular Data

539

Table 3: Mean test accuracy on 1-shot setting.

Methods NPS Dry Bean Wine Cover Type Average

RF 32.74 68.41 81.39 24.19 51.68

CB 34.01 64.48 84.17 22.54 51.30

AE + RF 34.41 63.14 85.56 22.41 51.38

AE + CB 33.99 61.77 86.53 22.57 51.22

ConcatAE + RF 32.54 68.47 87.64 23.76 53.10

ConcatAE + CB 33.14 66.15 87.92 23.81 52.76

OvR RF 32.31 69.24 81.25 22.46 51.32

OvR CB 33.42 69.54 81.81 22.47 51.81

AE + OvR RF 33.99 60.49 86.94 21.38 50.70

AE + OvR CB 32.91 63.03 86.81 22.02 51.19

ConcatAE + OvR RF 31.77 68.28 87.36 22.39 52.45

ConcatAE + OvR CB 32.40 70.63 87.36 23.43 53.46

STUNT (k-Means + Euclidean ProtoNet) 35.69 67.44 85.75 24.66 53.39

ConcatAE + STUNT 34.47 70.43 87.64 23.76 54.07

ConcatAE + k-Medoid + Manhattan ProtoNet 36.06 71.17 88.86 26.07 55.54

Table 4: Mean test accuracy on 5-shot setting.

Methods NPS Dry Bean Wine Cover Type Average

RF 39.56 84.37 92.50 35.73 63.04

CB 38.51 82.68 88.47 37.44 61.78

AE + RF 39.47 80.62 89.58 31.49 60.29

AE + CB 39.96 79.84 90.97 31.78 60.64

ConcatAE + RF 40.52 84.68 92.92 34.91 63.26

ConcatAE + CB 40.06 83.71 89.86 38.07 62.93

OvR RF 39.55 84.54 92.92 33.54 62.64

OvR CB 39.88 85.25 91.81 35.44 63.10

AE + OvR RF 38.69 79.85 90.28 30.52 59.84

AE + OvR CB 39.47 81.69 92.64 31.69 61.37

ConcatAE + OvR RF 40.18 84.66 94.03 33.43 63.08

ConcatAE + OvR CB 40.53 85.53 93.15 35.82 63.91

STUNT (k-Means + Euclidean ProtoNet) 40.76 83.48 94.03 34.72 63.25

ConcatAE + STUNT 40.93 84.15 95.00 31.76 62.96

ConcatAE + k-Medoid + Manhattan ProtoNet 41.25 85.62 95.28 34.58 64.18

catAE in conjunction with K-medoids clustering and

Manhattan ProtoNet, achieved the highest mean test

accuracy in 3 out of 4 datasets — NPS, Dry Bean,

and Wine — by 1.47%, showcasing the optimal per-

formance of this approach in 5-shot learning classiﬁ-

cation.

We also observe some signiﬁcant result on com-

parison between scenarios with augmentation —

ConcatAE + k-Medoid + Manhattan ProtoNet (our

approach)— and no augmentation — RF, CB, OvR

RF, and OvR CB. Figure 8 clearly shows that the

method with augmentation gives improvement both

on 1-shot and 5-shot settings. Speciﬁcally, our ap-

proach outperforms the traditional ensemble (TE)

models and the OvR classiﬁers by 7.8% in the 1-shot

setting and 2.5% in the 5-shot setting. This enhance-

ment underscores the effectiveness of our approach in

multiclass classiﬁcation, demonstrating optimal gen-

eralization capabilities compared to models without

augmentation.

7 CONCLUSIONS AND FUTURE

WORK

In this paper, we propose advancing ProtoNet that em-

ploys augmented LF by an autoencoder and MG by

STUNT in the few-shot learning mechanism. Specif-

ically, the achieved contributions to this work are

threefold. First, we propose an FSL-LFMG frame-

work to develop an end-to-end few-shot multiclass

classiﬁcation workﬂow on tabular data. This frame-

work is composed of three main stages that include

(i) data augmentation at the sample level utilizing au-

toencoders to generate augmented LF, (ii) data aug-

NCTA 2024 - 16th International Conference on Neural Computation Theory and Applications

540

Figure 8: The effect of data augmentation techniques compared to base models.

mentation at the task level involving self-generating

multitasks using the STUNT approach, and (iii) the

learning process taking place on ProtoNet, followed

by various model evaluations in our FSL mechanism.

Second, due to the outlier and noise sensitivity of K-

means clustering and the curse of dimensionality of

Euclidean distance, we enhance and customize the

STUNT approach by using K-medoids clustering that

is less sensitive to noisy outliers and Manhattan dis-

tance that is preferable for high-dimensional data. Fi-

nally, we conduct an extensive experimental study on

four diverse domain datasets— NPS segmentation,

Dry Bean type, Wine type, and Forest Cover type—to

prove that our FSL-LFMG approach on the multiclass

classiﬁcation outperforms the TE models and the OvR

classiﬁers by 7.8% in 1-shot and 2.5% in 5-shot learn-

ing. Moving forward, we plan to investigate more

data augmentation techniques for tabular data includ-

ing variational autoencoder and generative adversarial

network. We also aim to explore more state-of-the-art

few-shot learning techniques, such as meta-learning

algorithms and advanced metric learning approaches,

which have shown the promising results on tabular

data in other real-world domains and areas.

REFERENCES

Abdulsalam, S. O., Arowolo, M. O., Saheed, Y. K., and

Afolayan, J. O. (2022). Customer churn prediction in

telecommunication industry using classiﬁcation and

regression trees and artiﬁcial neural network algo-

rithms. Indonesian Journal of Electrical Engineering

and Informatics (IJEEI), 10(2):431–440.

Adebiyi, M. O., Ogundokun, R. O., Abokhai, A. A., et al.

(2020). Machine learning–based predictive farmland

optimization and crop monitoring system. Scientiﬁca,

2020.

Aeberhard, S. and Forina, M. (1991). Wine.

UCI Machine Learning Repository. DOI:

https://doi.org/10.24432/C5PC7J.

Arik, S. O. and Pﬁster, T. (2021). TabNet: Attentive Inter-

pretable Tabular Learning. Proceedings of the AAAI

Conference on Artiﬁcial Intelligence, 35(8):6679–

6687.

Arora, P., Varshney, S., et al. (2016). Analysis of k-means

and k-medoids algorithm for big data. Procedia Com-

puter Science, 78:507–512.

Attri, I., Awasthi, L. K., and Sharma, T. P. (2024). Ma-

chine learning in agriculture: a review of crop man-

agement applications. Multimedia Tools and Applica-

tions, 83(5):12875–12915.

Biau, G. and Scornet, E. (2016). A random forest guided

tour. Test, 25:197–227.

Blackard, J. (1998). Covertype. UCI Machine Learning

Repository. DOI: https://doi.org/10.24432/C50K5N.

Blier-Wong, C., Cossette, H., Lamontagne, L., and

Marceau, E. (2020). Machine learning in p&c in-

surance: A review for pricing and reserving. Risks,

9(1):4.

Borisov, V., Leemann, T., Seßler, K., Haug, J., Pawelczyk,

M., and Kasneci, G. (2022). Deep neural networks and

tabular data: A survey. IEEE Transactions on Neural

Networks and Learning Systems.

Hollmann, N., M

uller, S., Eggensperger, K., and Hutter, F.

(2022). Tabpfn: A transformer that solves small tabu-

FSL-LFMG: Few-Shot Learning with Augmented Latent Features and Multitasking Generation for Enhancing Multiclass Classiﬁcation on

Tabular Data

541

lar classiﬁcation problems in a second. arXiv preprint

arXiv:2207.01848.

Katzir, L., Elidan, G., and El-Yaniv, R. (2020). Net-dnf: Ef-

fective deep modeling of tabular data. In International

conference on learning representations.

Khan, M. S., Nath, T. D., Hossain, M. M., Mukherjee, A.,

Hasnath, H. B., Meem, T. M., and Khan, U. (2023).

Comparison of multiclass classiﬁcation techniques us-

ing dry bean dataset. International Journal of Cogni-

tive Computing in Engineering, 4:6–20.

Li, W., Wang, Z., Yang, X., Dong, C., Tian, P., Qin, T., Huo,

J., Shi, Y., Wang, L., Gao, Y., et al. (2023). Libfew-

shot: A comprehensive library for few-shot learning.

IEEE Transactions on Pattern Analysis and Machine

Intelligence.

Loukili, M., Messaoudi, F., and El Ghazi, M. (2022). Su-

pervised learning algorithms for predicting customer

churn with hyperparameter optimization. Interna-

tional Journal of Advances in Soft Computing & Its

Applications, 14(3).

Matloob, I., Khan, S. A., Hussain, F., Butt, W. H., Rukaiya,

R., and Khalique, F. (2021). Need-based and opti-

mized health insurance package using clustering algo-

rithm. Applied Sciences, 11(18):8478.

Nam, J., Tack, J., Lee, K., Lee, H., and Shin, J.

(2023). Stunt: Few-shot tabular learning with self-

generated tasks from unlabeled tables. arXiv preprint

arXiv:2303.00918.

Parnami, A. and Lee, M. (2022). Learning from few exam-

ples: A summary of approaches to few-shot learning.

arXiv preprint arXiv:2203.04291.

Popov, S., Morozov, S., and Babenko, A. (2019). Neural

oblivious decision ensembles for deep learning on tab-

ular data. arXiv preprint arXiv:1909.06312.

S¸ahin, C. (2023). Predicting base station return on invest-

ment in the telecommunications industry: Machine-

learning approaches. Intelligent Systems in Account-

ing, Finance and Management, 30(1):29–40.

Shwartz-Ziv, R. and Armon, A. (2022). Tabular data: Deep

learning is not all you need. Information Fusion,

81:84–90.

Sikri, A., Jameel, R., Idrees, S. M., and Kaur, H. (2024). En-

hancing customer retention in telecom industry with

machine learning driven churn prediction. Scientiﬁc

Reports, 14(1):13097.

sklearn (2024). sklearn Documentation. https:

//scikit-learn.org/0.15/modules/generated/sklearn.

multiclass.OneVsRestClassiﬁer.html. Accessed:

April 4, 2024.

Snell, J., Swersky, K., and Zemel, R. (2017). Prototypical

networks for few-shot learning. Advances in neural

information processing systems, 30.

Sun, B., Yang, L., Zhang, W., Lin, M., Dong, P., Young,

C., and Dong, J. (2019). Supertml: Two-dimensional

word embedding for the precognition on structured

tabular data. In Proceedings of the IEEE/CVF con-

ference on computer vision and pattern recognition

workshops, pages 0–0.

Tian, Y., Wang, Y., Krishnan, D., Tenenbaum, J. B., and

Isola, P. (2020). Rethinking few-shot image classi-

ﬁcation: a good embedding is all you need? In

Computer Vision–ECCV 2020: 16th European Con-

ference, Glasgow, UK, August 23–28, 2020, Proceed-

ings, Part XIV 16, pages 266–282. Springer.

Tunguz, B., Dieter, or Tails, H., Kapoor, K., Pandey, P.,

Mooney, P., Culliton, P., Mulla, R., Bhutani, S., and

Cukierski, W. (2023). 2023 kaggle ai report.

UCI (2020). Dry Bean. UCI Machine Learning Repository.

DOI: https://doi.org/10.24432/C50S4B.

Wang, R., Pontil, M., and Ciliberto, C. (2021). The role of

global labels in few-shot classiﬁcation and how to in-

fer them. Advances in Neural Information Processing

Systems, 34:27160–27170.

Wang, Y., Yao, Q., Kwok, J. T., and Ni, L. M. (2020). Gen-

eralizing from a few examples: A survey on few-shot

learning. ACM computing surveys (csur), 53(3):1–34.

Ye, A. and Wang, Z. (2023). Modern deep learning for

tabular data: novel approaches to common modeling

problems. Springer.

Yu, Z., Wang, K., Xie, S., Zhong, Y., and Lv, Z. (2022). Pro-

totypical network based on manhattan distance. Cmes-

Comput. Model. Eng. Sci, 131:655–675.

Zhang, R. and Liu, Q. (2023). Learning with few sam-

ples in deep learning for image classiﬁcation, a mini-

review. Frontiers in Computational Neuroscience,

16:1075294.

NCTA 2024 - 16th International Conference on Neural Computation Theory and Applications

542