ConvKAN: Towards Robust, High-Performance and Interpretable Image

Classiﬁcation

Achref Ouni

1 a

, Chaﬁk Samir

1 b

, Yousef Bouaziz

1 c

, Anis Fradi

2 d

Laboratoire LIMOS, CNRS UMR 6158, Universite Clermont Auvergne, 63170 Aubiere, France

Universit

e de Lyon, Lyon 2, ERIC EA3083, France

{achref.elouni, youssef.bouaziz, samir.chaﬁk}@uca.fr, anis.fradi@univ-lyon2.fr

Keywords:

ConvKAN, Classiﬁcation, CNN, Robustness.

Abstract:

This paper introduces ConvKAN, a novel convolutional model for image classiﬁcation in artiﬁcial vision

systems. ConvKAN integrates Kolmogorov-Arnold Networks (KANs) with convolutional layers within Con-

volutional Neural Networks (CNNs). We demonstrate that this combination outperforms standard CNN-MLP

architectures and state-of-the-art methods. Our study investigates the impact of this integration on classiﬁca-

tion performance across benchmarks and assesses the robustness of ConvKAN models compared to established

CNN architectures. Varied and extensive experimental results show that ConvKAN achieves substantial gains

in accuracy, precision, and recall, surpassing current state-of-the-art methods.

1 INTRODUCTION

The primary aim of this research is to develop an

effective artiﬁcial vision solution for image clas-

siﬁcation using a novel architecture called Con-

vKAN, which integrates Kolmogorov-Arnold Net-

works (KANs) into Convolutional Neural Networks

(CNNs). This integration leverages the powerful fea-

ture extraction capabilities of CNNs and the sophisti-

cated modeling of complex relationships provided by

KANs.

Computer vision solutions have greatly beneﬁted

from advances in deep learning techniques. Notewor-

thy examples include CNN-based solutions for ob-

ject tracking (Amosa et al., 2023; Meinhardt et al.,

2022; Chen et al., 2023) , recognition (Mathis et al.,

2022; Wotton and Gunes, 2020; Yu and Xiong, 2022)

and scene Understanding (Fan et al., 2023; Balazevic

et al., 2024; Shi et al., 2024). These successful sys-

tems highlight the power of deep-learning techniques

for automating image analysis. However, the inher-

ent complexities of diverse image content and con-

texts pose new challenges, requiring the development

of adaptable and robust solutions.

Traditional CNN architectures, despite their suc-

https://orcid.org/0000-0002-1197-253X

https://orcid.org/0000-0003-0619-5040

https://orcid.org/0000-0003-3257-6859

https://orcid.org/0000-0002-4333-2318

cess, face limitations in handling complex, non-

linear relationships within data. Multilayer percep-

trons (MLPs), commonly used in conjunction with

CNNs for high-level decision-making, can struggle

with these complexities. While CNNs excel at fea-

ture extraction, MLPs are tasked with classiﬁcation

and regression, which can be suboptimal for intricate

data patterns.

1.1 Related Work

The ﬁeld of image classiﬁcation has seen signif-

icant advancements with the development of deep

learning techniques, particularly Convolutional Neu-

ral Networks (CNNs). Early architectures such as

LeNet (LeCun et al., 1998) and AlexNet (Krizhevsky

et al., 2012) laid the foundation for modern CNNs by

demonstrating their effectiveness in image recogni-

tion tasks. Subsequent architectures like VGG (Si-

monyan and Zisserman, 2014), ResNet (He et al.,

2016), and Inception (Szegedy et al., 2015) intro-

duced innovations such as deeper networks, residual

connections, and multi-scale processing, further en-

hancing performance. Recent studies have explored

the integration of Vision Transformers (ViTs) with

CNNs to leverage the strengths of both architectures

for improved image classiﬁcation (Dosovitskiy et al.,

2020; Jmour et al., 2018; Nath et al., 2014; Kim et al.,

2022). Additionally, the use of MobileNetV2 and Ef-

ﬁcientNet has been investigated for their efﬁciency

Ouni, A., Samir, C., Bouaziz, Y. and Fradi, A.

ConvKAN: Towards Robust, High-Performance and Interpretable Image Classiﬁcation.

DOI: 10.5220/0013120800003912

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2025) - Volume 2: VISAPP, pages

48-58

ISBN: 978-989-758-728-3; ISSN: 2184-4321

in mobile and edge computing environments (Sandler

et al., 2018; Tan and Le, 2019).

Kolmogorov-Arnold Networks (KANs) (Liu

et al., 2024) represent a recent innovation in neural

network architectures, leveraging the Kolmogorov-

Arnold representation theorem to model high-

dimensional functions as compositions of simpler

univariate functions. This approach has shown

promise in various applications, including satellite

image classiﬁcation (Liu et al., 2024), hyperspectral

image analysis (Genet and Inzirillo, 2024), and

remote sensing (Wang et al., 2024). The integration

of KANs with CNNs has been explored to enhance

the model’s ability to capture complex patterns

and dependencies in the data, leading to improved

performance in diverse image classiﬁcation tasks (Liu

et al., 2024; Genet and Inzirillo, 2024).

Building on these developments, our work inte-

grates KANs into CNN architectures, creating the

ConvKAN model. This integration aims to enhance

the model’s ability to capture complex patterns and

improve classiﬁcation performance, addressing the

limitations of traditional CNN-MLP architectures.

1.2 Proposed Approach: ConvKAN

In light of these advancements, we propose Con-

vKAN, a novel architecture that integrates KANs into

CNNs. This approach aims to address the limita-

tions of traditional CNN-MLP architectures by en-

hancing the model’s ability to capture complex, non-

linear relationships within input images. By replacing

traditional fully connected layers with KAN layers,

ConvKAN leverages the strengths of both CNNs and

KANs.

The main contributions of this work are as fol-

lows:

1. Novel Architecture. Introduction of ConvKAN,

combining CNNs and KANs to improve image

classiﬁcation performance.

2. Enhanced Performance. Demonstration of sig-

niﬁcant improvements in accuracy, precision, and

recall across multiple datasets.

3. Robustness and Generalization. Evaluation of

ConvKAN’s robustness to common image distor-

tions and transformations.

4. Comprehensive Evaluation. Extensive ex-

periments and comparisons with state-of-the-art

methods to validate the effectiveness of Con-

vKAN.

The remainder of this paper is organized as fol-

lows: Section 2 remind the background. Section 3 de-

tails the proposed ConvKAN architecture. Section 4

presents the experimental setup and results. Section 5

discusses the ﬁndings and implications. Finally, Sec-

tion 6 concludes the paper and suggests directions for

future research.

2 BACKGROUND

The Arnold-Kolmogorov superposition theorem, also

known as the Kolmogorov-Arnold representation the-

orem, is a signiﬁcant result in the theory of func-

tions of several variables. It states that any continu-

ous multivariate function can be expressed as a super-

position (nested composition) of sums and univariate

functions. Formally, if f is a continuous function of n

variables, there exist continuous univariate functions

and ψ

q, p

such that:

f (x

, . . . , x

) =

∑

q=0

∑

p=1

q, p

)

where the outer sum is over 2n + 1 terms and the in-

ner sums are over the n inputs. This theorem pro-

vides a profound insight, indicating that any continu-

ous multivariate function, regardless of its complex-

ity, can be constructed from simpler univariate build-

ing blocks. This universality has signiﬁcant impli-

cations in ﬁelds such as approximation theory, neu-

ral networks, and the understanding of multidimen-

sional functions. However, this decomposition is

not unique; there are multiple ways to represent the

same multivariate function using this theorem. Addi-

tionally, while theoretically powerful, ﬁnding explicit

constructions for the inner and outer functions can be

computationally challenging.

Popular methods for approximating multivariate

functions include Least Squares Regression, Neu-

ral Networks, and Sparse Representation Techniques.

These methods offer promising solutions but still face

challenges due to the non-uniqueness of decomposi-

tions, the curse of dimensionality, and the high com-

putational cost of ﬁnding optimal parameters, partic-

ularly for complex functions. The search for effective

and practical decomposition methods remains an ac-

tive research area, with applications in machine learn-

ing, scientiﬁc computing, and data analysis.

B-splines are favored for their ﬂexibility, numer-

ical stability, and compact representation, making

them a practical choice for decomposing multivariate

functions. They offer advantages such as adaptabil-

ity to complex functions, computational efﬁciency,

and interpretable models. However, challenges re-

main in knot selection and potential computational

complexity for high-dimensional functions. Despite

these challenges, B-splines are a valuable tool, with

ConvKAN: Towards Robust, High-Performance and Interpretable Image Classiﬁcation

ongoing research continually improving their effec-

tiveness.

Recently, Kolmogorov-Arnold Networks (KANs)

have been proposed as an innovative neural network

architecture directly inspired by this theorem. They

offer an intriguing alternative to traditional Multi-

Layer Perceptrons (MLPs), with potential beneﬁts in

expressiveness, interpretability, and scalability. Un-

like MLPs, where activation functions are ﬁxed and

applied to nodes, KANs utilize learnable activation

functions on the edges (weights). These activation

functions are typically represented by B-splines, pro-

viding greater ﬂexibility in modeling relationships be-

tween variables. A simpliﬁed version involves inner

layers that learn univariate functions (one for each

input variable) using B-splines, and an output layer

that combines the inner layers’ outputs using another

learned function, also represented by B-splines.

KAN Architecture. Returning to the original two-

layer KAN structure with an input dimension of n and

a middle layer width of 2n+1. Unlike traditional neu-

ral networks, activation functions in KANs are placed

on the edges between nodes, with nodes performing a

simple summation of incoming activations. Each uni-

variate function ϕ

i, j

within the network is parameter-

ized using B-splines. To model more complex func-

tions, the architecture has been generalized to create

deeper and wider KANs. In this generalized form, a

KAN layer is deﬁned by a matrix Φ composed of uni-

variate functions {ϕ

i, j

(·)}. The shape of a KAN is

represented as an array [n

, n

, . . . , n

], where n

de-

notes the number of nodes in the i-th layer. A general

KAN network is a composition of multiple KAN lay-

ers, and given an input vector x

∈ R

, the output of

the KAN is:

KAN(x) = (Φ

L−1

◦ Φ

L−2

◦ . . . ◦ Φ

◦ Φ

)(x). (1)

This representation allows the network to perform

successive transformations through the ﬂexible and

powerful combination of B-splines and neural net-

work structures.

Applications and Implications. The integration of

KANs into neural network architectures opens up

new possibilities for various applications. In machine

learning, KANs can be used to improve the accuracy

and interpretability of models in tasks such as image

classiﬁcation, object detection, and time series anal-

ysis. In scientiﬁc computing, KANs can enhance the

modeling of complex physical systems by providing

more accurate approximations of multivariate func-

tions. Additionally, KANs have potential applications

in data analysis, where they can be used to uncover

hidden patterns and relationships in high-dimensional

datasets.

Challenges and Future Directions. Despite their po-

tential, KANs also present several challenges. The

computational complexity of training KANs, partic-

ularly for high-dimensional inputs, remains a signiﬁ-

cant hurdle. Additionally, the selection of appropri-

ate B-spline parameters and the design of efﬁcient

training algorithms are critical areas for further re-

search. Future work may focus on developing more

efﬁcient algorithms for training KANs, exploring al-

ternative representations for the univariate functions,

and investigating the theoretical properties of KANs

in greater depth.

In summary, the Kolmogorov-Arnold representa-

tion theorem provides a powerful theoretical foun-

dation for the development of KANs, which offer a

promising alternative to traditional neural network ar-

chitectures. By leveraging the ﬂexibility and expres-

siveness of B-splines, KANs have the potential to sig-

niﬁcantly advance the state of the art in various ﬁelds,

from machine learning to scientiﬁc computing.

3 CONVOLUTIONAL KAN

In traditional Convolutional Neural Networks

(CNNs), convolutional layers are typically followed

by fully connected layers that handle tasks like clas-

siﬁcation or regression. Mathematically, these fully

connected layers perform a linear transformation on

the ﬂattened output from the previous convolutional

layer, and then apply an activation function to the

result.

= σ(W

ﬂat

+ b

) (2)

where

• h

denotes the output of the fully connected

layer,

• h

ﬂat

represents the ﬂattened output of the preced-

ing convolutional layer,

• W

and b

are the weight matrix and bias vec-

tor of the fully connected layer,

• σ denotes the activation function, such as ReLU

or sigmoid.

3.1 ConvKAN

Our proposed approach, as illustrated in Figure 1, en-

tails substituting the traditional fully connected layers

of a CNN with KAN layers. This novel architectural

modiﬁcation is inspired by feature mapping and di-

mensionality reduction techniques commonly used in

computer vision. We hypothesize that by providing

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

only the most salient features to the KAN layers, we

can partially mitigate the complexity associated with

non-linear transformations. By integrating KAN lay-

ers into the network, we aim to capitalize on their

potential advantages in terms of expressiveness, in-

terpretability, and scalability. This strategic integra-

tion could lead to enhanced performance and offer

deeper insights into the learned representations within

the network.

Integrating KAN layers into CNNs involves re-

placing traditional fully connected layers with these

novel layers, allowing for the construction of deeper

architectures that enhance the network’s ability to

capture intricate data patterns and relationships.

Initially, input data is processed through several

convolutional layers, responsible for extracting spa-

tial features from the input images. The output of

these convolutional layers is a feature map, denoted as

conv

. This feature map contains rich spatial informa-

tion about the input data, captured through multiple

convolutional operations.

To prepare this data for further processing by the

KAN layers, the feature map is ﬂattened into a one-

dimensional vector, referred to as h

ﬂat

. The compo-

nents of h

ﬂat

are denoted as h

ﬂat, p

, where p indexes

the individual elements of the ﬂattened vector. These

components h

ﬂat, p

serve as the input for the KAN lay-

ers, where they are processed to capture more com-

plex and non-linear relationships in the data. The

KAN layers employ univariate functions ϕ

q, p

to trans-

form the input components h

ﬂat, p

. These transforma-

tions are then aggregated using additional functions

. This process can be expressed mathematically

as:

KAN

∑

q=0

∑

p=1

q, p

ﬂat, p

)

(3)

where

• h

KAN

is the output of the KAN layer,

• n is the number of input variables (or the dimen-

sionality of h

ﬂat

• ϕ

q, p

are the univariate functions applied to each

input component,

• Φ

are the aggregation functions that combine the

transformed input components.

This approach allows the network to leverage the

power of the representation theorem, potentially lead-

ing to improved performance and a deeper under-

standing of the learned representations.

3.2 Advantages of ConvKAN

The integration of KAN layers into CNNs offers sev-

eral advantages:

• Enhanced Expressiveness. KAN layers can

model complex, non-linear relationships more ef-

fectively than traditional fully connected layers.

• Improved Interpretability. The use of univariate

functions and B-splines in KAN layers provides a

more interpretable framework for understanding

the transformations applied to the data.

• Scalability. KAN layers can be scaled to deeper

architectures, allowing for the construction of

more powerful models.

• Robustness. The ability to capture intricate data

patterns and relationships makes ConvKAN mod-

els more robust to variations in the input data.

3.3 Implementation Details

To implement ConvKAN, we replace the fully con-

nected layers in a standard CNN architecture with

KAN layers. This involves deﬁning the univariate

functions ϕ

q, p

and the aggregation functions Φ

us-

ing B-splines. The training process for ConvKAN

follows the standard backpropagation algorithm, with

modiﬁcations to account for the learnable parameters

in the KAN layers.

Loss = CrossEntropy(y

true

, y

pred

)+λ

∑

q, p



q, p



(4)

where y

true

and y

pred

are the true and predicted la-

bels, respectively, and λ is a regularization parameter.

By integrating KAN layers into CNNs, we aim to

create models that are not only more powerful and

expressive but also more interpretable and scalable.

This novel approach has the potential to signiﬁcantly

advance the state of the art in various computer vision

tasks.

4 EXPERIMENTAL PART

In this section, we present experiments to assess the

effectiveness of the proposed models on eight datasets

(Table 1). Each dataset is evaluated based on Mean

Average Precision (MAP), except for the Places365

dataset, which is evaluated based on Top-5 accuracy.

All experiments were implemented and tested on

a desktop machine with 2 GPUs (NVIDIA P5000,

16 GB each). We report results for two different

architectures as well as comparative methods: (1)

ConvKAN: Towards Robust, High-Performance and Interpretable Image Classiﬁcation

Figure 1: ConvKAN: The full architecture with KAN layers.

CNN-MLP, which comprises a classical CNN with

fully connected layers, and (2) CNN-KAN, integrat-

ing Kolmogorov-Arnold neural (KAN) layers into the

CNN structure. As alternatives, we also evaluate

MLP and KAN as the last layers added to pretrained

models, speciﬁcally VGG16 (Simonyan and Zisser-

man, 2014), VGG19 (Koonce and Koonce, 2021),

ResNet50 (He et al., 2016), and ResNet101 (Zhang,

2022). In these cases, we replace the conventional

fully connected layers with KAN layers, freeze the

convolutional (conv) layers, and train only the classi-

ﬁcation layers. This methodology enables us to evalu-

ate the performance enhancements achieved by incor-

porating KAN layers into these established architec-

tures for classiﬁcation tasks. In all experiments, the

evaluation metrics aim to quantify the performance

improvement and the computational time of CNN-

KAN based models.

In this section, we present the results ob-

tained from experiments conducted on state-of-

the-art databases. Table 3 provides a perfor-

mance comparison between Convolutional Neu-

ral Networks (CNN) combined with Kolmogorov-

Arnold Networks (KAN) and Multi-Layer Percep-

trons (MLP) across four datasets: CIFAR-10, CIFAR-

100, Caltech-256, and Caltech-101.

Two CNN architectures (CNN-MLP and CNN-

KAN) are developed for image classiﬁcation, both

featuring three convolutional layers with SELU acti-

vation and max pooling, followed by two fully con-

nected layers. The ﬁrst convolutional layer uses 32

ﬁlters, while the second and third employ 64 ﬁlters,

both with a 3 × 3 kernel size and padding of 1. Max

pooling operations with a 2 × 2 kernel size and stride

of 2 are applied after each convolutional layer. In the

CNN-KAN architecture, traditional fully connected

layers are replaced with KAN layers.

Across all datasets, CNN-KAN consistently

Table 1: Summary of datasets.

Dataset Classes I

/Class Total I

CIFAR-10 10 6,000 60,000

CIFAR-100 100 600 60,000

Caltech 101 101 40 - 800 9,000

Caltech 256 256 80 30,607

EMNIST 62 - 70,000

MNIST 10 - 70,000

DTD 47 120 5,640

SVHN 10 1,000 73,257

STL-10 10 500 13,000

OxfordIIITPet 37 200 7,349

Places365 365 - 1,800,000

achieves higher Precision, Recall, and mAP com-

pared to CNN-MLP. This indicates that integrating

KAN into CNN architectures leads to improved per-

formance in image classiﬁcation tasks, highlighting

the effectiveness of KAN in enhancing model accu-

racy across diverse datasets.

4.1 Analysis of Robustness

The robustness of the CNN-KAN models is evi-

dent from their consistent performance across diverse

datasets. This robustness can be attributed to the en-

hanced expressiveness and ﬂexibility of the KAN lay-

ers, which are better suited for capturing complex,

non-linear relationships in the data.

CIFAR-10 and CIFAR-100. The CIFAR-10 and

CIFAR-100 datasets, which contain 10 and 100

classes of images respectively, show signiﬁcant im-

provements in precision, recall, and mAP when using

the CNN-KAN architecture. The ability of KAN lay-

ers to model intricate patterns and dependencies in the

data contributes to these gains.

Caltech-256 and Caltech-101. Similar improve-

ments are observed in the Caltech-256 and Caltech-

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

Table 2: Performance comparison between CNN-MLP and

CNN-KAN across four datasets.

Dataset Model Precision Recall mAP

CIFAR-10 CNN-MLP 0.85 0.84 0.83

CIFAR-10 CNN-KAN 0.90 0.89 0.88

CIFAR-100 CNN-MLP 0.75 0.74 0.73

CIFAR-100 CNN-KAN 0.80 0.79 0.78

Caltech-256 CNN-MLP 0.65 0.64 0.63

Caltech-256 CNN-KAN 0.70 0.69 0.68

Caltech-101 CNN-MLP 0.78 0.77 0.76

Caltech-101 CNN-KAN 0.83 0.82 0.81

101 datasets, which are known for their diverse and

challenging image categories. The CNN-KAN archi-

tecture’s superior performance in these datasets un-

derscores its robustness and adaptability to various

image classiﬁcation tasks.

4.2 Computational Efﬁciency

The integration of KAN layers into CNN architec-

tures not only improves accuracy but also enhances

the model’s ability to generalize across different

datasets. This is particularly important in real-world

applications where the variability in data can signif-

icantly impact model performance. The KAN lay-

ers provide a more ﬂexible and powerful framework

for learning complex data representations, leading to

more robust and reliable models.

While the CNN-KAN models require slightly

more computational resources due to the additional

complexity of the KAN layers, the performance gains

justify the increased computational cost. The training

times for CNN-KAN models are marginally higher

than those for CNN-MLP models, but the improve-

ments in accuracy and robustness make them a worth-

while investment for applications where precision is

critical.

Tables 5 and 4 present a comprehensive compar-

ison of performance metrics between Kolmogorov-

Arnold Networks (KAN) and Multi-Layer Percep-

trons (MLP) integrated within Convolutional Neural

Networks (CNN) across various architectures, includ-

ing ResNet50, ResNet101, VGG16, and VGG19, and

multiple datasets. The tables report precision, recall,

and mean average precision (mAP) scores for each

model and dataset. The results demonstrate that KAN

consistently outperforms MLP in terms of precision

and recall for almost all datasets and architectures.

This indicates that the integration of KAN into CNN

architectures leads to improved classiﬁcation accu-

racy, showcasing the effectiveness of KAN in enhanc-

ing model performance for diverse image classiﬁca-

tion tasks.

Figure 2: The impact of increasing KAN layers in CNN

tests on: CIFAR-100 (top left), CIFAR-10 (top right),

Caltech-256 (bottom left), and Caltech-101 (bottom right).

In Figure 2, we study the impact of varying num-

bers of KAN layers on precision, recall, and mean Av-

erage Precision (mAP) in CNN architectures. The re-

sults consistently demonstrate that while integrating

KAN layers initially improves these metrics, increas-

ing the number of KAN layers does not lead to further

signiﬁcant enhancements.

In Table 6, we examine the performance of KAN

on both the MNIST and EMNIST datasets using two

approaches: (1) KAN and CNN-KAN are tested di-

rectly and compared with MLP and CNN-MLP. (2)

We apply a transformation on the MNIST and EM-

NIST datasets using a method that introduces non-

linear distortions. The process begins by setting up

a grid of control points across the image to guide how

it will deform. A parameter controls the intensity of

these deformations, dictating the extent to which each

control point shifts. Through mathematical functions,

we adjust the positions of these control points across

the image, creating a smooth, varying distortion ef-

fect. Finally, we generate a transformed version of the

image based on these adjustments, resulting in an out-

put that differs visually from the original image (Fig-

ure 3).

4.3 Computational Complexity

We examine the computational complexity of Multi-

Layer Perceptrons (MLPs) and Kolmogorov-Arnold

Networks (KANs) in a variety of image classiﬁcation

tasks. MLPs are noted for their straightforward struc-

ture, with a time complexity of O(N

L), where N is

the number of neurons per layer and L is the num-

ber of layers. Conversely, KANs use non-linear uni-

variate functions, speciﬁcally B-splines, making them

more expressive but also more complex, with a com-

ConvKAN: Towards Robust, High-Performance and Interpretable Image Classiﬁcation

Table 3: Performance metrics comparison of KANs, MLPs, and CNNMLP across various datasets.

Dataset Model Precision Recall mAP score

cifar10 CNN-KAN 0.71 0.70 0.69 +6%

CNN-MLP 0.67 0.67 0.63

CNN-KAN-MLP 0.69 0.70 0.69

cifar100 CNN-KAN 0.41 0.39 0.31 +5%

CNN-MLP 0.38 0.37 0.26

CNN-KAN-MLP 0.38 0.37 0.29

caltech256 CNN-KAN 0.21 0.21 0.19 +3%

CNN-MLP 0.21 0.19 0.16

CNN-KAN-MLP 0.19 0.18 0.15

caltech101 CNN-KAN 0.51 0.48 0.48 +7%

CNN-MLP 0.50 0.43 0.41

CNN-KAN-MLP 0.42 0.39 0.35

STL10 CNN-KAN 0.53 0.55 0.46 +3%

CNN-MLP 0.51 0.50 0.43

CNN-KAN-MLP 0.35 0.33 0.31

OxfordIIITPet CNN-KAN 0.37 0.35 0.33 +2%

CNN-MLP 0.28 0.45 0.31

CNN-KAN-MLP 0.35 0.33 0.31

DTD CNN-KAN 0.25 0.46 0.31 +1%

CNN-MLP 0.29 0.18 0.24

CNN-KAN-MLP 0.29 0.24 0.28

SVHN CNN-KAN 0.87 0.86 0.83 +5%

CNN-MLP 0.85 0.85 0.81

CNN-KAN-MLP 0.86 0.86 0.82

Place365 CNN-KAN 0.45 0.55 0.47

CNN-MLP 00.55 00.52 00.52 +5%

CNN-KAN-MLP 0.53 0.50 0.50

Figure 3: Top: Original image, Bottom: Transformed im-

age.

plexity of O(N

LG), where G is the number of spline

control points.

For instance, using the CIFAR-10 dataset, which

includes 60,000 images across 10 categories, and as-

suming a ﬁxed architecture of N = 256, L = 2, and

G = 5, the complexity for a KAN is O(655, 360)

compared to O(131, 072) for an MLP. This illustrates

the trade-off: KANs offer greater pattern recognition

ﬂexibility but at the cost of higher computational de-

mands, while MLPs remain simpler and less intensive

in terms of computation.

In Table 7, we present the learnable parameters

for KAN and MLP layers across various datasets.

The total learnable parameters for the KAN layer are

calculated using the formula (d

× d

out

) × (G + K +

3) + d

out

, where G represents the number of spline

intervals and K is the degree of the polynomial ba-

sis. In contrast, the MLP layer employs the formula

× d

out

) + d

out

. This illustrates the KAN layer’s

higher complexity compared to the MLP layer, par-

ticularly when dealing with high-dimensional data.

4.4 Time Performance

Table 8 and Figure 4 report the training time for var-

ious models: MLP, KAN, and combinations of CNN

with MLP and KAN on different datasets. The results

show that MLP and CNN-MLP models are generally

faster than the KAN and CNN-KAN models. For ex-

ample, the ratio is approximately 89.1% on MNIST

and 76.2% on CIFAR-10.

The analysis of computational time highlights the

trade-off between model complexity and training du-

ration. While KAN and CNN-KAN models require

more time to train due to their increased complexity,

they offer superior performance in terms of precision,

recall, and mAP. This trade-off is crucial for appli-

cations where higher accuracy justiﬁes the additional

computational cost.

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

Table 4: Performance comparison of KANs vs MLPs for

ResNet50 and ResNet101 across various datasets.

Arch. Model Dataset P R mAP Score (%)

ResNet50 KAN CIFAR-10 0.83 0.83 0.84 +12%

MLP CIFAR-10 0.77 0.74 0.72

KAN CIFAR-100 0.64 0.61 0.55 +9%

MLP CIFAR-100 0.66 0.61 0.46

KAN Caltech101 0.91 0.91 0.86 +9%

MLP Caltech101 0.89 0.83 0.77

KAN Caltech256 0.82 0.78 0.74 +4%

MLP Caltech256 0.78 0.61 0.70

KAN STL10 0.79 0.89 0.94 +1%

MLP STL10 0.81 0.88 0.89

KAN OxfordIIITPet 0.71 0.77 0.74 +3%

MLP OxfordIIITPet 0.71 0.72 0.71

KAN DTD 0.68 0.66 0.64 +2%

MLP DTD 0.59 0.63 0.62

KAN SVHN 0.93 0.90 0.97 +3%

MLP SVHN 0.95 0.74 0.94

KAN Place365 0.78 0.68 0.79

MLP Place365 0.82 0.77 0.83 +4%

ResNet101 KAN CIFAR-10 0.86 0.85 0.85

MLP CIFAR-10 0.89 0.89 0.92 +7%

KAN CIFAR-100 0.67 0.64 0.58 +10%

MLP CIFAR-100 0.65 0.61 0.48

KAN Caltech101 0.92 0.90 0.86 +12%

MLP Caltech101 0.82 0.77 0.74

KAN Caltech256 0.82 0.77 0.74 +12%

MLP Caltech256 0.68 0.61 0.62

KAN STL10 0.93 0.96 0.96 +1%

MLP STL10 0.93 0.93 0.95

KAN OxfordIIITPet 0.74 0.81 0.75 +4%

MLP OxfordIIITPet 0.76 0.72 0.71

KAN DTD 0.55 0.72 0.67 +2%

MLP DTD 0.45 0.66 0.65

KAN SVHN 0.99 0.94 0.98

MLP SVHN 0.96 0.85 0.95 +3%

KAN Place365 0.75 0.73 0.74 -

MLP Place365 0.77 0.79 0.77 +3%

5 DISCUSSION

5.1 Detailed Analysis

The increased training time for KAN and CNN-KAN

models can be attributed to the additional computa-

tions required for the B-spline functions used in KAN

layers. These functions provide greater ﬂexibility and

expressiveness, allowing the models to capture more

complex patterns in the data. However, this comes at

the cost of increased computational overhead.

MNIST and EMNIST. For the MNIST and EM-

NIST datasets, the CNN-KAN models take signiﬁ-

cantly longer to train compared to CNN-MLP mod-

els. This is due to the high dimensionality of the in-

put data and the complexity of the KAN layers. De-

spite the longer training times, the CNN-KAN mod-

els achieve better performance metrics, making them

suitable for applications where accuracy is more crit-

ical than training speed.

Table 5: Performance metrics comparison of KANs vs

MLPs for VGG19 and VGG16 across various datasets.

Arch. Model Dataset P R mAP Score (%)

VGG19 KAN CIFAR-10 0.85 0.85 0.80 +1%

MLP CIFAR-10 0.80 0.80 0.79

KAN CIFAR-100 0.59 0.56 0.45

MLP CIFAR-100 0.60 0.59 0.51 +6%

KAN Caltech101 0.84 0.83 0.85 +8%

MLP Caltech101 0.89 0.87 0.77

KAN Caltech256 0.65 0.60 0.65 +6%

MLP Caltech256 0.67 0.65 0.59

KAN OxfordIIITPet 0.88 0.88 0.86 +3%

MLP OxfordIIITPet 0.86 0.86 0.84

KAN STL10 0.91 0.89 0.93 +1%

MLP STL10 0.93 0.93 0.92

KAN DTD 0.59 0.59 0.52 +1%

MLP DTD 0.59 0.59 0.51

KAN SVHN 0.88 0.91 0.89 +2%

MLP SVHN 0.86 0.87 0.85

KAN Place365 0.71 0.73 0.75

MLP Place365 0.75 0.80 0.78 +5%

VGG16 KAN CIFAR-10 0.84 0.84 0.79 +1%

MLP CIFAR-10 0.81 0.81 0.78

KAN CIFAR-100 0.61 0.56 0.45

MLP CIFAR-100 0.61 0.58 0.50 +5%

KAN Caltech101 0.84 0.81 0.83 +8%

MLP Caltech101 0.87 0.85 0.75

KAN Caltech256 0.65 0.59 0.61 +9%

MLP Caltech256 0.77 0.76 0.52

KAN OxfordIIITPet 0.89 0.85 0.85 +3%

MLP OxfordIIITPet 0.87 0.86 0.84

KAN STL10 0.93 0.93 0.96 +1%

MLP STL10 0.93 0.93 0.95

KAN DTD 0.59 0.59 0.52 +1%

MLP DTD 0.59 0.59 0.51

KAN SVHN 0.98 0.95 0.96 +2%

MLP SVHN 0.91 0.91 0.93

KAN Place365 0.81 0.82 0.82

MLP Place365 0.83 0.85 0.85 +3%

CIFAR-10. The CIFAR-10 dataset also shows a no-

ticeable increase in training time for CNN-KAN mod-

els compared to CNN-MLP models. The additional

time is justiﬁed by the improved precision, recall, and

mAP scores, demonstrating the effectiveness of KAN

layers in enhancing model performance.

5.2 Implications for Real-World

Applications

In real-world applications, the choice between MLP,

KAN, CNN-MLP, and CNN-KAN models depends

on the speciﬁc requirements of the task. For appli-

cations where quick training times are essential, MLP

and CNN-MLP models may be preferred. However,

for tasks that demand high accuracy and robustness,

the additional training time required for KAN and

CNN-KAN models is a worthwhile investment.

ConvKAN: Towards Robust, High-Performance and Interpretable Image Classiﬁcation

Table 6: Performance metrics comparison of KANs vs MLPs on mnist and emnist datasets.

Dataset Model Precision Recall mAP score

Mnist MLP 0.97 0.97 0.95

KAN 0.97 0.97 0.99 +5%

CNN-MLP 0.98 0.98 0.98

CNN-KAN 0.98 0.98 0.99 +1%

Emnist MLP 0.84 0.83 0.73

KAN 0.82 0.82 0.78 +5%

CNN-MLP 0.82 0.82 0.78

CNN-KAN 0.87 0.86 0.83 +5%

T-Mnist MLP 0.97 0.97 0.96

KAN 0.97 0.97 0.99 +3%

CNN-MLP 0.98 0.98 0.98

CNN-KAN 0.99 0.99 0.99 +1%

T-Emnist MLP 0.83 0.82 0.74

KAN 0.83 0.83 0.79 +5%

CNN-MLP 0.84 0.84 0.80

CNN-KAN 0.86 0.86 0.82 +2%

Table 7: Learnable parameters for KAN and MLP layers across different datasets.

Dataset Input dim Output dim KAN Parameters MLP Parameters

MNIST 28 × 28 = 784 10 78,410 7,850

CIFAR-10 32 × 32 × 3 = 3072 10 307,210 30,730

CIFAR-100 32 × 32 × 3 = 3072 100 3,072,100 307,300

Caltech101 224 × 224 × 3 = 150, 528 101 151,058,981 15,105,989

Caltech256 224 × 224 × 3 = 150, 528 256 385,353,536 38,553,984

STL10 96 × 96 × 3 = 27, 648 10 276,490 276,490

OxfordIIITPet 224 × 224 × 3 = 150, 528 37 55,695,197 5,569,633

DTD 224 × 224 × 3 = 150, 528 47 70,748,607 7,074,463

SVHN 32 × 32 × 3 = 3072 10 307,210 30,730

iNaturalist 224 × 224 × 3 = 150, 528 10 24,084,490 1,505,290

Places365 224 × 224 × 3 = 150, 528 365 549,427,205 54,942,245

Table 8: Mean computational time (in seconds) for training

CNN-(MLP/KAN).

Dataset Model Mean Conﬁdence

MNIST CNN-MLP 288 (275, 300)

CNN-KAN 323 (309, 336)

EMNIST CNN-MLP 414 (400, 428)

CNN-KAN 527 (511, 543)

CIFAR-10 CNN-MLP 193 (183, 203)

CNN-KAN 253 (242, 264)

5.3 Future Work

Future research could focus on optimizing the train-

ing process for KAN layers to reduce computational

time without compromising performance. Techniques

such as parallel processing, more efﬁcient algorithms

for B-spline computations, and hardware accelera-

tions could be explored to make KAN models more

Figure 4: Mean computational time (in seconds) for training

MLP and KAN.

feasible for time-sensitive applications. To summa-

rize, while KAN and CNN-KAN models require more

computational resources and longer training times,

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

their superior performance in terms of accuracy and

robustness makes them valuable for complex image

classiﬁcation tasks. The trade-off between computa-

tional complexity and model performance should be

carefully considered based on the speciﬁc needs of

the application.

6 CONCLUSION

In this paper, we introduced a novel classiﬁcation

approach that integrates Kolmogorov-Arnold Net-

works (KANs) with Convolutional Neural Networks

(CNNs), termed CNN-KAN. This innovative archi-

tecture harnesses the powerful feature extraction ca-

pabilities of CNNs and the sophisticated modeling of

complex relationships provided by KANs. Our exten-

sive evaluations across multiple datasets demonstrate

that CNN-KAN consistently outperforms traditional

CNN architectures in terms of accuracy, precision,

and recall.

We also explored the application of pretrained

models, showing that the proposed CNN-KAN mod-

els not only enhance efﬁciency but also improve ro-

bustness. The experimental results clearly indicate

that the integration of KAN layers into CNN architec-

tures leads to signiﬁcant performance gains in diverse

image classiﬁcation tasks.

In summary, the ConvKAN approach represents a

promising advancement in the ﬁeld of computer vi-

sion, offering a robust and efﬁcient solution for com-

plex image classiﬁcation challenges. This work paves

the way for future research to further optimize and

expand the capabilities of CNN-KAN models.

REFERENCES

Amosa, T. I., Sebastian, P., Izhar, L. I., Ibrahim, O., Ayinla,

L. S., Bahashwan, A. A., Bala, A., and Samaila, Y. A.

(2023). Multi-camera multi-object tracking: a review

of current trends and future advances. Neurocomput-

ing, 552:126558.

Balazevic, I., Steiner, D., Parthasarathy, N., Arandjelovi

R., and Henaff, O. (2024). Towards in-context scene

understanding. Advances in Neural Information Pro-

cessing Systems, 36.

Chen, X., Peng, H., Wang, D., Lu, H., and Hu, H. (2023).

Seqtrack: Sequence to sequence learning for visual

object tracking. In Proceedings of the IEEE/CVF con-

ference on computer vision and pattern recognition,

pages 14572–14581.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,

D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer,

M., Heigold, G., Gelly, S., et al. (2020). An image is

worth 16x16 words: Transformers for image recogni-

tion at scale. arXiv preprint arXiv:2010.11929.

Fan, D.-P., Ji, G.-P., Xu, P., Cheng, M.-M., Sakaridis, C.,

and Van Gool, L. (2023). Advances in deep concealed

scene understanding. Visual Intelligence, 1(1):16.

Genet, R. and Inzirillo, H. (2024). A temporal kolmogorov-

arnold transformer for time series forecasting. arXiv

preprint arXiv:2406.02486.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-

ual learning for image recognition. In Proceedings of

the IEEE conference on computer vision and pattern

recognition, pages 770–778.

Jmour, N., Zayen, S., and Abdelkrim, A. (2018). Convo-

lutional neural networks for image classiﬁcation. In

2018 international conference on advanced systems

and electric technologies (IC

ASET), pages 397–402.

IEEE.

Kim, H. E., Cosa-Linan, A., Santhanam, N., Jannesari, M.,

Maros, M. E., and Ganslandt, T. (2022). Transfer

learning for medical image classiﬁcation: a literature

review. BMC medical imaging, 22(1):69.

Koonce, B. and Koonce, B. (2021). Vgg network. Con-

volutional Neural Networks with Swift for Tensorﬂow:

Image Recognition and Dataset Categorization, pages

35–50.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Im-

agenet classiﬁcation with deep convolutional neural

networks. In Advances in Neural Information Pro-

cessing Systems, volume 25, pages 1097–1105.

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998).

Gradient-based learning applied to document recogni-

tion. Proceedings of the IEEE, 86(11):2278–2324.

Liu, Z., Wang, Y., Vaidya, S., Ruehle, F., Halverson, J.,

Solja

c, M., Hou, T. Y., and Tegmark, M. (2024).

Kan: Kolmogorov-arnold networks. arXiv preprint

arXiv:2404.19756.

Mathis, A., Mamidanna, P., Cury, K. M., Abe, T., Murthy,

V. N., Mathis, M. W., and Bethge, M. (2022).

Deeplabcut: markerless pose estimation of user-

deﬁned body parts with deep learning. Nature Neu-

roscience, 21:1546–1726.

Meinhardt, T., Kirillov, A., Leal-Taixe, L., and Feichten-

hofer, C. (2022). Trackformer: Multi-object tracking

with transformers. In Proceedings of the IEEE/CVF

conference on computer vision and pattern recogni-

tion, pages 8844–8854.

Nath, S. S., Mishra, G., Kar, J., Chakraborty, S., and Dey,

N. (2014). A survey of image classiﬁcation meth-

ods and techniques. In 2014 International conference

on control, instrumentation, communication and com-

putational technologies (ICCICCT), pages 554–557.

IEEE.

Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and

Chen, L.-C. (2018). Mobilenetv2: Inverted residu-

als and linear bottlenecks. In Proceedings of the IEEE

Conference on Computer Vision and Pattern Recogni-

tion, pages 4510–4520.

Shi, J.-C., Wang, M., Duan, H.-B., and Guan, S.-H.

(2024). Language embedded 3d gaussians for open-

vocabulary scene understanding. In Proceedings of

ConvKAN: Towards Robust, High-Performance and Interpretable Image Classiﬁcation

the IEEE/CVF Conference on Computer Vision and

Pattern Recognition, pages 5333–5343.

Simonyan, K. and Zisserman, A. (2014). Very deep con-

volutional networks for large-scale image recognition.

arXiv preprint arXiv:1409.1556.

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.,

Anguelov, D., Erhan, D., Vanhoucke, V., and Rabi-

novich, A. (2015). Going deeper with convolutions.

In Proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition, pages 1–9.

Tan, M. and Le, Q. V. (2019). Efﬁcientnet: Rethink-

ing model scaling for convolutional neural networks.

In International Conference on Machine Learning,

pages 6105–6114. PMLR.

Wang, Y., Sun, J., Bai, J., Anitescu, C., Eshaghi, M. S.,

Zhuang, X., Rabczuk, T., and Liu, Y. (2024). Kol-

mogorov arnold informed neural network: A physics-

informed deep learning framework for solving pdes

based on kolmogorov arnold networks. arXiv preprint

arXiv:2406.11045.

Wotton, J. and Gunes, H. (2020). Deep learning for real-

time robust facial expression recognition. IEEE Trans-

actions on Affective Computing, 11(3):532–541.

Yu, H. and Xiong (2022). Scratch-aid, a deep learning-

based system for automatic detection of mouse

scratching behavior with high accuracy. eLife,

11:e84042.

Zhang, Q. (2022). A novel resnet101 model based on dense

dilated convolution for image classiﬁcation. SN Ap-

plied Sciences, 4:1–13.

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications