A Regression Deep Learning Approach for Fashion Compatibility

ıs Silva

, Ivan Gomes

, C. Mendes Ara

ujo

, Tiago Cepeda

, Francisco Oliveira

and Jo

ao Oliveira

Department of Digital Transition, CITEVE, Centro Tecnol

ogico das Ind

ustrias T

extil e do Vestu

ario de Portugal,

V. N. Famalic

ao, Portugal

CMAT, Centro de Matem

atica and Departamento de Matem

atica, Universidade do Minho, Braga, Portugal

Keywords:

Visual Search, Deep Learning, Outﬁt, BiLSTM, CNN, Compatibility Learning, Similarity Learning,

Transformer.

Abstract:

In the ever-evolving world of fashion, building the perfect outﬁt can be a challenge. We propose a fashion

recommendation system, which we call Visual Search, that uses computer vision and deep learning to ensure

that it has a co-ordinated set of fashion recommendations. It looks at photos of incomplete outﬁts, recognizes

existing items, and suggests the most compatible missing piece. At the heart of our system lies a compatibility

model made of a Convolutional Neural Network and bidirectional Long Short Term Memory to generate a

complementary missing piece. To complete the recommendation process, we incorporated a similarity model,

based on Vision Transformer. This model meticulously compares the generated image to the catalog items,

selecting the one that most closely matches the generated image in terms of visual features.

1 INTRODUCTION

The phenomenal rise of e-commerce has revolution-

ized traditional distribution channels, unleashing a

wave of innovative business models that have funda-

mentally reshaped the retail landscape (Xiao et al.,

2019). However, this transformative digital revolu-

tion has not yet permeated the entire spectrum of

businesses, particularly within the traditional sector,

where many enterprises continue to lag behind in

terms of online visibility and engagement with the

digital marketplace. The COVID-19 pandemic, a

far more recent phenomenon, has had a notably pro-

found impact. The outbreak led to a precipitous de-

cline of sales in traditional retail stores, while world-

wide online sales of clothing and textiles surged to

unprecedented heights (C¸ ic¸ek and Muzaffar, 2021),

compelling several companies to re-evaluate their di-

rect online sales strategies.

The digital realm is poised to become the primary

driver of growth for the fashion industry, present-

ing a wealth of opportunities for textile companies to

thrive.

This strategic move would enable the sector to

complement its traditional distribution channels with

a direct-to-consumer approach, fostering stronger

customer connections and brand loyalty. Simultane-

ously, the Business-to-Business segment should not

be overlooked, as it represents a signiﬁcant source of

revenue and growth potential. Regardless of whether

the focus is on Business-to-Consumer or Business-to-

Business, many of the underlying challenges remain

the same, particularly in creating a seamless and per-

sonalized digital experience for both consumer and

business customers. In fashion analysis, visual com-

patibility refers to the extent to which clothing items

complement visually across different categories. For

instance, the compatibility between a “suit” and “ox-

fords” is typically higher than with “trainers”.

Visual Search emerges as a groundbreaking

pipeline of deep learning models for fashion recom-

mendation, ushering in a new era of innovation that

transforms the way individuals curate their sartorial

identities.

It comprises of four key modules. The ﬁrst han-

dles image prepossessing, ensuring necessary manip-

ulations, the second, utilizing a compatibility model,

assesses features against a fashion dataset, pinpoint-

ing the ideal clothing item. A third module ﬁnds real-

world analogs, and the fourth validates predictions.

Users can upload multiple photos of their current out-

ﬁt and the service will swiftly offer suggestions of the

store’s catalog for the missing clothing item to ensure

the overall outﬁt coordination, or they can simply up-

load an image of the desired piece and ﬁnd the most

similar item in the store’s catalog. Thus enhancing the

Silva, L., Gomes, I., Araújo, C., Cepeda, T., Oliveira, F. and Oliveira, J.

A Regression Deep Learning Approach for Fashion Compatibility.

DOI: 10.5220/0012682300003690

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 26th International Conference on Enterprise Information Systems (ICEIS 2024) - Volume 1, pages 141-148

ISBN: 978-989-758-692-7; ISSN: 2184-4992

141

overall user experience by facilitating the discovery of

items that closely align with their styling preferences.

Our main purpose with Visual Search is to op-

timize the shopping experience by functioning as a

recommendation system through sophisticated image

processing, compatibility and similarity analysis.

Recognizing each user’s distinct style ﬁngerprint,

Visual Search tailors recommendations to the current

wardrobe. This personalized approach ensures that

users receive relevant and useful suggestions, enhanc-

ing their overall shopping experience. In the realm of

online shopping, Visual Search eliminates guesswork

and frustration. By providing tailored recommenda-

tions, the system simpliﬁes the search for the perfect

outﬁt, reducing the time spent sifting through count-

less options. This heightened convenience fosters

user satisfaction and encourages repeat purchases.

The paper follows a structured approach consist-

ing of six sections. The second section is focused

in the exploration of related works, elucidating ex-

isting research that has served as the inspiration for

this study. Section three is all about the methodology,

describes the datasets used, and details the architec-

ture and complexity of the two main models, the com-

patibility model and the similarity model. This sec-

tion expounds upon their respective architectures, the

mathematical formulas behind, while also explain-

ing the method employed for performance evalua-

tion. The fourth section showcases the results ob-

tained from the implementation of each model, of-

fering insights and the implications derived from the

ﬁndings. This section also highlights the enhance-

ments achieved through the proposed methodologies.

In the ﬁfth section, Conclusion, the accomplishments

of the study are reviewed alongside expectations for

future research. Lastly, the sixth section serves as an

acknowledgment of the contributions and support re-

ceived.

2 RELATED WORK

We explore various strands of research closely

associated with our methodology.

Recommendation Systems in Fashion. Various

methodologies have been proposed for suggesting

fashion items (Hwangbo et al., 2018; Yethindra and

Deepak, 2021; Bellini et al., 2023). (Hwangbo et al.,

2018) introduced a recommendation system that

compiles data from online shopping mall databases,

gather purchase history (ofﬂine) and click history

(online) to feed a so called “K-RecSys“ model that

takes in consideration this parameters. (Yethindra

and Deepak, 2021) provide a personalized clothing

recommendations for men using logistic regression

classiﬁcation and semantic similarity computation

through fashion ontology. (Bellini et al., 2023)

introduced a recommendation system tailored for

fashion retail shops. It employs a multi-clustering

approach, considering items and users’ proﬁles

across both online and physical stores. By leveraging

mining techniques, the system predicts the purchase

behavior of newly acquired customers.

Visual Compatibility Extraction. In this ﬁeld

(Yin et al., 2019) proposed a fashion compatibil-

ity knowledge learning method that incorporates

visual compatibility relationships as well as style

information using a Convolutional Neural Network

(CNN), the ‘convolutional‘ part references the use

of convolutional layers, which apply convolution

operations to detect and extract features from input

data. (Han et al., 2017) employed a Bidirectional

Long Short Term Memory (BiLSTM) to capture the

compatibility relationships of fashion items by con-

sidering an outﬁt as a sequence from top to bottom

and then accessories and images in the collection

as individual time steps, ”named Long Short-Term

Memory (LSTM)” due to its ability to capture

and retain long-term dependencies in data while

handling short-term information through memory

cells. (Revanur et al., 2021) used a semi-supervised

learning approach where was leveraged large unla-

beled fashion corpus to create pseudo positive and

negative outﬁts on the ﬂy during training. For each

labeled outﬁt in a training batch, a pseudo-outﬁt is

obtained by matching each item in the labeled outﬁt

with unlabeled items. More recently, (Jing et al.,

2023) delved into a fashion compatibility modeling

approach with a category-aware multimodal attention

network, termed as FCM-CMAN. In this paper, the

focus is on the visual compatibility of entire outﬁts,

where items in a fashion collection are expected to

exhibit similar styles, forming a cohesive and stylish

composition. To achieve this, a BiLSTM model

is employed to discern compatibility relationships

within outﬁts, capturing the dependencies among

various fashion items. This approach goes beyond by

using a visual-semantic embedding. This capability

enhances individual item recommendations by under-

standing and revealing their relationships within the

given context.

Similarity in Fashion. The focus is on identi-

fying items that have similarity to those shown.

This entails uncovering apparel pieces that share

common visual attributes or style elements, providing

ICEIS 2024 - 26th International Conference on Enterprise Information Systems

142

customers with recommendations that align closely

with their preferences (Dong et al., 2021; Gao et al.,

2020; Manandhar et al., 2018). (Dong et al., 2021).

introduce an Attribute-Speciﬁc Embedding Network

to predict ﬁne-grained fashion similarity by jointly

learning multiple attribute-speciﬁc embeddings.

(Gao et al., 2020) propose a novel graph reasoning

network (GRNet) on a similarity pyramid, which

learns similarities between a query and a gallery cloth

by using both initial pairwise multi-scale feature rep-

resentations and matching propagation for unaligned

representations. (Manandhar et al., 2018) introduced

a new attribute-guided metric learning (AGML) with

multitask CNN that jointly learns fashion attributes

and image embeddings while taking category and

brand information into account.

3 APROACH/METHODS

In our work, we use this concept of similarity to en-

hance our compatibility learning. After generating a

fashion item that is compatible within a given context,

we use similarity metrics to identify real-world items

that closely resemble our generated prediction.

A CNN approach was employed to capture and

extract features from images, speciﬁcally focusing on

generating a unseen image from a given contextual

representation, in this case, an outﬁt represented as

a sequence. Subsequently, we used a BiLSTM for

fashion compatibility modeling, which processes the

sequential nature of the outﬁt to generate the ﬁnal im-

age. Following the generation process, a pre-trained

model was employed to identify the most similar real

image to our generated one, completing the compre-

hensive workﬂow of our approach.

3.1 Dataset

The Cleaned Maryland dataset was developed by the

Fashion Team at the Laboratory for Artiﬁcial and it is

a clean version of the Maryland Polymores dataset. It

was also used in (Han et al., 2017; Zou et al., 2022).

Polyvore, a widely used fashion website, serves as a

platform where users share outﬁts, providing informa-

tion including images, descriptions, likes, hashtags,

and more. The Cleaned Maryland dataset comprises

21,889 outﬁts carefully extracted from Polyvore, hav-

ing the fashion items been re-organized into 20 cate-

gories. For the purpose of our speciﬁc study, however,

we adopted a more focused approach. Instead of uti-

lizing all 20 categories, we narrowed down our selec-

tion to a more streamlined set of four key categories:

‘top‘, ‘bottom‘, ‘other‘, and ‘feet‘. This reduction was

a deliberate choice aimed at simplifying the dataset

and transforming it into a sequence of items, aligning

with the learning capabilities of our BiLSTM model.

As a result of this, the outﬁts in our dataset have been

condensed to a more manageable number of 1356.

One notable feature of the dataset is the exclusion of

background information. By removing unnecessary

background details, the dataset minimizes extraneous

noise. An example of an outﬁt in this categories is

shown in Figure1.

Figure 1: Compatible Outﬁt from Maryland dataset.

In addition to a evaluation on the datasets, testing

is extended to include another dataset named ‘Com-

munity Pictures‘. This dataset comprises 5000 images

of 20 categories and it was collaboratively constructed

by the community, with members contributing images

of their clothing items.

3.2 Fashion Outﬁt Compatibility Model

Our compatibility model initiates with a CNN as a

feature extractor. The primary goal of the CNN is to

capture visual features from input images, with a spe-

ciﬁc focus on identifying and extracting key features

crucial for understanding the outﬁt sequence, deter-

mining their importance for future feeding into the

BiLSTM. The CNN model is extended to operate on

each image in the sequence independently. The fea-

ture maps acquired from individual images are subse-

quently either combined or further processed to cap-

ture temporal dependencies.

Let I

represent the t-th image in the input se-

quence, and F(I

)be the feature map obtained after

passing I

through the CNN. The convolutional opera-

tion is as follows:

F(I

)

i, j,k

= σ



∑

m,n, p,k

· I

t,i+m, j+n, p

+ b



• F(I

)

i, j,k

is the activation at position (i, j) in the

k-th feature map for the t-th image.

– W

m,n, p,k

is the weight of the k-th ﬁlter at posi-

tion (m, n) in channel p.

– I

t,i+m, j+n, p

denotes the pixel intensity at posi-

tion (i + m, j + n) in channel p of the t-th im-

age.

– b

is the bias term for the k-th ﬁlter.

– σ is the ReLU activation function.

The output of the CNN, F(I

), serves as input to

the BiLSTM model. The BiLSTM then processes the

A Regression Deep Learning Approach for Fashion Compatibility

143

Figure 2: Model architecture.

sequence of feature maps to generate predictions for

the next item in the outﬁt sequence. The loss func-

tion E

(F;Θ

) is computed based on the negative log

probability of observing the next item x

t+1

given the

previous items x

, ..., x

. The LSTM equations involve

the use of the ReLU activation function:

(F;Θ

) = −

∑

t=1

logPr(x

t+1

, ..., x

;Θ

)

where Θ

denotes the model parameters of the for-

ward prediction model, and Pr(·), computed by the

LSTM model, is the probability of observing x

t+1

conditioned on previous inputs.

More speciﬁcally, the LSTM model maps an input

sequence {x

, x

, ..., x

} to outputs via a sequence of

hidden states by computing the following equations

recursively from t = 1 to t = N:

= σ(W

t−1

+ b

= σ(W

x f

h f

t−1

c f

t−1

+ b

= f

t−1

+ i

σ(W

t−1

+ b

= σ(W

t−1

+ b

= o

σ(c

where: - x

is the input at time t, - h

is the hidden

state at time t, - c

is the cell state at time t, - i

, f

are the input, forget, and output gates’ activations,

- σ is the ReLU activation function, - W and b are the

weight matrices and bias vectors for different gates in

the LSTM.

• Forward LSTM:

→

= LSTM

forward

→

t−1

)

where

is the input at time t,

→

is the hidden state at time t.

The forward LSTM processes the input sequence

from the beginning to the end, capturing depen-

dencies in the forward direction.

• Backward LSTM:

←

= LSTM

backward

←

t+1

)

where

is the input at time t,

←

is the hidden state at time t.

The backward LSTM processes the input se-

quence from the end to the beginning, capturing

dependencies in the backward direction.

• Final Hidden State:

= [

→

;

←

]

The ﬁnal hidden state at time t is the concatena-

tion of the forward and backward hidden states.

This combined representation captures both for-

ward and backward context, enabling the model

to understand the sequential dependencies within

the input sequence.

3.3 Fashion Similarity Model

Our similarity approach, takes in the compatibility

model output, our generated image, to identify the

most visually similar image from a pre-loaded set.

This process is orchestrated with a pre-trained Vision

Transformer (ViT) as the backbone for similarity as-

sessment. The vit base patch16 224.mae follows the

ViT architecture, which represents a departure from

CNN. Vision Transformers use a transformer-based

architecture, originally designed for natural language

processing tasks, to process image data. Notably,

the model employs a patch-based approach, breaking

down the input image into smaller patches and treat-

ing them as sequence for processing.

• Parameters (M): 85.8

• GMACs: 17.6

• Activations (M): 23.9

• Trained Images: 224x224

ICEIS 2024 - 26th International Conference on Enterprise Information Systems

144

The number of parameters (85.8M) indicates the

model’s complexity, while the GMACs (17.6) re-

ﬂect its computational workload in terms of Giga

Multiply-Accumulates. Additionally, the activations

(23.9M) represent the total number of activations dur-

ing inference, offering insights into the model’s com-

putational efﬁciency. Lastly, the size of trained im-

ages (224x224) underscores the scale at which the

model operates.

One of features of vit base patch16 224.mae is

its pretraining methodology. The Self-Supervised

Masked Autoencoder (MAE) technique involves

training the model to predict masked-out portions of

the input image.

3.4 Performance Evaluation

In evaluating the accuracy of our approach, we em-

ploy the Universal Image Quality Index (UIQI) as the

metric used for this task. The UIQI measures the sim-

ilarity and quality of images, enabling a evaluation

beyond a mere binary comparison (Wang and Bovik,

2002).

The UIQI is calculated using the following for-

mula:

UIQI =

4·covariance(I

)·mean(I

)

(variance(I

)+variance(I

))·(mean

)+mean

))

Here: I

and I

are the intensity values of the

two images being compared.

Approach and Threshold

To evaluate the accuracy of our approach, we consider

the most similar image identiﬁed by our similarity

model. We calculate the UIQI of this identiﬁed image

with the test image belonging to the outﬁt in the ﬁrst

place. Instead of relying on a simplistic comparison

of identical images, we set a threshold for the UIQI.

If the UIQI value exceeds this threshold, we classify

the prediction as accurate.

Why UIQI over Image Similarity

The choice of UIQI over a straightforward image sim-

ilarity check is motivated by the nature of fashion.

Similar images may exhibit subtle differences, such as

distinct patterns or textures, yet still be suitable for the

same outﬁt. For instance, two white t-shirts might dif-

fer in design but share compatibility within the con-

text of an outﬁt. The UIQI accounts for it, providing

a more reﬁned assessment of image quality and simi-

larity (Wang and Bovik, 2002). This approach allows

us to capture the essence of fashion compatibility be-

yond strict visual identity.

4 RESULTS

In this section, we discuss the benchmark and con-

clusions aimed at enhancing the accuracy of our ap-

proach.” We systematically compare results across

various variables, offering insights into the effective-

ness of our method.

4.1 Compatibility

Table 1: Comparison between different datasets.

CNN + BiLSTM Maryland CP

top 91.18% 42.64%

bottom 84.24% 49.49%

other 79.56% 82.64%

feet 77.06% 83.74%

Based on the evaluation results, it is evident that the

model’s performance varies signiﬁcantly across the

two datasets, Maryland and Community Pictures, and

their respective categories.

In the Maryland dataset, where outﬁts are made

with purpose and exhibit a logical and consistent com-

position, the model consistently demonstrates accu-

racy ranging from 75% to 90% across all four cat-

egories (‘top‘, ‘bottom‘, ‘other‘, and ‘feet‘). This

consistent performance suggests that the model ef-

fectively generalizes to the structured composition of

the Maryland dataset. The variation in accuracy from

‘top‘ and ‘bottom‘ to ‘feet‘ and ‘other‘ could be from

the fact that feet and other have more variability in

their shapes and colors.

On the other hand, in the Community Pictures

dataset, which involves randomly assembled outﬁts

without a predetermined logic, due to the fact that the

dataset was just made out of random images from the

community, the model exhibits notable differences in

accuracy among the categories. Remarkably high ac-

curacy is observed for ‘other‘ and ‘feet‘, where in-

terpolation was applied due to a reduced number of

images in these categories. The interpolation led

to the model encountering the same images multiple

times, enabling it to recognize and classify these cat-

egories effectively, resulting in high accuracy. How-

ever, the categories ‘top‘ and ‘bottom‘ show consid-

erably lower accuracy. This can be attributed to the

inherent randomness and lack of consistency in the

outﬁt compositions within the Community Pictures

dataset. As outﬁts were generated by mixing clothes

without a structured approach, the model struggled to

ﬁnd meaningful patterns in these categories, leading

to low accuracy.

We present the results obtained from our exper-

imental evaluation of different compositional models

A Regression Deep Learning Approach for Fashion Compatibility

145

applied to the task at hand. Table 2 showcases the per-

formance metrics in terms of accuracy for each model

conﬁguration.

VGG is a CNN architecture introduced by the Vi-

sual Geometry Group at the University of Oxford.

The key characteristic of VGG is its simplicity and

uniform architecture. The network consists of multi-

ple layers with small receptive ﬁelds Proposed by (He

et al., 2016), ResNet introduces a concept of residual

learning, where shortcut connections allow the net-

work to learn the residual functions, making it easier

to train extremely deep networks.

Table 2: Models Benchmark Accuracy Results.

Comp Model CNN ResNet Resnet+CNN VGG VGG+CNN

LSTM 81,33% 80,29% 78,56% 80,29% 78,45%

BiLSTM 91,18% 85,65% 80,92% 84,56% 79,37%

No LSTM 65,29% 64,56% 62,11% 62,35% 61,98%

The LSTM model performs signiﬁcantly better

than the No LSTM counterpart across all the com-

pared models. This suggests that the inclusion of

LSTM layers in the model architecture contributes

positively to the overall accuracy. Models with-

out LSTM exhibit lower accuracy compared to their

LSTM counterparts. This indicates that the tempo-

ral dependencies captured by LSTM layers are bene-

ﬁcial for this task. BiLSTM consistently outperforms

LSTM, achieving the highest accuracy among all the

models. This indicates that bidirectional temporal

context is crucial for the task, as BiLSTM considers

information from both past and future time steps (Han

et al., 2017). The use of pre-trained models (ResNet

and VGG) did not lead to an improvement in accu-

racy compared to the standalone CNN model. This

unexpected result suggests that, in this particular task,

the transfer learning process may not have effectively

leveraged the pre-learned features from these archi-

tectures. It could be related to the domain of the pre-

trained models, the speciﬁcs of the transfer learning

process, or the characteristics of the dataset.

4.2 Similarity

To determine the best-performing model among a

batch of 600 backbones, an evaluation process was

undertaken. The evaluation aimed to assess each

model’s ability to identify the ﬁve most similar im-

ages from a large batch of diverse data. For each eval-

uation instance, a single image was provided as input

to the model, which was then tasked with retrieving

the ﬁve most similar images from the given dataset.

This process was repeated for multiple images, to

quantify the performance of each model, the Univer-

sal Quality Index (UIQI) was employed to compute

the mean similarity index for the top ﬁve retrieved im-

ages across all input images. The decision to evalu-

ate the top ﬁve images comes from the nature of rec-

ommending items for outﬁts. In fashion, there of-

ten exist multiple suitable clothing options that can

complement a particular look. By considering the

top ﬁve recommendations, the evaluation process ac-

knowledges the variability and subjective nature of

those recommendations. Based on the evaluation re-

sults of the similarity models, the performance varies

across different resolutions and models. The table be-

low presents the UIQI mean values for the top four

similarity models at various image resolutions.

Figure 3: Generated Image as input of Similarity model.

In Figure 3 its a representation of the generated

image being compared with real world images, the

similarity model compares it against a database of

real images, evaluating features, textures, colors, and

overall composition to determine the closest match, in

this example, the model determined that the most sim-

ilar image had a UIQI of 0.71 in comparison to ours.

Since this value exceeds our threshold, the image is

labeled to be well predicted.

Table 3: Similarity Accuracy.

Sim Models vit tresnet convnext efﬁcientnet

128x128 0.8729 0.8646 0.8634 0.8603

256x256 0.8744 0.8678 0.8659 0.8625

364x364 0.8749 0.8691 0.8662 0.8634

512x512 0.8749 0.8693 0.8674 0.8640

While higher resolutions generally lead to better

UIQI mean values (Wang and Bovik, 2002), it’s es-

sential to note that the improvements in model accu-

racy may not always justify the increased complexity

and time consumption associated with handling those

highest resolutions. Notably, the observations indi-

cate that the 512x512 resolution tends to exhibit the

highest UIQI mean, underscoring the inﬂuence of res-

olution on model performance, however the improve-

ment is not enough to justify the use of higher resolu-

tions since it will slow down the pipeline too much.

5 CONCLUSION

This paper uses a approach to fashion compatibility

learning by simultaneously training a CNN and a Bi-

LSTM model. the method treats an outﬁt as a se-

ICEIS 2024 - 26th International Conference on Enterprise Information Systems

146

quence, with each item serving as a time step. What’s

new about this approach is the fact that this model

task is to generate an image via pixel predicting val-

ues, a regression. Additionally, a similarity model is

used on top of this generated image to recommend

a real-world image. To validate the accuracy of this

approach, the Universal Image Quality Index is em-

ployed on the recommended image and the actual im-

age. This metric serves as a measure of how closely

our recommendation aligns with real outﬁts.

The outcomes showcase the effectiveness of this

approach in learning the compatibility of fashion out-

ﬁts. Recognizing that fashion compatibility is sub-

jective, varying from one individual to another, our

future research will explore modeling user-speciﬁc

compatibility and style preferences, our goal is to con-

struct a more personalized system that caters to indi-

vidual tastes and preferences, thereby enhancing the

overall user experience.

Expanding our approach to include keywords like

style, mood, artist, material, texture, and brand adds

depth to our model’s decision-making process im-

proving our latent space of features extracted. This

approach holds immense promise in the realm of gen-

erative AI, particularly in the domain of image gen-

eration. Our aspirations go beyond outﬁts because

compatibility isn’t exclusive to fashion. Just as there

are compatible pieces in clothing, there are combi-

nations in other industries too. By exploring diverse

fashion domains, we’re talking of compatibility that

exists beyond clothing. Moreover, our curiosity ex-

tends beyond fashion, by the potential for compati-

bility in other sectors, recognizing that the principles

guiding our recommendation system can be applied

across various industries. Whether it’s in home goods,

technology, or beyond.

ACKNOWLEDGEMENTS

TexP@CT Mobilizing Pact - Innovation Pact for

the Digitalization of Textiles and Clothing, project

no. 61, to Reinforce the Competitiveness and

Resilience of the National Economy, ﬁnanced

through Component 5 - Capitalization and Business

Innovation, of the European funds allocated to

Portugal by the Recovery and Resilience Plan (PRR),

under the European Union’s (EU) Recovery and

Resilience Mechanism, as part of Next Generation

EU (https://recuperarportugal.gov.pt/ ), for the period

2021 - 2026. CMAT: partially supported by FCT-

‘Fundac¸

ao para a Ci

encia e a Tecnologia’, within

projects UIDP/00013/2020 and UIDB/00013/2020

(DOI 10.54499/UIDP/00013/2020 and DOI

10.54499/UIDB/00013/2020)

REFERENCES

Bellini, P., Palesi, L. A. I., Nesi, P., and Pantaleo, G.

(2023). Multi clustering recommendation system for

fashion retail. Multimedia Tools and Applications,

82(7):9989–10016.

Dong, J., Ma, Z., Mao, X., Yang, X., He, Y., Hong, R., and

Ji, S. (2021). Fine-grained fashion similarity predic-

tion by attribute-speciﬁc embedding learning. IEEE

Transactions on Image Processing, 30:8410–8425.

Gao, Y., Kuang, Z., Li, G., Luo, P., Chen, Y., Lin, L., and

Zhang, W. (2020). Fashion retrieval via graph reason-

ing networks on a similarity pyramid. IEEE Transac-

tions on Pattern Analysis and Machine Intelligence.

Han, X., Wu, Z., Jiang, Y.-G., and Davis, L. S. (2017).

Learning fashion compatibility with bidirectional

lstms. In Proceedings of the 25th ACM international

conference on Multimedia, pages 1078–1086.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-

ual learning for image recognition. In Proceedings of

the IEEE Conference on Computer Vision and Pattern

Recognition (CVPR).

Hwangbo, H., Kim, Y. S., and Cha, K. J. (2018). Rec-

ommendation system development for fashion retail

e-commerce. Electronic Commerce Research and Ap-

plications, 28:94–101.

Jing, P., Cui, K., Guan, W., Nie, L., and Su, Y. (2023).

Category-aware multimodal attention network for

fashion compatibility modeling. IEEE Transactions

on Multimedia.

Manandhar, D., Bastan, M., and Yap, K.-H. (2018). Tiered

deep similarity search for fashion. In Proceedings

of the European Conference on Computer Vision

(ECCV) Workshops.

Revanur, A., Kumar, V., and Sharma, D. (2021). Semi-

supervised visual representation learning for fashion

compatibility. In Proceedings of the 15th ACM Con-

ference on Recommender Systems, pages 463–472.

Wang, Z. and Bovik, A. C. (2002). A universal image qual-

ity index. IEEE signal processing letters, 9(3):81–84.

Xiao, J., Wu, Y., Xie, K., and Hu, Q. (2019). Managing

the e-commerce disruption with it-based innovations:

Insights from strategic renewal perspectives. Informa-

tion & Management, 56(1):122–139.

Yethindra, D. N. and Deepak, G. (2021). A semantic ap-

proach for fashion recommendation using logistic re-

gression and ontologies. In 2021 International Con-

ference on Innovative Computing, Intelligent Commu-

nication and Smart Electrical Systems (ICSES), pages

1–6. IEEE.

Yin, R., Li, K., Lu, J., and Zhang, G. (2019). Enhancing

fashion recommendation with visual compatibility re-

lationship. In The world wide web conference, pages

3434–3440.

Zou, X., Pang, K., Zhang, W., and Wong, W. (2022). How

good is aesthetic ability of a fashion model? In Pro-

A Regression Deep Learning Approach for Fashion Compatibility

147

ceedings of the IEEE/CVF Conference on Computer

Vision and Pattern Recognition, pages 21200–21209.

C¸ ic¸ek, Y. and Muzaffar, H. (2021). The impact of covid-19

pandemic crisis on online shopping. AYBU Business

Journal, 1(1):16–25.

ICEIS 2024 - 26th International Conference on Enterprise Information Systems

148