Vision-Language Models for E-commerce: Detecting Non-Compliant

Product Images in Online Catalogs

Maciej Niemir

, Dominika Grajewska

and Bartłomiej Nito

Łukasiewicz Research Network, Poznan Institute of Technology, Poznan, 6 Ewarysta Estkowskiego St., Poznan, Poland

Keywords:

Vision-Language Models, Product Image Validation, Image Quality, Product Data Validation, Automated

Validation, Image Classiﬁcation, Digital Catalog, Product Data Quality, Data Consistency, Online Retail,

Data Management.

Abstract:

This study explores the use of vision-language models (VLMs) for automated validation of product images in

e-commerce, aiming to ensure visual consistency and accuracy without the need for extensive data annotation

and specialized training. We evaluated two VLMs, LLaVA and Moondream2, to determine their effective-

ness in classifying images based on suitability for online display, focusing on aspects such as visibility and

representational clarity. Each model was tested with varying textual prompts to assess the impact of query

phrasing on predictive accuracy. Moondream2 outperformed LLaVA in both precision and processing speed,

making it a more practical solution for large-scale e-Commerce applications. Its high speciﬁcity and negative

predictive value (NPV) highlight its effectiveness in identifying non-compliant images. Our results suggest

that VLMs like Moondream2 provide a viable approach to visual validation in e-Commerce, offering beneﬁts

in scalability and implementation efﬁciency, particularly where a rapid and reliable assessment of product

imagery is critical. This research demonstrates the potential of VLMs as effective alternatives to traditional

image validation methods, underscoring their role in enhancing the quality of the digital catalog.

1 INTRODUCTION

In today’s digital era, the global e-commerce market

is experiencing rapid expansion, making the exchange

of digital information an essential component of mod-

ern trade. Scholars and industry professionals alike

recognize that maintaining high quality data is a key

challenge for organizations, and poor data quality can

have potentially signiﬁcant negative effects on busi-

ness operations (Wang and Strong, 1996; Ballou et al.,

2004; Haug et al., 2011). Quality of product data

refers primarily to attributes such as accuracy, com-

pleteness, timeliness, and consistency of information

in online catalogs (Wang and Strong, 1996). Ensur-

ing data quality has become a critical determinant of

success or failure for many enterprises, directly in-

ﬂuencing the efﬁciency of business transactions (Cao

and Zhang, 2011; Hole et al., 2018).

Product images play one of the most important

roles in ensuring high-quality product data in e-

https://orcid.org/0000-0002-1054-4285

https://orcid.org/0009-0000-1234-6728

https://orcid.org/0000-0003-3306-7650

commerce. Since consumers cannot physically ex-

amine products, they rely heavily on the images pro-

vided, making these visuals a crucial factor in the

decision-making process. However, image quality

issues, such as the use of logos instead of prod-

uct images or product images placed against inap-

propriate backgrounds, can severely undermine cus-

tomer satisfaction and trust, leading to lost sales and

tarnished reputations (Di et al., 2014; Qalati et al.,

2021). Poor image management can also increase

operational costs (Appelbaum et al., 2017; Biryukov,

2020), while the lack of automated solutions to man-

age this process has become increasingly problematic

(Russom, 2011).

The advent of vision language models (VLMs),

such as Moondream2 and LLaVA, offers promis-

ing new tools to address these issues by automating

the detection of inappropriate product images in e-

Commerce platforms. These models are capable of

interpreting both visual and textual prompts, enabling

them to identify instances where product images do

not meet predeﬁned standards. For example, VLMs

can be trained to detect whether an image contains

a company logo instead of the actual product, or if

1116

Niemir, M., Grajewska, D. and Nito

n, B.

Vision-Language Models for E-commerce: Detecting Non-Compliant Product Images in Online Catalogs.

DOI: 10.5220/0013265000003890

In Proceedings of the 17th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2025) - Volume 3, pages 1116-1123

ISBN: 978-989-758-737-5; ISSN: 2184-433X

the product is depicted against a background that de-

viates from the commonly accepted white or neutral

backdrop.

This paper explores the potential of using VLMs

to enhance the quality of product data by ensur-

ing visual consistency across e-commerce platforms.

Speciﬁcally, it investigates the capabilities of Moon-

dream2

and LLaVA (Liu et al., 2023) to detect and

classify non-compliant images, addressing challenges

related to manual data entry and the maintenance of

image quality across various online marketplaces. Us-

ing AI-driven solutions, this research aims to pro-

pose a scalable approach to improving product data

quality, contributing to the broader goal of optimizing

e-commerce platforms for both businesses and con-

sumers.

We begin by presenting an overview of the current

landscape of e-commerce and the critical role of data

quality in product imagery validation. In the Method-

ology section, we outline our approach to evaluat-

ing two speciﬁc VLMs, detailing the model setup,

prompt design, and evaluation metrics used to mea-

sure performance. The Experiments and Results sec-

tion discusses the practical tests conducted with each

model on various prompts, providing a detailed com-

parison based on accuracy, precision, and processing

efﬁciency. Finally, in the Conclusion, we summarize

our ﬁndings, discuss the implications of VLMs in e-

Commerce validation, and propose directions for fu-

ture research aimed at enhancing image compliance

accuracy in large-scale online catalogs.

2 RELATED WORK

Recent research emphasizes the critical role of high-

quality product images in e-Commerce, especially as

visual consistency and precision become essential to

foster consumer trust and engagement. Niemir and

Mrugalska (2022) observe that, unlike physical stores,

e-commerce relies on images to convey product at-

tributes, necessitating standards for clarity, resolution,

and object visibility across all product categories.

Muszy

nski et al. (2022) highlight the importance of

data quality in high-safety industries such as food and

cosmetics, advocating for the use of artiﬁcial intelli-

gence in validating both visual and textual attributes.

Their work underscores the need for automated solu-

tions that not only categorize images but also ensure

visual compliance with established e-commerce stan-

dards, suggesting that automated validation and AI

support can signiﬁcantly enhance industry standards,

https://www.moondream.ai/

such as the Global Data Model, while also facilitating

large-scale data management.

Michalski (2020) examines consumer perception,

demonstrating how the shapes of digital packaging

inﬂuence purchase intent. The study indicates that

ergonomic, standard packaging shapes increase cus-

tomer preference, underscoring the need for visually

appealing and consistent presentations to foster posi-

tive shopping experiences. These ﬁndings align with

the need for automated quality checks to effectively

manage visual presentation in extensive product cata-

logs.

Ouni et al. (2022) introduced a method of se-

mantic image quality assessment based on Convolu-

tional Neural Networks (CNN) to analyze product im-

ages within the e-commerce context. Their approach,

based on perceptual models, detects common visual

issues, such as poor lighting, color distortions, and

low sharpness, without the need for a reference im-

age. This method, known as Semantic Image Qual-

ity Assessment (SIQA), enables a detailed analysis

of features such as naturalness, readability, and color

consistency—critical to the visual quality of online

products. SIQA focuses primarily on perceptual im-

age quality rather than on verifying conformity with

product-speciﬁc category data.

Szymkowski and Niemir (2024) investigated the

use of CNNs and Visual Transformers (VTs) in au-

tomatic classiﬁcation of products according to GS1

GPC codes, indirectly assessing the degree to which

an image’s depicted object aligns with the expected

category.

While these studies provide valuable insights, a

universal and comprehensive solution for image qual-

ity assessment in e-commerce—encompassing vari-

ous quality issues and offering guidance on the types

of errors encountered—remains lacking. Vision lan-

guage models (VLM), such as those described by

Zhang et al. (2024), have signiﬁcant potential to ﬁll

this gap. These models employ contrastive learning

to pair images with textual descriptions, enabling pre-

cise categorization even in novel product contexts.

Although current VLM implementations are primar-

ily focused on categorization rather than on detailed

quality control tailored to speciﬁc e-commerce stan-

dards, our research demonstrates their potential ap-

plication as image validators.

Vision-Language Models for E-commerce: Detecting Non-Compliant Product Images in Online Catalogs

1117

3 METHODOLOGY

3.1 Our Approach

In response to the limitations of existing Vision-

Language Models (VLMs) that are primarily opti-

mized for categorization rather than comprehensive

quality control, our research takes a more generalized

approach. Rather than relying on highly speciﬁc crite-

ria tied to e-commerce standards, we explore the util-

ity of Moondream2 and LLaVA in a ﬂexible valida-

tion framework, focusing on whether an image could

reasonably represent a product suitable for an online

marketplace.

Moondream2 and LLaVA each contribute unique

strengths to this general approach. Moondream2 ex-

cels in object recognition and classiﬁcation, making it

well-suited for straightforward validation tasks where

accurate identiﬁcation of a product in the image is es-

sential. LLaVA, meanwhile, provides contextual de-

scriptions and interpretative feedback, supporting sce-

narios where qualitative judgment is needed to deter-

mine if an image’s composition aligns with typical e-

commerce product photos.

In our experiments, we focused on broad prompts

that ask each model to identify whether the content

in a given image could plausibly be used as a prod-

uct image. It is important to emphasize that the in-

put to the models consisted solely of product images

and predeﬁned textual prompts. No additional meta-

data, such as product names or attributes, were uti-

lized during the evaluation. This approach empha-

sizes generalizability and ﬂexibility, enabling us to

evaluate whether an image depicts a product in a way

that aligns with the expectations for online retail with-

out strictly adhering to platform-speciﬁc quality stan-

dards.

For our evaluation, we developed and tested sev-

eral dozen prompts for both models. From this set,

we selected six prompts—three for Moondream2 and

three for LLaVA—designed to assess the models’

ability to provide basic yet meaningful information

about image suitability. This approach allowed us to

analyze their effectiveness in general product image

validation. A detailed description of these prompts

and the corresponding results is presented in Section

5, where we evaluate each model’s potential to facil-

itate a generalized, adaptive approach to image val-

idation in e-Commerce. We acknowledge that the

selected prompts are not identical for both models;

however, they were chosen to best reﬂect the objec-

tive function. The aim was not to compare the models

based on identical prompts but to evaluate their ef-

fectiveness in the process of image validation. This

approach allowed us to focus on the practical utility

of each model in addressing validation tasks.

3.2 Setup

To simulate the computing power available to a

medium-sized company that does not necessarily spe-

cialize in providing AI solutions, we assumed that

the maximum computing power we could use in our

experiments was two NVIDIA GeForce RTX 3090

graphics cards.

The Moondream2 and LLaVA:34b-v1.6 models

were inferenced directly after downloading from the

HuggingFace

server. Furthermore, it is important to

note that the LLaVA model was used in the quan-

tized Q4 version, balancing performance with preci-

sion (Gholami et al., 2022).

3.3 Evaluation Method

To evaluate our models, we used standard metrics

such as accuracy, precision, recall, and F1 (Powers,

2011). Accuracy allowed us to measure the overall

correctness of classiﬁcation, indicating the percent-

age of cases where the model correctly identiﬁed both

suitable and unsuitable images. Precision referred to

the proportion of images classiﬁed by the model as

suitable that genuinely met quality requirements—the

higher the precision, the fewer cases where the model

incorrectly identiﬁed low-quality images as suitable.

Recall expressed the model’s ability to correctly iden-

tify all images that were indeed suitable—a higher

recall indicated that the model rarely missed images

meeting quality criteria. The F1 score combined pre-

cision and recall, enabling us to assess the models in

a more balanced manner, particularly when these two

metrics varied.

To further assess the models’ performance in de-

tecting unsuitable images, we incorporated speciﬁcity

and Negative Predictive Value (NVP). Speciﬁcity

(Altman and Bland, 1994a) measured the model’s

ability to correctly reject images that were indeed un-

suitable - the higher the speciﬁcity, the more effec-

tively the model identiﬁed low-quality images. Neg-

ative Predictive Value (NVP) (Altman and Bland,

1994b) indicated the percentage of images classiﬁed

as unsuitable that truly did not meet quality standards;

a higher NVP denoted greater conﬁdence that images

labeled as unsuitable indeed failed to meet the crite-

ria.

Our primary objective was to achieve the highest

possible precision, with acceptable levels of NVP and

https://huggingface.co/

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

1118

speciﬁcity. This approach minimized the number of

false positive classiﬁcations of low-quality images as

suitable, while simultaneously reducing the risk of in-

correctly rejecting suitable images.

4 DATASETS

The test dataset was sourced from a product catalog

created by the producers of these items (over 60,000

companies). The catalog covers a wide range of prod-

uct categories available online. The data entry process

in the catalog is not centrally supervised, leading to

various types of errors stemming from lack of knowl-

edge or random mistakes, making it an ideal source

for testing data improvement capabilities. Among

the analyzed categories were food products, clothing,

automotive items, consumer electronics, DIY equip-

ment, household chemicals, medical supplies, sta-

tionery, handicrafts, and furniture. From the avail-

able 1.2 million images, a preliminary ﬁltering pro-

cess was applied to ensure they met essential techni-

cal standards, including appropriate size, background

brightness, and background uniformity. Additional

criteria involved assessing the proportion of back-

ground coverage in the image. This was determined

by analyzing the ratio of the product object to the

overall image area. Images where the background oc-

cupied more than 80% of the total area were excluded.

These measures aimed to eliminate images with ex-

cessive background dominance or insufﬁcient focus

on the product, resulting in a more consistent set of

images that better aligned with the visual standards in

e-commerce. Subsequently, using a perceptual hash

algorithm, images signiﬁcantly different from each

other were randomly selected.

Manual annotation of image accuracy was con-

ducted by a trained annotator with experience in e-

commerce product data validation. The annotator fol-

lowed standardized guidelines, including detailed ex-

amples of correct and incorrect product images, to

ensure consistency and reliability across evaluations.

Ambiguous cases were evaluated through consulta-

tion to minimize potential biases. The criteria used

for annotation included factors such as readability of

information, visibility of the main product object, ap-

propriate presentation form, and background neutral-

ity (Niemir and Mrugalska, 2022).

A total of 1,663 unique images were annotated

and assigned to random product categories. Among

them, 174 images were assessed as incorrect and

1,489 as correct. The assessment process focused on

eliminating images that deviated signiﬁcantly from

online product presentation standards, such as miss-

ing images, substitute packaging (e.g., box graphics),

company logos instead of actual product images, out-

door photos, product presentation suggestions, or la-

bels. Notably, the analysis excluded verifying the

consistency of the product name and category with

its visualization in the image, as the objective was to

conduct a general evaluation of image quality.

During the analysis, certain product categories

whose speciﬁc presentation style hindered effective

veriﬁcation based on the packaging form were nec-

essary to be excluded. In particular, this applied to

categories where the image depicted the product with

a dominant pattern or texture element, which often led

to incorrect interpretation by the model. Problematic

categories included:

• Books, magazines, CDs, DVDs, vinyl records -

images are usually only accompanied by covers or

labels, making it difﬁcult to assess the full presen-

tation of the product.

• Wall coverings, carpets – images focused on tex-

tures or patterns, preventing the identiﬁcation of

the entire product.

• Decorative magnets, stickers, paintings,

posters – products were often presented on

surfaces or in contexts that could confuse the

models.

• Live plants – photos often taken outdoors did not

meet the standard requirements for presentation

on a neutral background.

• Services – images related to services, such as lo-

gos or graphic elements, did not meet the typical

criteria for physical products.

These categories required different visual analysis

strategies to avoid classiﬁcation errors and improve

the accuracy of evaluating image suitability in the

e-commerce context; therefore, they were excluded

from the study. For apparel products, several presen-

tation methods are commonly accepted in online re-

tail. Clothing can be displayed against a white back-

ground, as is typical for other products, hung on a

hanger, or shown on a person, similar to images in

advertising brochures of marketplaces. This practice

is generally permitted as long as the image focuses

on the product, ensuring that the presentation empha-

sizes the clothing item without distracting elements.

However, some e-commerce platforms enforce addi-

tional restrictions on speciﬁc clothing categories, par-

ticularly children’s apparel. These restrictions may

include limitations on the use of human models or

stricter requirements for presentation neutrality. As

a result, including apparel products in the validation

pipeline necessitates additional veriﬁcation steps to

Vision-Language Models for E-commerce: Detecting Non-Compliant Product Images in Online Catalogs

1119

ensure compliance with platform-speciﬁc standards.

This becomes especially relevant when contextually

validating the chosen form of presentation. Neverthe-

less, this study employed a simpliﬁed validation ap-

proach, prioritizing general usability criteria without

addressing these more detailed considerations.

5 EXPERIMENTS AND RESULTS

The evaluation focused on the capabilities of two AI

vision-language models (VLMs), LLaVA and Moon-

dream2, in classifying product images for their suit-

ability in e-commerce applications. Each model was

tested with three distinct text prompts to examine how

variations in query phrasing inﬂuence model predic-

tions. The list of prompts is as follows:

• LLaVA - prompt 1 (llava 1): Can it be a photo for

online sales? The product packaging picture does

not have to be detailed. It is important that the

product or its packaging is visible. The composi-

tion of the product does not have to be visible, the

photo does not have to be sharp. Return answer

in JSON format: {’answer’: [YES/NO], ’expla-

nation’: string}

• LLaVA - prompt 2 (llava 2): Is it a product

on a photo? Return answer in JSON format:

{’answer’: [YES/NO], ’explanation’: string}

• LLaVA - prompt 3 (llava 3): Verify whether a

buyer will understand what they are purchasing

based on the provided product name, considering

the following assumptions: 1. The buyer is a na-

tive Polish speaker. 2. During the shopping pro-

cess, the buyer only sees the product name. 3. The

buyer shops at a store within a speciﬁc industry, so

they are familiar with industry-speciﬁc terms and

phrases. 4. The product name may include brand

names and manufacturer codes. 5. The product

does not necessarily need to have description on

it. Return answer in JSON format: {’answer’:

[YES/NO], ’explanation’: string}

• Moondream2 - prompt 1 (moondream 1): Does

the photo show the product? Answer yes or no.

• Moondream2 - prompt 2 (moondream 2): Is it a

photo that shows a product for online sales? An-

swer yes or no.

• Moondream2 - prompt 3 (moondream 3):Can

provided image be a an image for an online auc-

tion? Answer yes or no.

For both models, the input consisted exclusively

of a product image and a corresponding predeﬁned

textual query. No metadata, such as product names

or descriptive attributes, were included in the experi-

ments. This approach ensured that the evaluation fo-

cused solely on the models’ ability to process visual

and prompt-based inputs without additional contex-

tual information. The Moondream2 model required

a two-step query process: ﬁrst, to obtain a ”yes” or

”no” answer, and second, to provide an explanation

if the initial response was ”no”, due to limitations in

handling both response types within a single prompt.

A comprehensive comparison of key metrics such as

accuracy, precision, and recall is presented in Table 1.

Table 1: Performance metrics for different prompts. Own

work.

Model Accuracy Precision Recall F-1

llava 1 0.90 0.90 0.99 0.94

llava 2 0.54 0.91 0.53 0.67

llava 3 0.23 0.93 0.15 0.26

moon 1 0.92 0.95 0.97 0.96

moon 2 0.75 0.92 0.79 0.85

moon 3 0.50 0.92 0.49 0.64

The best prompt for each model was selected

based on a combination of accuracy and F1 score, as

these metrics provide a balanced view of the model’s

capability to correctly classify suitable images while

minimizing both false positives and false negatives.

In this table, the results indicate that Moon-

dream2 achieved consistently higher accuracy and F1

scores compared to LLaVA, particularly with prompt

”moon 1,” which yielded an accuracy of 0.92 and an

F1 score of 0.96. This prompt demonstrated Moon-

dream2’s strength in maintaining a high level of pre-

cision (0.95) and recall (0.97), making it the most ef-

fective prompt for this model. For LLaVA, ”llava 1”

was identiﬁed as the best prompt, with an accuracy of

0.90 and an F1 score of 0.94, showing strong recall

(0.99) and balanced precision (0.90).

By selecting the prompt with the highest com-

bined accuracy and F1 score for each model, we es-

tablished a basis for more in-depth analysis. The sub-

sequent evaluation, which included NPV and speci-

ﬁcity metrics, further reﬁned our understanding of

each model’s ability to correctly reject non-compliant

images, as detailed in Table 2. and represented in the

confusion matrices in Figures 1 and 2.

The confusion matrices presented above illustrate

the performance of the LLaVA and Moondream2

models in classifying product images for e-commerce

suitability, with the best-performing prompt for each

model. In these matrices, the ”True” labels represent

the actual classiﬁcations (1 for suitable and 0 for un-

suitable images), while the ”Predicted” labels show

the model’s classiﬁcation outcomes.

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

1120

Table 2: Performance metrics for different models. Own

work.

LLaVA MoonDream2

Accuracy 89.84% 93.87%

Precision 90.34% 95.06%

Recall 99.26% 98.25%

NPV 59.26% 79.03%

Speciﬁcity 9.20% 56.32%

Figure 1: Confusion matrix for the MoonDream2 model

and the best-performing prompt (Prompt 1). Own work.

In the confusion matrix of the LLaVA model,

we observe relatively high false positives, where un-

suitable images are classiﬁed as suitable. Mean-

while, Moondream2 demonstrates a better balance,

with fewer false positives and a higher true negative

count, reﬂecting better speciﬁcity and Negative Pre-

dictive Value (NPV).

A critical aspect of data validation is the ability

to provide a clear explanation of why an image has

been ﬂagged as non-compliant. Both models eval-

uated in this study — Moondream2 and LLaVA —

are capable of generating comprehensive assessments

of the images, explaining the reasons for their suit-

ability or unsuitability for e-commerce use. Table

3 presents examples of such responses generated by

the Moondream2 model. Due to space limitations in

this publication, we have included only the responses

from Moondream2. LLaVA’s evaluations were sub-

stantively similar in content but tended to be more de-

tailed and linguistically reﬁned.

Each entry in the Table 3 includes an image and

a brief explanation generated by the model, clarify-

ing why the image may not meet e-commerce stan-

dards. For example, the ﬁrst image shows a sym-

bolic graphic rather than an actual product, which

Moondream2 notes as lacking the necessary detail

to represent a sellable item. Similarly, other exam-

ples highlight issues such as inadequate color repre-

Figure 2: Confusion matrix for the LLaVa model and the

best-performing prompt (Prompt 1). Own work.

sentation, absence of the product itself, or an overly

distracting background. These explanations demon-

strate Moondream2’s ability to provide nuanced as-

sessments, focusing on factors such as visibility, color

accuracy, and background relevance, which are essen-

tial for high-quality product representation in online

catalogs.

In addition to examining the models’ capabilities

for data validation, we evaluated the response time

of each model by calculating the average process-

ing time for 1,000 samples. For scalable solutions

managing large product catalogs, processing speed is

critical, especially in a dynamic e-commerce environ-

ment.

Table 4 presents the average response times for

both models, with Moondream2 achieving a signiﬁ-

cantly faster processing time of 0.34 seconds per im-

age compared to LLaVA’s 3.37 seconds. This dis-

parity in response times highlights Moondream2’s

efﬁciency, making it better suited for large-scale e-

commerce applications where rapid processing of ex-

tensive product catalogs is necessary to maintain op-

erational effectiveness and user experience.

6 CONCLUSIONS

The results of the conducted study demonstrated that

Vision-Language Models (VLMs) can serve as effec-

tive tools for supporting additional validation of prod-

uct images in terms of their suitability. This repre-

sents an innovative approach that has not been achiev-

able with traditional validation methods. Notably,

these models do not require ﬁne-tuning to achieve

satisfactory results, which signiﬁcantly reduces costs

and implementation time. Another advantage lies

in the universality of image validation, meaning that

Vision-Language Models for E-commerce: Detecting Non-Compliant Product Images in Online Catalogs

1121

Table 3: Descriptions generated by the Moondream2 model.

Own work.

The photo does not showcase the

product, but rather a minimalist

symbol – a green circle with a

white heart in the center, sym-

bolizing love, care, and affection.

The absence of additional details

allows the focus to be on the sym-

bolic meaning of the heart, often

used in marketing materials and

visual communication.

The image does not effectively

depict the product as it is in black

and white, making it difﬁcult to

convey colors and details accu-

rately. In this case, the tortilla

and wheat tortilla are key ele-

ments, and a color photograph

would better represent their ap-

pearance.

The photo does not depict the

product, as it only shows its label,

focusing on information about in-

gredients, nutritional values, and

key characteristics. The lack of an

actual image of the product means

it is not visible in the photo.

The product is not visible in

the image because the woman is

holding it in front of her face,

making her and her actions the

main focus of the picture. The

background, which shows the

ocean, adds an aesthetic value to

the image, but it is not essen-

tial for showcasing the product,

which is the KABISA tea.

Table 4: Performance time for different models. Own work.

Model Avg time

LLaVA 3,37 s

MoonDream2 0,34 s

building a validator does not necessitate referencing

speciﬁc products within particular categories.

Although the primary goal of this study was not to

benchmark the models, the results further indicated

that the smaller Moondream2 model outperformed

LLaVA in both validation accuracy and operational

efﬁciency. Moondream2 achieved higher scores in

key metrics, such as NPV and speciﬁcity, highlighting

its greater effectiveness in identifying non-compliant

product images for this task. Despite LLaVA’s ad-

vanced language capabilities and ability to generate

detailed descriptions, it proved less efﬁcient in tasks

focused on visual validation in the e-commerce con-

text. The ﬁndings thus indicate that employing large

multimodal models is not necessary to achieve satis-

factory results. Moreover, the analysis conducted on

a dedicated infrastructure showed that Moondream2

processes queries more than ten times faster than its

larger counterpart, making it a more efﬁcient solution

for large product catalogs.

6.1 Future Work

This study did not incorporate metadata, such as prod-

uct names, unit counts, packaging types, and other

attributes, which could enable signiﬁcantly more de-

tailed validation of image content. In future research,

we plan to integrate such data, allowing for the val-

idation of consistency between attributes and more

precise visual assessment of image compliance with

requirements.

Further efforts will also focus on expanding the

variety of prompts tailored to speciﬁc product cate-

gories. A comprehensive query system is planned,

where appropriate prompts will be assigned to in-

dividual nodes or branches of a product category

graph. This approach could signiﬁcantly enhance

validation effectiveness, particularly for product cat-

egories with speciﬁc graphical presentation require-

ments (e.g., clothing, media with covers, wall cov-

erings, magnets, stickers, posters, graphics, pho-

tographs, services, live plants).

Another direction of development involves divid-

ing image validation into separate problem categories.

Individual validation stages could include analyzing

background quality (uniformity, brightness, and the

ratio of the background to the area occupied by the

product), detecting the presence of company logos

only, verifying the number of products in an image,

and assessing proper product presentation, especially

in the clothing category.

Additionally, we consider incorporating alterna-

tive models such as BLIP, CLIP, and GPT to com-

pare their effectiveness through benchmarking. We

also plan to develop a dedicated multimodal model

specialized in product data validation, including eval-

uating the accuracy of product images. This solu-

tion could leverage methods and mechanisms used in

training the LLaVA model, allowing for a tailored ap-

proach to the speciﬁc requirements of e-commerce.

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

1122

6.2 Limitations

The conducted research highlights the promising po-

tential of Vision-Language Models (VLMs) for au-

tomating image validation in e-commerce. However,

certain limitations of this approach should be ac-

knowledged.

One limitation is the accuracy of the validation it-

self, which is not error-free. Consequently, the valida-

tor may work well as a module for suggesting qual-

ity improvements and ﬂagging image defects for cat-

alog administrators, but its use for deﬁnitively reject-

ing defective images requires detailed testing before

implementation in a speciﬁc catalog. Similarly, any

update to the model version in a production environ-

ment should also be preceded by prior research, as

results may vary.

Additionally, VLMs have signiﬁcant computa-

tional requirements, which may pose a barrier for

smaller enterprises. Further extensions of these mod-

els to accommodate industry-speciﬁc requirements

could negatively affect their performance unless they

are optimized for computational load and infrastruc-

ture accessibility.

REFERENCES

Altman, D. G. and Bland, J. M. (1994a). Statistics notes:

Diagnostic tests 1: sensitivity and speciﬁcity. BMJ,

308(6943):1552.

Altman, D. G. and Bland, J. M. (1994b). Statistics notes:

Diagnostic tests 2: predictive values. BMJ, 309:102.

Appelbaum, D., Kogan, A., Vasarhelyi, M., and Yan,

Z. (2017). Impact of business analytics and enter-

prise systems on managerial accounting. Interna-

tional Journal of Accounting Information Systems,

25:29–44.

Ballou, D. P., Madnick, S. E., and Wang, R. Y. (2004). Spe-

cial section: Assuring information quality. J. Manag.

Inf. Syst., 20:9–11.

Biryukov, A. (2020). Data quality as a service. Journal Of

Applied Informatics, 15:120–132.

Cao, M. and Zhang, Q. (2011). Supply chain collabo-

ration: Impact on collaborative advantage and ﬁrm

performance. Journal of operations management,

29(3):163–180.

Di, W., Sundaresan, N., Piramuthu, R., and Bhardwaj, A.

(2014). Is a picture really worth a thousand words?

- on the role of images in e-commerce. In Proceed-

ings of the 7th ACM international conference on Web

search and data mining, WSDM ’14, page 633–642,

New York, NY, USA. Association for Computing Ma-

chinery.

Gholami, A., Kim, S., Dong, Z., Yao, Z., Mahoney, M. W.,

and Keutzer, K. (2022). A Survey of Quantiza-

tion Methods for Efﬁcient Neural Network Inference,

pages 291–326. Chapman and Hall/CRC.

Haug, A., Zachariassen, F., and Van Liempd, D. (2011).

The costs of poor data quality. Journal of Industrial

Engineering and Management, page 168–193.

Hole, Y., Pawar, S., and Bhaskar, M. P. (2018). Service

marketing and quality strategies. Periodicals of Engi-

neering and Natural Sciences (PEN), 6(1):182–196.

Liu, H., Li, C., Wu, Q., and Lee, Y. J. (2023). Visual in-

struction tuning. ArXiv, abs/2304.08485.

Michalski, R. (2020). The role of virtual package shapes in

digital product presentation. In Rebelo, F. and Soares,

M., editors, Advances in Ergonomics in Design, page

24–30, Cham. Springer International Publishing.

Muszy

nski, K., Niemir, M., and Skwarek, S. (2022).

Searching for ai solutions to improve the quality of

master data affecting consumer safety. In Business

Logistics in Modern Management, page 121–140, Os-

ijek, Croatia. Faculty of Economics in Osijek.

Niemir, M. and Mrugalska, B. (2022). Product data quality

in e-commerce: Key success factors and challenges.

In Production Management and Process Control.

Ouni, S., Kamoun, K., and AlAttas, M. (2022). Se-

mantic Image Quality Assessment Using Conventional

Neural Network for E-Commerce Catalogue Manage-

ment, page 89–113. Springer International Publishing,

Cham.

Powers, D. M. W. (2011). Evaluation: from precision, recall

and f-measure to roc, informedness, markedness and

correlation. ArXiv, abs/2010.16061.

Qalati, S. A., Vela, E. G., Li, W., Dakhan, S. A., Hong Thuy,

T. T., and Merani, S. H. (2021). Effects of perceived

service quality, website quality, and reputation on pur-

chase intention: The mediating and moderating roles

of trust and perceived risk in online shopping. Cogent

Business & Management, 8(1):1869363.

Russom, P. (2011). Big data analytics. TDWI best practices

report, fourth quarter, 19(4):1–34.

Szymkowski, M. and Niemir, M. (2024). Convolutional

neural networks and vision transformers in product

gs1 gpc brick code recognition. In Sheng, B., Bi,

L., Kim, J., Magnenat-Thalmann, N., and Thalmann,

D., editors, Advances in Computer Graphics, Lecture

Notes in Computer Science, page 440–450, Cham.

Springer Nature Switzerland.

Wang, R. Y. and Strong, D. (1996). Beyond accuracy: What

data quality means to data consumers. Journal of

Management Information Systems, 12(4):5–33.

Zhang, J., Huang, J., Jin, S., and Lu, S. (2024).

Vision-language models for vision tasks: A survey.

arXiv:2304.00685.

Vision-Language Models for E-commerce: Detecting Non-Compliant Product Images in Online Catalogs

1123