EasyPortrait: Face Parsing and Portrait Segmentation Dataset

Karina Kvanchiani, Elizaveta Petrova, Karen Efremyan, Alexander Sautin and Alexander Kapitanov

SberDevices, Moscow, Russian Federation

Keywords:

Portrait Segmentation, Face Parsing, Portrait Dataset.

Abstract:

Video conferencing apps have recently improved functionality by incorporating computer vision-based fea-

tures such as real-time background removal and face beautiﬁcation. The lack of diversity in existing por-

trait segmentation and face parsing datasets – particularly regarding head poses, ethnicity, scenes, and video

conferencing-speciﬁc occlusions – motivated us to develop a new dataset, EasyPortrait, designed to address

these tasks simultaneously. It contains 40,000 primarily indoor photos simulating video meeting scenarios,

featuring 13,705 unique users and ﬁne-grained segmentation masks divided into 9 classes. Since annotation

masks from other datasets were unsuitable for our task, we revised the annotation guidelines, enabling Easy-

Portrait to handle cases like teeth whitening and skin smoothing. This paper also introduces a pipeline for

data mining and high-quality mask annotation through crowdsourcing. The ablation study demonstrated the

critical role of data quantity and head pose diversity in EasyPortrait. The cross-dataset evaluation experiments

conﬁrmed the best domain generalization ability among portrait segmentation datasets. The proposed dataset

and trained models are publicly available*.

1 INTRODUCTION

Video conferencing apps have quickly gained pop-

ularity in corporations for online meetings (Sander,

2020) and in daily life to keep in touch with dis-

tant relatives

. The video conferencing market value

worldwide is expected to continue growing in the

coming decades

. As a result, these apps have been

enhanced with various beneﬁcial features, including

face beautiﬁcation, background blur, and noise re-

duction

. Such extensions can improve the user ex-

perience by enabling background changing and skin

smoothing

Our work focused on incorporating the described

features into the video conferencing app. The app

should ensure a real-time CPU-based experience on

the user’s device and produce a highly accurate re-

sponse. Besides, our system must be robust to the

∗

https://anonymous.4open.science/r/easyportrait-

5109/

https://inlnk.ru/Vo8ydV

https://www.statista.com/statistics/1293045/video-

conferencing-market-value-worldwide/

https://www.banuba.com/blog/10-video-

conferencing-trends-enhanced-by-face-ar-technology

https://www.businessinsider.com/zoom-video-

conferencing-ﬁlter-touch-up-work-from-home-

coronavirus-2020-3

Figure 1: The face parsing and portrait segmentation anno-

tation examples from the EasyPortrait dataset.

amount of context in images, backgrounds, persons in

the frame, their poses, attributes (e.g., race and age),

and accessories (headphones, hats, etc.). Finally, im-

proving the users’ experience with such functions as

teeth whitening and accurate background changing in

the case of transparent glasses lenses and uneven hair

is preferable.

All these requirements impose restrictions on the

solution and training dataset. The system must func-

tion in real-time without delays and produce ﬁne-

grained segmentation masks. The suitable data is

required to be 1) heterogeneous in subjects, their

head turns, subject-to-camera distances, scenes, and

speciﬁc for videoconferencing domain subjects’ ac-

cessories like eyeglasses and headphones; 2) anno-

tated with main face parsing classes (“skin”, “brows”,

“eyes”, and “lips”) and an extra class “teeth”; 3) accu-

Kvanchiani, K., Petrova, E., Efremyan, K., Sautin, A. and Kapitanov, A.

EasyPortrait: Face Parsing and Portrait Segmentation Dataset.

DOI: 10.5220/0013135300003912

In Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2025) - Volume 3: VISAPP, pages

327-338

ISBN: 978-989-758-728-3; ISSN: 2184-4321

327

rately annotated for both tasks simultaneously. This

approach enables training a single model to handle

all potential use cases, achieving competitive metrics

while saving limited resources for model inference

(see Table 4 in the supplementary details).

Existing datasets are unsuitable for our purpose

due to the limitations described in Section 2, which

motivated us to create the EasyPortrait dataset. Im-

ages were annotated manually by 9 classes according

to specially designed rules, which allowed us to cover

all described cases. We checked that such data char-

acteristics as the number of samples and their diver-

sity in head pose positively impact the model’s robust-

ness and effectiveness (Section 4). The generalization

ability of the training data was also assessed through

cross-dataset evaluation experiments (Section 5).

Our contributions can be summarized as follows:

• We present EasyPortrait, a portrait segmentation

and face parsing dataset containing 40,000 pairs

of images and segmentation masks from 13,705

individuals in domain-suitable scenes with differ-

ent head poses and various speciﬁc videoconfer-

encing app accessories. We considered the im-

portance of ethnic diversity in solving problems

based on persons and their facial attributes and

collected images from users of various countries

and continents.

• We developed a pipeline for gathering and label-

ing images with ﬁne-grained segmentation masks

via crowdsourcing platforms. This approach bal-

ances annotation cost and quality, efﬁciently pro-

ducing detailed and reliable image masks.

• The experiments demonstrate that EasyPortrait

exhibits the best generalization capability regard-

ing mIoU metrics across all portrait segmentation

test sets.

2 RELATED WORK

This section discusses existing portrait segmentation

and face parsing datasets separately. We will consider

them from two points of view: 1) the applicability

to the videoconferencing domain and 2) data creation

techniques for each segmentation task and their con-

sequences. We focus solely on image datasets, an-

alyzing and benchmarking segmentation models on

static images to maintain a consistent evaluation envi-

ronment while avoiding video data’s added complex-

ity and computational demands.

2.1 Portrait Segmentation and Matting

Datasets

The portrait segmentation task involves labeling ev-

ery pixel in an image as either “person” or “back-

ground”. Since matting annotations can be reduced to

binary ones, image matting datasets will also be con-

sidered. As videoconferencing apps always take por-

traits for input and not a person entirely, only datasets

with waist-deep people are reviewed. Therefore,

other popular image person segmentation and im-

age matting datasets, e.g., P3M-10k (Li et al., 2021),

PPR10K (Liang et al., 2021), PhotoMatte13K/85 (Lin

et al., 2020), Persons Labeled (per, 2020), are not

described in this paper. We selected EG1800 (Shen

et al., 2016), AiSeg (ais, 2019), FVS (Kuang and Tie,

2021), Face Synthetics (Wood et al., 2021), and Hu-

manSeg14K (Chu et al., 2021) as the most widespread

and predominantly containing photos of people from

the waist up. Table 1 provides the numerical analysis

of reviewed datasets.

Content. Chosen datasets can be divided into

three groups regarding image source: 1) downloaded

from websites, 2) collected manually, and 3) gener-

ated. The last two allow the data authors to determine

content on their own directly. Images of EG1800 and

AiSeg were collected from services like Flickr, which

made their scenes multi-domain. The Face Synthet-

ics dataset was entirely generated, entailing primarily

blurred backgrounds, and thus is far from in the wild.

The manually collected FVS (Kuang and Tie, 2021)

and PP-HumanSeg14 (Chu et al., 2021) portrait seg-

mentation datasets are single-domain with a bias to-

wards videoconferencing. The FVS dataset provides

composite images from 10 conference-style green-

screen videos and virtual backgrounds. As a result,

FVS samples suffer from the remaining green screen

around a person in the frame. The PP-HumanSeg14

dataset includes 23 different conference backgrounds

and samples of participants performing actions such

as waving hands and drinking water. The provided

samples contain one or more labeled people with

faces blurred for privacy.

Annotation. Since segmentation mask annotation

is one of the most challenging problems in the com-

puter vision ﬁeld, data authors prefer automatic meth-

ods. All reviewed datasets except PP-HumanSeg14K

were annotated automatically or using Photoshop (see

Fig. 5 in the supplementary material for visual exam-

ples). Such methods frequently produce coarse masks

that prevent accurate high-frequency parts segmenta-

tion (e.g., hair) – one of the main hardships of back-

ground removal in video conferencing.

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

328

Table 1: Comparison of portrait segmentation and face parsing datasets. Because of the speciﬁcs of the tasks, we indicated

the task name for the ﬁrst ones and the number of classes for the second. Several datasets include images of different sizes,

so the standard label was provided in the resolution column. 96% of images in the EasyPortrait are FullHD; see more

information in Section 3.2. We also included notes about the annotation method, which can be important regarding label

quality; more detailed information can be found in Section 2. Note that 20 classes in the FaceOcc dataset contain 19 classes

from CelebAMask-HQ.

Dataset Samples Task / Classes Resolution Annotation Method

EG1800, 2016 (Shen et al., 2016) 1,800 segmentation 600 × 800 Photoshop

FVS, 2018 (Kuang and Tie, 2021) 3,600 segmentation 640 × 360 chroma-key

AISeg, 2018 (ais, 2019) 34,427 matting 600 × 800 automatically

PP-HumanSeg14K, 2021 (Chu et al., 2021) 14,117 segmentation 1280 × 720 manually

The Face Synthetics, 2021 (Wood et al., 2021) 100,000 segmentation 512 × 512 automatically

Helen, 2012 (Le et al., 2012) 2,330 11 400 × 400 automatically

LFW-PL, 2013 (Kae et al., 2013) 2,927 3 250 × 250 automatically & reﬁned

CelebAMask-HQ, 2019 (Lee et al., 2019) 30,000 19 512 × 512 manually & checked & reﬁned

LaPa, 2020 (Liu et al., 2020) 22,176 11 LR automatically & reﬁned

iBugMask, 2021 (Lin et al., 2021) 22,866 11 HR manually

The Face Synthetics, 2021 (Wood et al., 2021) 100,000 19 512 × 512 automatically

FaceOcc, 2022 (Yin and Chen, 2022) 30,000 20 512 × 512 manually

EasyPortrait, 2023 40,000 9 FullHD manually & checked

2.2 Face Parsing Datasets

The face parsing task aims to assign pixel-level se-

mantic labels for facial images. Generally, face pars-

ing refers to classifying image pixels, such as brows,

eyes, nose, lips, mouth, and skin. We considered sev-

eral widespread face parsing datasets for our purposes

(Table 1 and Fig. 6 in the supplementary material).

Content. The main limitation of existing face

parsing datasets is the low diversity in head poses

and the absence of speciﬁc occlusions. Helen’s (Le

et al., 2012) authors obtained the data from other

datasets and websites like Flickr by searching for

“portrait” in various languages to avoid cultural bias.

The CelebAMask-HQ dataset (Lee et al., 2019)

mainly contains front-facing images of celebrities

from Celeba (Liu et al., 2015) with centered heads.

Besides, the faces usually occupy a signiﬁcant part of

the image. Thus, background information is mainly

discarded. The LaPa dataset (Liu et al., 2020) was

designed based on images from the 300W-LP (Zhu

et al., 2017) and Megaface (Taherkhani et al., 2018)

datasets. Received faces were aligned and mostly

cropped, with limited background information pre-

served. This image collection method was also uti-

lized to create the iBugMask dataset (Lin et al., 2021)

containing samples from Helen’s training set. The

iBugMask authors focused on head pose diversity and

augmented images with a synthetic rotation algorithm

from 3DFFA (Guo et al., 2018), which led to massive

artifacts. The Face Synthetics dataset was speciﬁcally

diversiﬁed during generation by various head poses,

human identities, clothes, and such occlusions as eye-

glasses and face masks.

Annotation. Since the face parsing annotation

process is more challenging than portrait segmenta-

tion, the data is frequently marked manually or au-

tomatically with further reﬁning. Additionally, ex-

isting datasets had annotations created with inappro-

priate rules, limiting their usability for our purposes.

First, almost all reviewed datasets relate beard to skin

class and nostrils to nose class. Second, some con-

tain glasses as skin and others – annotate transparent

glasses as non-transparent ones. Such factors made

the skin enhancement task impossible due to further

artifacts. Third, none of them contain separate anno-

tations for teeth. Finally, there are other difﬁculties

present:

• LFW-PL (Kae et al., 2013) is limited to only 3

classes (background, face, hair), which is unsuit-

able for solving our speciﬁc problems.

• Helen’s (Le et al., 2012) 2,330 facial images were

annotated by facial part landmarks utilizing Ama-

zon Mechanical Truck, and then masks were gen-

erated automatically. The LaPa’s (Liu et al., 2020)

authors pointed out Helen’s annotation errors.

• CelebAMask-HQ (Lee et al., 2019) ignored oc-

clusions on its own, however, the authors (Yin and

Chen, 2022) solved this problem with the dataset

extension – FaceOcc. It contains images from

CelebAMask-HQ and is annotated with only one

class – occlusions (eyeglasses, tongue, makeup,

and others). Regardless, FaceOcc includes a beard

to the skin class. We are consider FaceOcc as

CelebAMask-HQ with FaceOcc.

• The iBugMask (Lin et al., 2021) contains many

persons with annotated masks for only one of

them.

he mentioned datasets are unsuitable for our task

due to the outlined issues and additional challenges,

including the absence of FullHD images, a lim-

ited number of subjects, privacy concerns, and low-

quality annotations. In addition to general shortcom-

EasyPortrait: Face Parsing and Portrait Segmentation Dataset

329

ings, other datasets lack video-conferencing domain-

speciﬁc characteristics like task-speciﬁc occlusions

and situations, various head poses, and domain con-

text scenes. Motivated by the above and the grow-

ing need for a suitable dataset tailored to video con-

ferencing applications, we developed a new dataset,

EasyPortrait, which includes both face parsing and

portrait segmentation annotations. We intentionally

diversiﬁed the EasyPortrait by head poses, subjects,

scenes, subjects’ attributes such as ethnicity, and their

occlusions (glasses, beards, piercing, etc.). It was an-

notated with all required classes for our applications,

with speciﬁc rules for the skin class and occlusions in

particular (Table 3 in the supplementary material).

3 EasyPortrait DATASET

This part provides our dataset creation pipeline

overview, the dataset characteristics, and its splitting.

3.1 Image Collection & Labeling

Two crowdsourcing platforms, Toloka

and ABC El-

ementary

, were chosen for all stages of dataset cre-

ation. Our pipeline consists of two main stages: (1)

the image collection stage, which is followed by val-

idation completely realized on Toloka, and (2) the

mask creation stage, which was accomplished on both

platforms. All crowd workers are ensured fair com-

pensation, reﬂecting their contributions and efforts.

At each stage, the responses of low-skilled workers

were subjected to our quality control methods. A

more detailed description is provided below.

Image Collection. The crowd workers’ task was

to take a selﬁe or a photo of themselves in front of the

computer. As we aimed to design a diverse dataset in

terms of occlusions, races, and head turns and make

it suitable to solve teeth whitening problems, one of

the further conditions periodically supplemented the

task:

• Occlusions. The sent photo should contain one of

such characteristics as hands in front of the face,

glasses, the tongue out, headphones, or hats.

• Head turns. The head on the sent photo should be

turned in any direction at various angles.

• Teeth whitening. Random workers were asked to

send photos with open mouths.

• Ethnicity. We recognized the importance of hav-

ing a diverse dataset of facial images and ensured

https://toloka.yandex.ru/

https://elementary.activebc.ru/

the inclusion of individuals from various coun-

tries.

Note that all workers have signed a document stating

their consent to the photo publication before starting

the tasks.

Image Collection Quality Check. We foresaw

the possible dishonesty of the crowd workers and

checked all images for duplicates by image hash com-

parison. The image collection quality check also in-

cludes image validation, where images are reviewed

for compliance with conditions such as the distinct-

ness of the face, the head being entirely in the frame,

and the clarity of the frame. The validation stage was

available to crowd workers only after training and ex-

amination tasks. Different users checked each image

3 to 5 times for at least 80% conﬁdence.

Image Labeling. The annotation of portrait seg-

mentation usually has several ambiguous instances,

such as occlusions in front of the person, hand-held

items, hats, headphones, hair, and others. Face pars-

ing masks are also unclear due to occlusions in front

of the face, including tongue, hair, eyeglasses, beard,

etc. The annotation rules directly affect the ﬁnal

segmentation masks and the model trained on them.

Rules for annotating each class and processing occlu-

sions for workers are given in Table 3 in the supple-

mentary material.

The labeling stage was divided into parts to sim-

plify the annotation process for the workers. All im-

ages received after the collection stage were gradually

sent to the annotation of individual pairs of classes:

person and background, skin and occlusions (which

include such things as eyeglasses, beard, tongue out,

and others), eyes and brows, lips and teeth. After la-

beling, we separated the overall masks of eyes and

brows into left and right ones using heuristics. Fig. 2

visualizes the mask creation stages.

Image Labeling Quality Check. We required all

workers to complete training tasks for each pair to en-

hance mask quality. We analyzed the quality of crowd

workers’ training by automatically comparing masks

from professional data annotators.

We requested ABC Elementary’s qualiﬁed work-

ers to label each image with the subsequent veriﬁca-

tion by the platform’s experts. Due to a lack of trust in

the platform’s experts, whom we did not manage, and

the shortage of qualiﬁed annotators capable of provid-

ing sufﬁcient overlap, we incorporated low-skilled an-

notators into the pipeline with an overlap of 5. Thus,

each image was annotated by 5 crowd workers for

subsequent averaging and getting the best result. Seg-

mentation masks were created from polygons. We ag-

gregated the same annotations to one segmentation

mask (see bottom of the Fig. 2), checked IoU (In-

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

330

Figure 2: Example of data collection pipeline. Each image was annotated by 5 annotators. The masks are averaged with the

expert-veriﬁed one and merged to obtain the ﬁnal segmentation mask.

tersection over Union), and compared it to a unique

threshold, selected for each pair manually

. Then, we

averaged the received aggregated mask with veriﬁed

one with weights of 0.3 and 0.7, respectively.

Mask Merging. Whole masks were received by

the alternate overlay of masks in the following order:

person, face skin, left brow, right brow, left eye, right

eye, lips, and teeth.

In addition, the decision to release the dataset to

the public and ethical reasons prompted us to add the

ﬁltration stage to the end of the pipeline – checking

for children under 18, naked people, religious signs,

and watermarks.

3.2 Dataset Characteristics

The mean and standard deviation of images

in EasyPortrait are [0.562, 0.521, 0.497] and

[0.236, 0.236, 0.232], respectively.

Classes. EasyPortrait is annotated with 9 classes,

including “background”, “person”, “face skin”, “left

brow”, “right brow”, “left eye”, “right eye”, “lips”,

and “teeth”. We extracted all occlusions, such as

glasses, hair, hands, etc., from the skin. The beard

is extracted from the skin if it is clearly deﬁned (refer

to Fig. 7 in the supplementary material for details).

However, such parts of a person as headphones, car

belts, and others are included in the person class to

facilitate background removal and streamline the data

annotation process.

The thresholds were chosen by comparing crowd

workers’ masks with corresponding experts’ masks from

training tasks to ensure a qualitative visual result.

Diversity. The proposed dataset contains images

of scenes such as an ofﬁce, living room, kitchen,

bedroom, outdoors, car, etc. Samples in the dataset

display various clothes, hats, headphones, and med-

ical masks). They are also diverse in lighting con-

ditions, subjects’ age, gender, and poses. Most sub-

jects in the dataset are between 20 and 40 years

old, with approximately equal numbers of women

and men (see Fig. 3h,k). Almost all images con-

tain only one person, which is especially common at

video conferences. We also collected images from re-

gions such as Africa, Asia, India, and Europe, giv-

ing us approximate region and ethnic diversity (see

Fig. 3a). Furthermore, some individuals in the pho-

tos display emotions, such as smiling, expressing

anger, being surprised, and others (see Figure 3j). We

examined the distribution of emotions using the fa-

cial emotional recognition model proposed by (Ryu-

mina et al., 2022), which categorizes emotions into 7

classes: “neutral”, “happiness”, “sadness”, “anger”,

“surprise”, “disgust”, and “fear”.

Images Resolution. We speciﬁcally focused on

collecting FullHD images to avoid visual artifacts as-

sociated with rescaling. Most images in EasyPortrait,

namely 38,353, are FullHD: the maximum side is

1,920, and the minimum side is 835 to 1,920 (Fig. 3e).

The minimal resolution is 720 × 720. Fig. 3g shows

the dataset’s samples’ mask area distribution.

Dataset Quality. We analyzed the number of

points per image for each class since images were la-

beled with polygons. On average, each EasyPortrait

image has 655 points, from which it can be concluded

that the annotation is of high quality. In comparison,

EasyPortrait: Face Parsing and Portrait Segmentation Dataset

331

Figure 3: Dataset statistics. a) subjects’ countries distribution; b), c), d) image distribution by subjects in train, validation, and

test sets, respectively (each bar represents the count of images recorded by a particular subject group); e) image resolution

distribution: samples overlap with equal transparency and density reveals quantity; f) brightness distribution; g) mask area

distribution; h) subjects’ age distribution; i) subjects’ devices: only smartphones, personal computers, and tablets were used

while recording; j) subjects’ emotions; k) subjects’ gender.

Helen (Le et al., 2012) was annotated only with 194

points per image and LaPa (Liu et al., 2020) with 106.

3.3 Dataset Splitting

All annotated images were divided into training, val-

idation, and testing sets, with 30,000, 4,000, and

6,000 samples. Training images were received from

4,398 unique users, while validation and testing im-

ages were collected from 3,468 and 6,000 unique

users, accordingly (see Fig. 3b-d). The test set con-

tains the most unique users for maximum subject di-

versity. Additionally, the subjects in all three sets do

not overlap, eliminating any possibility of data leak-

age. We also added the anonymized user ID hash,

which can be used for manual dataset splitting.

4 ABLATION STUDY

We conducted an ablation study to assess the dataset’s

primary characteristics. By altering the dataset’s data

volume and head pose variability, we trained models

and compared them with those trained on the orig-

inal EasyPortrait. In the ablation study, we utilized

BiSeNetv2 (Yu et al., 2018), FPN (Lin et al., 2017),

FCN (Long et al., 2014), and Segformer-B0 (Xie

et al., 2021) for portrait segmentation and face pars-

ing tasks. Validation and test sets remain unchanged

in all ablation experiments.

Quantitative Necessity. To evaluate the impact

of data quantity, we trained selected models using

varying training set sizes: 30,000 (original), 20,000,

10,000, and 5,000 images. The deterministic slice

was used for a train set expansion, i.e., images in the

n[i] set are included in the n[i + 1] set. The ablation

study results are provided in Fig. 4. The quantitative

necessity experiments revealed an increase in metrics

as the size of the training set expanded. Both por-

trait segmentation and face parsing metrics show an

increase with the expansion of the training set size.

However, the improvement in portrait segmentation

is less prominent than the face parsing results.

Pose-Diversity Necessity. We also assess the im-

portance of the head pose by varying pose diversity in

10,000 training images. We obtained the head pose

coefﬁcients (yaw and pitch) for each image in the

dataset using 3DDFA network (Guo et al., 2018). For

both of these coefﬁcients, we chose three coefﬁcient

windows from homogeneous to heterogeneous pose

distribution: [-7.5; 7.5], [-15; 15] [-180; 180]. Reduc-

ing head pose heterogeneity results in declining face

parsing metrics (Fig. 4b). Variations in head rotations

do not signiﬁcantly impact portrait segmentation met-

rics; therefore, we did not include a plot for these ex-

periments.

Cross-Dataset Ablation Study. We also con-

ducted an additional set of experiments: we trained

the FPN model on the EasyPortrait dataset with

changes in data diversity. Then, we evaluated the

model on other face parsing and portrait segmenta-

tion test sets, mentioned in Section 5.2. Alterations in

data quantity and head pose variations have minimal

impact on portrait segmentation results. In contrast,

in the face parsing task, an increase in data diversity

positively inﬂuences the model’s metrics (Fig. 4a-b).

5 EXPERIMENTS

The main goal of extensive base experiments is to

demonstrate that the dataset has the ability to train

models, achieving concurrent results without the need

to simulate facial occlusion or pose variations (as in

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

332

Figure 4: The impact visualization of such dataset characteristics as a) sample amount, b) head pose diversity for face parsing,

and c) sample amount for portrait segmentation task. Solid lines correspond to models trained and tested on the EasyPortrait

dataset. In contrast, the dotted line is the model pretrained on the EasyPortrait and tested on other datasets (see the legend for

details). We evaluated all the datasets discussed in Section 5.2; however, we did not create visualizations for datasets without

signiﬁcant metric changes. Note that all the plots have different scales.

(Liu et al., 2020) and (Lin et al., 2021), respectively).

For this reason, we chose various models for our base

experiments. We evaluate the models’ quality via the

mean Intersection-over-Union (mIoU) metric (Long

et al., 2015).

5.1 Base Experiments

Separation on Two Tracks. We split our experi-

ments into two tracks – portrait segmentation and face

parsing – to transparently compare EasyPortrait with

other datasets separately. This division is also nec-

essary to avoid ambiguity and ensure the obtained

metrics represent both tasks. The portrait segmen-

tation is based on two EasyPortrait classes (“back-

ground” and “person”), whereas the face parsing

masks include eight classes (“background”, “skin”,

“left brow”, “right brow”, “left eye”, “right eye”,

“lips” and “teeth”). For portrait segmentation, we

deﬁned all classes of EasyPortrait except the back-

ground as a person, while for face parsing, we desig-

nated the person class as the background. The model

conﬁguration and training process are identical for

both tasks, except for the number of classes in the de-

coder model’s head.

Models. We prioritized lightweight architectures

for easy integration into videoconferencing apps, en-

abling real-time use. As general segmentation archi-

tectures, we selected BiSeNetv2 (Yu et al., 2018),

DeepLabv3 (Chen et al., 2017), FPN (Lin et al.,

2017), FCN (Long et al., 2014), DANet (Fu et al.,

2019), and Fast SCNN (Poudel et al., 2019) mod-

els. We utilized Segformer-B0 (Xie et al., 2021) to

assess the performance of the transformer model on

the proposed dataset. Besides the aforementioned

widespread segmentation architectures, we experi-

mented with models speciﬁcally designed for por-

trait segmentation and face parsing. For this pur-

pose, we chose the SINet (Park et al., 2019a) and Ex-

tremeC3Net (Park et al., 2019b) for the ﬁrst one and

EHANet (Luo et al., 2020) model for the second.

We trained each of these networks for 100 epochs

with batch size 32. AdamW (Loshchilov and Hutter,

2017) was used as an optimizer and learning rate (LR)

with the initial value of 0.0002. The LR changes ac-

cording to the polynomial LR-scheduler with factor

1.0 by default. We also note that all random seeds in

Python and PyTorch were ﬁxed to a value of 1001 to

enhance the reproducibility of the experiments.

Augmentations and Images Resolution. Images

and segmentation masks were resized to the maxi-

mum side of 384 with aspect ratio preservation and

symmetrically padded to square. We used bilinear in-

terpolation for image resizing, while nearest neighbor

interpolation was applied to masks to maintain class

consistency. Photometric distortion was used with a

brightness delta of 16, a contrast in the range [0.5,

1.0], saturation in the range [0.5, 1.0], and a hue delta

of 5. At last, the results were normalized using pre-

computed per-channel dataset statistics listed in Sec-

tion 3.2.

The results of our experiments are presented in the

Table 4 in supplementary material. All the models

trained on our dataset achieve high metrics, with the

FPN model outperforming others in both tasks.

5.2 Cross-Dataset Evaluation

We conduct cross-dataset evaluation to compare our

dataset with existing ones in face parsing and portrait

segmentation domains.

Experiments Conﬁguration. We train the FPN

model for two segmentation tasks on each dataset. All

datasets’ samples were exposed to resizing to ﬁxed

384 × 384 shape and base augmentations pipeline

described in Section 5.1. The training process and

model conﬁguration are the same as the base experi-

ments.

EasyPortrait: Face Parsing and Portrait Segmentation Dataset

333

Table 2: Cross-dataset evaluation results. Each cell value contains mean IoU (mIoU) metrics for the corresponding training

and testing sets pair. Train (test) average mIoU represents the overall mIoU value on the listed testing (training) sets. The

high train average mIoU metric directly relates to the dataset’s generalization ability. A low test average mIoU metric reﬂects

dataset complexity, as a model pre-trained on a different set struggles to achieve a high metric. We highlighted the best metric

in each column to emphasize the dataset’s ability to generalize to other distributions. The best metric in all columns except

the last one was chosen, excluding diagonal values.

Portrait Segmentation

Tested

Dataset EasyPortrait (ours) FVS HumanSeg14K Face Synthetics Train avg. mIoU

Trained

EasyPortrait (ours) 98.64 97.86 93.18 97.76 96.86

FVS (Kuang and Tie, 2021) 79.05 96.24 90.6 80.36 86.56

HumanSeg14K (Chu et al., 2021) 76.01 96.23 97.53 71.66 85.35

Face Synthetics (Wood et al., 2021) 84.99 57.14 57.87 99.44 74.86

Test avg. mIoU 84.67 86.87 84.8 87.31

Face Parsing

Tested

Dataset EasyPortrait (ours) CelebAMask-HQ iBugMask Face Synthetics LaPa Train avg. mIoU

Trained

EasyPortrait (ours) 81.51 76.01 39.0 51.2 61.03 61.75

CelebAMask-HQ (Lee et al., 2019) 66.17 83.41 54.74 46.6 60.62 62.31

iBugMask (Lin et al., 2021) 61.58 79.1 64.59 44.42 66.3 63.19

Face Synthetics (Wood et al., 2021) 55.55 40.67 18.84 83.12 42.63 48.16

LaPa (Liu et al., 2020) 68.56 73.92 47.66 48.05 79.02 63.44

Test avg. mIoU 66.67 70.62 44.97 54.68 61.92

Portrait Segmentation. Besides our dataset, the

model was trained and tested on HumanSeg14K (Chu

et al., 2021), Face Synthetics (Wood et al., 2021),

and FVS (Kuang and Tie, 2021) portrait segmentation

datasets. We could not include the EG-1800 (Shen

et al., 2016) and the AiSeg (ais, 2019) datasets due

to a lack of images on the public shared sources and

inappropriate samples, respectively.

Some additional preprocessing steps were ap-

plied:

• We led the EasyPortrait’s class “person” to a

consistent appearance by labeling others classes

(without “background”) as “person” class.

• FVS (Kuang and Tie, 2021) contains non-binary

masks, necessitating binarization. Pixel values

clustered near 0 or 255, so we used a threshold of

127 to separate “person” and “background”. The

dataset was split into 1,326 training and 935 test-

ing samples, with 200 images randomly selected

from the training set for validation.

• HumanSeg14K (Chu et al., 2021) dataset was di-

vided into the training, validation, and test parts

with 8,770, 2,431, and 2,482 samples, respec-

tively.

• Similar to EasyPortrait’s preprocessing, we pre-

pared the Face Synthetics (Chu et al., 2021)

dataset to portrait segmentation masks. We ran-

domly picked 75,000 training, 15,000 testing, and

10,000 validation samples.

Face Parsing. Given that the EasyPortrait skin

class was annotated using unique rules and most face

parsing datasets lack annotations for the teeth class,

we selected only six classes for cross-dataset evalua-

tion: “background”, “left brow”, “right brow”, “left

eye”, “right eye”, “lips”. We adopted the original an-

notations of face parsing datasets to the target ones:

• Such EasyPortrait classes as “teeth”, “person” and

“skin” classes were mapped to the “background”.

• Since lips of the CelebAMask-HQ dataset are di-

vided into two classes: “lower lip” and “upper

lip”, we combined them into one “lips” class. The

remaining classes are considered as background.

All datasets below were preprocessed similarly to

CelebAMask-HQ (Lee et al., 2019). We divided

the CelebAMask-HQ dataset into 22,500 training,

3,000 validation, and 4,500 test samples.

• The LaPa (Liu et al., 2020) dataset was originally

split into 18,167 training, 2,000 validation, and

2,000 test samples.

• Originally, images from iBugMask (Lin et al.,

2021) were split into 21,866 training and 1,000

testing examples. The validation set was ran-

domly sampled from the training set and con-

tained 1,866 images. Note that iBugMask (Lin

et al., 2021) contained images with a bounding

box for the face in the provided mask. To avoid

parsing other faces, we crop them as described in

the original paper.

• The Face Synthetics dataset was distributed into

75,000 training, 10,000 validation, and 15,000 test

samples by us.

Results. The cross-dataset evaluation results in

Table 2 demonstrate that the EasyPortrait has the

best generalization capability regarding mIoU met-

rics on each portrait segmentation test set. Also,

the FPN model, trained on the EasyPortrait dataset

for portrait segmentation, surpasses FVS results even

on their own test set. Due to the reduced list of

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

334

classes, the quantitative assessment provides limited

insights into the dataset’s applicability for the face

parsing task. Besides, the videoconferencing domain

is slightly limited: images in other datasets are more

heterogeneous in context, displaying multiple peo-

ple and different activities, while EasyPortrait consis-

tently shows a single person in front of a computer or

phone. Despite this, we achieved concurrent results

in the face parsing task.

6 DISCUSSION

Ethical Considerations. Creating facial feature

datasets involves signiﬁcant ethical considerations.

Individuals must provide informed consent for their

facial data to be collected, stored, and used. Instruc-

tions for the crowd tasks were presented in clear lan-

guage. Additionally, ensuring diversity in the dataset

is vital to prevent biases leading to unfair treatment

of certain groups. To address privacy concerns, all

crowd workers signed a consent form allowing us to

process and publish their photos. We adhere to the

Federal Law ”On Personal Data” (27.07.2006 N152)

of Russia, ensuring legal compliance in data handling.

Besides, we used anonymized user ID hashes to pro-

tect crowd workers’ privacy. Despite the limitations

of using the Russian crowdsourcing platform, efforts

were made to minimize biases and make the dataset

as racially diverse as possible. Thus, the three most

frequently identiﬁed races – Caucasian, Negroid, and

Mongoloid – are covered.

Possible Misuse. Facial feature datasets can be

misused in various ways, such as enhancing surveil-

lance systems for mass tracking and proﬁling individ-

uals based on race or other attributes, leading to dis-

crimination. They also pose risks of creating deep-

fakes and identity theft. Note that we employ the col-

lected data exclusively for research purposes and urge

the potential users to follow ethical research prac-

tices, which include proper citation, acknowledgment

of data sources, and sharing their work under the same

license as the original.

Limitations. The proposed dataset was designed

for one concrete domain — video conferencing,

which entailed some limitations in scenes, occlu-

sions, and the number of people in photos. While

resistance to the multiplicity of subjects in inference

can be achieved by mosaic augmentation, the system

based on EasyPortrait can produce errors on unspe-

ciﬁc backgrounds and accessories.

Another area for improvement in the current ver-

sion of EasyPortrait is the lack of some classes, which

reduces the number of its applications. However, we

plan to signiﬁcantly expand the classes by “mouth”,

“hair”, “headphones”, “glasses”, “earrings”, “nose”,

“hat”, “neck”, and “beard”.

The limited choice of crowdsourcing platforms

caused a bias towards the Caucasian race. De-

spite the attempts to diversify the data regarding sub-

jects’ countries, the bias toward Russia remained (see

Fig. 3a). Analogically, the balance between subjects’

emotions and their heads’ turns was not fully reached

(see Fig. 3i and Fig. 8, respectively). Overcoming the

described limitations presents an opportunity to de-

velop a system that will be more robust to variability

in the environment and people. Expanding dataset di-

versity will be a valuable focus for future work.

7 CONCLUSION

We propose a large-scale image dataset for por-

trait segmentation and face parsing, which contains

40,000 indoor photos of people, each with a high-

quality 9-class semantic mask. Easyportrait sup-

ports beautiﬁcation and segmentation tasks like back-

ground removal, skin enhancement, and teeth whiten-

ing, enhancing user experience in video conferenc-

ing apps. We performed extensive model experiments

and cross-dataset comparisons, with an ablation study

highlighting the importance of data quantity and head

pose diversity for robust training. Future work in-

cludes adding occlusions to annotations and expand-

ing the number of classes (see Section 6).

REFERENCES

(2019). AISeg. https://github.com/aisegmentcn/matting

human datasets.

(2020). Persons Labeled. https://ecosystem.supervise.ly/

projects/persons.

Chen, L.-C., Papandreou, G., Schroff, F., and Adam, H.

(2017). Rethinking atrous convolution for semantic

image segmentation.

Chu, L., Liu, Y., Wu, Z., Tang, S., Chen, G., Hao, Y., Peng,

J., Yu, Z., Chen, Z., Lai, B., and Xiong, H. (2021).

Pp-humanseg: Connectivity-aware portrait segmenta-

tion with a large-scale teleconferencing video dataset.

2022 IEEE/CVF Winter Conference on Applications

of Computer Vision Workshops (WACVW), pages 202–

209.

Fu, J., Liu, J., Tian, H., Li, Y., Bao, Y., Fang, Z., and Lu,

H. (2019). Dual attention network for scene segmen-

tation.

Guo, J., Zhu, X., and Lei, Z. (2018). 3ddfa. https://github.

com/cleardusk/3DDFA.

Kae, A., Sohn, K., Lee, H., and Learned-Miller, E. G.

(2013). Augmenting crfs with boltzmann machine

EasyPortrait: Face Parsing and Portrait Segmentation Dataset

335

shape priors for image labeling. 2013 IEEE Con-

ference on Computer Vision and Pattern Recognition,

pages 2019–2026.

Kuang, Z. and Tie, X. (2021). Flow-based video seg-

mentation for human head and shoulders. ArXiv,

abs/2104.09752.

Le, V., Brandt, J., Lin, Z. L., Bourdev, L. D., and Huang,

T. S. (2012). Interactive facial feature localization. In

European Conference on Computer Vision.

Lee, C.-H., Liu, Z., Wu, L., and Luo, P. (2019). Maskgan:

Towards diverse and interactive facial image manip-

ulation. 2020 IEEE/CVF Conference on Computer

Vision and Pattern Recognition (CVPR), pages 5548–

5557.

Li, J., Ma, S., Zhang, J., and Tao, D. (2021). Privacy-

preserving portrait matting. Proceedings of the 29th

ACM International Conference on Multimedia.

Liang, J., Zeng, H., Cui, M., Xie, X., and Zhang, L.

(2021). Ppr10k: A large-scale portrait photo retouch-

ing dataset with human-region mask and group-level

consistency. 2021 IEEE/CVF Conference on Com-

puter Vision and Pattern Recognition (CVPR), pages

653–661.

Lin, S., Ryabtsev, A., Sengupta, S., Curless, B., Seitz, S. M.,

and Kemelmacher-Shlizerman, I. (2020). Real-time

high-resolution background matting. 2021 IEEE/CVF

Conference on Computer Vision and Pattern Recogni-

tion (CVPR), pages 8758–8767.

Lin, T.-Y., Doll

ar, P., Girshick, R., He, K., Hariharan, B.,

and Belongie, S. (2017). Feature pyramid networks

for object detection.

Lin, Y., Shen, J., Wang, Y., and Pantic, M. (2021). Roi tanh-

polar transformer network for face parsing in the wild.

Image Vis. Comput., 112:104190.

Liu, Y., Shi, H., Shen, H., Si, Y., Wang, X., and Mei, T.

(2020). A new dataset and boundary-attention seman-

tic segmentation for face parsing. In AAAI Conference

on Artiﬁcial Intelligence.

Liu, Z., Luo, P., Wang, X., and Tang, X. (2015). Deep learn-

ing face attributes in the wild.

Long, J., Shelhamer, E., and Darrell, T. (2014). Fully con-

volutional networks for semantic segmentation. 2015

Proceedings of the IEEE Conference on Computer Vi-

sion and Pattern Recognition (CVPR), 3431-3440.

Long, J., Shelhamer, E., and Darrell, T. (2015). Fully con-

volutional networks for semantic segmentation.

Loshchilov, I. and Hutter, F. (2017). Decoupled

weight decay regularization. 2019 The International

Conference on Learning Representations (ICLR),

abs/1711.05101.

Luo, L., Xue, D., and Feng, X. (2020). Ehanet: An effec-

tive hierarchical aggregation network for face parsing.

Applied Sciences, 10(9):3135.

Park, H., Sj

osund, L. L., Monet, N., Yoo, Y., and Kwak,

N. (2019a). Sinet: Extreme lightweight portrait

segmentation networks with spatial squeeze modules

and information blocking decoder. arXiv preprint

arXiv:1911.09099.

Park, H., Sj

osund, L. L., Yoo, Y., and Kwak, N. (2019b).

Extremec3net: Extreme lightweight portrait segmen-

tation networks using advanced c3-modules. arXiv

preprint arXiv:1908.03093.

Poudel, R. P. K., Liwicki, S., and Cipolla, R. (2019). Fast-

scnn: Fast semantic segmentation network.

Ryumina, E., Dresvyanskiy, D., and Karpov, A. (2022).

In search of a robust facial expressions recognition

model: A large-scale visual cross-corpus study. Neu-

rocomputing.

Sander, E. L. J. (2020). Coronavirus could spark a revo-

lution in working from home: Are we ready? The

conversation.

Shen, X., Hertzmann, A., Jia, J., Paris, S., Price, B. L.,

Shechtman, E., and Sachs, I. (2016). Automatic por-

trait segmentation for image stylization. Computer

Graphics Forum, 35.

Taherkhani, F., Nasrabadi, N. M., and Dawson, J. (2018).

A deep face identiﬁcation network enhanced by facial

attributes prediction.

Wood, E., Baltruvsaitis, T., Hewitt, C., Dziadzio, S., John-

son, M., Estellers, V., Cashman, T. J., and Shotton, J.

(2021). Fake it till you make it: face analysis in the

wild using synthetic data alone. 2021 IEEE/CVF In-

ternational Conference on Computer Vision (ICCV),

pages 3661–3671.

Xie, E., Wang, W., Yu, Z., Anandkumar, A.,

Alvarez, J. M.,

and Luo, P. (2021). Segformer: Simple and efﬁcient

design for semantic segmentation with transformers.

In Neural Information Processing Systems.

Yin, X. and Chen, L. L. (2022). Faceocc: A diverse, high-

quality face occlusion dataset for human face extrac-

tion. ArXiv, abs/2201.08425.

Yu, C., Wang, J., Peng, C., Gao, C., Yu, G., and Sang, N.

(2018). Bisenet: Bilateral segmentation network for

real-time semantic segmentation. In European Con-

ference on Computer Vision.

Zhu, X., Liu, X., Lei, Z., and Li, S. Z. (2017). Face align-

ment in full pose range: A 3d total solution. IEEE

transactions on pattern analysis and machine intelli-

gence.

APPENDIX

In the supplementary materials, we include illustra-

tions of selected samples with their corresponding

annotation masks from the EasyPortrait dataset, the

head turns distributions across different datasets, the

guidelines for class annotations, and the evaluation re-

sults on the EasyPortrait dataset.

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

336

Figure 5: Visual comparison of existing portrait segmentation datasets. One can notice high-frequency details (e.g., hair)

in segmentation masks in samples from our dataset. The AiSeg (ais, 2019) dataset is not considered since it provides the

extracted foreground images without a corresponding annotation mask.

Figure 6: Visual comparison of existing face parsing datasets. Only Face Synthetics (Wood et al., 2021) and EasyPortrait

datasets can be used to solve background removal and face enhancement problems. None of them except EasyPortrait can be

used for teeth whitening. We do not include LFW-PL (Kae et al., 2013) and FaceOcc (Yin and Chen, 2022) datasets in the

visualization due to the lack of classes and the need for preprocessing, respectively.

Figure 7: Visualization of the beard annotation rules. (up) The beard is included in the skin if it is a separate hair or barely

noticeable. (bottom) The beard is excluded from the skin if it is clear.

EasyPortrait: Face Parsing and Portrait Segmentation Dataset

337

Figure 8: Head turns distributions for several face parsing and portrait segmentation datasets, including EasyPortrait. Yaw

and pitch coefﬁcients were obtained via 3DDFA network (Guo et al., 2018).

Table 3: EasyPortrait annotators rules.

Class Rules

Person – headphones and things in front of the person are deﬁned as a person’s class

– individual hairs and all empty areas closed by a person are not included in the person class

Eyebrows stand out along a strict border, excluding individual hairs

Eyes distinguished by whites, excluding eyelids and eyelashes

Skin – the skin class should affect only skin without hair, eyes, and other face attributes

– the boundaries of the skin of the face or person should be highlighted logically on overexposed or

darkened photos

– the rare bristle also considered skin

– ears, second chin, and nostrils are not included in the skin class

Teeth teeth and everything else in the open mouth stand out separately, the latter as an occlusion

Occlusions – makeups and piercing are deﬁned as occlusions

– the part of eyeglasses which cover skin should be annotated as occlusion, including sunglasses and

glare on clear glasses

– beard with a strict border are considered occlusion

– the tongue out of the mouth should be annotated as occlusion

Table 4: Evaluation results on the EasyPortrait. Column “mIoU” is divided into two subcolumns: face parsing and portrait

segmentation tasks. For face parsing, we present mIoU metrics for each class separately, while for portrait segmentation,

we provide only the overall mIoU score. We additionally trained FPN and Segformer-B0 on 224 × 224, 512 × 512, and

1024 × 1024 resolutions to demonstrate the overall increasing tendency amongst both convolutional and transformer models

depending on increasing resolution.

Model Input Size Model Size (MB) FPS

mIoU

Face Parsing PS

skin l-eye r-eye l-brow r-brow lips teeth overall overall

BiSeNetv2 (Yu et al., 2018)

384

56.5 91.47 90.75 71.94 72.57 67.67 67.53 80.87 63.09 76.72 97.95

SegFormer-B0 (Xie et al., 2021) 14.19 72.45 92.05 78.55 79.26 72.5 72.21 83.53 73.52 81.38 98.61

FCN + MobileNetv2 (Long et al., 2014) 31.17 66.07 90.49 69.95 70.63 66.29 66.09 79.23 59.84 75.23 98.19

FPN + ResNet50 (Lin et al., 2017) 108.91 58.1 92.28 79.48 80.08 72.64 72.47 84.15 74.09 81.83 98.64

DeepLabv3 (Chen et al., 2017) 260.02 25.65 91.77 73.78 74.63 69.61 69.74 83.42 70.53 79.11 98.63

Fast SCNN (Poudel et al., 2019) 6.13 93.89 88.58 58.42 58.7 58.68 58.87 73.16 44.86 67.56 97.64

DANet (Fu et al., 2019) 190.29 42.43 91.8 74.01 74.93 70.01 69.75 83.7 70.8 79.3 98.63

EHANet (Luo et al., 2020) 44.81 132.78 89.68 68.87 69.26 63.6 63.82 73.98 52.05 72.56 -

SINet (Park et al., 2019a) 0.13 134.18 - - - - - - - - 93.32

ExtremeC3Net (Park et al., 2019b) 0.15 71.75 - - - - - - - - 96.54

SegFormer-B0

224

14.9 74.84 90.19 68.59 70.46 65.79 65.72 77.94 60.66 74.83 98.17

FPN + ResNet50 108.91 61.56 90.6 69.67 71.88 65.84 65.64 78.94 62.95 75.6 98.31

SegFormer-B0

512

14.9 65.88 92.5 81.03 81.18 74.31 74.08 84.87 78.14 83.19 98.66

FPN + ResNet50 108.91 53.14 92.55 81.55 81.47 74.33 74.38 85.27 77.77 83.33 98.64

SegFormer-B0

1024

14.9 62.9 93.13 84.2 83.97 76.41 76.12 86.88 83.2 85.42 98.74

FPN + ResNet50 108.91 52.34 92.94 84.55 84.24 76.11 76.11 86.93 82.62 85.37 98.54

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

338