Intrinsic Image Decomposition: Challenges and New Perspectives
Diclehan Ulucan, Oguzhan Ulucan and Marc Ebner
Institut für Mathematik und Informatik, Universität Greifswald,
Walther-Rathenau-Straße 47, 17489 Greifswald, Germany
Keywords:
Intrinsic Image Decomposition, Performance Evaluation, Error Metrics.
Abstract:
In the field of intrinsic image decomposition, alongside developing a robust algorithm for the ill-posed problem,
it is also required to benchmark the method on a comprehensive dataset by using a suitable evaluation metric.
However, there are certain limitations in existing evaluation metrics. In this study, two new evaluation strategies
are proposed to analyze intrinsics according to their characteristics. The ensemble of metrics combines different
perceptual quality metrics in scale-space, while the imperceptible Δ𝐸 score is the modified version of the
classical Δ𝐸 metric. Intrinsic image decomposition studies that extract the reflectance and shading images
are benchmarked on two datasets. Furthermore, an overview of the field of intrinsic image decomposition is
provided and the challenges that have to be overcome are pointed out.
1 INTRODUCTION
We can easily perceive our surroundings by uncon-
sciously making use of our abilities to estimate dis-
tances, discount the illuminant, and differentiate be-
tween colors (Zeki, 1993; Ulucan et al., 2022c). These
abilities are difficult to mimic for machine systems.
One way to enable an artificial system to carry out
such tasks is to make use of intrinsic image decom-
position. An image can be decomposed into a "fam-
ily of intrinsic characteristics" (Barrow et al., 1978).
Each element of this family is a low-level feature of the
input scene and it is called an intrinsic image. Each
intrinsic image allows us to extract distinct character-
istics of a scene more efficiently (Ebner, 2007).
There are several problems in intrinsic image de-
composition and one of the main challenges arises
from its nature (Bonneel et al., 2017). Intrinsic image
decomposition is a severely under-constrained prob-
lem, therefore most of the studies only consider the
reflectance and shading features to simplify the prob-
lem. While the shading 𝑆 can be described as the
component demonstrating the interaction between the
illumination and the surfaces, the reflectance 𝑅 can be
defined as the element providing the ratio between the
total incident and total reflected illumination (Barrow
et al., 1978; Shen et al., 2011). An image 𝐼 at location
(𝑥, 𝑦) can be represented as follows;
𝐼(𝑥,𝑦) = 𝑅(𝑥,𝑦) 𝑆(𝑥,𝑦). (1)
Another challenge in intrinsic image decomposi-
tion is the lack of a common evaluation benchmark.
The utilized datasets have a tendency to meet the as-
sumptions made in the proposed algorithms, hence
objectively determining the best performing method
is quite difficult (Bonneel et al., 2017). The fact that
almost all datasets have distinct characteristics makes
it also hard to assess intrinsic image decomposition
methods in a robust manner. For instance, in the MIT
Intrinsic Images dataset (Grosse et al., 2009) one ob-
ject is placed in front of a black background without
any strong shadow or color casts. In the Intrinsic Im-
ages in the Wild Dataset (Bell et al., 2014), the ground
truth information is subjective. There exist also other
large-scale datasets, however, some of them consist of
images containing only a single 3D model rendered
with an environmental map and in these datasets, the
object can be easily segmented since it is placed in
the foreground (Shi et al., 2017). It is also worth men-
tioning here that, there are some datasets, which are
not specifically designed for intrinsic image decom-
position but can be used in this field and others that
contain very complex scenes (Li et al., 2021; Roberts
et al., 2021). Based on these observations, we recently
introduced a comprehensive intrinsic image decom-
position benchmark called IID-NORD (Ulucan et al.,
2022a), which contains ground truth information for 5
intrinsic images, namely, reflectance, shading, depth
map, surface normals, and light vectors.
Ulucan, D., Ulucan, O. and Ebner, M.
Intrinsic Image Decomposition: Challenges and New Perspectives.
DOI: 10.5220/0011969800003497
In Proceedings of the 3rd International Conference on Image Processing and Vision Engineering (IMPROVE 2023), pages 57-64
ISBN: 978-989-758-642-2; ISSN: 2795-4943
Copyright
c
2023 by SCITEPRESS Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)
57
A further point considered as a challenge in in-
trinsic image decomposition studies is the absence of
quality metrics reflecting the actual performance of
the algorithms (Garces et al., 2022). Since existing
image quality metrics focus on specific features that
are important for the task at hand, it is difficult to find
a metric performing robustly in a field such as intrin-
sic image decomposition, which requires the analysis
of distinct images at once to make an overall rank-
ing. Therefore, a metric, which analyzes distinct in-
trinsics by taking into account the individual features
that each intrinsic holds and outputs a global quality
score, would allow us to benchmark intrinsic image
decomposition algorithms in a more accurate manner.
On the other hand, since intrinsic images can be used
in pipelines of different computer vision tasks, i.e. the
reflectance can be adopted for image segmentation,
metrics that are able to investigate an intrinsic indi-
vidually are also needed.
Consequently, as the utilization of intrinsic image
decomposition contributes to various computer vision
tasks and computer graphics applications, it is essen-
tial to analyze the performances of existing intrinsic
image decomposition algorithms in a robust manner
to point out the shortcomings and strengths of the
methods. This will also lead the path to design more
efficient intrinsic image decomposition approaches.
Thereupon, in this study, two new evaluation strate-
gies, namely "ensemble of metrics", and imperceptible
Δ𝐸 score, which is a modified version of the Δ𝐸 met-
ric, are introduced. Also, seven existing metrics are
used to demonstrate the performance of intrinsic im-
age decomposition methods. To the best of available
knowledge although they can be beneficial for intrin-
sic image decomposition studies some of them have
not been considered in this field yet. Furthermore, a
subset of our recently introduced dataset IID-NORD is
created, which contains only the reflectance and shad-
ing elements.
This paper is organized as follows. Section 2
presents intrinsic image decomposition algorithms.
Section 3 introduces the new evaluation metrics. Sec-
tion 4 discusses the experimental results. Section 5
gives a brief summary of the study.
2 INTRINSIC IMAGE
DECOMPOSITION METHODS
In the computer vision society, intrinsic image decom-
position has been widely studied in the last decades,
but the fundamental observations it is based on are
dating back more than a thousand years to Alhazen,
a famous scientist who left his substantial observa-
tions in optics as a legacy to the researchers in this
field (Barrow et al., 1978; Barron and Malik, 2014).
The challenges it holds and the benefits it can pro-
vide made intrinsic image decomposition an attrac-
tive research field. Numerous algorithms based on
various approaches and input requirements have been
proposed in the last five decades. The intrinsic im-
age decomposition methods may need multiple im-
ages taken under different lights, an input sequence
where the light source is positioned at diverse loca-
tions in each image, a time-varying image stack, user
scribbles, multiple images with distinct viewing con-
ditions, depth information, different focal distances,
or a single RGB input (Bonneel et al., 2017). An in-
trinsic image decomposition algorithm requiring only
a single input image can be considered as more ad-
vantageous than methods relying on multiple images
and different necessities. This observation relies on
the fact that real-world single images are widely avail-
able and it is laborious to create image sequences in
the appropriate format. Also, for tasks where intrinsic
image decomposition is used as a pre-processing step,
it is unlikely to have an input stack and inefficient to re-
quire user interaction. Based on these observations, in
this study, algorithms relying on a single RGB image
are considered for the experiments, since they reflect
the requisites of many different applications. In this
section, these algorithms are explained briefly.
The Retinex algorithm is one of the oldest intrin-
sic image decomposition studies (Land, 1964). The
method is inspired from biological findings, which are
based on Land’s famous experiments. The algorithm
relies on the observations that adjacent regions of dis-
tinct objects have sharp reflectance changes since the
alteration between the intensities is large, and flat sur-
faces and shadows have smooth intensity differences.
As a result, while the large gradient changes in an im-
age are mostly due to changes in reflectance, small
gradients are associated with the shading component.
Later on, the Retinex algorithm is combined with a
non-local reflectance constraint (Zhao et al., 2012). It
is assumed that whenever two pixels have the same
chromaticity texture vectors they also have the same
reflectance value. Another intrinsic image decom-
position algorithm is based on the assumption that
considerably small patches in natural images should
have similar reflectance values (Shen et al., 2011).
Hence, the intrinsic image decomposition problem is
solved by optimizing an energy function, where its
constraints assign larger weights to the local neighbor-
ing pixels. In the SIRFS algorithm, intrinsic images
are extracted from a masked image, which contains a
single object (Barron and Malik, 2014). In this multi-
scale optimization based method prior information is
IMPROVE 2023 - 3rd International Conference on Image Processing and Vision Engineering
58
taken to produce the intrinsics. In another intrinsic im-
age decomposition study an unsupervised deep learn-
ing method, which trains on image pairs, is designed
to recover the reflectance and shading information of
a scene (Lettry et al., 2018). In the LR3M algorithm,
which is a low-light enhancement method, intrinsics
of an image are used to enhance the visual quality of
the input while estimating a piece-wise smooth illumi-
nation and a noise-suppressed reflectance of the scene
from a Retinex based model (Ren et al., 2020).
Apart from requiring a single input, these algo-
rithms are selected for the evaluations, since they also
have different characteristics, such as being based on
optimization techniques, relying on neural networks,
integrating intrinsic image decomposition into other
image processing applications, and considering lo-
cal spatial information, which increases the variety of
analyses carried out in this study.
3 EVALUATION METRICS
There are numerous intrinsic image decomposition al-
gorithms but there is only one widely used objective
error metric to benchmark the intrinsic image decom-
position methods, namely the local mean squared er-
ror (LMSE), which is a modified version of the mean
squared error (MSE) (Grosse et al., 2009). LMSE has
difficulties in reflecting the actual performance of a
method since the neighboring relationships of pixels
are not considered (Bonneel et al., 2017; Garces et al.,
2022). Therefore, the amount of available information
in intrinsics and the LMSE score have a tendency to
not coincide. Furthermore, when large regions having
constant reflectance are decomposed correctly, usu-
ally very low LMSE is observed regardless of the re-
maining parts of the image (Bonneel et al., 2017).
Apart from LMSE, the peak signal-to-noise ra-
tio (PSNR) and the structural similarity index (SSIM)
are also used for evaluation. PSNR calculates the
peak signal-to-noise ratio in decibels (dB) between
the ground truth and processed images (Gonzalez and
Woods, 2018). Since pixel-wise evaluations are con-
ducted in PSNR, the neighboring relationships of pix-
els is neglected, and the scores do not necessarily rep-
resent the available information. On the other hand,
SSIM takes into account the neighboring relationships
of pixels and is inspired from the top-to-bottom as-
sumption of the human visual system. SSIM inves-
tigates the structural similarity between the reference
and processed images (Wang et al., 2004). The struc-
ture, contrast, and luminance components of images
are regarded during the computation of the perceptual
quality score. The image is evaluated patch-wise in
SSIM, hence local spatial information is taken into ac-
count. SSIM scores are in the range [0, 1], where a
score closer to 1 refers to a better outcome. In order
to avoid problems related to viewing conditions, later
on, SSIM is modified and the multi-scale SSIM (MS-
SSIM) is introduced (Wang et al., 2003). SSIM can
be represented as follows (Wang et al., 2004);
𝑆𝑆𝐼𝑀 =
(2𝜇
𝐼
1
𝜇
𝐼
2
+ 𝐶
1
)(2𝜎
𝐼
1
𝐼
2
+ 𝐶
2
)
(𝜇
2
𝐼
1
+ 𝜇
2
𝐼
2
+ 𝐶
1
)(𝜎
2
𝐼
1
+ 𝜎
2
𝐼
2
+ 𝐶
2
)
(2)
where, 𝐼
1
and 𝐼
2
are the ground truth and output, 𝜇, 𝜎,
and 𝜎
2
represent the mean, covariance, and variance,
respectively, while 𝐶
1
and 𝐶
2
are small constants.
As it is pointed out in several studies (Gao et al.,
2010; Ding, 2018; Zhu et al., 2021) evaluation strate-
gies correlating with the human visual system have
a tendency of achieving more reliable scores and the
effectiveness of using these strategies for evaluating
intrinsic image decomposition algorithms has already
been proven (Chen and Koltun, 2013; Narihira et al.,
2015). Additionally, it is well-known that carrying out
computations using scale-space helps to avoid prob-
lems arising due to unknown viewing distance and
display resolution. Thereupon, in this paper, to ana-
lyze distinct features of different intrinsics in a robust
manner an "ensemble of metrics" is proposed, which
utilizes different quality metrics in scale-space. In
the ensemble of metrics (EM) three different evalua-
tion strategies namely, SSIM, feature similarity index
(FSIM) (Zhang et al., 2011), and visual information
fidelity (VIF) (Sheikh and Bovik, 2006) are combined
to benchmark intrinsic image decomposition studies.
These metrics are selected, since they analyze fea-
tures, which are important to the outcomes of intrinsic
image decomposition algorithms, such as structure,
contrast, luminance, color, and the amount of infor-
mation coinciding between the ground truth and the
estimated intrinsic image.
VIF (Sheikh and Bovik, 2006) aims at measuring
how much of the information that can be extracted
from the reference image can also be derived from
the test image. The computation of VIF is carried
out in the wavelet domain by making use of Gaussian
scale mixtures 𝐶, which are a random field that can be
presented as the product of two independent random
fields. VIF can be computed as follows,
𝑉 𝐼𝐹 =
𝑘𝑤

(
𝐶
𝑇 ,𝑘
;
𝐹
𝑇 ,𝑘
|𝑠
𝑇 ,𝑘
)
𝑘𝑤

(
𝐶
𝑇 ,𝑘
;
𝐸
𝑇 ,𝑘
|𝑠
𝑇 ,𝑘
)
(3)
where,

is the set of spatial locations for the ran-
dom field, 𝑤 represents the subbands of the image,
𝐶
𝑇 ,𝑘
denotes 𝑇 elements of 𝐶
𝑘
,
𝐹
𝑇 ,𝑘
and
𝐸
𝑇 ,𝑘
denote
the 𝑇 elements of the test image and reference images
Intrinsic Image Decomposition: Challenges and New Perspectives
59
in one subband, respectively, and 𝑠
𝑇
is the model pa-
rameters of the associated image.
FSIM (Zhang et al., 2011) considers the local
structures and contrast information of the images.
FSIM is computed for grayscale images, but it has a
straightforward extension for RGB images. FSIM is
computed by using phase congruency (PC), which is
contrast invariant, and gradient magnitude (GM). The
PC component assumes that points having maximal
phase in the frequency domain correspond to perceiv-
able features, which correlates with the behavior of
the human visual system while detecting significant
features in images. Since the contrast information in-
fluences the human visual system during perception,
GM is included during the formation of FSIM to take
the contrast information of a scene into account. FSIM
can be computed as follows,
𝐹 𝑆𝐼𝑀 =
𝑥,𝑦𝑁
(𝐹
𝑃 𝐶
(𝑥,𝑦)𝐹
𝐺𝑀
(𝑥,𝑦))𝑃 𝐶
𝑚𝑎𝑥
(𝑥,𝑦)
𝑥,𝑦𝑁
𝑃 𝐶
𝑚𝑎𝑥
(𝑥,𝑦)
(4)
where, 𝐹
𝑃 𝐶
(𝑥,𝑦) and 𝐹
𝐺𝑀
(𝑥,𝑦) are the PC and GM
components of the image, respectively, 𝑃 𝐶
𝑚𝑎𝑥
(𝑥,𝑦)
represents the maximum PC value of the input images,
and 𝑁 is the number of pixels in the image. As VIF,
FSIM also outputs scores in the range [0, 1], where
scores closer to 1 represent better results.
In order to form the ensemble of metrics, first of
all, the Gaussian and Laplacian pyramids of the input
and estimation are computed. The number of scales is
adaptively determined according to the image resolu-
tion. Both pyramids are utilized since they have dif-
ferent characteristics (Ebner et al., 2007). The Gaus-
sian pyramid contains the low-frequency components
of the image, thus most of the color information in the
input is preserved in each scale, whereas the Lapla-
cian pyramid behaves like a high-pass filter in which
the high-frequency elements, i.e. fine details, of the
images are maintained in every scale. The utilization
of both of these pyramids leads to the consideration
of distinct details at different scales, hence features
of images can be analyzed in a more robust manner.
Furthermore, analyzing the high- and low-frequency
components in an image separately allows us to eval-
uate the outcomes of algorithms with metrics that are
more suited to examine specific features appearing
more explicitly in one pyramid than the other.
In the ensemble of metrics, SSIM, VIF, and FSIM
are computed at each scale in both pyramids. While
all metrics are utilized to evaluate the shading compo-
nent, only SSIM and the colored FSIM (FSIMc) are
considered for the reflectance element. VIF is dis-
carded for reflectance since it is only computed for
the luminance channel of images, i.e. the evaluation
of the reflectance and shading components results in
the same outcome. In the Gaussian pyramid, SSIM
is calculated by taking into account all of the image
features used in its standard computation, while in the
Laplacian pyramid, only the structure and contrast are
considered since the luminance component is irrele-
vant. On the other hand, all components of FSIM are
regarded in both of the pyramids, since they are sen-
sitive to the information in these pyramids. Lastly,
VIF is computed in both pyramids, since high- and
low-frequency components contain distinct informa-
tion. After the scores at every scale in each pyramid
are obtained, the scores of corresponding levels are
linearly combined for each metric individually as fol-
lows,
𝑃
𝑅
𝑀
(𝑖) =
𝐺
𝑅
𝑀
(𝑖) + 𝐿
𝑅
𝑀
(𝑖)
2
(5)
𝑃
𝑆
𝑀
(𝑖) =
𝐺
𝑆
𝑀
(𝑖) + 𝐿
𝑆
𝑀
(𝑖)
2
(6)
where, 𝑃 is the average of the Gaussian and
Laplacian pyramids, 𝑅 and 𝑆 are the reflectance
and shading components, respectively, 𝑖 represents
the scale, 𝑀
{𝑆𝑆𝐼𝑀, 𝐹 𝑆𝐼𝑀𝑐} and 𝑀
{𝑆𝑆𝐼𝑀, 𝑉 𝐼𝐹 , 𝐹 𝑆𝐼𝑀}.
Each 𝑃 contains evaluation scores at various
scales, hence these results have to be merged into one
overall score for each metric. To fuse the scores, in-
spiration is taken from the experiments of forming the
MS-SSIM metric (Wang et al., 2003). In the exper-
iments of Wang et al., which are based on human
judgments, it is noticed that the human visual sys-
tem gives different importance to the same error at
distinct scales. Even when each image in a pyramid
has the same error, the perceived quality changes in
each scale. From the results of Wang’s experiments,
it can be deduced that the assigned importance is ap-
proximately Gaussian. Therefore, in EM, the scores at
distinct scales are combined using a Gaussian-based
weighting strategy to assign a different weight to each
scale as follows,
𝑃
𝑅
𝑀
=
𝑖
𝑃
𝑅
𝑀
(𝑖) 𝑒
𝑖
2
2𝜎
2
(7)
𝑃
𝑆
𝑀
=
𝑖
𝑃
𝑆
𝑀
(𝑖) 𝑒
𝑖
2
2𝜎
2
(8)
where, 𝜎 depends on the number of scales and it is
computed as 𝜎 = (𝑖 1)∕5.
Then, the scores for each intrinsic image are lin-
early combined as in the following,
𝐸𝑀
𝑅
=
1
2
𝑗
𝑃
𝑅
𝑀
(𝑗)
(9)
𝐸𝑀
𝑆
=
1
3
𝑗
𝑃
𝑆
𝑀(𝑗)
(10)
IMPROVE 2023 - 3rd International Conference on Image Processing and Vision Engineering
60
where, subscript 𝑗 represents the 𝑗
𝑡ℎ
element of 𝑀
and 𝑀, and 𝐸𝑀
𝑅
and 𝐸𝑀
𝑆
are the ensemble of
metrics scores for the reflectance and shading com-
ponents, respectively. Lastly, the scores of reflectance
and shading are averaged to obtain a global EM score.
It should be noted here that the last level of the pyra-
mids is ignored during FSIM computations in EM
since FSIM also uses the scale-space during score
computation, which causes ambiguities in the smallest
level of the pyramids in EM.
As mentioned in Sec. 1, an evaluation strategy
focusing on a single intrinsic image provides an ef-
fective analysis for tasks, where a particular intrin-
sic is of interest. Therefore, alongside EM, another
metric, namely imperceptible Δ𝐸 score, is introduced
in this study. This metric is considered only for the
evaluation of the reflectance component since Δ𝐸
(CIEDE2000) (Luo et al., 2001; Sharma et al., 2005)
focuses on the color difference of input images. Δ𝐸
is computed in CIELAB color space by analyzing
the lightness, chroma, and hue components. While
Δ𝐸 scores less than 1 are imperceptible, a score
in the range [1,4) may also be unnoticeable to ob-
servers (Ebner, 2007). Based on these findings, the
conventional Δ𝐸 metric is modified. Since color in-
formation is a low-frequency component of images,
the Δ𝐸 score is computed at each scale of the Gaus-
sian pyramid. Then, the number of pixels having a Δ𝐸
score in the range [0, 4) is counted individually for ev-
ery level. Subsequently, at each level, the number of
pixels having an unnoticeable Δ𝐸 score is divided by
the total number of pixels in the corresponding scale.
Afterwards, all these ratios are weighted with a Gaus-
sian function as in Eqn. 8 and summed to obtain the
imperceptible Δ𝐸 score. As a result, a score closer to
1 indicates that the estimated reflectance image is ap-
proximating the ground truth, while scores closer to 0
show that significant observable differences between
the ground truth and estimation are present. While the
imperceptible Δ𝐸 metric is computed in scale-space
due to its aforementioned advantages, it can also be
calculated directly in the original scale of images. It
is worth mentioning here that the imperceptible Δ𝐸
metric can also be useful for color constancy studies,
where the standard Δ𝐸 metric is already being used
as an evaluation strategy (Ebner, 2007; Ulucan et al.,
2022b).
4 EXPERIMENTS
In order to benchmark intrinsic image decom-
position algorithms, a subset of our recent
dataset IID-NORD is created in the open-source
3D graphics toolkit called OpenSceneGraph
(www.openscenegraph.com). The same proce-
dure with IID-NORD is followed, which is briefly
explained in the following. The subset is called RS-
NORD and it contains 1936 sRGB images along with
their corresponding reflectance and shading ground
truths. The images have a resolution of 1600 × 965
pixels. Each scene in RS-NORD contains a room
with various objects. Different than IID-NORD, the
textures are either created synthetically or captured
using a mobile phone camera. The layout of the rooms
and viewing angles are changed during rendering. A
single point light source illuminating the scene with
different lights is placed into the scene, and its loca-
tion is repositioned for each rendering. Consequently,
dynamic shadows (Wimmer et al., 2004) and distinct
illumination conditions are obtained for the scenes.
Also, the ambient light is turned on to make 20% of
an objects color visible.
In this section the algorithms explained in Sec. 2,
namely Retinex (Land, 1964; Grosse et al., 2009),
Zhao (Zhao et al., 2012), Shen (Shen et al., 2011),
SIRFS (Barron and Malik, 2014), Lettry (Lettry et al.,
2018), and Ren (Ren et al., 2020) are benchmarked
both on the MIT Intrinsic Images and the RS-NORD
datasets by using an Intel i7 CPU @ 3.5 GHz Quad-
Core 16 GB RAM machine. The implementations of
the methods are taken from the official webpages of
the authors. No optimization is carried out on the al-
gorithms. Moreover, all the images are decomposed
by a baseline algorithm (Bonneel et al., 2017), which
is a simple approach that decomposes images with-
out considering any important aspect of the intrinsic
image decomposition problem. Any algorithm devel-
oped specifically for intrinsic image decomposition is
desired to outperform the baseline method. In this ap-
proach, the chromaticity image (𝐼
𝑐
) is assumed to be
the reflectance, and the square root of the direct av-
erage of channels (𝑌 ), i.e. grayscale illumination, is
considered as the shading. 𝐼
𝑐
and 𝑌 can be computed
as follows,
𝐼
𝑐
=
(
𝑅
𝑅 + 𝐺 + 𝐵
,
𝐺
𝑅 + 𝐺 + 𝐵
,
𝐵
𝑅 + 𝐺 + 𝐵
)
(11)
𝑌 =
𝑅 + 𝐺 + 𝐵
3
(12)
where, 𝑅, 𝐺 and 𝐵 are the color channels of the image.
In order to evaluate the algorithms, 9 different met-
rics namely, LMSE, PSNR, SSIM, MS-SSIM, FSIM,
FSIMc, VIF, EM, and imperceptible Δ𝐸 score (Δ𝐸
𝑖
)
are used (Table 1). Note here that SIRFS is not evalu-
ated on RS-NORD, since as mentioned in Sec. 2 this
algorithm only takes input images with single objects.
Intrinsic Image Decomposition: Challenges and New Perspectives
61
Table 1: The statistical outcomes of algorithms. Best scores
for each metric are highlighted. The last column provides
the average execution time in seconds, where the run time
of Ren is not provided since its code is binary and does not
only output the intrinsics.
LMSE PSNR SSIM MS-SSIM FSIM FSIMc VIF EM 𝚫𝐄
𝐢
Avg. time
MIT
Baseline 0.078 10.924 0.718 0.697 0.786 0.859 0.223 0.618 0.556 𝟎.𝟎𝟑𝟏
Retinex 0.091 11.128 0.726 0.720 0.795 0.882 0.179 0.631 0.580 3.106
Zhao 𝟎.𝟎𝟑𝟔 12.156 0.785 0.780 0.815 0.908 0.363 0.692 0.626 3.671
Shen 0.062 13.753 0.698 0.756 0.725 0.864 0.298 0.709 0.639 38.097
SIRFS 0.042 𝟏𝟑.𝟖𝟏𝟐 𝟎.𝟖𝟎𝟑 𝟎.𝟕𝟗𝟕 𝟎.𝟖𝟑𝟑 𝟎.𝟗𝟏𝟏 𝟎.𝟑𝟕𝟕 𝟎.𝟕𝟐𝟒 𝟎.𝟕𝟏𝟒 171.018
Lettry 0.056 12.275 0.527 0.722 0.706 0.873 0.122 0.639 0.581 13.833
Ren 0.079 9.233 0.703 0.713 0.816 0.869 0.176 0.605 0.528
RS-NORD
Baseline 0.093 10.030 0.599 0.636 0.776 0.611 0.391 0.622 0.014 0.242
Retinex 0.101 10.581 𝟎.𝟔𝟓𝟖 0.688 0.786 0.677 0.386 𝟎.𝟔𝟓𝟐 0.061 53.634
Zhao 0.102 6.257 0.305 0.633 0.772 0.606 0.117 0.441 0.039 40.219
Shen 𝟎.𝟎𝟔𝟖 11.457 0.575 0.612 0.708 0.681 0.313 0.612 0.079 390.300
Lettry 0.097 𝟏𝟐.𝟎𝟗𝟑 0.621 𝟎.𝟕𝟏𝟎 0.766 𝟎.𝟕𝟎𝟐 0.235 0.643 0.018 220.351
Ren 0.094 9.562 0.505 0.670 0.738 0.653 0.121 0.585 0.024
As it can be seen from the results of the MIT In-
trinsic Images dataset (Table 1), Zhao has the lowest
LMSE, while SIRFS has the best scores in all the other
metrics. This is also in accordance with the visual
results in Fig 1. As aforementioned LMSE neglects
local spatial cues, and may not coincide with the per-
ceptually available information that is taken into ac-
count in metrics such as FSIM and FSIMc, which in-
dicates that LMSE may not reflect the actual perfor-
mance of algorithms. As demonstrated in Fig. 1, Zhao
faces an obvious challenge in eliminating shadows and
specularity from the reflectance image, while Shen is
mostly able to handle these features, which also co-
incides with the Δ𝐸
𝑖
and EM scores. As discussed
in Sec. 3, evaluation strategies correlating with the
human visual system generally output more accurate
results. Hence, investigating the visual outcomes of
the intrinsic image decomposition methods can help
to understand what type of metrics provide a more re-
liable score. It can be argued that the proposed metrics
are able to output scores that correlate with the actual
available information in the intrinsics.
The intrinsic image decomposition algorithms
face a challenge when benchmarked on a more com-
plex dataset than the MIT Intrinsic Images dataset.
Generally, for each metric, a different method pro-
duces the best score in RS-NORD. In terms of PSNR,
MS-SSIM, and FSIMc, Lettry outperforms the other
intrinsic image decomposition methods. However, the
visual results demonstrate that Lettry outputs color-
distorted intrinsics, which reduces its Δ𝐸
𝑖
signifi-
cantly, and affects its EM score. According to the
visual outcomes in the RS-NORD dataset, overall,
Retinex produces the closest intrinsics to the ground
truth information, which can also be observed from
its EM score. On the other hand, LMSE, which is
designed for intrinsic image decomposition studies,
indicates that Shen performs the best decomposition
among others. However, as seen in Fig. 1, while
Shen greatly preserves the color information in the re-
flectance element, which coincides with its Δ𝐸
𝑖
score,
it faces difficulty in extracting the shading informa-
tion, which is reflected in its EM score. While ambi-
guities are present in both intrinsics, the reason LMSE
highlights Shen as the best performing algorithm can
be explained by the fact that the preservation of large
areas of constant reflectance tends to result in low
LMSE scores (Bonneel et al., 2017). As it can be de-
duced from these observations, during the evaluation
of intrinsic image decomposition algorithms it is im-
portant to consider the different characteristics of each
intrinsic image and weigh the outcomes in a balanced
manner in order to output a reliable statistical score.
Additional results of algorithms together with their
scores are provided in Fig 2. The proposed metrics
are able to output results coinciding with the available
information in the intrinsics. In cases, where the Δ𝐸
𝑖
and EM scores do not coincide, it can be deduced that
while one of the intrinsics is successfully estimated,
there is an issue in the estimation of the other one.
When intrinsic image decomposition methods are
tested on a complex dataset both the statistical and
visual results show that a single approach in solv-
ing the ill-posed intrinsic image decomposition prob-
lem is not sufficient to handle image features such
as strong shadow casts, highlights, and specularities,
since each method is more responsive to different fea-
tures. Therefore, an ensemble of intrinsic image de-
composition methods, which consists of algorithms
that can handle different features might be beneficial.
5 CONCLUSION
Intrinsic image decomposition is an extensively stud-
ied field, which holds many challenges. The under-
constraint nature of intrinsic image decomposition is
the main difficulty in this field. Even though simplifi-
cations are made during intrinsic image computations,
the ill-posed structure of the problem remains. Other
challenges in intrinsic image decomposition are the
lack of a common benchmark and an evaluation metric
that allow us to analyze the intrinsic image decompo-
sition algorithms efficiently. In this study, two new
evaluation techniques, namely ensemble of metrics
and imperceptible Δ𝐸, are proposed to present pos-
sible solutions and new perspectives to the field of in-
trinsic image decomposition. Moreover, it is aimed to
provide a guide to future studies by giving an overview
of the field and addressing the problems it holds.
IMPROVE 2023 - 3rd International Conference on Image Processing and Vision Engineering
62
Input and Ground Truths SIRFS Zhao Shen
Input and Ground Truths Retinex Lettry Shen
Figure 1: Visual comparison of algorithms. First two rows present scenes from the MIT Intrinsic Images dataset, and last two
rows contain images from the RS-NORD dataset.
Input Ground Truths Retinex (0.004, 0.689) Ren (0.268,0.679)
Input Ground Truths Shen (0.299, 0.666) Lettry (0.004,0.546)
Input Ground Truths Baseline (0.005, 0.518) Zhao (0.008, 0.435)
Input Ground Truths Ren (0.576,0.731) Zhao (0.565, 0.706)
Input Ground Truths Retinex (0.370, 0.714) Baseline (0.328, 0.648)
Input Ground Truths SIRFS (0.900, 0.787) Zhao (0.743,0.709)
Figure 2: Comparison of the algorithms on both datasets (first three rows are from RS-NORD, and last three rows are from
the MIT Intrinsic Images dataset). For each method firstly the Δ𝐸
𝑖
score, then the EM score is given in parenthesis.
Intrinsic Image Decomposition: Challenges and New Perspectives
63
REFERENCES
Barron, J. T. and Malik, J. (2014). Shape, illumination, and
reflectance from shading. IEEE Trans. Pattern Anal.
Mach. Intell., 37:1670–1687.
Barrow, H., Tenenbaum, J., Hanson, A., and Riseman, E.
(1978). Recovering intrinsic scene characteristics.
Comput. Vision Syst., 2:2.
Bell, S., Bala, K., and Snavely, N. (2014). Intrinsic images
in the wild. ACM Trans. Graph., 33:1–12.
Bonneel, N., Kovacs, B., Paris, S., and Bala, K. (2017).
Intrinsic decompositions for image editing. Com put.
Graph. Forum, 36:593–609.
Chen, Q. and Koltun, V. (2013). A simple model for intrinsic
image decomposition with depth cues. In ICCV, pages
241–248, Sydney, NSW, Australia. IEEE.
Ding, Y. (2018). Image quality assessment based on human
visual system properties. Vis. Qual. Assessment Natu-
ral Med. Image, 37:63–106.
Ebner, M. (2007). Color Constancy, 1st ed. Wiley Publish-
ing, ISBN: 0470058299.
Ebner, M., Tischler, G., and Albert, J. (2007). Integrating
color constancy into JPEG2000. IEEE Trans. Image
Process., 16:2697–2706.
Gao, X., Lu, W., Tao, D., and Li, X. (2010). Image quality
assessment and human visual system. In Proc. Vis.
Commun. Image Process., pages 316–325, Huang-
shan, China. SPIE.
Garces, E., Rodriguez-Pardo, C., Casas, D., and Lopez-
Moreno, J. (2022). A survey on intrinsic images: Delv-
ing deep into lambert and beyond. Int. J. Comput. Vi-
sion, 130:836–868.
Gonzalez, R. C. and Woods, R. E. (2018). Digital Image
Processing, 3rd ed. Pearson Prentice Hall.
Grosse, R., Johnson, M. K., Adelson, E. H., and Freeman,
W. T. (2009). Ground truth dataset and baseline eval-
uations for intrinsic image algorithms. In ICCV, pages
2335–2342, Kyoto, Japan. IEEE.
Land, E. H. (1964). The retinex. Amer. Scientist, 52:247–
264.
Lettry, L., Vanhoey, K., and Van Gool, L. (2018). Unsuper-
vised deep single-image intrinsic decomposition us-
ing illumination-varying image sequences. Comput.
Graph. Forum, 37:409–419.
Li, Z., Yu, T.-W., Sang, S., Wang, S., Song, M., Liu, Y., Yeh,
Y.-Y., Zhu, R., Gundavarapu, N., Shi, J., Bi, S., Yu, H.-
X., Xu, Z., Sunkavalli, K., Hasan, M., Ramamoorthi,
R., and Chandraker, M. (2021). Openrooms: An open
framework for photorealistic indoor scene datasets. In
CVPR, pages 7190–7199, Nashville, TN, USA. IEEE.
Luo, M. R., Cui, G., and Rigg, B. (2001). The devel-
opment of the CIE 2000 colour-difference formula:
CIEDE2000. Color Res. Appl., 26:340–350.
Narihira, T., Maire, M., and Yu, S. X. (2015). Direct in-
trinsics: Learning albedo-shading decomposition by
convolutional regression. In ICCV, pages 2992–2992,
Santiago, Chile. IEEE.
Ren, X., Yang, W., Cheng, W.-H., and Liu, J. (2020). LR3M:
Robust low-light enhancement via low-rank regular-
ized retinex model. IEEE Trans. Image Process.,
29:5862–5876.
Roberts, M., Ramapuram, J., Ranjan, A., Kumar, A.,
Bautista, M. A., Paczan, N., Webb, R., and Susskind,
J. M. (2021). Hypersim: A photorealistic synthetic
dataset for holistic indoor scene understanding. In
ICCV, pages 10912–10922, Montreal, QC, Canada.
IEEE.
Sharma, G., Wu, W., and Dalal, E. N. (2005). The
CIEDE2000 color-difference formula: Implementa-
tion notes, supplementary test data, and mathematical
observations. Color Res. Appl., 30:21–30.
Sheikh, H. R. and Bovik, A. C. (2006). Image informa-
tion and visual quality. IEEE Trans. Image Process.,
15:430–444.
Shen, J., Yang, X., Jia, Y., and Li, X. (2011). Intrinsic im-
ages using optimization. In CVPR, pages 3481–3487,
Colorado Springs, CO, USA. IEEE.
Shi, J., Dong, Y., Su, H., and Yu, S. X. (2017). Learning
non-lambertian object intrinsics across shapenet cat-
egories. In CVPR, pages 1685–1694, Honolulu, HI,
USA. IEEE.
Ulucan, D., Ulucan, O., and Ebner, M. (2022a). IID-
NORD: A comprehensive intrinsic image decompo-
sition dataset. In ICIP, pages 2831–2835, Bordeaux,
France. IEEE.
Ulucan, O., Ulucan, D., and Ebner, M. (2022b). BIO-CC:
Biologically inspired color constancy. In BMVC, Lon-
don, UK. BMVA Press.
Ulucan, O., Ulucan, D., and Ebner, M. (2022c). Color con-
stancy beyond standard illuminants. In ICIP, pages
2826–2830, Bordeaux, France. IEEE.
Wang, Z., Bovik, A. C., Sheikh, H. R., and Simoncelli, E. P.
(2004). Image quality assessment: from error visibil-
ity to structural similarity. IEEE Trans. Image Pro-
cess., 13:600–612.
Wang, Z., Simoncelli, E. P., and Bovik, A. C. (2003). Mul-
tiscale structural similarity for image quality assess-
ment. In Proc. Asilomar Conf. Signals Syst. Comput.,
pages 1398–1402, Pacific Grove, CA, USA. IEEE.
Wimmer, M., Scherzer, D., and Purgathofer, W. (2004).
Light space perspective shadow maps. Rendering
Techn., 2004:143–151.
Zeki, S. (1993). A Vision of the Brain. Blackwell Science,
ISBN: 0632030545.
Zhang, L., Zhang, L., Mou, X., and Zhang, D. (2011).
FSIM: A feature similarity index for image quality
assessment. IEEE Trans. Image Process., 20:2378–
2386.
Zhao, Q., Tan, P., Dai, Q., Shen, L., Wu, E., and Lin, S.
(2012). A closed-form solution to retinex with non-
local texture constraints. IEEE Trans. Pattern Anal.
Mach. Intell., 34:1437–1444.
Zhu, W.-H., Sun, W., Min, X.-K., Zhai, G.-T., and Yang, X.-
K. (2021). Structured computational modeling of hu-
man visual system for no-reference image quality as-
sessment. Int. J. Automat. Comput., 18:204–218.
IMPROVE 2023 - 3rd International Conference on Image Processing and Vision Engineering
64