Joint Brightness and Tone Stabilization of Capsule Endoscopy Videos

Sibren van Vliet

, A ndr´e Sobiecki

1,2

and Alexandru C. Telea

Johann Bernoulli Institute for Mathematics and Computer Science, University of Groningen, The Netherlands

ZiuZ Visual Intelligence, Gorredijk, The Netherlands

Keywords:

Capsule Endoscopy, Video Processing, Color Stabilization.

Abstract:

Pill endoscopy cameras generate hours-long videos that need to be manually inspected by medical specialists.

Technical limitations of pill cameras often create large and uninformative color variations between neighboring

frames, which make exploration more difﬁcult. To increase the exploration efﬁciency, we propose an automatic

method for joint intensity and hue (tone) stabilization that reduces such artifacts. Our method works in real

time, has no free parameters, and is simple to implement. We thoroughly tested our method on several real-

world videos and quantitatively and qualitatively assessed its results and optimal parameter values by both

image quality metrics and user studies. Both types of comparisons str ongly support the effectiveness, ease-of-

use, and added value claims for our new method.

1 INTRODUCTION

Endoscopy of the gastrointestinal tract is since long

used to screen, diagnose, locate, or treat con ditions

such as gastrointestinal bleeding, inﬂammatory bowel

disease, celiac disease, polyps, and certain cancer ty-

pes (Classen and Phillip, 198 4). This is traditionally

done by using a small camer a at the end of a thin ﬂex-

ible tube inserted into the mouth and guided through

the tract. However, this method does not reach the

many tight bends of the intestines.

3515

4765

6900

8096

32713

2654

2659

2689

2694

2696

Figure 1: Sample frames from endoscopy pill camera

footage illustrating intensity (top row) and hue (bottom row)

problems.

A re cent disruptive technology is the pill camera, a

small capsu le holding a camera and lights (Hale et al.,

2014). After being swallowed, the camera r ecords 8

to 12 hours of video. While chea per, less intrusive,

and better covering the full gastrointestinal tract, pill

cameras have several issues. Figure 1 shows sample

frames from a video recorded by the MiroCam pill

camera (Hale et a l., 2014) at 3 frames per second at

320

pixel resolution. Each frame contains a circular

picture surrounded by black b orders, with the frame

number in white. In the top row frames, areas close

to the camera ar e very bright, and far away areas are

completely dark, due to the distance fr om the cap-

sule’s lights. Consider frame 4765. All tissue here has

in reality the same color, but it is not imag ed as su c h.

As the capsule moves onwards from frame 4765, the

moderately lit area in the center of frame 4765 beco-

mes too b right, as the light a pproaches it. Also, the

too dark area top-left in frame 4765 becomes mode-

rately lit due to the camera motion. All in all, the

same tissue a rea is shown in differing intensities over

time. Figure 1(bottom row) sh ows a second type of

problem: All images are of the same tissue type, so

they should have the same color tone (hue). Ye t, as the

camera automatically adjusts its color balance, tone

ﬂuctuates over time. For instance, frame 2654 has a

pink ton e; frame 2659 has a mo re orange tone; frame

2689 appears pin k aga in; frame 2694 appears orange;

and frame 26 96 shifts to pink again.

Medical practitioners viewing endoscopy videos

are being distracted b y sudden tone and/or intensity

ﬂuctuations, which do not contain any information.

Color correction (also called stabilization) methods

are an effective way to alleviate such problems. Ho-

wever, such methods should not introduce any arti-

facts which c ould mislead the ph ysycian. From dis-

cussions with gastroenterologists, we found two key

Vliet, S., Sobiecki, A. and Telea, A.

Joint Brightness and Tone Stabilization of Capsule Endoscopy Videos.

DOI: 10.5220/0006552401010112

In Proceedings of the 13th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2018) - Volume 4: VISAPP, pages

101-112

ISBN: 978-989-758-290-5

101

requirements for a stabilization method: (i) the rela-

tive intensity of pixels in the corrected and original

image should be the same (if a pixel a is brighter than

another pixel b in the input ima ge I, then a should also

be brighter than b in the correc te d image I

′

; and (ii)

hue changes should be small enough so that a tissue

type can be reliably reco gnized in stabilized images.

While many generic color correction algorithms ex-

ist (A nbarjafari, 2014 ; Vig et al., 2016; Purushotha-

man et al., 2 016; Gautam and Tiwari, 2 015; Gonz´alez

et al., 2016; Moradi et al., 2015), few have been deve-

loped with the speciﬁc constraints of endoscopy vi-

deos: low resolution, poor lighting of large image

areas, relatively low framera te , rapid variation of the

light direction, real-time operation , and the avoidance

of misleading artifacts in the corrected video. More-

over, such algorithms have various parameters which

inﬂuence their results. We ar e not aware of any stu-

dies showing how to ﬁnd optimal pa rameter values

that smooth out intensity and tone changes but do not

create signiﬁcant ar tifacts.

In this paper we attack the problem of jo int

intensity-and-tone stabilization in endoscopy videos.

We ana lyze a large set of existing inten sity-and-tone

stabilization techniques vs th e video endoscopy con-

straints (Sec. 2). We select the best candidate, w hich

we next enhance to optimally meet all these con-

straints (Secs. 3, 4) . We evaluate our enhanced al-

gorithm quantitatively (by image similar ity metrics)

and qualitatively (by an extensive user study), on a

set of endoscopy videos showing a wide variation of

imaged tissues and lighting conditions (Sec. 5) . The

evaluation shows that our improved algorithm surpas-

ses the best-so- far algorithm we could ﬁnd, by per-

forming joint intensity and tone stabilization, being

parameter free, guaranteeing goo d image quality, and

working at the same speed as the pill camera. Finally,

we conclude with directions for future work (Sec. 6 ).

2 RELATED WORK

Color correction has a long history in image and video

processing (Gijsenij et al., 2011). Early methods in-

clude greyscale histogram equalization (GHE ) (Kim

and Yang, 2006) and d ynamic histogram equalization

(DHE) (Sun et al., 2005). Few methods w ere d esig-

ned for, or tested on, endoscopy videos. Hence, be-

sides considering endoscopy-speciﬁc methods, it is

useful to study if more generic methods can be used,

with suitable modiﬁcations, for our problem. We dis-

cuss below ten methods which target (partially) our

intensity and hue stabilization goal, and are either

well-known in image processing or else are designed

to handle e ndoscopy videos. We assess these met-

hods by rating them on a Likert scale (5=very good,

4=goo d, 3=average, 2=poor, 1=very poor ) against the

following requirements:

• Validation measures how well the claims of a

method are defended by results shown in the re-

spective paper. M ethods sh owing stronger vali-

dation are more interesting candidates to adapt to

our endoscopy use-case.

• Reproductibility measures how easy is to

(re)implement a method and obtain the results

described in tha t paper. This is essential: without

reproducibility, we cannot validate and/or extend

a given method.

• Complexity measures the c omputational com-

plexity of a method for a video of n frames of

w × h pixels. Ideally, we want a (near) linear

complexity me thod in video size so that we can

achieve interactive exploration.

• Usability tells h ow easy c an a non-te chnical user

run the method. It is measure d by the number and

intuitiveness of the exposed parameters. A met-

hod with ma ny parameters which are not intuitive

or easy to set is not very usable. This is a critical

requirement for an application that aims to decre-

ase the workload fo r a medical specialist.

(Anbarjafari, 2014) propo sed an iterative n

root

and n

power co lor equalization for single generic

images. The intensity channel of an image in H SI

space is passed through a non-linear transfer function

f (x) = x

ln(0.5)/ln(

, where x is the image’s mean in-

tensity. Th e operation is repeated until the ﬁnal imag e

achieves a mean ‘goal’ intensity equal to γ, set typi-

cally to γ = 0.5. The method is good in lighting very

dark image areas and darkening too bright areas. Ho-

wever, it does not address our problem of tone stabi-

lization in vide os.

(Vig et al., 2 016) equalize colors in single images

by increasing the intensity of dark areas, but keeps

bright areas unchanged , akin to overexposing. Not

darkening very bright areas is a limitation in our con-

text. Also, this technique doe s contrast enhancement;

this can create artifacts in endoscopy images which

typically contain only low contrast tissue.

(Purushothaman et al., 2016) propose a differen-

tial histogram equaliza tion method for color images

which increases th e contrast of color images so as to

make the color information more visible to the hu-

man eye. However, as a result, brig htly lit ar e as may

become even brighter, losing potentially valuable in-

formation in endoscopy imagery.

(Gautam and Tiwari, 2015) propose yet another

histogram equalization based method for single ima-

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

102

ges which incre a ses contrast in dimly lit areas while

not brightening properly lit areas. However, too bright

areas are not darkened, which conﬂicts with our inten-

sity equa lization goal.

(Gonz´alez et al., 2016) propose an improvement

of the earlier luminance Multi-Scale Retinex met-

hod (Funt et al., 1997) that targets hues. The method

is very powerful at brightening d ark areas and thus re-

vealing rich colo r infor mation. However, already well

lit areas may become too bright.

(Moradi et a l., 2015) propose a method speciﬁ-

cally targeted at endoscopy images which increases

contrast and removes noise. However, intensity n or-

malization is not speciﬁcally addressed. Also, the

method does not speciﬁcally han dle tone stabilization.

(Va zquez-Corral and Bertalmio, 2014) pr opose a

so-called video tone stabilization method w hich equa-

lizes a set of images taken from several cameras or

from a single camera where white balance and/or ex-

posure change over time. The metho d works by ma-

king all input images more similar with respect to a

so-called re ference image. I t works in both hue and

intensity channels, both wh ic h ar e important for our

context. However, an open challen ge is how to auto-

matically select a single reference frame.

(Wang et al., 2014) propose yet another video

tone stabilization, based on smoothing differences be-

tween neighbor frames, much like a n average running

through time, applied on the trajectory of the color

state in color space. A par ameter allows turning the

smoothing off to keep large tone temporal differences

which can encode important information.

(Farbman and Lischinski, 2011) also propose a vi-

deo tone stabilization method for videos, based on

the same reference frame idea as (Vazquez-Corral and

Bertalmio, 2014). While the results of this method are

impressive, a major drawback is that it appears to be

closed-source and patented, which makes its replica-

tion and application ha rd at best.

(Bassiou an d Kotropoulos, 2007) present a single-

image me thod based on histogram equalization. The

method uses multi-level smoothing correct images

in HSI space, using the probab ility density functi-

ons of the satura tion and intensity componen ts while

keeping hue unchanged. The method can equalize in-

tensity very well. However, it does not directly a d-

dress the p roblem o f tone stabilization.

Table 1 summarizes our sur vey. The method

of (Anbarjafari, 2014) (referred next to as ‘Anbar-

jafari’) gets the best overall rating, with the met-

hods of (Vazquez-Corral and Bertalmio, 2014) and

(Bassiou and Kotropoulos, 2007) coming next. As

such, w e considered extending these three methods

for our goal. However, rep lica ting the algorithms in

(Va zquez-Corral and Bertalmio, 2 014) and (Bassiou

and Kotropoulos, 2007) did not succeed in producing

the same results as in the respective papers, as several

crucial details were omitted in the papers. As such,

we settled with extending the method of (Anbarjafari,

2014) to suit our goals, as described next.

3 PROPOSED METHOD

As explained in Sec. 2, the Anbarjafari method brig-

htens dark areas and darkens bright areas in single

images. However, we want to equalize intensity and

smooth out hue ﬂuctuations over time. For this, we

extend th e A nbarjafari method as follows.

We smooth out ﬂuc tuations in an image channel

over time by detecting large variances between the

channel’s histograms (computed over all input image

pixels) of all frames within a time window, and next

changin g the pixel values so that th e histogram is

suitably compressed. By compressing the h isto gram,

differences between pixel values are mad e smaller.

When applied to all frames within a time window,

the compression rate should progress gradually, in or-

der to smoothen out sudden differences. This techni-

que can be applied to any image c hannel in any color

space, e.g., RGB or HSI. As discussed next in Sec. 5,

we will apply our technique on both the intensity and

saturation chan nels of a HSI-space image, and com-

bine it with the original Anbarjafari method, which

we will also apply on b oth above channe ls. The hue

channel is left untouched, as changing it easily yields

undesired artifacts.

t−1

t−2

t+2

t+1

saturation

number of

pixels

t−2

t−1

t+1

t+2

Figure 2: Five successive video frames in which tone ﬂuc-

tuation occurs (from orange to pink). Below each frame,

a histogram of saturation values is shown. Summing these

histograms results in a cumulative histogram.

The histogram compression works as follows.

Consider the current frame t in the video and a time-

window of 2k+1 frames centered at t. Figure 2 shows

Joint Brightness and Tone Stabilization of Capsule Endoscopy Videos

103

Table 1: Brightness and/or tone stabilization methods reviewed in this work.

Method Validation Reproductibility Complexity Usability

(Anbarjafari, 2014) (4) Very good results

for two test-sets

(4) MATLAB code

provided

(4) O (whnx) with x ≈

(4) A single intuitive

parameter to set (goal

mean).

(Vig et al., 2016) (2) Good results for

two test-sets, but

only for brightening

dark areas

(2) No code provi-

ded, reproducing is

difﬁcult

(4) O(whn) (2) Four not very in-

tuitive parameters

(Purushothaman

et al., 2016)

(3) Good results on

two test-sets, but

mainly for brighte-

ning dark areas

(3) No code provi-

ded, but implementa-

tion clear and easy to

reproduce)

(2) O((wh)

nx) with

x ≈ 128)

(3) A single parame-

ter which is easy to

understand

(Gautam and Tiwari,

2015)

(3) Good results on

ﬁve test-sets, but dark

areas can become un-

desirably darker

(2) No code provi-

ded, reproducing is

moderately difﬁcult

(4) O(whn) (5) No parameters to

be set

(Gonz´alez et al.,

2016)

(3) Good results on

six test-sets, but all

only show brighte-

ning dark areas

(3) No code provi-

ded, reproducing is

moderately difﬁcult

(4) O(whNn) where

N is t he constant size

of a small neighbor-

hood around each

pixel

(3) Three parameters,

of which two are not

directly intuitive

(Moradi et al., 2015) (4) Good results on

four test-sets

(2) No code provi-

ded, reproducing is

difﬁcult due to vague

description

(4) O(whn) (2) Two parameters

which do not have an

intuitive meaning

(Vazquez-Corral and

Bertalmio, 2014)

(5) Very good results

on 24 test-sets.

(3) No code provi-

ded, algorithm ex-

planation leaves out

some important de-

tails

(4) O(whn) (authors

mention that real-

time operation is

feasible)

(3) Two parameters

which do not have an

intuitive meaning

(Wang et al., 2014) (5) Good results on

seven test-sets

(2) No code pro-

vided, reproducing

seems difﬁcult

(4) O(whn) (1) Five parameters

which do not have an

intuitive meaning

(Farbman and Lis-

chinski, 2011)

(4) Good results on

ﬁve test-sets

(1) No code provi-

ded, algorithm paten-

ted by authors

(4) O(whn) (4) A single parame-

ter with clear usage

instructions

(Bassiou and Kotro-

poulos, 2007)

(4) Good results on

ﬁve test-sets

(3) Third party code

used in the paper pro-

duces undesired re-

sults

(4) O(whn) (4) Parameter(s) of

probability smoo-

thing step not

explained

this for a window of 5 frames. Below each frame t, the

histogram H

of its saturation channel is shown (in the

following, we use saturation as example, though our

technique also works o n the intensity channel, as al-

ready stated). In the frames, we observe an undesi-

red tone shift from orange to pink . We also observe

a distinct shap e ch a nge of the saturation histograms.

Hence, the shape change can be used as an indicator

of the amount of color variation. For this, we need

a way to measure the amount of shape change. To

do this, we ﬁrst compute a cumulative histogram H

whose bins are given by

∑

i=−k

t+i

where H

t+i

is th e bin for saturation value x of the his-

togram for frame t + i. As our pill ca mera images are

8 bit per channel, we use histograms of 25 5 bins. We

next compute the mean µ and variance σ

of H

and

use the latter as a measure of the shape change of all

histograms with in the time window. A small variance

indicates a small tone ﬂuctuation, mea ning tha t very

little histogram compression is ne e ded. A large vari-

ance indicates a large tone ﬂuctua tion, meaning that

more co mpression is needed to sm ooth out the ﬂuctu-

ation.

We c an now procee d with the actual histogram

compression (see also Fig. 3). We start with the

computed mean µ and variance σ

of the cumula-

tive histogram H

(Fig. 3a). Sec ondly, we eliminate

the mean by subtrac ting µ fro m the saturation s of all

pixels (Fig. 3b). Thirdly, we compress the histogram

by dividing the saturations by aσ

(Fig. 3c). Here,

a ∈ [1/σ

,1] controls the comp ression amount: For

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

104

number of

pixels

number of

pixels

number of

pixels

number of

pixels

saturation saturation

shift by μ

division by

ασ

shift by c

a) b)

c) d)

Figure 3: Histogram compression. a) The histogram’s mean

µ and variance σ

are computed. b) The histogram is shifted

µ bins to the left so that its mean is zero. c) The histogram

is compressed by dividing all saturation values by aσ

. d)

The histogram is shifted right by c bins.

a = 1, all saturations are divided by σ

, so tha t the

histogram is compressed by an amount proportional

to the variance. For a = 1/σ

, no compression occurs.

After this step, a part of the histogram will correspond

to negative saturation values, which of course make

no sense. To ﬁx this, it seems natural to shift the histo -

gram back with the same value µ we used in step one.

However, we veriﬁed that doing so produces unnatu-

ral looking tones – pixel satur ations appear higher or

lower than desired. To solve this issue, w e use a shift

value c ∈ [0,1] (Fig. 3), as follows. If c = 0, the his-

togram is shifted so that its leftmost bin corresponds

to saturation value 0; if c = 1, the histogram is shif-

ted so that its rightmost bin corresponds to saturation

value 255. Intermediate values for c produce linearly

interpolated shifts betwee n these two extremes.

Several comments are due. The proposed histo-

gram compression extends the relative p ixel intensity

constraint mentioned in Sec. 1 to pixel saturations. In-

deed, the applied transformations are linear, and the

shape of the histogram is preserved. Separately, while

the histogram compre ssion is c omputed on the cu-

mulative time-window histogram, the individual pixel

intensity or saturation manipulations are done sepa-

rately on each frame. This ensures that these mani-

pulations will vary smoothly in time, as the cumula-

tive histogram has the effect of a smoothing sliding-

window time ﬁlter.

4 IMPLEMENTATION

We implemented our method in single-threaded C++

under Lin ux and Windows. Our tool covers both the

original Anbarjafari method and our new method, and

allows one to apply them separately, or in sequence,

on the saturation and/or intensity channe ls. The tool

loads a pill-camera video in MPEG format, allows

changin g the para meters k, a, and c of our algorithm

and the mean goal γ of Anba rjafari, plays the origi-

nal and stabilized videos side-by-side, and saves the

stabilized video as an MPEG ﬁle (Fig. 4).

Figure 4: Software tool for color stabilizati on and video

exploration.

For a time window of 41 frames (k = 20), compu-

ting histog rams takes ab out 3 second s on a 2.3 GHz

laptop with 4GB RAM. The video stabilization runs

smoothly at 3 frames/second, which is the recording

speed of th e pill-camera video (Sec. 1 ). The compu-

tational complexity is O(whn) for processing a video

of n frames each of w × h pixels, i.e. linear in input

size. After computing the h istograms, changing all

parameters is, however, instantaneous. This allows a

physician to focus on an image of interest and explore

it to e.g. brighten or darken its various areas in real

time.

5 EVALUATION

As already outlined, only very few evaluations of co-

lor stabilization for endoscopy videos are present in

the literature . Moreover, these take the form of pre-

senting the stabilized images, but come with limited

or even no actual evaluation of the quality thereof. We

improve upon this by presen ting next both a qualita-

tive user-study based evaluation (Sec. 5.1) and a quan-

titative m etrics-based evaluation (Sec. 5.2).

Joint Brightness and Tone Stabilization of Capsule Endoscopy Videos

105

5.1 Qualitative Evaluation

The nature of color stabilization is q uite application-

speciﬁc and possibly even user-speciﬁc. It is not easy

to fo rmally measure how much ‘better’ a given stabi-

lized image is than another one. Also, note that we

have no g round truth, in the sense of an ‘optimally’

stabilized image. As such , it is deﬁnitely important

to compare different stabilization methods or para-

meter settings by me ans of user studies. To this end,

we p erformed a survey in which users were asked to

rank images produced by different stabilization met-

hods and para meter values, as de scribed next.

5.1.1 Evaluation Materials

We acquired several end oscopy videos, each 8

hours long, recorded using the MiroCam pill ca-

mera (Medivators, 2017), fr om medic a l specialists at

a major regional hospital in the Neth erlands. The vi-

deos were pre-screened by the specialists for suita-

bility – that is, containing no major artifacts due to

camera malfunction, and containing a wide range of

image intensities and tone s that would pose difﬁcul-

ties in manual analysis and for which stabilization

would b e of added value. Since organizing a study

where multiple users examine thousands of images

such as present in our videos was infe a sible, we ﬁrst

manually grouped the available vid eo frames into ﬁve

representative classes, depending on th e c olor and in-

tensity distribution, as follows:

• Dark area directly b ordering a very bright area

(Fig. 1, frame 3515);

• Dark area separated from a very bright area by

moderate illumination (Fig. 1, frame 4765);

• Dark area dire c tly surro unded by bright areas on

all sides (Fig. 1, frame 6900);

• Dark area surrounded by bright areas on a ll si-

des, with a moderate illu mination transition zone

(Fig. 1, frame 8096);

• Dark area bordering a bright area of varied color

and structure (Fig. 1, frame 8096).

Next, we randomly selected a few imag e s in each

class for the qualitative study. For each image, we ran

several combinations of the Anbarjafari method (A)

and our proposed method (P) described in Sec. 3, app-

lied on the intensity (I) and saturation channels (S), as

described below. Note tha t only the ﬁrst combination

(A used solely on I) is covered by existing literature,

all other com binations being novel.

1. A → I: A applied to I only;

2. A → (I,S): A applied to both I and S channels;

3. P → I: P applied to I only;

4. P → (I,S): P applied to both I and S channels;

5. (A,P) → I: A applied to I, followed by applying P

to the resulting I;

6. (A,P) → (I,S): A applied to I, followed by ap-

plying P to the resulting I; and A applied to S,

followed by applying P to the resulting S.

For each combination, we ran the involved met-

hods for several p a rameter values. Spec iﬁcally, we

set the mean go al γ in Anbarjafari to values in

{0.6, 0.7,0.8, 0.9,1%}; and the compression a of

our method to values in {0.02,0.04,0.08,0.16, 0.32}.

The la tter set of values is chosen as such since c is

used as a den ominator (Sec. 3), so it affects a function

of hyperbolic type 1/x. For the time window size and

correction, we used the ﬁxed values of k = 20 fram es

and c = 0.4 respectively, which have been determined

by us empirically by testing stabilization on several

videos.

Figure 5 shows the stab iliza tion re sults o btained

for frame 8096 (Fig. 1) for several method and

parameter com binations. Du e to space limitations,

we cannot show all the tested r e sults which entail

several hundreds of image s. The rows in Fig. 1

indicate method combination s; columns indicate

parameter-value combinatio ns. Below we discuss the

ﬁndings we observed ourselves – that is, before using

these results in the actual su rvey, which is descr ibed

next in Sec. 5. 1.2.

A → I: We see that, as the parameter γ incr ea-

ses, dark areas are brightened, and colors and details

get more easily visible to the human eye. For all ﬁve

frames in the top row in Fig. 5, we found that γ = 0.7

yields the greatest intensity increase with acceptable

loss of d e ta ils. When γ > 0.7, images become too

noisy. Moreover, in endoscopy im ages, detail such as

edges is mainly deﬁned by intensity and not hue, so

too much brightening erases such detail.

A → (I,S): Similar to brightening dark areas, in-

creasing γ now makes the color of low-saturation

(gray-like) areas more vivid. Since low saturation

areas match very well dark areas in the gastrointesti-

nal tra ct, this method a dditionally boosts dark areas

by making them not only brighter, but also more

colorful. As for the A → I method, we found an

optimal value around γ = 0.7. Larger γ values affect

color tones too much, which can create undesirable

artifacts, like renderin g a norma l tissue too red, thus

suggesting an internal bleeding.

P → I: Similar to A → I, this method makes dark

areas become brighter as c increases. However,

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

106

γ=0.6, a=0.02 γ=0.7, a=0.04 γ=0.8, a=0.08 γ=0.9, a=0.16 γ=1, a=0.32

A➞ IA➞ (I,S)P➞ IP ➞ (I,S)(A,P) ➞ (I,S) (A,P) ➞ I

Figure 5: Frame 8096 (shown in Fig. 1) processed with various combinations of algorithms and parameters.

details in dark areas ar e lost ea rlier than in the A →

I case. We also note that this method yields overall

brighter images than A → I (compare rows 1 and 3

in Fig. 5). However, detail shading is sligh tly less

well visible. This is expected, since the goal of our

method (P) is not to enhance single image s, but to

smooth sudden changes in video sequences. Since

P → I essentially compresses the intensity channel

histogram, edges captured by intensity differences

may beco me less visible.

P → (I,S): In addition to the previously discussed

Joint Brightness and Tone Stabilization of Capsule Endoscopy Videos

107

effect on the intensity levels, this method makes

colors more saturated as c increases. Inte restingly,

saturation is not increased as aggressively as in A →

(I,S). Again, this is beca use our algorithm does not

try to increase saturatio n to a ce rtain predeﬁned level

γ, but aims to smooth out sudden differences in the

saturation histograms of neighboring frames. This

is w hy, as we will discuss later, our metho d is better

for stabilizing satu ration in vide os rather than single

images.

(A,P) → I: We observe that the results of this method

are nearly identical to those of P → I. We explain

this by the fact that P compresses th e histogra m after

A enhanced the intensity. This largely undoes the

enhancements that the A method m ade. As a result,

the output images suffer from the same problems we

observed whe n using P → I, namely lo ss of deta ils

due to the histogram compre ssion.

(A,P) → (I,S): We observe that the results of this

method are very similar to those of A → (I,P). Ho-

wever, the saturation is less d ramatically increased.

We explain this by the fact that, after the A method

has ma de the satura tion very high, the P method

compresses the saturation histogram, thus making the

color vib rance less extreme.

From all above, we draw the following preliminar y

qualitative conclusions. The Anbarjafari method

(A) with a mean goal value around γ = 0.7 shows

itself to be best for intensity stabilization of single

images. However, it is not effective in stabilizing

tone ﬂuctu ations – when applied to saturation (A

→ S), it may actually enhance tone ﬂuctuations. In

contrast, o ur method (P) is effective in smoothing

tone ﬂuctuations, but less effective in stabilizing

intensity.

5.1.2 User Survey

We reﬁned the qualitative observations presented

above, which a re drawn from ou r own study of the

computed results, by conducting an online survey that

involved a wide group of people, thereby rea lizing a

more represen ta tive qualitative evaluation. The sur-

vey material con sisted of ﬁve pages, one page for an

image in each image class d eﬁned in Sec. 5. 1.1. Each

page contained all stabilized images for the respective

input image, laid out identically to Fig. 5. We also in-

cluded an additional column representing the actual

input image. However, the c olumn was not marked

as such, so the particip ants could not know which is

the in put and which the outputs of the stabilization.

For each image row, the participant was asked to pick

the image that they thoug ht was the b est in terms of

enhancing the info rmation in th e brighter and darker

areas of the image and without introducing too much

noise or losing information. This answers the ques-

tion ‘which parameter values are best for a given me t-

hod combination? ’. Next, at the end of each page,

participants were asked to r eview the six images they

picked as best for the six rows and pick the best one

among these. This answers the question ’which met-

hod combination d elivers the b e st results, given that

all methods are run with their optimal parameter va-

lues?’.

The survey was cond ucted using Goo gle Forms.

Participants were encouraged to look at each row of

images for roughly 10 seconds, so tha t the survey

could be ﬁnished in about 5 minutes. However, the

participants could spe nd m ore time if desired, and

were also allowed to go back to previous pages to

review or change their answers. Note that the parti-

cipants did not see any annota tions on the survey pa-

ges such as the method names and parameter values

in Fig. 5. Eighteen people participated in the survey.

All are specialists in image processing and computer

vision, and are well familiar with endoscopy videos

and their issues. The participants were aged betwe e n

20 and 50, the majority being male.

Table 2 presents the aggregated results of the sur-

vey. Rows indicate method comb inations, and co-

lumns indicate parameter values, just like in Fig. 5.

Each cell contains two n umbers, sepa rated by a slash.

The ﬁrst number indicates how many times an image

generated by the respective method and parameter-

values combination was chosen best in a row of ima-

ges – thus, best for all tested parameter values. The

second number (in bold) indicates how many times an

image was chosen as best for an entire survey page –

thus, best for all method and parameter values combi-

nations tested.

We get several insig hts from these ﬁg ures. First,

we see that the parameter values γ = 0.6,a = 0.02 and

γ = 0.7,a = 0.04 get most votes, the former being

seen best when the combined method (A,P) is used,

and the latter when the individual Anbarjafari (A)

method is used, respectively. These are thus g ood

values for a wide set of images a nd a wide set of

users. Note that the setting γ = 0.7 matches what we

found ourselves in our preliminary qualitative evalua-

tion (Sec. 5.1.1). Hence, we use these values as pre-

sets in our tool (Sec. 4). Secondly, w e see that very

high parameter values are never p referred. This mat-

ches our own ﬁndings that such values yield too much

disappearance of relevant details (Sec . 5.1.1). Thirdly,

we see that the Anbarjafari method applied to satura-

tion (A → S) with γ = 0.7,a = 0.04 has the highest

number of overall best results. This matc hes our ear-

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

108

lier observations that this method is indeed very good

for stabilizing single imag es. More over, this is an in-

teresting novel result, as the Anbarjafari method has

been originally proposed to work on intensity only.

Separately, as explained earlier, this me thod is not ai-

med at stabilizing tone ﬂuctuations in video sequen-

ces – something that our survey could not capture, as

participants we re shown only individual frames. Fi-

nally, we see that the combination (A,P) → (I,S) with

γ = 0.6, a = 0.02 scores the be st image-in-a-row. As

such, this method combination is arguably goo d for

video color stabilization, albeit it scores lower for sin-

gle frame stabilization.

2654

2659

2689

2694

2696

2709

original frame

P ➞ S

(A,P) ➞ (I,S)

Figure 6: Selected frames from a video fragment demon-

strating how the combination (A,P) → (I,S) successfully

stabilizes both intensity and tone in image sequences.

5.1.3 Video Intensity and Tone Stabilization

Among the studied methods, we found the original

Anbarjafari method to be the best for intensity stabi-

lization in single images. Yet, this method does not

handle tone stabilization in video seque nces. Consi-

der Fig. 6, left column, which shows a selection of

frames from a video of a bleeding gastrointestinal tis-

sue. The ﬁrst ﬁve frames are identical to those in

Fig. 1, bottom row. As also outlined in Sec. 1, a cer-

tain amount of tone ﬂuctuation is visible even in this

short sequence.

We next show how the combination of Anbarja-

fari and our method solves this problem. First, as a

baseline, we apply only our method to the saturation

channel (P → S), see Fig. 6 mid dle column, with a

time window k = 40, compression a = 0.04, and cor-

rection c = 0.4, in line with the optimal values found

for our method (P) in the survey. We see how the sud-

den tone changes have now b een smoothed out – all

frames in Fig. 6, middle column, have a pinkish tone.

The tone stabilization is even more evident when wa-

tching the actual video. However, the intensity is not

stabilized. To solve this, we apply the c ombination of

Anbarjafari and our method to both the intensity and

saturation channels ((A,P) → (I,S)), see Fig.6 , right

column. I n addition to the previous parame ters, we

use a mean goal γ = 0.7, shown to be optimal in our

survey (Sec. 5.1.2). As visible, especially for frames

2659 and 2709, the intensity is more uniform now; in

addition, the tone ﬂuctuation s are low, thanks to our

method. All in all, we conclude that the combination

(A,P) → (I,S) is indeed a good way to stab ilize both

intensity and tone ﬂuctuations.

5.2 Quantitative Evaluation

The qualitative evaluation of the various comb inati-

ons of methods and parameters in Sec. 5.1 has e mpi-

rically found good parameter values that yield images

perceived by users as stabilized. However, as explai-

ned already in Sec. 1, stab iliza tion sh ould not create

artifacts which could lead to misinterpretation of the

imaged tissue structures. Formally put, stabilization

can b e thought of a function Φ(γ,a,I

input

) = I

stabilized

from images to images which aims to maxim iz e both

the temporal stability of intensity and tones and in

the same time minimize th e perceptual difference be-

tween the original and stabilized im a ges. The beha-

vior of this func tion is driven by our method’s free

parameters, of which th e most important are the goal

mean γ (for Anbarjafari) and th e compression a (f or

our histogra m-based compression). To stu dy how Φ

affects image similarity, we need a way to compare

input

and I

stabilized

. For this, similar ly to (Moradi

et al., 2015), we use the peak-to-signal noise ratio

(PSNR) and structu ral similarity index (SSIM) me-

trics, well k nown in image processing. For 8-bit-per-

Joint Brightness and Tone Stabilization of Capsule Endoscopy Videos

109

Table 2: Image-quality survey results accumulated for all ﬁve tested endoscopy image classes.

original γ = 0.6,a = 0.02 γ = 0.7,a = 0.04 γ = 0.8,a = 0.08 γ = 0.9,a = 0.16 γ = 1,a = 0.32

A → I 6 / 6 18 / 7 40 / 12 24 / 3 2 / 0 0 / 0

A → (I,S) 5 / 5 14 / 5 55 / 17 16 / 3 0 / 0 0 / 0

P → I 11 / 7 41 / 4 28 / 6 7 / 0 3 / 2 0 / 0

P → (I,S) 9 / 7 41 / 7 30 / 2 8 / 1 2 / 0 0 / 0

(A,P) → I 8 / 7 50 / 6 21 / 3 11 / 1 0 / 0 0 / 0

(A,P) → (I,S) 8 / 7 57 / 7 23 / 6 2 / 0 0 / 0 0 / 0

channel im ages like ours, typical PSNR values for

good similarity are between 3 0 and 50 d B, where hig-

her is better (Huynh-Thu and Ghanbari, 2008). SSIM

ranges between -1 and 1 where higher is better (1 de-

notes identical images) (Wang et al., 2004).

Figure 7 shows the plots of the PSNR an d SSIM

similarity metrics between th e original endoscopy

images I

input

and the stabilized on e s I

stabilized

function of the key parameters γ (for Anbarjafari) and

a (for our method), for the set of im a ges used in our

qualitative analysis (see Sec. 5.1.1 ), and for ﬁxed va-

lues of k = 20 and c = 0.4. As methods, we consi-

dered Anbar jafari applied on intensity (A → I) and

separately on saturation (A → S), an d our method ap -

plied on intensity (P → I) and separately o n saturation

(P → S). Fro m these plots we make the following ob-

servations.

Quality: The A → I metho d peaks for both PSNR

and SSIM at γ very close to 0.5, i.e., the mean in-

tensity of I

input

. T his is expec te d: If the goal mean

equals the or iginal mean, no correction needs to be

done, as I

stabilized

is iden tica l to I

input

. In contrast, A

→ S peaks at values around γ = 0.7. This matches

very well the optimal γ values found in our qualitative

study (Sec. 5.1). Hence, the γ values found best by

users to explore the images is also the one where the

least chan ges are done by stabilization. Moreover, the

maximal PSNR values (over 5 0 dB) and SSIM values

(close to 1) indicate that our stabilization looses very

little from the original image feature s. Separately, we

see that both SSIM and PSNR have very good values

for a close to 0.04, which was found earlier in our

qualitative studies to yield a very good tone stabili-

zation (Secs. 5.1.2 and 5.1.3). This conﬁrms that o ur

preset a = 0.04 is indeed a good one.

Intensity vs Saturation Stabilization: The plots for

A → I a nd A → S are very similar in shape and mag-

nitude. This matches our earlier qualitative ﬁnding

that the Anbarjafari metho d can be used to stabilize

both intensity and saturation (Sec . 5.1). In contrast,

the plot for P → S is always larger than P → I. This

means that our proposed method P is better at stabili-

zing saturations (tones) than intensities, which again

correlates with our qualitative ﬁndings.

Parameter Sensitivity: The plots for A → I and A

→ S have overall quite high der ivatives close to the

maximum, while the plots for P → I and P → S show

a much more stable, and actually monotonic, varia-

tion. This tells that setting the compression a for the

P method is less sensitive than setting the m e an goal

γ for the A method. However, this does not mean that

tuning γ is sensitive: As explained above, we obtain a

very good image quality for values around γ = 0.5 for

the method A → I, and respectively for values around

γ = 0.7 for the method A → P. All in all, we conclud e

that parameter setting is not a sensitive process.

Consistency and Smoothness: Across the ﬁve fra-

mes, plots for the same method are similar in shape,

position, and peak location. This is desirable, as it

tells that optimal parameter values ar e consistent for

quite different inp uts. Considering the earlier para-

meter sensitivity analysis, the paramete r presets pro-

posed in Sec. 5.1 can be indeed used as default va-

lues for entire videos. This makes our method basi-

cally parameter-free. Secondly, the plots are smooth,

with no jitters, which tells that small parameter-value

changes do not massively affect the image similarity.

Hence, our method is robust vs parameter changing,

if users really need to change the pre set values.

6 CONCLUSIONS

We propose a new method for jointly stabilizing in-

tensity and tone (hue) in endosco py videos. For this,

we adapt the intensity channel by brightening dark

areas and darkening too brig ht areas, and also mini-

mize tone ﬂuctuations between temporally close fra-

mes. Our method is simple to im plement, works at the

frame-r a te of the pill camera, has n o free parameters

that users should set, delivers consistent results for a

wide variety of endoscopy videos, and alters on ly mi-

nimally the input images, thereby reducing the risk of

creating misleading artifacts. Summarizing, our main

contributions are:

Survey: To our knowledge, our work is the ﬁrst in

which a large set (10) of imaging m ethods was stu-

died for suitability for the spec iﬁc case of endoscopy

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

110

PSNR, frame 3515

0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

0.5

γ for A ➞ I γ for A ➞ S

a for P ➞ I a for P ➞ S

PSNR, frame 4765

γ for A ➞ I γ for A ➞ S

a for P ➞ I a for P ➞ S

PSNR, frame 6900

γ for A ➞ I γ for A ➞ S

a for P ➞ I a for P ➞ S

PSNR, frame 8096

γ for A ➞ I γ for A ➞ S

a for P ➞ I a for P ➞ S

SSIM, frame 3515

γ for A ➞ I γ for A ➞ S

a for P ➞ I a for P ➞ S

SSIM, frame 4765

γ for A ➞ I γ for A ➞ S

a for P ➞ I a for P ➞ S

SSIM, frame 6900

γ for A ➞ I γ for A ➞ S

a for P ➞ I a for P ➞ S

PSNR, frame 8096

γ for A ➞ I γ for A ➞ S

a for P ➞ I a for P ➞ S

Figure 7: PSNR and SSIM image-similarity plots for several frames from Fig. 1 processed with Anbarjafari and our method.

The horizontal axis denotes either the goal mean γ or the compression factor a depending on the graph type.

video stabilization, from a practical perspective inclu-

ding validation, reproductibility, computational com-

plexity, and ease of use.

Joint Stabilization: Wh ile several methods perform

intensity stabilization , we show how both intensity

and tone can be jointly stabilized. For the former, we

use an existing method ( Anbarjafari, 2014). For the

latter, we propose a simple but efﬁcient method based

on histogram co mpression.

Validation: Compared to existing work, we per-

form a signiﬁcantly more thorough validation inclu-

ding testing several method types applied on inten-

sity and/or saturation; a detailed user study for ﬁnding

good method combinations and paramete r values; and

a qua ntitative evaluation that shows how to ﬁnd para-

meter presets which match the values sugg ested by

our qualitative study and also minimally affect image

quality. This makes our method fully parameter-free

and guarantees its output quality. Our method can be

easily and efﬁciently implemented.

Limitations: Our search of the algorithm-and-

parameter space is, of cou rse, not exhaustive. More

methods and parameter values exist which cou ld be

assessed. It is also fair to say that our current evalua-

tion already surpasses what one typically encounters

in endoscopy video stabilization papers. Separa te ly,

one can argue that the differences between the orig i-

nal and stabilized images are quite small, so the entire

stabilization p rocess is not worthwhile. Yet, when wa-

tching the actua l stabilized videos, these differences

are well visible, and show that the stabilized material

is easier to follow.

Several future work directions exist. More exten-

sive evaluations can be made to compa re with additi-

onal color stabilization methods, use more videos, or

a more users. Machine lea rning techniques could be

used to perform a more ﬁne-gra ined stabilization ba-

sed on images or image regions labeled by users as

requirin g brightening.

ACKNOWLEDGEMENTS

We thank Medisch Centrum Leeuwarden for provi-

ding us the capsule endoscopy videos.

REFERENCES

Anbarjafari, G. (2014). HSI based colour image equa-

lization using iterative n

root and n

power.

arXiv:1501.00108 [cs.CV] .

Bassiou, N. and Kotropoulos, C. (2007). Color image his-

togram equalization by absolute discounting back-off.

CVIU, 107(1):108–122.

Classen, M. and Phillip, J. (1984). Electronic endoscopy of

the gastrointestinal tract. Endoscopy, 16(1):16–19.

Farbman, Z. and Lischinski, D. (2011). Tonal stabilization

of video. ACM Trans Graph, 30(4):89–101.

Funt, B., Barnard, K., Brockington, M., and Cardei, V.

(1997). Luminance-based multi-scale retinex. In

Proc. 8

Congress of the International Colour Asso-

ciation.

Gautam, C. and Tiwari, N. (2015). Efﬁcient color

image contrast enhancement using range limited bi-

histogram equalization with adaptive gamma cor-

rection. In Proc. IEEE ICIC.

Joint Brightness and Tone Stabilization of Capsule Endoscopy Videos

111

Gijsenij, A., Gevers, T., and van der Weijer, J. (2011). Com-

putational color constancy: Survey and experiments.

IEEE Trans Imag Process, 20(9):2475–2489.

Gonz´alez, D. M., Ponomaryov, V., and Kravchenko, V.

(2016). Chromaticity improvement in images wi th

poor l ighting using the multiscale-retinex MSR algo-

rithm. In Proc. IEEE MSSW.

Hale, M., Sidhu, R., and McAlindon, M. (2014). Cap-

sule endoscopy: Current practice and future directi-

ons. World J Gastroenterol, 20(24):7752–7759.

Huynh-Thu, Q. and Ghanbari, M. (2008). Scope of validity

of PSNR in image/video quality assessment. Electron

Lett, 44(13).

Kim, T. and Yang, H. (2006). A multidimensional his-

togram equalization by ﬁtting an isotropic Gaussian

mixture to a uniform distribution. In Proc. IEEE ICIP,

pages 2865–2868.

Medivators (2017). MiroCam capsule endoscope.

www.medivators.com/products/gi-physician-

products/mirocam-capsule-endoscope.

Moradi, M., Falahati, A., Shahbahrami, A., and Zare-

Hassanpour, R. (2015). Improving visual quality

in wireless capsule endoscopy images with contrast-

limited adaptive histogram equalization. I n Proc.

IPRIA. IEEE.

Purushothaman, J., Kamiyama, M., and Taguchi, A. (2016).

Color image enhancement based on hue differential

histogram equalization. In Proc. ISPACS, pages 322–

331. IEE E.

Sun, C. C. , Ruan, S. J., Shie, M. C., and T, W. P. (2005). Dy-

namic contrast enhancement based on histogram spe-

ciﬁcation. IEEE Trans Consum Electr, 51(4):1300–

1305.

Vazquez-Corral, J. and Bertalmio, M. (2014). Color stabili-

zation along time and across shots of t he same scene,

for one or several cameras of unknown speciﬁcations.

IEEE Trans Imag Process, 23(10).

Vig, N., Budhiraja, S., and Singh, J. (2016). Hue preser-

ving color image enhancement using guided ﬁlter ba-

sed sub image histogram equalization. In Proc. 9

Intl. Conf. on Contemporary Computing (IC3).

Wang, Y., Tao, D., Li, X., Song, M., Bu, J., and Tan,

P. (2014). Video tonal stabilization via color states

smoothing. IEEE Trans Image Process, 23(11).

Wang, Z., Bovik, A., Sheikh, H., and Simoncelli, E.

(2004). I mage quality assessment: from error visibi-

lity to structural similarity. IEEE Trans Imag Process,

13(4):600–612.

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

112