implementation uses Hessian-Affine keypoint
detection and SIFT for keypoint description. The
dictionary contains 20,000 visual words and was
built using randomly selected keypoints from the
same datasets in which the method was tested.
• Xmatch from Zhao and Ngo (2009), which uses
scale-rotation invariant pattern entropy based on
the SIFT descriptor and exhausted pair-wise com-
parisons.
Apart from these methods, the results for using only
the image similarity measure, without the subsequent
proposed clustering steps, are also reported (Similar-
ity).
Since existing methods mainly target the domain
of copyright detection, their evaluation datasets in-
clude images from news channels (Zhao and Ngo,
2009), or synthetic degradations (e.g. cropping, ro-
tation, intensity, resizing etc.) applied to images from
the web (Foo et al., 2007a). These databases however,
are quite different from the personal photolibrary of a
typical user, which mostly contains people in family
moments, traveling/vacation, or other everyday activ-
ities. More importantly though, these datasets pro-
vide only a binary ground truth, which cannot capture
the ambiguity of NIND cases (Jinda-Apiraksa et al.,
2013). For this reason, two different datasets were
used for the evaluation of the methods in this study,
featuring images taken from personal photolibraries.
The comparison results are reported in the following.
4.1 California-ND Dataset
The California-ND dataset has been specifically de-
signed for ND detection in personal photolibraries
(Jinda-Apiraksa et al., 2013). The advantage of this
dataset is that it comprises 701 images from a real
user’s travel photo collection, the size of which co-
incides with the average number of photo taken per
trip (Loos et al., 2009). Although the total number
of images may not be as high as in other established
datasets in the copyright detection domain, this is the
only existing publicly available dataset including im-
ages directly taken from a personal photo collection,
which has also been annotated for ND cases by a
panel of 10 observers, and as such captures the inher-
ent ambiguity of NIND cases. In order to use them
in our evaluation, the 10 annotations were averaged,
resulting in a real number in the interval [0, 1], indi-
cating the agreement between subjects that a pair of
images may be ND. These results are stored in matrix
G, which serves as the ground truth.
The ND cases include the 3 major categories re-
ported in Jaimes et al. (2002): variations in the scene,
the camera settings, and the image. This includes
changes in the subject/background, zooming, pan-
ning, tilting, brightness/exposure difference, white
balance difference, burst shots, group photos, perfor-
mance/show photos, portrait photos etc. It should
be noted that zooming, in reality, can be different
from simple cropping, which is extensively used in
other datasets, since by the time the camera lens
has zoomed and focused, the scene may also have
changed. Furthermore, the photos included in the
dataset are captured by two different cameras with
non-synchronized timestamps. This has an impact on
any method that uses timestamps as a feature of image
(dis)similarity, including the proposed one.
1
1
1
1
1
0.46
0.46
0.46
0.46
0.95
0.95
0.95
0.95
0
0
0
0
0
0
0
000
0 0
A
B C D E
A
B
C
D
E
Correlation Matrix R
1
1
1
1
1
0.3
0.3
0.2
0.2
0.9
0.9
0.6
0.6
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
A
B C D E
A
B
C
D
E
Ground Truth G
1
1
1
1
1
0.44 0.44 0.43 0.44
0.44
0.44
0.43
0.44
0.94
0.94
0.440.44
0.44 0.94
0.94
0.42
0.42
0.43
0.43
0.44
A
B C D E
A
B
C
D
E
Similarity Values
Figure 5: Correlation matrices for the ND cases of Fig. 1.
Fig. 5 demonstrates the strength of PhotoCluster
using the images of Fig. 1, which are part of the
California-ND dataset. The results of PhotoCluster
(matrix R) are compared to the ground truth (matrix
G) and the similarity values used in the multiple clus-
tering step. The subset of five images (A, B, C, D, and
E), contains one obvious pair of ND (B,C) and many
other ambiguous cases. This is evident from matrix
G, where the average observers’ rating for B and C
is 0.9, whereas it ranges from 0.1 to 0.6 for the other
pairs.
The image similarity values do not follow the
ground truth trend. According to matrix G, only 10%
of the observers agreed that images D and E are ND
with images A, B, and C, whereas the similarity value
for all of them is around 0.44. Once the multiple clus-
tering step of PhotoCluster is applied on these simi-
larity values, R resembles G much better; images D
and E are assigned 0 probability of being ND with
A, B, and C. This shows that the results of the pro-
posed method roughly follow the pattern of ground
truth, whereas image similarity alone is not enough
for capturing the ambiguity of NIND cases.
Fig. 6 depicts the performance of the different
methods for the California-ND dataset. It confirms
again that image similarity alone is not adequate for
detecting ND cases; while it exhibits very high recall,
it has very low precision. Consequently, the sF1 score
is very low. When compared to PhotoCluster, the con-
tribution of the multiple clustering approach becomes
apparent.
INDetector and BoVW exhibit very similar re-
sults, and a behavior opposite to Similarity. Their pre-
PhotoCluster-AMulti-clusteringTechniqueforNear-duplicateDetectioninPersonalPhotoCollections
157