
reliability. The strict stopping rule in the experiment
also led to a greater number of trials, in which ob-
servers viewed each image a high number of times,
which could have caused annoyance and decreased
participant satisfaction. In the case of the slider-based
method, a shorter completion time and viewing the
image only once increased the interest and concen-
tration of the observer, leading to greater consistency
in given answers. However, the simplicity of the
test also made it more susceptible to shortcuts, which
could introduce bias and rely more on the honesty of
the observer.
5 CONCLUSION
In this study, we conducted a comparison between
3AFC and slider-based methods to determine contrast
preference. We compared the reliability of the data
obtained from both methods for locally and globally
repeated images using statistical analysis and visual
comparisons to assess the differences between the re-
sponses. We found variations in the results while an-
alyzing individual observers. However, on average,
the differences between the experiments were con-
sistent, with slightly lower mean contrast preferences
observed in the slider-based experiment.
Results suggest that neither method could be sub-
stituted for the other, as they did not correlate well
and results were significantly different. However, in-
creasing the number of repetitions could stabilize the
results and improve precision and reliability. In cases
where time is crucial and a large number of samples
need to be processed, a slider-based test could be a
better option. To achieve the compromise between
time and reliability, the images can be repeated at
least twice to gather the mean of preferences, while
using the slider-based interface. However, when deal-
ing with a smaller number of samples where reliabil-
ity is crucial, the 3AFC test can be considered. Ulti-
mately, the choice of the most suitable method should
always be made in accordance with the specific re-
search objectives. Factors such as time constraints,
sample size, and desired reliability should be care-
fully considered. Further improvements can be made
to both methods by addressing the bias in the starting
point and ensuring an optimal duration of the experi-
ment to prevent observer fatigue.
REFERENCES
Azimian, S., Torkamani-Azar, F., and Amirshahi, S. A.
(2021). How good is too good? a subjective study
on over enhancement of images. Color and Imaging
Conference (CIC), pages 83–88.
Bosch, O. J., Revilla, M., DeCastellarnau, A., and We-
ber, W. (2019). Measurement reliability, validity, and
quality of slider versus radio button scales in an on-
line probability-based panel in norway. Social Science
Computer Review, 37(1):119–132.
Cherepkova, O., Amirshahi, S. A., and Pedersen, M.
(2022a). Analysis of individual quality scores of dif-
ferent image distortions. Color and Imaging Confer-
ence (CIC), pages 124–129.
Cherepkova, O., Amirshahi, S. A., and Pedersen, M.
(2022b). Analyzing the variability of subjective im-
age quality ratings for different distortions. In 2022
Eleventh International Conference on Image Process-
ing Theory, Tools and Applications (IPTA), pages 1–6.
Chyung, S. Y., Swanson, I., Roberts, K., and Hankinson,
A. (2018). Evidence-based survey design: The use
of continuous rating scales in surveys. Performance
Improvement, 57(5):38–48.
Hayes, M. (1921). Experimental development of the
graphic rating method. Psychological Bulletin, 18:98–
99.
Jin, E. W. and Keelan, B. W. (2010). Slider-adjusted
softcopy ruler for calibrated image quality assess-
ment. Journal of Electronic Imaging, 19(1):011009–
011009.
Karma, I. G. M. (2020). Determination and measurement
of color dissimilarity. International Journal of Engi-
neering and Emerging Technology, 5:67.
Koo, T. K. and Li, M. Y. (2016). A guideline of selecting
and reporting intraclass correlation coefficients for re-
liability research. Journal of chiropractic medicine,
15(2):155–163.
Landis, J. R. and Koch, G. G. (1977). The measurement of
observer agreement for categorical data. biometrics,
pages 159–174.
Lin, H., Hosu, V., and Saupe, D. (2019). Kadid-10k: A
large-scale artificially distorted iqa database. In 2019
Eleventh International Conference on Quality of Mul-
timedia Experience (QoMEX), pages 1–3. IEEE.
Litchfield, J. j. and Wilcoxon, F. (1949). A simplified
method of evaluating dose-effect experiments. Jour-
nal of pharmacology and experimental therapeutics,
96(2):99–113.
Lu, Z.-L. and Dosher, B. (2013). Adaptive psychophysical
procedures. In Visual psychophysics: From laboratory
to theory, chapter 11, pages 351–384. MIT Press.
Mantiuk, R. K., Tomaszewska, A., and Mantiuk, R. (2012).
Comparison of four subjective methods for image
quality assessment. In Computer graphics forum, vol-
ume 31, pages 2478–2491. Wiley Online Library.
Massey Jr, F. J. (1951). The kolmogorov-smirnov test for
goodness of fit. Journal of the American statistical
Association, 46(253):68–78.
McGraw, K. O. and Wong, S. P. (1996). Forming inferences
about some intraclass correlation coefficients. Psycho-
logical methods, 1(1):30.
Pixabay (2023). https://pixabay.com, last visited:
13.10.2023.
VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications
434