Table 4: Comparisons of the results of selecting questions
randomly vs. according to conditional entropy
(a) WCP Dataset
Method Avg. # questions Accuracy[%]
random 3.01 81.3
entropy 2.75 86.2
(b) Mall Dataset
Method Avg. # questions Accuracy[%]
random 3.12 41.3
entropy 2.73
51.3
5.4 Impact of Conditional Entropy
To show the effect of conditional entropy, we com-
pared the results achieved with randomly selected
questions. As shown in Table 4, WCP Dataset and
Mall Dataset improved accuracy by 4.9 points and
10.0 points when using conditiona l entropy rather
than randomly selected questions. In selecting ques-
tions, we let the questions of scene text information,
which is comparatively easy to locate, be more likely
to be selecte d, while questions that have low con-
fidence object labels are less likely to be selected.
Therefore, localization can be more accurate. In ad-
dition, the average number of questions was also re-
duced by about 0.4.
6 CONCLUSIONS
In this paper, in order to reduce the burden on the
user and at the same time achieve highly accurate in-
door localization, we proposed a method that gener-
ates qu estions from reference images and filters the
possible locations based on responses from the user to
questions used by the method. The results o f exper-
iments on two datasets showe d that even in the case
of extremely low accuracy in similar image retrieval,
an average of 2.75 respo nses, without increasing the
number of captured query images need ed, resulted in
higher accuracy than the conventional m ethod.
As a future challenge, methods such as fine-tuning
using indoor datasets will be considered to improve
the a c curacy of obje ct detection. Furthermore, to gen-
erate questions that users are comfortable responding
to, and que stions that consider the difference betwee n
views, it is worth checking Visual Question Genera-
tion (VQG) as it can be adapted for localizatio n. Last
but not least, the problem of the fall in accura cy due
to changes in stores or objects because of timezone
differences sho uld be resolved.
REFERENCES
Bautista, D. and Atienza, R. (2022). Scene Text Recogni-
tion with Permuted Autoregressive Sequence Models.
In ECCV, Cham. Springer International Publishing.
Chiou, M. J. et al. (2020). Zero-Shot Multi-View Indoor
Localization via Graph Location Networks. In ACM
Multimedia, pages 3431–3440.
Dong, J. et al. (2019). ViNav: A Vision-Based Indoor Nav-
igation System for Smartphones. IEEE Trans Mob
Comput, 18(6):1461–1475.
Gao, R . et al. (2016). Sextant: Towards Ubiquitous Indoor
Localization Service by Photo-Taking of the Environ-
ment. IEEE Trans Mob Comput, 15(2):460–474.
He, K. et al. (2017). Mask R - CNN. In ICCV, pages 2961–
2969.
Li, S. and He, W. (2021). VideoLoc: Video-based Indoor
Localization with Text Information. In INFOCOM,
pages 1–10.
Li, X. et al. (2021). Accurate Indoor Localization Using
Multi-View Image Distance. IEVC.
Liu, C. et al. (2008). SIFT Flow: Dense Correspondence
Across Different Scenes. In ECCV, pages 28–42.
Springer.
Liu, Z. et al. (2017). Multiview and Multimodal Pervasive
Indoor Localization. In ACM Multimedia, pages 109–
117.
Liu, Z. et al. (2021). Swin Transformer: Hierarchical Vision
Transformer using Shifted Windows. arXiv preprint
arXiv:2103.14030.
MarketsandMarkets (2022). Indoor Location Market
by Component (Hardware, Solutions, and Services),
Technology (BLE, UWB, Wi-Fi, RFID), Application
(Emergency Response Management, Remote Monitor-
ing), Organization Size, Vertical and Region - Global
Forecast to 2027. MarketsandMarkets.
Noh, H. et al. (2017). Large-Scale Image Retrieval with
Attentive Deep Local Features. In ICCV, pages 3456–
3465.
Radenovi´c, F. et al . (2018). Fine-Tuning CNN I m-
age Retrieval with No Human Annotation. TPAMI,
41(7):1655–1668.
Sun, X. et al. (2017). A Dataset for Benchmarking Image-
Based Localization. In CVPR, pages 5641–5649.
Taira, H. et al. (2018). InLoc: Indoor Visual Localization
with Dense Matching and View Synthesis. In CVPR,
pages 7199–7209.
Torii , A. et al. (2015). 24/7 Place Recognition by View
Synthesis. In CVPR, pages 1808–1817.
Wang, S. et al. (2015). Lost Shopping! Monocular Local-
ization in Large Indoor Spaces. In ICCV, pages 2695–
2703.
Zhou, X. et al. (2017). EAST: an Efficient and Accurate
Scene Text Detector. In CVPR, pages 5551–5560.