4 CONCLUSION
This study investigates the feasibility of
reconstruction of 3D scene of GI from 2D endoscopic
videos as an end-to-end process, i.e. from an input 2D
video to an output 3D model, without the prior
knowledge of camera information of location,
intrinsic and extrinsic data. SOTA NeRF approach is
applied. Because of the challenges facing acquisition
of endoscopic video with less texture information on
the GI surface, the camera positional information
extracted from videos requires images with varying
view angles, which in our case, is limited. Hence the
ground truth images after pre-processing only contain
10% of the original input. However, even with only
1000 images for each lesion as one training dataset,
the 3D model is able to render high quality images
with various viewing angles. For the two training data
sets, the averaged measures of SSIM, PSNR and
LPIPS between original (ground truth) and rendered
images are 19.46 ± 2.56, 0.70 ± 0.054, and 0.49 ± 0.05
respectively. In comparison with the work of NeRF
where 31.01, 0.947 and 0.081 are obtained for natural
images [6], our results appear to be less performed.
However, in [6], around 100 views are acquired for
each filmed object with known camera information. In
our study, this information has to be extracted from
endoscopic videos themselves with much less viewing
angles due to the constraints of viewing space in the
food passage, leading to less image frames are
employed. In addition, because of the combination of
movements while performing endoscopic filming,
including heartbeat, respiration, and camera, many
images appear blurry to a certain extent. These blurry
images are usually ignored when applying COLMAP
library to track camera locations. This is because the
tracking of motion based on optical flow, i.e. the same
spot would appear similar intensity level in the
subsequent images, which is not the case for blurry
images.
In the future, more datasets will be evaluated. In
addition, post processing will be conducted including
to remove noises or ghostly artefact as recommended
more recently by Warburg et al (Warburg 2023).
Specifically, to make use as many video frames as
possible, especially for medical applications with
limited dataset, a new algorithm will be developed to
establish camera information based on the existing
available but less clear images through the application
of human vision models. While many frames are burry
with regard to motion tracking, human vision can still
perceive these motions easily and clearly. In this way,
the developed system will also become more
transparent.
ACKNOWLEDGEMENTS
This project is founded by the British Council as
Women in STEM Fellowship (2023-2024). Their
financial support is greatly acknowledged.
REFERENCES
Ali S, Zhou F, Braden B, et al (2020), “An objective
comparison of detection and segmentation algorithms
for artefacts in clinical endoscopy,” Nature Scientific
Reports, 10, Article number: 2748.
Ali S, Bailey A, Ash S, et al. (2021a), “A Pilot Study on
Automatic Three-Dimensional Quantification of
Barrett’s Esophagus for Risk Stratification and Therapy
Monitoring”, Gastroenterology, 161: 865-878.
Ali S, Dmitrieva M, Ghatwary N, et al (2021b), “Deep
learning for detection and segmentation of artefact and
disease instances in gastrointestinal endoscopy,”
Medical Image Analysis, 70:102002.
Bae G, Budvytis I, Yeung CK, Cipolla R (2020), “Deep
Multi-view Stereo for Dense 3D Reconstruction from
Monocular Endoscopic Video”, MICCAI 2020. Lecture
Notes in Computer Science, vol 12263. Springer.
Bissoonauth-Daiboo P, Khan MHM, Auzine MM, Baichoo
S, Gao XW, Heetun Z (2023), “Endoscopic Image
classification with Vision Transformers,” in ICAAI
2023, pp. 128-132.
Bolya D, Zhou C, Xiao F, Lee YJ (2019), YOLACT: real-
time Instance Segmentation, Proceedings of the ICCV
2019.
Gao XW, Taylor S, Pang W, Hui R , Lu X, Braden B (2023),
“Fusion of colour contrasted images for early detection
of oesophageal squamous cell dysplasia from
endoscopic videos in real time , Information Fusion,” 92:
64-79.
Kajiya JT, Herzen BPV (1984), “Ray tracing volume
densities,” Computer Graphics, SIGGRAPH.
Lu L, Mullins CS, Schafmayer C, Zeißig S, Linnebacher M
(2021), “A global assessment of recent trends in
gastrointestinal cancer and lifestyle-associated risk
factors,” Cancer Commun (Lond), 41(11): 1137-1151.
Mildenhall B, Srinivasan PP, Tancik M, et al. (2020), NeRF:
Representing Scenes as Neural Radiance Fields for
View Synthesis, in ECCV 2020.
Prinzen M, Trost J, Bergen T, Nowack S, Wittenberg T
(2015), “3D Shape Reconstruction of the Esophagus
from Gastroscopic Video,” In: H Handels, et al. (eds)
Image Processing for Medicine, pp. 173-178. Springer
Berlin, Heidelberg.
Schonberger JL, Frahm JM (2016), “Structure-from-motion
revisited,” In: CVPR’2016.
Tancik M, Weber E, Ng E, et al (2023), Nerfstudio: a
Modular Framework for Neural Radiance Field
Development, arXiv:2302.04264.
Wang Z, Bovik AC, Sheikh HR, Simoncelli EP (2004),
“Image quality assessment: From error visibility to