CodeSCAN: ScreenCast ANalysis for Video Programming Tutorials

Alexander Naumann

1,2

, Felix Hertlein

1,2

, Jacqueline Höllig

1,2

, Lucas Cazzonelli

1,2

and Steffen Thoma

1,2

FZI Research Center for Information Technology, Karlsruhe, Germany

Karlsruhe Institute of Technology, Karlsruhe, Germany

Keywords:

Computer Vision, Optical Character Recognition, Object Detection, Image Binarization, Datasets.

Abstract:

Programming tutorials in the form of coding screencasts play a crucial role in programming education, serving

both novices and experienced developers. However, the video format of these tutorials presents a challenge

due to the difﬁculty of searching for and within videos. Addressing the absence of large-scale and diverse

datasets for screencast analysis, we introduce the CodeSCAN dataset. It comprises 12,000 screenshots cap-

tured from the Visual Studio Code environment during development, featuring 24 programming languages,

25 fonts, and over 90 distinct themes, in addition to diverse layout changes and realistic user interactions.

Moreover, we conduct detailed quantitative and qualitative evaluations to benchmark the performance of Inte-

grated Development Environment (IDE) element detection, color-to-black-and-white conversion, and Optical

Character Recognition (OCR). We hope that our contributions facilitate more research in coding screencast

analysis, and we make the source code for creating the dataset and the benchmark publicly available at a-

nau.github.io/codescan.

1 INTRODUCTION

In the ever-evolving landscape of programming edu-

cation and knowledge dissemination, coding screen-

casts on platforms such as YouTube have emerged as

powerful tools for both novice and experienced de-

velopers. These video tutorials not only provide a vi-

sual walkthrough of coding processes but also offer a

unique opportunity for learners to witness real-time

problem-solving, coding techniques, and best prac-

tices. As the popularity of coding screencasts con-

tinues to soar, the possibility to augment traditional

video content with additional information to improve

the learner’s experience, becomes increasingly inter-

esting and relevant. The source code extracted from

a screencast, that replicates the full coding project up

to the given position in the video, is one such element

that can empower learners with the ability to delve

deeper into the presented material.

Incorporating such code extraction into the realm

of coding screencasts enables learners to beneﬁt from

a dual advantage. Firstly, it empowers video search

algorithms to utilize the full source code, enhancing

the precision and relevance of search results. By in-

dexing and analyzing the extracted code, search en-

gines can pinpoint speciﬁc programming concepts,

Figure 1: Source code extraction pipeline: Given an image

from a coding screencast (1), the IDE elements need to be

detected (2). This is followed by an optional binarization

step (3) which converts the color image into a black-and-

white image, and ﬁnally OCR is applied (4) to read out the

source code.

language features, or problem-solving techniques,

leading learners to the most pertinent videos that align

with their educational needs. Rather than watching

an entire video in search of a speciﬁc code segment,

learners can navigate directly to the relevant sections,

streamlining the learning process and increasing over-

all comprehension. Secondly, having access to the

Naumann, A., Hertlein, F., Höllig, J., Cazzonelli, L. and Thoma, S.

CodeSCAN: ScreenCast ANalysis for Video Programming Tutorials.

DOI: 10.5220/0013093100003912

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2025) - Volume 2: VISAPP, pages

269-277

ISBN: 978-989-758-728-3; ISSN: 2184-4321

269

full code at any given point in a coding screencast also

empowers learners to engage in hands-on experimen-

tation seamlessly. By providing learners with the en-

tire codebase in its current state, irrespective of their

position in the video timeline, the educational expe-

rience is transformed into an interactive playground

for exploration and experimentation, where the code

can be executed and experimented with at any stage

within the video.

However, the task of retrieving the full source

code of a project at any given point in a coding

screencast is challenging: Instructors frequently nav-

igate between different ﬁles within a project, intro-

duce small modiﬁcations or copy-paste large amounts

of code and additionally, the character distribution of

source code differs strongly from standard text para-

graphs. To cope with this it is crucial to understand

the visible IDE components in addition to being able

to extract the visible source code as text. To this

end, we present a novel large-scale and high-quality

dataset for coding screencast analysis called CodeS-

CAN, in this work. It contains more than 12,000

screenshots of coding projects in 24 different pro-

gramming languages. These screenshots are indepen-

dent and cannot be composed into coherent videos.

While full programming tutorials would enable bet-

ter end-to-end testing of the pipeline, it is important

to note, that all current approaches only use single

frames as input. CodeSCAN is the ﬁrst dataset of its

kind that sets itself apart by its unprecedented size,

diversity and annotation granularity. We use Visual

Studio Code with approximately 100 different themes

and 25 different fonts. Through automated interac-

tion with Visual Studio Code, we achieve large vi-

sual diversity by e.g. editing code, highlighting ar-

eas, performing search, and much more. In addition

to the dataset, we present a detailed literature review

and analyze the performance for text recognition on

integrated development environment (IDE) images in

detail. More speciﬁcally, we analyze the inﬂuence of

different OCR engines, image binarization and image

quality. An overview of a potential source code ex-

traction pipeline is outlined in Fig. 1.

To summarize, the main contributions of our work

are

• we introduce CodeSCAN, a novel high-quality

dataset for screencast analysis of video program-

ming tutorials with 12,000 screenshots contain-

ing a multiplicity of annotations and high diver-

sity regarding programming languages and visual

appearance,

• we evaluate a baseline CNN on diverse IDE el-

ement detection and propose the usage of a so-

called coding grid for source code localization,

• we benchmark ﬁve different OCR engines, and

analyze the inﬂuence of image binarization and

image quality on performance.

The remainder of this work is organized as fol-

lows. We present an overview of related work in

Sec. 2. Subsequently, we present the details of our

dataset generation approach in Sec. 3. Sec. 4 presents

the evaluation for IDE element detection and OCR.

Finally, Sec. 5 concludes the paper.

2 RELATED WORK

We review the existing literature on approaches for

code extraction, tools using code extraction, and, ﬁ-

nally, available datasets.

2.1 Approaches

A common pipeline to extract the source code from a

programming screencast as text looks as follows: (1)

the video is split up into images and duplicates are

removed, (2) the image is classiﬁed as containing or

not containing source code, (3) the source code is lo-

cated by predicting a bounding box, and ﬁnally (4)

OCR is applied within this bounding box to extract

the source code as text. This full pipeline is neces-

sary to be able to perform inference on new, unseen

video programming tutorials. However, the literature

frequently focuses only on a subset of these tasks.

Ott et al. (2018a) trained a CNN classiﬁer to iden-

tify whether a screenshot contains fully visible code,

partially visible code, handwritten code or no code.

Ott et al. (2018b) use a CNN to locate the code and to

classify its programming language (Java or Python)

purely by using image features. Alahmadi et al.

(2020b) tackle code localization to remove the noise

in OCR results which is introduced by additional IDE

elements such as the menu by comparing ﬁve differ-

ent backbone architectures. Malkadi et al. (2020) fo-

cus on comparing different OCR engines, thus, focus-

ing on step (3). The literature analyzes between one

and seven different programming languages (if these

are speciﬁed). Java is most popular (Alahmadi et al.,

2018, 2020b; Bao et al., 2019, 2020b, 2020a; Bergh

et al., 2020; Ott et al., 2018a, 2018b; Ponzanelli et

al., 2016b, 2016a, 2019; Yadid & Yahav, 2016; Zhao

et al., 2019), followed by Python (Alahmadi et al.,

2020b; Bergh et al., 2020; Malkadi et al., 2020; Ott

et al., 2018b; Zhao et al., 2019), C# (Alahmadi et

al., 2020b; Malkadi et al., 2020) and XML (Alahmadi,

2023). Also, different IDEs, such as Eclipse (Bao et

al., 2015, 2017, 2019), Visual Studio (Malkadi et al.,

2020) or a diverse mix of multiple IDEs (Alahmadi et

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

270

al., 2018, 2020b, 2020a; Alahmadi, 2023; Malkadi et

al., 2020; Ott et al., 2018a, 2018b; Zhao et al., 2019)

are used. Other related tasks include recognizing user

actions (e.g. enter or select text) in coding screen-

casts (Zhao et al., 2019) and identifying UI screens in

videos (Alahmadi et al., 2020a).

2.2 Tools

Ponzanelli et al. (2016b) present CodeTube, a recom-

mender system that leverages textual, visual and au-

dio information from coding screencasts. The users

inputs a query, and CodeTube intends to return a rel-

evant, cohesive and self-contained video. Bao et al.

(2020b) introduce ps2code, a tool which leverages

code extraction to provide a coding screencast search

engine and a screencast watching tool which enables

interaction. PSFinder (Yang et al., 2022) classiﬁes

screenshots into containing or not containing an IDE

to identify live-coding screencasts.

2.3 Datasets

To the best of our knowledge, only two of the pre-

viously mentioned works have their full annotated

datasets publicly available. The dataset presented by

Malkadi et al. (2020) comprises screenshots of only

the visible code with the associated source code ﬁles

in Java, Python and C#. The dataset was created

manually, and no pixel-wise correspondence between

source code and images is available. Furthermore,

there is low diversity in the IDE color scheme and

font. Alahmadi et al. (2020a) present a dataset where

UI screenshots are annotated with bounding boxes.

Since they do not target code extraction, no such an-

notations are provided.

2.4 Discussion

To empower learners to engage in hands-on experi-

mentation seamlessly and to improve the search qual-

ity within and across videos, it is crucial to be able

to extract the source code of a complete project. This

requires, for example, to identify the ﬁle tree and the

currently active tab. Moreover, it is important to reli-

ably and incrementally edit existing ﬁles. In order to

enable such sophisticated full-project source code ex-

traction, it is essential to have a large, high-quality

dataset with detailed annotations. Since currently

only one relevant annotated dataset with insufﬁcient

annotation granularity is publicly available (Malkadi

et al., 2020), we present the novel dataset CodeS-

CAN. CodeSCAN covers 24 different programming

languages, over 90 Visual Studio Code themes and 25

different fonts. We provide a multiplicity of automat-

ically annotated pixel-accurate annotations that reach

far beyond code localization. Furthermore, we are the

ﬁrst to benchmark diverse IDE element localization

and to analyze the inﬂuence of image binarization and

image quality on OCR.

3 DATASET GENERATION

Our dataset was acquired by scraping data from

https://github.dev. The details on the dataset acqui-

sition will be presented in Sec. 3.1. Subsequently,

we lay out our annotation generation approach in

Sec. 3.2.

3.1 Data Acquisition

For the data acquisition, we exploit the fact that

any Github repository can be opened with a browser

version of Visual Studio Code by changing the

URL from https://github.com/USERNAME/REPONAME

to https://github.dev/USERNAME/REPONAME. Thus, we

randomly select 100 repositories with a permissive

license (MIT, BSD-3 or WTFPL) per programming

language. Since we consider 24 programming lan-

guages, this results in a total of 2, 400 repositories.

The considered programming languages are C, C#,

C++, CoffeeScript, CSS, Dart, Elixir, Go, Groovy,

HTML, Java, JavaScript, Kotlin, Objective-C, Perl,

PHP, PowerShell, Python, Ruby, Rust, Scala, Shell,

Swift, and TypeScript. We select ﬁve ﬁles with

ﬁle endings corresponding uniquely to the underly-

ing language (e.g. .py for Python) at random per

repository, leading to a total of 12, 000 ﬁles. To

achieve highly diverse IDE appearances, we use more

than 90 different Visual Studio Code themes and 25

monospaced fonts. Moreover we apply the following

fully automated, random layout changes

• different window sizes (for desktop 1920 × 1200,

1920 × 1080, 1200 × 1920, 1440 × 900 and tablet

768 × 1024, 1280 × 800, 800 × 1280),

• layout type (classic or centered layout),

• coding window width,

• degree of zoom for the source code text,

• output panel visibility, size and type (e.g. Termi-

nal, Debugger, etc.),

• sidebar visibility, location, size and type (toggling

sidebar options),

• menubar visibility (classic, compact, hidden, vis-

ible),

CodeSCAN: ScreenCast ANalysis for Video Programming Tutorials

271

• activity bar visibility,

• status bar visibility,

• breadcrumbs visibility,

• minimap visibility, and

• control characters and white spaces rendering vis-

ibility

and realistic user interactions

• highlighting multiple lines of code,

• right clicking at random positions in the code,

• adding or removing characters from the source

code,

• search within the opened ﬁle,

• search within the project, and

• open a random number of other tabs.

The automation of all these appearance changes

was implemented using the framework Selenium

We refer to one such ﬁnal IDE conﬁguration (i.e. the

full visual appearance of the IDE with the current ac-

tive source code ﬁle and its appearance characteris-

tics) as scene in the following. To persist the cur-

rent scene, we export the scene webpage using Sin-

gleFile

. In addition, we save the full source code of

the currently visible ﬁle as reference. Thus, at the end

of the dataset acquisition phase, we have a source.txt

and page.html ﬁle for each of the 12, 000 code ﬁles.

Note, that especially selecting random IDE

themes is challenging to automate

and thus, not al-

ways the desired theme was applied correctly for each

scene. Since each scene is represented in a single

HTML ﬁle, however, we are able to leverage the CSS

speciﬁcations (background and font color) to perform

theme-wise clustering after the data acquisition for

quality assurance.

3.2 Annotation Generation

We utilize the HTML and source code ﬁle to automat-

ically generate all annotations. The bounding box of

any IDE element can be retrieved by manually iden-

tifying the CSS selector for the relevant element in-

side the HTML and accessing its bounding box in-

formation. Since the naming is consistent across all

HTML ﬁles, this process is necessary only once per

element. We annotate numerous IDE elements (e.g.

coding area, active tab, status bar etc.) and refer to

See https://www.selenium.dev/documentation/.

https://github.com/gildas-lormeau/SingleFile.

This is due to strongly varying loading times of the

theme and since themes frequently have multiple available

color schemes.

Tab. 1 for a full list and to Fig. 2b for a visualization.

Note, that arbitrary additional annotations for other

IDE elements can be added any time to our dataset.

By taking an automated screenshot of the scene using

the original resolution, we get the underlying image to

which the annotations can be applied directly without

any offsets.

Moreover, we extract annotations for text detec-

tion and recognition for the source code of the active

ﬁle visible in the coding area. This is done on a per-

line and per-word basis. Additionally, we provide an-

notations for the coding grid, i.e. the tight bounding

box around the visible code including the character

height and width. These annotations of only six num-

bers (four for the bounding box and character width

and height) enables the computation of the coding

grid as visualized in Fig. 2c. The coding grid can be

used to retrieve line and character-based text bound-

ing boxes and brings the additional beneﬁt, that it is

much easier to accurately retrieve indentation, which

is crucial for languages such as Python.

Since we additionally analyze the inﬂuence of bi-

narization, i.e. conversion to a clean black-and-white

version of the screenshot, we automatically generate

annotations for this use case. This is done by creating

a binarized HTML version of the scene by identifying

the relevant CSS selectors for source code text. We

enforce black color for all source code text and white

for everything else by adjusting the CSS parameters.

See Fig. 2d for an example.

The dataset is split into a training, validation and

test set of sizes 7, 561, 1, 921, and 2, 518, respectively.

There is no overlap of fonts across the splits, and only

a minor overlap across themes, which was inevitable

due to the previously mentioned issues in automating

the theme selection process.

4 EVALUATION

Coding screen cast analysis comprises several parts.

For our evaluations, we assume videos are processed

frame-by-frame and thus, consider only single frames

as input for the respective components. We evaluate

IDE element detection in Sec. 4.1. Next, we evalu-

ate image binarization in Sec. 4.2, which can help to

improve text recognition performance as analyzed in

Sec. 4.3. In addition to the inﬂuence of binarization,

we also compare different OCR engines and inves-

tigate how they are affected by changes in the input

image quality.

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

272

(a) Color Image of Scene

(b) Bounding Box Annotations

(d) Black-and-white Image of Scene

Figure 2: Example of the different available

annotation and data types showing the Github

repository rrousselGit/ﬂutter_hooks with ﬁle

use_automatic_keep_alive_test.dart. Note that we

omit line, word and character annotations in (b) for better

readability.

4.1 Object Detection

Since the focus of this work is not to improve object

detection algorithms, we benchmark our dataset on

a well-established baseline. We use a Mask R-CNN

(He et al., 2017) with a ResNet-50-FPN (He et al.,

2016; Lin et al., 2017) backbone that was pre-trained

on MS COCO (Lin et al., 2014) for our experiments.

We freeze the weights at stage four and use stochas-

tic gradient descent with momentum, a batch size of

16 and a cosine learning rate schedule (Loshchilov

& Hutter, 2017). For the learning rate schedule, we

set the initial learning rate to 0.001, the ﬁnal learn-

ing rate to 0 after 10, 000 and use a linear warm-up

during the ﬁrst 1, 000 iterations. The results are sum-

marized in Tab. 1. Some elements, such as sidebar,

tabs and actions container, minimap and output panel

achieve a Box AP above 80. Other elements, such as

title bar, notiﬁcations, last code line number, sugges-

tion widget, tabs breadcrumbs are harder to detect and

the Box AP lies below 50. The bounding box of the

coding grid and the ones for source code text lines are

most important for text detection and recognition, and

achieve a Box AP of 69.9 and 71.9, respectively.

Table 1: Quantitative object detection performance per class

using Box AP.

Bounding Box

Class name AP AP50 AP75

sidebar 92.0 99.8 98.7

minimap 84.6 99.0 97.6

tabs_and_actions_container 82.9 98.2 96.4

output_panel 82.4 97.6 93.9

status_bar 79.5 99.0 95.4

activity_bar 76.8 95.6 82.0

editor_container 72.5 92.4 84.7

line 71.9 93.3 87.4

code_container 71.9 90.9 82.9

coding_grid 69.9 96.7 78.6

code_line_number_bar 62.1 88.3 66.9

scrollbar 56.3 80.0 70.1

active_tab 53.1 70.0 60.0

ﬁnd_widget 53.1 69.6 63.8

last_code_line_number 48.7 80.7 56.2

notiﬁcations 48.4 89.2 40.6

suggestion_widget 48.2 68.9 57.4

tabs_breadcrumbs 45.4 71.7 53.0

title_bar 31.6 44.5 40.2

4.2 Binarization

We investigate the performance of image binarization

approaches for converting a screenshot from a cod-

ing screencast to a binarized black-and-white version.

For our experiments, we train Pix2PixHD (Wang et

al., 2018), the successor of the very popular Pix2Pix

(Isola et al., 2017) architecture for image-to-image

translation. We train for 40 epochs and leave the

remaining original training conﬁguration unchanged.

The results are summarized in Tab. 2 and we provide

qualitative samples in Figs. 3 and 4. Overall, bina-

rization works well as indicated by the mean Peak

CodeSCAN: ScreenCast ANalysis for Video Programming Tutorials

273

Signal-to-Noise Ratio (PSNR) of 5.49 in Tab. 2 and

the qualitative example in Fig. 3. The model seems to

have difﬁculties especially with rare strong contrasts

and very low contrast as indicated by the examples in

Fig. 4.

Figure 3: Qualitative examples of the binarization showing

the input image on the top and the predicted binarization on

the bottom.

Table 2: Quantitative binarization performance.

Accuracy ↑ PSNR ↑ DRDM ↓

Median 72.07 5.54 67.52

25% Quant. 67.09 4.82 62.74

75% Quant. 76.01 6.20 73.46

Mean 70.81 5.49 165.70

Std. 7.77 1.08 1699.39

4.3 Optical Character Recognition

We benchmark ﬁve different approaches for text

recognition in the following. For the evaluation, we

create a subset of our test split by randomly sampling

10 words from each of the 2,518 test images to re-

duce the computational workload while maintaining

statistical signiﬁcance. To eliminate the inﬂuence of

text detection performance, we use the ground truth

bounding boxes. We investigate the inﬂuence of bina-

rization (color image vs. ground truth binarization vs.

predicted binarization) and image quality (20%-100%

of original image size as input) on text recognition

performance. The state-of-the-art text recognizers we

benchmark are:

• Tesseract (Smith, 2007) (common baseline)

• SAR (Li et al., 2019) pre-trained mainly on

MJSynth (Jaderberg et al., 2014, 2016), Synth-

Text (Gupta et al., 2016) and SynthAdd (Li et al.,

2019)

• MASTER (Lu et al., 2021) pre-trained on

MJSynth (Jaderberg et al., 2014, 2016), Synth-

Text (Gupta et al., 2016) and SynthAdd (Li et al.,

2019)

• ABINet (Fang et al., 2021) pre-trained on

MJSynth (Jaderberg et al., 2014, 2016) and Syn-

thText (Gupta et al., 2016)

• TrOCR (Li et al., 2023) ﬁne-tuned on the SROIE

dataset (Huang et al., 2019)

Note, that TrOCR does not distinguish lowercase

and capital letters. Thus, we convert the ground truth

to lowercase for computing the mean Character Error

Rate (CER) in only this case. This obviously sim-

pliﬁes the task, which is why these results cannot be

fairly compared with the other approaches.

On the full image resolution, we can observe from

Tab. 3 that TrOCR and Tesseract beneﬁt most from

having access to the ground truth or predicted bina-

rized version of the input image. Overall best perfor-

mance is 0.08 mean CER for TrOCR and the ground

truth binarized input images. The performance of

SAR, MASTER and ABINet is more robust on color

images which means those approaches do not bene-

ﬁt from a prior conversion to a black-and-white im-

age using Pix2PixHD. We argue that these differences

stem from the differences in training data. In sce-

narios where mostly binarized training data is avail-

able, using image-to-image translation approaches is

a valid and helpful way to boost performance on the

ﬁnal task.

All approaches show robust performance when

the image quality is degraded slightly, i.e. when it is

reduced to 60%-70% of the original image size. This

effect is constant across image types as seen in Fig. 5.

See https://rrc.cvc.uab.es/?ch=13

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

274

Figure 4: Challenging qualitative examples of the binarization showing the input image on the left and the predicted binariza-

tion on the right.

Table 3: Quantitative results at full image resolution using the binarized, converted (from Sec. 4.2) and color image. We report

the mean (standard deviation) for the Character Error Rate (CER). (

∗

CER is computed ignoring capitalization).

CER ↓ Binarized (GT) Binarized (Pred.) Color

ABINet (Fang et al., 2021) 0.37 (0.59) 0.43 (0.78) 0.43 (0.94)

Tesseract (Smith, 2007) 0.32 (0.45) 0.41 (0.46) 0.67 (0.51)

SAR (Li et al., 2019) 0.26 (0.73) 0.35 (0.97) 0.28 (0.86)

MASTER (Lu et al., 2021) 0.24 (0.95) 0.35 (0.97) 0.25 (0.99)

TrOCR

∗

(Li et al., 2023) 0.08 (0.27) 0.14 (0.33) 0.25 (0.93)

5 CONCLUSION

Coding screencasts have emerged as powerful tools

for programming education of novices as well as ex-

perienced developers. In addition to the video con-

tent, having access to the full source code of the

project at any given time of the programming tuto-

rial brings two major beneﬁts: (1) platform-wide and

tutorial-based search can leverage the full source code

to enhance the precision and relevance of search re-

sults and (2) it empowers learners to engage in hands-

on experimentation with the source code seamlessly.

In this work, we work towards enabling detailed in-

formation retrieval from such coding screencasts. We

present a novel high-quality dataset called CodeS-

CAN, comprising 12, 000 fully annotated IDE screen-

shots. CodeSCAN is highly diverse and contains 24

programming languages, over 90 different themes of

Visual Studio Code, and 25 fonts while at the same

time varying the IDE appearance through changing

the visibility, position and size of different IDE ele-

ments (e.g. sidebar, output panel). Our evaluations

show that baseline object detectors are suitable for

text recognition and achieve a Box AP for source

code line detection of 71.9. Moreover, we showed

that baseline image-to-image translation architectures

are well suited for coding screencast image binariza-

tion. Since we used baseline architectures for object

CodeSCAN: ScreenCast ANalysis for Video Programming Tutorials

275

(a) Binarized (GT) Images

(b) Binarized (Pred.) Images

Figure 5: Quantitative evaluation of OCR performance for

different image qualities.

detection and image binarization, we expect that per-

formance can be increased signiﬁcantly by resorting

to state-of-the-art models. Finally, we compared dif-

ferent OCR engines and analyzed their dependence on

image quality and image binarization. We found that

binarization can boost performance for some OCR

engines, while slight quality degradations in terms

of image resolution do not signiﬁcantly affect the

text recognition quality. In future work, we plan to

tackle the full source code retrieval from program-

ming videos pipeline by leveraging the CodeSCAN

dataset.

Several future directions are very interesting.

Since we limit ourselves to screenshot analysis, the

tracking, composing and synchronizing ﬁles during a

video tutorial remains an open task. In addition, eval-

uation metrics could move from classical text recog-

nition towards evaluating code executability and the

tree distance to the original abstract syntax tree. Fi-

nally, the coding grid parameters could be estimated

using an additional coding grid head. Its availability

is expected to signiﬁcantly simplify the identiﬁcation

of the correct indentation.

REFERENCES

Alahmadi, M., Hassel, J., Parajuli, B., Haiduc, S., & Ku-

mar, P. (2018). Accurately Predicting the Location

of Code Fragments in Programming Video Tutorials

Using Deep Learning. Proceedings of the 14th Inter-

national Conference on Predictive Models and Data

Analytics in Software Engineering.

Alahmadi, M., Khormi, A., & Haiduc, S. (2020a). UI

Screens Identiﬁcation and Extraction from Mobile

Programming Screencasts. 2020 IEEE/ACM 28th In-

ternational Conference on Program Comprehension

(ICPC).

Alahmadi, M., Khormi, A., Parajuli, B., Hassel, J., Haiduc,

S., & Kumar, P. (2020b). Code localization in pro-

gramming screencasts. Empirical Software Engineer-

ing, 25 (2).

Alahmadi, M. D. (2023). VID2XML: Automatic Extrac-

tion of a Complete XML Data From Mobile Program-

ming Screencasts. IEEE Transactions on Software En-

gineering, 49 (4).

Bao, L., Li, J., Xing, Z., Wang, X., Xia, X., & Zhou, B.

(2017). Extracting and analyzing time-series HCI data

from screencaptured task videos. Empirical Software

Engineering, 22 (1).

Bao, L., Li, J., Xing, Z., Wang, X., & Zhou, B. (2015).

Reverse engineering time-series interaction data from

screen-captured videos. 2015 IEEE 22nd Interna-

tional Conference on Software Analysis, Evolution,

and Reengineering (SANER).

Bao, L., Pan, S., Xing, Z., Xia, X., Lo, D., & Yang, X.

(2020a). Enhancing developer interactions with pro-

gramming screencasts through accurate code extrac-

tion. Proceedings of the 28th ACM Joint Meeting on

European Software Engineering Conference and Sym-

posium on the Foundations of Software Engineering.

Bao, L., Xing, Z., Xia, X., & Lo, D. (2019). VT-Revolution:

Interactive Programming Video Tutorial Authoring

and Watching System. IEEE Transactions on Software

Engineering, 45 (8).

Bao, L., Xing, Z., Xia, X., Lo, D., Wu, M., & Yang,

X. (2020b). Psc2code: Denoising Code Extraction

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

276

from Programming Screencasts. ACM Transactions

on Software Engineering and Methodology, 29 (3).

Bergh, A., Harnack, P., Atchison, A., Ott, J., Eiroa-Lledo,

E., & Linstead, E. (2020). A Curated Set of Labeled

Code Tutorial Images for Deep Learning. 2020 19th

IEEE International Conference on Machine Learning

and Applications(ICMLA).

Fang, S., Xie, H., Wang, Y., Mao, Z., & Zhang, Y. (2021).

Read Like Humans: Autonomous, Bidirectional and

Iterative Language Modeling for Scene Text Recogni-

tion. 2021 IEEE/CVF Conference on Computer Vision

and Pattern Recognition.

Gupta, A., Vedaldi, A., & Zisserman, A. (2016). Syn-

thetic Data for Text Localisation in Natural Images.

2016 IEEE Conference on Computer Vision and Pat-

tern Recognition (CVPR).

He, K., Gkioxari, G., Dollar, P., & Girshick, R. (2017).

Mask R-CNN. IEEE International Conference on

Computer Vision (ICCV).

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual

Learning for Image Recognition. IEEE Conference on

Computer Vision and Pattern Recognition (CVPR).

Huang, Z., Chen, K., He, J., Bai, X., Karatzas, D., Lu, S.,

& Jawahar, C. V. (2019). ICDAR2019 Competition

on Scanned Receipt OCR and Information Extraction.

2019 International Conference on Document Analysis

and Recognition (ICDAR).

Isola, P., Zhu, J.-Y., Zhou, T., & Efros, A. A. (2017).

Image-to-Image Translation with Conditional Adver-

sarial Networks. 2017 IEEE Conference on Computer

Vision and Pattern Recognition (CVPR).

Jaderberg, M., Simonyan, K., Vedaldi, A., & Zisserman, A.

(2016). Reading Text in the Wild with Convolutional

Neural Networks. International Journal of Computer

Vision, 116 (1).

Jaderberg et al., 2014). Deep Features for Text Spotting.

Computer Vision – ECCV 2014.

Li, H., Wang, P., Shen, C., & Zhang, G. (2019). Show,

Attend and Read: A Simple and Strong Baseline for

Irregular Text Recognition. Proceedings of the AAAI

Conference on Artiﬁcial Intelligence, 33 (01).

Li, M., Lv, T., Chen, J., Cui, L., Lu, Y., Floren-

cio, D., Zhang, C., Li, Z., & Wei, F. (2023).

TrOCR: Transformer-Based Optical Character Recog-

nition with Pre-trained Models. Proceedings of the

AAAI Conference on Artiﬁcial Intelligence, 37 (11).

Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B.,

& Belongie, S. (2017). Feature Pyramid Networks

for Object Detection. IEEE Conference on Computer

Vision and Pattern Recognition (CVPR).

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P.,

Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Mi-

crosoft COCO: Common Objects in Context. Com-

puter Vision – ECCV 2014.

Loshchilov, I., & Hutter, F. (2017). SGDR: Stochastic gra-

dient descent with warm restarts. 5th International

Conference on Learning Representations (ICLR).

Lu, N., Yu, W., Qi, X., Chen, Y., Gong, P., Xiao, R., & Bai,

X. (2021). MASTER: Multiaspect non-local network

for scene text recognition. Pattern Recognition, 117.

Malkadi, A., Alahmadi, M., & Haiduc, S. (2020). A Study

on the Accuracy of OCR Engines for Source Code

Transcription from Programming Screencasts. Pro-

ceedings of the 17th International Conference on Min-

ing Software Repositories.

Ott, J., Atchison, A., Harnack, P., Bergh, A., & Linstead,

E. (2018a). A deep learning approach to identify-

ing source code in images and video. Proceedings of

the 15th International Conference on Mining Software

Repositories.

Ott, J., Atchison, A., Harnack, P., Best, N., Anderson, H.,

Firmani, C., & Linstead, E. (2018b). Learning lexical

features of programming languages from imagery us-

ing convolutional neural networks. Proceedings of the

26th Conference on Program Comprehension.

Ponzanelli, L., Bavota, G., Mocci, A., Di Penta, M.,

Oliveto, R., Hasan, M., Russo, B., Haiduc, S., &

Lanza, M. (2016a). Too Long; Didn’t Watch! Extract-

ing Relevant Fragments from Software Development

Video Tutorials. 2016 IEEE/ACM 38th International

Conference on Software Engineering (ICSE).

Ponzanelli, L., Bavota, G., Mocci, A., Di Penta, M.,

Oliveto, R., Russo, B., Haiduc, S., & Lanza, M.

(2016b). CodeTube: Extracting relevant fragments

from software development video tutorials. Proceed-

ings of the 38th International Conference on Software

Engineering Companion.

Ponzanelli, L., Bavota, G., Mocci, A., Oliveto, R., Penta, M.

D., Haiduc, S., Russo, B., & Lanza, M. (2019). Au-

tomatic Identiﬁcation and Classiﬁcation of Software

Development Video Tutorial Fragments. IEEE Trans-

actions on Software Engineering, 45 (5).

Smith, R. (2007). An Overview of the Tesseract OCR En-

gine. Ninth International Conference on Document

Analysis and Recognition (ICDAR 2007), 2.

Wang, T.-C., Liu, M.-Y., Zhu, J.-Y., Tao, A., Kautz, J., &

Catanzaro, B. (2018). High-Resolution Image Syn-

thesis and Semantic Manipulation with Conditional

GANs. 2018 IEEE/CVF Conference on Computer Vi-

sion and Pattern Recognition (CVPR).

Yadid, S., & Yahav, E. (2016). Extracting code from

programming tutorial videos. Proceedings of the

2016 ACM International Symposium on New Ideas,

New Paradigms, and Reﬂections on Programming and

Software.

Yang, C., Thung, F., & Lo, D. (2022). Efﬁcient Search of

Live-Coding Screencasts from Online Videos. 2022

IEEE International Conference on Software Analysis,

Evolution and Reengineering.

Zhao, D., Xing, Z., Chen, C., Xia, X., & Li, G. (2019). Ac-

tionNet: Vision-Based Workﬂow Action Recognition

From Programming Screencasts. 2019 IEEE/ACM

41st International Conference on Software Engineer-

ing (ICSE).

CodeSCAN: ScreenCast ANalysis for Video Programming Tutorials

277