Automated Performance Metrics for Objective Surgical Skill Assessment

in Laparoscopic Training

Asaf Arad

, Julia Leyva I Torres

, Kristian Nyborg Jespersen

, Nicolaj Boelt Pedersen

Pablo Rey Valiente

, Alaa El-Hussuna

and Andreas Møgelmose

Department of Electronic Systems, Aalborg University, Denmark

Open Source Research Collaboration, Aalborg, Denmark

Keywords:

Automated Performance Metrics, Laparoscopic Surgery, Multi Object Tracking.

Abstract:

The assessment of surgical skill is critical in advancing surgical training and enhancing the performance of

surgeons. Traditional evaluation methods relying on human observation and checklists are often biased and

inefﬁcient, prompting the need for automated and objective systems. This study explores the use of Auto-

mated Performance Metrics (APMs) in laparoscopic surgeries, using video-based data and advanced object

tracking techniques. A pipeline was developed, combining a ﬁne-tuned YOLO11 model for detection with

state-of-the-art multi-object trackers (MOTs) for tracking surgical tools. Metrics such as path length, velocity,

acceleration, jerk, and working area were calculated to assess technical performance. BoT-SORT emerged as

the most effective tracker, achieving the highest HOTA and MOTA, enabling robust tool tracking. The system

successfully extracted APMs to evaluate and compare surgical performance, demonstrating its potential for

objective assessment. This work validates state-of-the-art algorithms for surgical video analysis, contributing

to improved surgical training and performance evaluation. Future efforts should address limitations like pixel-

based measurements and dataset variability to enhance the system’s accuracy and applicability, ultimately

advancing patient safety and reducing training costs.

1 INTRODUCTION

Accurate assessment of surgical performance is a

cornerstone of surgical training, especially as the

ﬁeld advances toward proﬁciency-based methodolo-

gies (Jin et al., 2018; Ebina et al., 2022a; Ebina et al.,

2022b; Guerin et al., 2022). Traditional methods of

evaluating trainees, which rely on human observers

and task-speciﬁc checklists, are accessible, but suf-

fer from bias and time inefﬁciencies (D’Angelo et al.,

2015). These limitations highlight the need for auto-

mated and objective assessment systems which pro-

vide consistent and detailed feedback to trainees.

APMs have emerged as a promising solution, of-

fering objective and data-driven evaluations as sug-

gested by Ebina et al. (Ebina et al., 2022b). By lever-

aging video-based data and computer vision tech-

niques, APMs can be used to analyze surgical perfor-

mance with higher precision and reproducibility. Un-

like traditional observer-based methods, APMs elim-

inate bias and offer a standardized approach for eval-

uating surgical skills, allowing the surgeons to po-

tentially improve upon surgical training programs

Figure 1: Simple illustration showcasing the detection of

surgical instruments used to calculate popular performance

metrics for surgical skill evaluation.

(Buckley et al., 2015; Peng et al., 2019; Sallaberry,

Tori, and Nunes, 2022).

Recent advancements in artiﬁcial intelligence,

particularly in object detection and object tracking,

have enabled the development of robust APM sys-

tems. Technologies such as Convolutional Neural

824

Arad, A., Leyva I. Torres, J., Jespersen, K. N., Pedersen, N. B., Valiente, P. R., El-Hussuna, A. and Møgelmose, A.

Automated Performance Metrics for Objective Surgical Skill Assessment in Laparoscopic Training.

DOI: 10.5220/0013380300003912

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2025) - Volume 3: VISAPP, pages

824-830

ISBN: 978-989-758-728-3; ISSN: 2184-4321

Networks (CNNs) allow automatic tracking and anal-

ysis of surgical instruments, providing a detailed un-

derstanding of tool usage, motion efﬁciency, and task

completion. These capabilities pave the way for gen-

erating a wide range of performance metrics, includ-

ing time efﬁciency, path consistency, and tool uti-

lization patterns, offering valuable insights for both

trainees and trainers (Twinanda et al., 2016; Jin et

al., 2018; Rivas-Blanco et al., 2021).

This work focuses on the identiﬁcation and im-

plementation of common APMs for surgical perfor-

mance evaluation in laparoscopic surgeries; a mini-

mally invasive surgical method using small incisions

and a camera to guide the procedure (Rivas-Blanco et

al., 2021). Speciﬁcally, the project aims to develop a

pipeline of an image recognition system, which is ca-

pable of tracking surgical instruments and calculating

performance metrics based on their motion and usage

during procedures (Figure 1). The main contribution

of this work is validation of state-of-the-art trackers’

abilities to extract popular metrics used for evaluat-

ing surgical skills from 2D captured videos of laparo-

scopic surgeries. In future work, these metrics could

serve as a basis for comparative studies with tradi-

tional surgical skill evaluation techniques, providing

insights into their alignment with surgical expertise.

2 RELATED WORKS

When identifying the most commonly used APMs,

the following was found: Metrics such as path length

(the distance traveled by a tool), operative time (to-

tal and sub-task durations), velocity (tool movement

speed), jerk (smoothness of motion), acceleration

(rate of motion change), and tool angle (orientation

of tools) are frequently implemented for their abil-

ity to evaluate motion efﬁciency and skill precision

in both laparoscopic and robotic surgeries (Buckley

et al., 2015; Ebina et al., 2022b; Guerin et al., 2022).

Certain metrics, such as force measurements, are

speciﬁc to robotic surgeries due to the built-in sam-

pling of instrument sensor data, which enables pre-

cise evaluation of applied forces (Sallaberry, Tori, and

Nunes, 2022; Trejos et al., 2014). Other metrics can

also be categorized into procedure-speciﬁc metrics,

such as stitching techniques used in bladder sutur-

ing (Chen et al., 2018). Similarly, deep features ex-

tracted via deep learning approaches can be consid-

ered a separate category (Jin et al., 2018; Reiley et al.,

2011; Rivas-Blanco et al., 2021; Moglia et al., 2021).

Figure 2 summarizes the 10 most commonly used

APMs, according to an unstructured literature review

conducted by us, based on 26 relevant papers found

through Google Scholar and PubMED using the key-

words Automatic Performance Metrics AND (Surgery

OR Surgery Training) and Objective Evaluation AND

(Surgery OR Surgery Training).

Figure 2: Table showcasing the 10 most popular perfor-

mance metrics found, ordered by most frequently found

(left) to least frequently found (right).

The majority of popular APMs rely on tracking

surgical tools, often through video data, to provide the

positional and motion information needed for perfor-

mance evaluation. However, tool tracking in surgi-

cal environments is challenging due to factors such as

low-texture organ surfaces, visual artifacts from ﬂuids

or reﬂections, and dynamic environmental conditions

like blood or smoke (Schmidt et al., 2024).

3 METHODOLOGY

The methodology for this project was designed to de-

velop a comprehensive pipeline for assessing surgi-

cal performance. The process involves the detection,

tracking, and analysis of surgical tools within laparo-

scopic videos, as illustrated in Fig. 3.

The workﬂow begins with video input of surgical

procedures, where the tools used need to be detected

and localized. A YOLO11 object detection model,

was ﬁne-tuned on a surgical tool dataset to detect the

tools’ bounding boxes in each frame. The detected

bounding boxes were then passed to state-of-the-

art MOTs to establish temporal consistency between

frames. The selection of the most suitable tracker was

based on MOT metrics such as HOTA, MOTA, and

IDF1. The tracked tools’ motion data, including path

length, velocity, acceleration, jerk, and working area,

were then aggregated into APMs. These metrics were

compiled into an output table, facilitating compara-

tive analysis of surgical videos and providing possible

Automated Performance Metrics for Objective Surgical Skill Assessment in Laparoscopic Training

825

insights into the performance of these procedures.

Input video of

surgical procedure

Detection and

localization of

surgical tools

YOLO11 model

ﬁne-tuned

on surgical

tool dataset

Extract APMs

from tracked tools

Tracker chosen

based on MOT

metrics (HOTA,

MOTA, IDF1)

Output table

containing APMs

Figure 3: Flow Chart of the steps for extracting APMs from

surgical videos.

3.1 Identifying State of the Art Trackers

Three state-of-the-art MOTs have been identiﬁed and

implemented: BOT-SORT, ByteTrack, and UniTrack.

These three trackers were selected based on their good

performance on benchmarks from (Benchmark", n.d.)

and (Code", n.d.).

• BOT-SORT enhances the Simple Online and Re-

altime Tracking (SORT) framework with im-

proved Kalman ﬁltering, camera motion com-

pensation, and a combination of Intersec-

tion over Union (IoU) and Re-Identiﬁcation

(ReID) (Aharon, Orfaig, and Bobrovsky, 2022).

• ByteTrack leverages low-conﬁdence detections to

recover missed objects, using a two-step associa-

tion process (Zhang et al., 2022).

• UniTrack employs a uniﬁed model for multiple

tracking tasks and uses reconstruction-based sim-

ilarity for better associations (Wang et al., 2021).

3.2 Tracker Evaluation

To evaluate the trackers, MOT metrics have been cal-

culated, including Higher Order Tracking Accuracy

(HOTA), Multi-Object Tracking Accuracy (MOTA),

and Identity F1 (IDF1) for each tracker. These evalu-

ation metrics quantify how well the trackers perform

on a given data set. MOTA prioritizes detection (iden-

tiﬁcation and localization) accuracy, IDF1 prioritizes

association (maintaining the identity of objects be-

tween frames) accuracy, and HOTA was created to

balance the accuracy of both detection and associa-

tion to align with human perception (J Luiten., 2021).

3.3 Datasets

The dataset was employed by a combination of pub-

licly available datasets from laparoscopic videos.

The dataset used for model training is m2cai16-tool-

locations (Jin et al., 2018), which includes 2,532

frames annotated with bounding box coordinates for

seven surgical tools (Grasper, Bipolar, Hook, Scis-

sors, Clipper, Irrigator and Specimen Bag). This

dataset was chosen because of its detailed annotations

and diversity of tool types, which facilitated robust

model training. Examples are shown in ﬁgure 4.

Figure 4: Frame examples of m2cai16-tool-locations

dataset.

For testing the trackers’ performance, we used

the Cholec80-Boxes dataset (Abdulbaki Alshirbaji et

al., 2024), which contains annotations at 1 fps for 5

videos from Cholec80 (Twinanda et al., 2016). As op-

posed to m2cai16-tool-locations, this dataset contains

sequential frames arranged in chronological order.

3.4 Fine Tuning Using YOLO11

As all the identiﬁed trackers use YOLO for object de-

tection, a YOLO model was ﬁne-tuned to adapt it in

our domain. For this, YOLO11 was chosen as it is the

latest, and best performing, version at the time of writ-

ing (Jocher and Qiu, 2024). The model was ﬁne-tuned

exclusively with the m2cai16-tool-locations dataset

using a dataset split of 70%/15%/15% (Training, val-

idation, test).

For ﬁne-tuning, the medium model, YOLO11m,

was chosen, as it has a good trade-off between com-

plexity and performance. The model was trained for

500 epochs using the default hyperparameters pro-

vided by Ultralytics (Jocher and Qiu, 2024).

Figure 5 illustrates the training and validation loss

across the epochs, stopping at epoch 215 due to early

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

826

stopping. The model achieved its best performance at

epoch 115, with a box validation loss of 1.38 and a

CLS validation loss of 1.01. The corresponding pre-

cision and recall scores were 0.48 and 0.91, respec-

tively.

Figure 5: Training/Validation loss (Y-axis) of the ﬁne-tuned

YOLO model for each epoch (X-axis).

Testing the model on the remaining 15% of the

data resulted in a mean average precision (mAP@0.5)

of 0.957 across all classes, with the Hook class

achieving the highest mAP@0.5 (0.992), and the

Grasper class achieving the lowest mAP@0.5 (0.896).

Table 1 shows the confusion matrix of our classiﬁca-

tion model where columns represent the true classes

and rows the prediction. The lowest detection score

was observed for the Grasper, which was misclassi-

ﬁed as background in 0.16% of cases. However, it

is worth noting that the Grasper appeared far more

frequently in the dataset (223 instances) compared

to the other classes Bipolar (72), Hook (49), Scis-

sors (66), Clipper (48) Irrigator (57), and Specimen

Bag (77). Generally, the classiﬁcation errors primar-

ily stem from missed detections as opposed to confu-

sion between the classes.

3.5 Acquisition of APMs

In order to be able to quantitatively assess surgical

performance, the following APMs were chosen for

calculation: path length (Eq. 1), average velocity (Eq.

2), average acceleration (Eq. 3), and average jerk

Table 1: Confusion matrix for the YOLO11 test results,

highlighting which classes the model confuses with each

other.

Predicted / True Grasper Bipolar Hook Scissors Clipper Irrigator SpecimenBag Background

Grasper 0.84 0.03 0.03 0.64

Bipolar 0.94

Hook 0.96

Scissors 0.92 0.03

Clipper 0.02 0.92 0.03

Irrigator 0.93 0.12

SpecimenBag 0.87 0.15

Background 0.16 0.06 0.04 0.03 0.08 0.07 0.10

(Eq. 4). These were selected based on the most com-

monly used APMs derived from the literature search

(see 2) that we also deemed possible to calculate using

the datasets that were available. In addition, working

area (Eq. 5) can be managed as it depends only on

the object position appearance. The APMs are ac-

quired using the center of the bounding box from the

tracker results. The spatial metrics are obtained using

Savitzky-Golay smoothing ﬁlter of the position of the

tools suggested by Ebina et al. (Ebina et al., 2022b).

P L =

n−1

∑

i=1

∥p

tip

(i + 1) − p

tip

(i)∥ (1)

¯v =

∑

i=1

∥

tip

(i)∥ (2)

¯a =

∑

i=1

∥

tip

(i)∥ (3)

J =

∑

i=1

∥

tip

(i)∥ (4)

WA = (x

(q=97.5%)

− x

(q=2.5%)

) · (y

(q=97.5%)

− y

(q=2.5%)

)

(5)

Where p

tip

(i) represents the tool tip in frame i, and

(q=97.5%)

and x

(q=2.5%)

are the 97.5-percentile and

2.5-percentile, respectively, of the bounding box cen-

ter position in the X-axis, and similarly for the Y-axis.

4 RESULTS

We evaluated state-of-the-art trackers using standard

metrics (MOTA, HOTA, and IDF1). Among the

evaluated trackers, BoT-SORT achieved the highest

scores and proved to be the most effective for our

dataset. As a result, it was selected for computing

the APMs, which are used to compare the technical

Automated Performance Metrics for Objective Surgical Skill Assessment in Laparoscopic Training

827

skill and performance across different laparoscopic

surgery videos.

4.1 Trackers

Table 2: MOT metrics from BoT-SORT, ByteTrack, and

UniTrack of the videos from Cholec80-Boxes dataset.

Tracker \ Metric MOTA HOTA IDF1

BoT-SORT 54.46 12.61 9.17

ByteTrack 51.17 11.44 8.48

UniTrack 48.77 11.71 9.37

Table 2 shows the results of MOTA, HOTA and

IDF1 metrics obtained from each tracker. BoT-SORT

was selected as the main tracker due to its superior

HOTA score (12.61), reﬂecting its strong ability to

balance detection accuracy and trajectory association.

Although BoT-SORT only demonstrated marginally

better performance on individual metrics, it demon-

strated consistently solid performance overall. It also

achieved the highest MOTA (54.46) and performed

well in IDF1 (9.17) compared to the others. Byte-

Track, although slightly behind BoT-SORT, showed

strong performance in MOTA (51.17) but had the low-

est HOTA score (11.44). UniTrack was the best in

IDF1 (9.37) but lagged behind in MOTA and HOTA,

indicating a less balanced performance overall.

4.2 APMs

Table 3 shows the APM scores of the videos from

Cholec80-Boxes dataset. The videos have approxi-

mately the same length and were captured using the

same surgical instrument types as mentioned in sec-

tion 3.3. The chosen APMs were acquired using the

calculations in section 3.5. The scores for each video

were acquired using BoT-SORT tracking outputs for

each tool type, then the total scores were calculated

as a weighted mean of all instruments. The average

weighting was acquired using the appearance tem-

poral ratio of each tool type in the entire video, i.e.,

the ratio between the number of frames each tool ap-

peared and the total number of frames. In addition,

the path length score is represented by the total path

length per minute in each video, for scalability.

It can be observed that video04 and video02 have

the longest path length and the largest working area.

They also have the highest average velocity, acceler-

ation, and jerk, i.e. the movements in those videos

are incoherent and might often have jitter and shaki-

ness. In contrast, video05 has the shortest path length

and the smallest working area, and a smaller average

velocity, acceleration, and jerk. Thus, it might have

smoother and less shaky movements and more coher-

Table 3: Examples of APM calculations from different

videos.

[pixel/min]

[pixel/s]

[pixel/s

]

[pixel/s

]

[pixel

]

video01 3213 57.67 30.16 32.24 149250

video02 3771 80.86 38.38 40.43 122654

video03 1715 63.15 30.81 32.84 110318

video04 3258 65.14 36.64 39.06 147245

video05 1888 56.94 29.49 31.17 81156

ent and accurate movements during the operation.

Furthermore, in order to analyze and comprehend

the APMs acquired from tracking a video, we gener-

ate heat maps of the total number of appearances and

the locations of the surgical tools from laparoscopic

videos. Heat maps illustrate the working area metric,

and may serve as an indicator of technical skill, as it

shows how focused the surgeon is during the proce-

dure. It may leverage comparison methods between

different videos (Jin et al., 2018). Figure 6 shows ex-

amples of heat maps obtained from 2 aforementioned

tracked videos: video04 (ﬁg. 6a) and video05 (ﬁg.

6b). It may be deemed from the visualization of the

heat maps that video05 has more focused and accu-

rate movements, compared to video04, as matching

the comparison of the aforementioned metrics in ta-

ble 3.

Total heatmap

(a) video04.

Total heatmap

(b) video05.

Figure 6: Heat map of bounding box occurrences and loca-

tions of tools in two full laparoscopic videos.

5 DISCUSSION

In this study, we focused on automating the assess-

ment of surgical performance in laparoscopic surg-

eries by tracking surgical tools and computing APMs.

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

828

Our ﬁndings show that by leveraging video-based

analysis and state-of-the-art object tracking tech-

niques, we were able to generate detailed perfor-

mance metrics such as path length, velocity, accelera-

tion, jerk, and working area. These metrics may offer

an objective and data-driven approach to evaluate sur-

gical performance.

While we can successfully extract the aforemen-

tioned APMs from input videos, we are currently not

able to evaluate how these APMs translate to surgical

performance. Making such an evaluation would re-

quire thorough analysis of the videos by one or more

expertly trained surgeons. This analysis could poten-

tially be correlated with the extracted APMs to de-

termine what classiﬁes as good surgical performance.

Thus, given the APMs of the videos, we can only

compare the scores between different videos. Future

work is required to determine how well the extracted

APMs can predict surgical performance.

Moreover, it is not currently possible for us to

evaluate the accuracy of the APMs we have calcu-

lated. The datasets we have acquired do not include

ground truth data regarding the positioning of the

tools relative to the environment they are in. Be-

cause of this, we cannot make conclusions about how

good our APM estimates are, but only conclude about

comparison between one video’s surgery performance

over another. One factor that could also potentially in-

troduce error is the spatial component in some APMs,

such as velocity, acceleration, jerk, and path length,

because the calculation requires depth information.

The tools move around in all directions, making it

hard to estimate depth from 2D videos. It might be

possible to estimate this if the camera parameters are

known, but these parameters are not available for the

datasets that we have used.

Additionally, the nature of laparoscopic surgeries

introduces further complexity. These surgeries of-

ten involve a person manually holding the camera,

leading to slight movements that can add noise to

the APM calculations. This camera movement is not

taken into account when calculating the APMs.

Although APMs are still in their infancy, this

emerging ﬁeld holds signiﬁcant promise for trans-

forming surgical training and enhancing the perfor-

mance of experienced surgeons. APMs can provide

actionable feedback to surgeons, potentially reducing

the reliance on high surgical volumes for skill acqui-

sition. With the current constraints on surgical care

limiting procedure volumes, the development and re-

ﬁnement of APMs offer a viable solution to optimize

training and performance evaluation. Future research

should focus on expanding the repertoire of APMs

and advancing their accuracy and applicability in as-

sessing surgical performance.

6 CONCLUSION

In this project, we developed an automated system

to evaluate surgical performance, providing a foun-

dation for improved assessment methods. The pri-

mary objective was to create a pipeline capable of

processing laparoscopic surgery videos, detecting and

tracking surgical instruments, and calculating APMs

to support objective evaluation.

The ﬁne-tuned YOLO11 model demonstrated

strong performance on the m2cai16-tool-locations

dataset, achieving a mean average precision

(mAP@0.5) of 0.957. However, on the Cholec80-

Boxes dataset, it scored a lower mAP@0.5 of 0.65,

highlighting dataset-dependent variability. Despite

this, the model provided a robust foundation for

tracking.

For tracking, ByteTrack, BoT-SORT, and Uni-

Track were evaluated using established MOT metrics

such as HOTA, MOTA, and IDF1. These MOT met-

rics helped identify BoT-SORT as the most effective

tracker for this application, balancing detection ac-

curacy and identity association across the evaluation

metrics.

The system successfully extracted APMs, such

as path length, velocity, acceleration, jerk, working

area, and usage time distribution, enabling compara-

tive analyses between surgical videos. Future work

should address our limitations to improve APM accu-

racy.

In conclusion, this work validates state-of-the-art

detection and tracking algorithms’ ability to compute

APMs from 2D laparoscopic surgery videos. This

project lays a foundation to improve the quality and

ﬁdelity of surgical training, offering a potential in en-

hancing patient safety and reducing training costs.

ACKNOWLEDGMENT

We would like to thank Aalborg University for pro-

viding the computational resources needed for this

project.

REFERENCES

Abdulbaki Alshirbaji, T. et al. (Aug. 2024). Cholec80-

Boxes: Bounding-Box Labels for Surgical Tools in

Five Cholecystectomy Videos. Zenodo. DOI: 10 . 5281/

Automated Performance Metrics for Objective Surgical Skill Assessment in Laparoscopic Training

829

zenodo . 13170928. URL: https : / / doi . org / 10 . 5281 /

zenodo.13170928.

Aharon, N., R. Orfaig, and B.-Z. Bobrovsky (2022). “BoT-

SORT: Robust Associations Multi-Pedestrian Track-

ing”. In: arXiv preprint arXiv:2206.14651.

Benchmark", " O. T. (n.d.). Accessed on: Dec. 11, 2024.

[Online]. Available: https://www.MOTChallenge.net.

Buckley, C. E. et al. (2015). “Zone calculation as a tool for

assessing performance outcome in laparoscopic sutur-

ing”. In: Surgical Endoscopy 29, pp. 1553–1559.

Chen, J. et al. (2018). “Use of automated performance met-

rics to measure surgeon performance during robotic

vesicourethral anastomosis and methodical develop-

ment of a training tutorial”. In: The Journal of urology

200.4, pp. 895–902.

Code", " with (n.d.). Accessed on: Dec. 11, 2024 [Online].

Available: https://www.paperswithcode.com.

D’Angelo, A.-L. et al. (2015). “Idle time: an underdevel-

oped performance metric for assessing surgical skill.”

In: American journal of surgery 209, pp. 645–651.

Ebina, K. et al. (2022a). “Automatic assessment of laparo-

scopic surgical skill competence based on motion met-

rics”. In: PLoS one 17.11, e0277105.

— (2022b). “Objective evaluation of laparoscopic surgi-

cal skills in wet lab training based on motion analy-

sis and machine learning”. In: Langenbeck’s archives

of surgery 407.5, pp. 2123–2132.

Guerin, S. et al. (2022). “Review of automated performance

metrics to assess surgical technical skills in robot-

assisted laparoscopy”. In: Surgical Endoscopy, pp. 1–

18.

J Luiten. A Ošep., P. D. e. a. (2021). “HOTA: A Higher

Order Metric for Evaluating Multi-object Tracking”. In:

International Journal of Computer Vision 129, pp. 548–

578. URL: https://doi.org/10.1007/s11263-020-01375-

Jin, A. et al. (2018). Tool Detection and Operative Skill

Assessment in Surgical Videos Using Region-Based

Convolutional Neural Networks. arXiv: 1802 . 08774

[cs.CV]. URL: https://arxiv.org/abs/1802.08774.

Jocher, G. and J. Qiu (2024). Ultralytics YOLO11. Ver-

sion 11.0.0. URL: https : / / github . com / ultralytics /

ultralytics.

Moglia, A. et al. (2021). “A systematic review on artiﬁcial

intelligence in robot-assisted surgery”. In: International

Journal of Surgery 95, p. 106151.

Peng, W. et al. (2019). “An automatic skill evaluation

framework for robotic surgery training”. In: The Inter-

national Journal of Medical Robotics and Computer

Assisted Surgery 15.1, e1964.

Reiley, C. E. et al. (2011). “Review of methods for objec-

tive surgical skill evaluation”. In: Surgical endoscopy

25, pp. 356–366.

Rivas-Blanco, I. et al. (2021). “A review on deep learn-

ing in minimally invasive surgery”. In: IEEE Access 9,

pp. 48658–48678.

Sallaberry, L. H., R. Tori, and F. L. Nunes (2022). “Au-

tomatic performance assessment in three-dimensional

interactive haptic medical simulators: A systematic re-

view”. In: ACM Computing Surveys 55.7, pp. 1–35.

Schmidt, A. et al. (2024). “Tracking and mapping in med-

ical computer vision: A review”. In: Medical Image

Analysis 94, p. 103131. ISSN: 1361-8415. DOI: https :

/ / doi . org / 10 . 1016 / j . media . 2024 . 103131. URL:

https : / / www. sciencedirect . com / science / article / pii /

S1361841524000562.

Trejos, A. L. et al. (2014). “Development of force-based

metrics for skills assessment in minimally invasive

surgery”. In: Surgical endoscopy 28, pp. 2106–2119.

Twinanda, A. et al. (Feb. 2016). “EndoNet: A Deep Archi-

tecture for Recognition Tasks on Laparoscopic Videos”.

In: IEEE Transactions on Medical Imaging 36. DOI: 10.

1109/TMI.2016.2593957.

Wang, Z. et al. (2021). “Do different tracking tasks require

different appearance models?” In: Advances in Neural

Information Processing Systems 34, pp. 726–738.

Zhang, Y. et al. (2022). “ByteTrack: Multi-Object Tracking

by Associating Every Detection Box”. In.

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

830