A Player Position Tracking Method

Based on a Wide-Area Pan-Tilt-Zoom Video

Shunzo Yamagishi

, Chun Xie

, Hidehiko Shishido

and Itaru Kitahara

Master’s and Doctoral Program in Intelligent Mechanical Interaction Systems, University of Tsukuba, Ibaraki, Japan

Center for Computational Sciences, University of Tsukuba, Ibaraki, Japan

Faculty of Science and Engineering, Department of Information Systems Engineering, Soka University, Tokyo, Japan

Keywords: Sports Analytics, Player Tracking, Wide-Area, Video Processing, Pan-Tilt-Zoom Camera.

Abstract: This paper proposes a method to estimate the posture of an athlete moving on a vast field in a sporting event

using a pan-tilt-zoom camera. In order to estimate the posture of an athlete on a sports field from a dynamic

video sequence, our method extracts image features to search correspondence points among successive frames

and computes the homography transformation matrix that compensates for changes the camera parameters

(e.g., angle of view and posture). The effectiveness of this method is qualitatively verified using MLB TV

broadcast video. The accuracy and error factors are also quantitatively verified by CG simulation.

1 INTRODUCTION

The use of video data for player evaluation in sports

is advancing. In team sports played on large fields,

such as soccer and baseball, both individual perfor-

mance and team coordination are analyzed. Detailed

physical movement data is crucial for evaluating in-

dividual skills, such as dribbling or catching. Con-

versely, data on player positions and the positional re-

lationship among the ball and players are used to eval-

uate passing and hitting techniques and to analyze

tactics like formations.

To evaluate individual physical movements, local

information around the players is needed, typically

acquired using telephoto tracking shots. For tactical

analysis, capturing wide-area information of the en-

tire field with fixed wide-angle shots is more effec-

tive. However, due to limitation of camera resolution,

it is challenging to capture both local and wide-area

information with a single camera, while using multi-

ple cameras introduces extra cost and sometimes un-

feasible in competitive situations due to restrictions

imposed by competition organizations.

https://orcid.org/0000-0003-4936-7404

https://orcid.org/0000-0001-8575-0617

https://orcid.org/0000-0002-5186-789X

A PTZ (pan-tilt-zoom) camera can change its field

of view (zoom) and posture (pan and tilt), enabling

both telephoto tracking and wide-area fixed shooting

depending on the situation. Since the shooting region

of a PTZ camera dynamically changes, it is necessary

to estimate the shoot region of the field in every

frame. Typically, standardized field markers set up in

accordance with competition regulations, such as

lines, are used for alignment. However, during tele-

photo shooting, only a small portion of the field is

captured, making alignment difficult due to the lack

of visible landmarks.

This paper aims to solve this issue by estimating

changes in the camera's shooting area using natural

feature points instead of relying on explicit land-

marks. Instead of mapping player positions directly

from a camera frame to the field, we transform them

to a reference frame where landmarks are clearly ob-

served in a sequential manner. Our method can map

player positions to the filed coordinate using videos

taken by a PTZ camera that switch between wide-an-

gle and telephoto tracking. In the experiment, the ef-

fectiveness of our method and the errors associated

with estimating player positions are discussed.

Yamagishi, S., Xie, C., Shishido, H. and Kitahara, I.

A Player Position Tracking Method Based on a Wide-Area Pan-Tilt-Zoom Video.

DOI: 10.5220/0012920100003828

In Proceedings of the 12th International Conference on Sport Sciences Research and Technology Support (icSPORTS 2024), pages 46-53

ISBN: 978-989-758-719-1; ISSN: 2184-3201

Figure 1: Overview of our proposed method: The geometric relationship between each frame of the PTZ camera and the field

𝑯 can be determined by the homography matrix 𝑯

𝒇

between the reference frame and the field, and the homography matrix

𝑯

𝒑𝒕𝒛

𝒊

between each frame 𝑖 and the reference frame.

2 RELATED WORK

Player tracking has been applied to various kinds of

sports. In basketball, Lu et al. proposed a method for

tracking players using a single pan-tilt-zoom camera,

and applied it to the analysis of player performance

on the court by estimating the homography between

video frames and the court (Lu et al., 2013). In foot-

ball, a method was proposed to analyze the movement

trajectories of players on the field by automatically

estimating camera parameters using optical flow and

detecting players with a Kalman filter from match

video (Beetz et al., 2007). In ski racing, research vis-

ualizes the movement trajectory of players from cam-

era video fixed to a tripod (Dunnhofer et al., 2023).

Due to the development of deep learning, object

detection methods have been proposed that can ex-

tract effective features and perform class identifica-

tion from small regions in low-resolution video or im-

ages (Redmon et al., 2016; Ren et al., 2016). As a re-

sult, in the field of sports, attempts have been made to

detect and track small regions of players and balls in

video from drone and fixed-camera video of the entire

field (Katić et al., 2024).

There is also a lot of research being done on field

alignment, which seeks to determine the geometric

relationship between the field coordinate system and

the camera image coordinate system. For example, a

method has been proposed for adjusting the position

of the parallel lines and vanishing points of a field us-

ing the energy maximization problem of a Markov

field (Homayounfar et al., 2017). Additionally, there

are research examples that use regression networks to

align images of sports fields with template images of

fields (Jiang et al., 2020; Shi et al., 2022). Another

proposed approach is to create a database and query

it with images of edges extracted from input images.

(Chen & Little, 2019). These methods seek to estab-

lish a correspondence between frames and fields by

directly matching field features for each frame.

Therefore, they rely on accurately detecting field

A Player Position Tracking Method Based on a Wide-Area Pan-Tilt-Zoom Video

Figure 2: Process flow of the proposed method. Mask player regions in video frames based on player detection results. Feature

points are detected from the masked frames. The frame of interest is compared with the previous 𝑛 frames to obtain the

transformation matrix 𝑯

𝒑𝒕𝒛

𝒊

to the reference frame.

Figure 3: Detection of players by object detection network.

The center of the lower edge of the resulting bounding box

is used as the position of the player 𝑥





,𝑦





 in the PTZ

camera frame.

features such as straight lines and circles.

In baseball, the image features available for

matching between video frames are too sparse to es-

timate robust inter-frame transformations. To address

this problem, we set a reference frame and calculate

the transformation between each frame and the refer-

ence frame, as well as the transformation between the

reference frame and the field coordinate system. This

allows us to determine the position of the player on

the field even when reliable image features are not

available.

3 METHOD

Figure 1 shows the overview of our proposed method,

and Figure 2 details the process flow. Our objective

is to detect a target player in a video involving various

camera motions and fields of view, and map the

player’s position to a field coordinate system via a

homography transformation. Here, we denote points

in three coordinate systems as follows:

 𝑥





,𝑦





: coordinate in the 𝑖



PTZ camera

frame

 𝑥



,𝑦



: coordinate in the specified refer-

ence frame of the PTZ camera

 𝑥



,𝑦



: coordinate on the field

3.1 Player Detection and Tracking in a

PTZ-Video Frame

An object detection neural network is used to detect

and track the player in a frame of the shot PTZ-video.

The center of the lower edge of the resulting bounding

box, as shown in Figure 3, is used as the position of

the player 𝑥





,𝑦





. We also apply a Multi-Object

Tracking (MOT) technique to ensure a target player

can be traced across different frames.

icSPORTS 2024 - 12th International Conference on Sport Sciences Research and Technology Support

Figure 4: Masked feature detection: We use the bounding

box of detected players to generate a mask image for each

frame and remove the feature points in the masked area. The

remaining features are consistent in terms of frame-to-

frame transformation, making them robust for feature

matching.

3.2 Player Position Mapping

Since the PTZ camera keeps stationary (e.g., fixed the

environment using a tripod) and the shooting distance

of sports scene is relatively long, it can be assumed

that the optical center of the camera is also fixed while

shooting. Therefore, the image coordinate system of

each frame can be transformed using a homography

matrix. The relationship between the coordinates in

the PTZ camera plane 𝑥





,𝑦





 and the coordi-

nates in the reference camera plane𝑥



,𝑦



 can

be described using the homography matrix 𝑯

𝒑𝒕𝒛

𝒊

shown in equation (1).



𝑥



𝑦



~𝑯

𝒑𝒕𝒛

𝒊



𝑥





𝑦







(1)

𝑯

𝒑𝒕𝒛

𝒊

can be estimated via feature matching be-

tween a PTZ camera frame and the reference frame.

However, the camera field of view differs greatly be-

tween frames, and features can be extremely sparse in

zoomed-in shots where the observed area is very nar-

row. As a result, it is usually difficult to find enough

robust feature pairs to register a PTZ camera frame

directly to the reference frame.

To address this problem, we match the target

frame with the previous 𝑛 frames, instead of the ref-

erence frame, to estimate a local transformation

through Least Squares Method and Direct Linear

Transform(Hartley & Zisserman, 2003). The local

transformations are then incorporated back towards

the reference frame to obtain the homography be-

tween the target frame and the reference frame 𝑯

𝒑𝒕𝒛

𝒊

To make the estimation of local transformations more

robust, we filter out unreliable features which majorly

detected on the moving players from feature matching.

This is done by using a mask image generated from

the bounding boxes of players produced by an object

detection network. The process is demonstrated in

Figure 4.

Given the homography transformation 𝑯

𝒇

be-

tween the reference camera plane and the field plane,

we can map the player coordinates 𝑥





,𝑦





 in the

𝑖



frame to the field plane coordinates 𝑥



,𝑦



 via

Equations (2) and (3). Note that 𝑯

𝒇

can be easily ob-

tained by specifying the field coordinates of several

representative points on the reference camera image.

𝑯𝑯

𝒇

𝑯

𝒑

𝒕𝒛

𝒊

(2)



𝑥



𝑦



~𝑯

𝑥





𝑦







(3)

4 EXPERIMENTS

To demonstrate the application of our method, two

types of evaluations are performed. The first evalua-

tion is qualitative, using real-world video of an out-

fielder catching a fly ball. The second evaluation is

A Player Position Tracking Method Based on a Wide-Area Pan-Tilt-Zoom Video

Figure 5: Results using MLB live video. Left: the frame of interest. Middle: the reference frame, with the frame of interest

registered in the area marked by the white rectangle. Right: estimated player positions visualized on the aerial image. The

player is moving from white point to red point.

quantitative, using a CG simulation. For object detec-

tion, we adopt YOLOv8 (Jocher et al., 2022/2023)

and ByteTrack (Zhang et al., 2022) for multi-object

tracking. SIFT (Lowe, 2004) is used for feature ex-

traction for feature matching. In addition, the match-

ing interval 𝑛 is set to 30 frames.

4.1 Live-Action Video Evaluation

We conducted a qualitative experiment using PTZ

video videos shot in a real-world baseball game. The

video was sourced from MLB FILM ROOM

(MLB.com, 2024) and primarily captured using PTZ

cameras for TV broadcasts. We clipped scenes where

the outfielder catches the ball from the cited video and

used these as our input. We manually masked out the

chyrons in the video as they contain undesired fea-

tures.

To visualize player positions, we used aerial pho-

tographs from Google Earth (Google LLC, 2024).

Figure 5 illustrates the results. The left column of Fig-

ure 5 shows a zoomed-in frame, and the middle col-

umn shows this frame registered to the reference

frame using 𝑯

𝒑𝒕𝒛

𝒊

. The right column demonstrates

that even when the camera's field of view differed

greatly from the reference frame, we can still success-

fully map the positions of the players on the field.

4.2 CG Environment Evaluation

We use Unity (Unity Technologies, 2023b) to create

the CG simulation and incorporated the Baseball Sta-

diums Pack (Distinctive Developments Ltd, 2017)

and Starter Assets - ThirdPerson (Unity Technolo-

gies, 2023a) for the player models To improve object

detection accuracy, we changed the color of the

player model and fine-tuned the object detection

mode. Using this environment, we rendered a video

that includes PTZ camera work, reproducing an out-

fielder catching a ball. The video was rendered in Full

HD quality at 30fps.

4.2.1 Decomposition of Error

The error in player position in the field plane is eval-

uated by breaking it down into depth and lateral di-

rections as viewed from the camera. Figure 6 illus-

trates the overview. First, the camera optical axis is

obtained from the true values of the camera parame-

ters. Next, the camera optical axis is orthogonally

projected onto the field plane. The error in the posi-

tion of the player in the direction of the orthographic

projection of the optical axis is the D-error, and the

error in the component perpendicular to the D-error

is the L-error. D-error represents the error in

icSPORTS 2024 - 12th International Conference on Sport Sciences Research and Technology Support

Figure 6: An overview of the decomposition of player po-

sition error, where D-error indicates the error in the depth

direction as seen from the camera and L-error indicates the

error in the lateral direction as seen from the camera.

the depth direction as seen from the camera, and L-

error represents the error in the lateral direction as

seen from the camera.

4.2.2 Player Position Error on the Field

Plane

We analyze the errors in the estimated player posi-

tions 𝑥



,𝑦



 compared to the ground truth 𝑥



,𝑦





,

as shown in the left column of Figure 7. The potential

sources of error are:

 The error in the player's position 𝑥





,𝑦





 on

the PTZ camera plane due to object detection.

 the estimated error in the homography transfor-

mation matrix 𝑯

𝒑𝒕𝒛

𝒊

from the PTZ camera plane

to the reference camera plane.

To further investigate the effects of each error

source, we calculated the overall errors using the fol-

lowing setups:

 Use ground truth player positions in PTZ cam-

era frames and map them to the field plane using

the estimated homography transformation. The

result is shown in Figure 7 (middle)

 Use estimated player positions in PTZ camera

frames provided by object detection and map

them to the field plane using the ground truth

homography transform. The result is shown in

Figure 7 (right)

As can be seen from the middle and right columns

of Figure 7, the error is much more significant when

using estimated player positions.

Figure 7: The player position error is shown decomposed into the camera depth direction (D-error) and the lateral direction

as seen from the camera (L-error). (left) when object detection yields an estimate of the player's position on the PTZ camera

plane and matching estimates the transformation to the reference camera plane; (center) when true values are given for the

player's position on the PTZ camera plane from the camera parameters; (right) when true values are given for the transfor-

mation to the reference camera plane. If given. We see that the error in the player's position in the field coordinate system is

greater in the depth direction of the camera. Comparing the rows also shows that most of the error is due to the detected

position on the PTZ camera plane.

A Player Position Tracking Method Based on a Wide-Area Pan-Tilt-Zoom Video

Figure 8: The relationship between the number of frames elapsed from the reference frame and the accuracy of 𝑯

𝒑𝒕𝒛

𝒊

estima-

tion (square root of the mean square error of the corners). Homographic errors accumulated as the number of synthetic trans-

formations increased. However, no differences were found between the different camera work.

4.2.3 Effect of PTZ Camera Motion

We evaluate the accuracy of 𝑯

𝒑𝒕𝒛

𝒊

, by using root mean

squared error (RMSE) of the four corners of the trans-

formed camera frames on the reference frame. To in-

vestigate the effect of each camera operation and dis-

tance to the reference frame on 𝑯

𝒑𝒕𝒛

𝒊

, we use videos

with only pan-tilt, only zoom, and all PTZ move-

ments. The results, shown in Figure 8, indicate that

homography errors accumulate as the number of com-

posite transformations increases.

4.2.5 Discussion

The experimental results indicate that the position es-

timation error for players in short videos is primarily

due to the object detection result 𝑥





,𝑦





 on the

PTZ camera plane. However, in longer videos, the er-

rors caused by 𝑯

𝒑𝒕𝒛

𝒊

accumulate as the transformation

matrix is integrated from the PTZ frame to the refer-

ence frame.

Additionally, defining the player's position as the

bottom-center of the bounding box is not strictly ac-

curate since players often jump and land repeatedly.

This is especially true for shallow tilt angles, which

cause significant errors in the depth direction of the

camera (D-error). Improving player position estima-

tion accuracy requires developing more precise meth-

ods for detecting players on the PTZ camera plane.

5 CONCLUSIONS

This paper proposed a method to determine the posi-

tions of players on a large field during sporting events

from videos captured using a PTZ camera where the

camera pose and FOV change dynamically based on

the context. Our approach estimates player positions

by finding corresponding points between consecutive

frames using image features and calculating a homog-

raphy transformation matrix to map the player posi-

tions from the dynamic camera frame to the static

field plane.

We validated the effectiveness of our method with

MLB TV broadcast video and further verified its ac-

curacy and error sources using a CG simulation with

known true values. This research provides a valuable

technique for accurately tracking player positions

globally in dynamic sports settings.

REFERENCES

Beetz, M., Gedikli, S., Bandouch, J., Kirchlechner, B.,

Hoyningen-Huene, N. von, & Perzylo, A. (2007). Vis-

ually tracking football games based on TV broadcasts.

Proceedings of the International Joint Conference on

Artificial Intelligence (IJCAI). https://media-

tum.ub.tum.de/doc/1289990/document.pdf

Chen, J., & Little, J. J. (2019). Sports Camera Calibration

via Synthetic Data. Proceedings of the IEEE/CVF Con-

ference on Computer Vision and Pattern Recognition

Workshops, 0–0. https://openaccess.thecvf.com/con-

tent_CVPRW_2019/html/CVSports/Chen_Sports_Ca

mera_Calibration_via_Syn-

thetic_Data_CVPRW_2019_paper.html

Distinctive Developments Ltd. (2017). Baseball Stadiums

Pack | 3D Environments | Unity Asset Store. Baseball

Stadiums Pack | 3D Environments | Unity Asset Store.

https://assetstore.unity.com/packages/3d/environ-

ments/baseball-stadiums-pack-78197

Dunnhofer, M., Sordi, L., & Micheloni, C. (2023). Visual-

izing Skiers’ Trajectories in Monocular Videos. Pro-

ceedings of the IEEE/CVF Conference on Computer Vi-

sion and Pattern Recognition, 5187–5197. https://open-

access.thecvf.com/con-

icSPORTS 2024 - 12th International Conference on Sport Sciences Research and Technology Support

tent/CVPR2023W/CVSports/html/Dunnhofer_Visual-

izing_Skiers_Trajectories_in_Monocular_Vid-

eos_CVPRW_2023_paper.html

Google LLC. (2024). Google Earth. Google Earth.

https://www.google.com/intl/ja/earth/

Hartley, R., & Zisserman, A. (2003). Multiple view geome-

try in computer vision. Cambridge university press.

https://books.google.co.jp/books?hl=ja&lr=lang_ja|lan

g_en&id=si3R3Pfa98QC&oi=fnd&pg=PR11&dq=mul

tiple+view+geometry+in+computer+vi-

sion&ots=aUp1nrbf9P&sig=Au-

mifQDqjW95sIbZidcJgOoSW6o

Homayounfar, N., Fidler, S., & Urtasun, R. (2017). Sports

Field Localization via Deep Structured Models. 2017

IEEE Conference on Computer Vision and Pattern

Recognition (CVPR), 4012–4020. https://doi.org/10.11

09/CVPR.2017.427

Jiang, W., Higuera, J. C. G., Angles, B., Sun, W., Javan,

M., & Yi, K. M. (2020). Optimizing Through Learned

Errors for Accurate Sports Field Registration. 201–

210.

https://doi.org/10.1109/WACV45572.2020.9093581

Jocher, G., Chaurasia, A., & Qiu, J. (2023). Ultralytics

YOLO (8.0.0) [Python]. https://github.com/ultralyt-

ics/ultralytics (Original work published 2022)

Katić, A., Matić, V., & Papić, V. (2024). Detection and

Player Tracking on Videos from SoccerTrack Dataset.

2024 23rd International Symposium INFOTEH-JA-

HORINA (INFOTEH), 1–6. https://doi.org/10.1109/IN-

FOTEH60418.2024.10495998

Lowe, D. G. (2004). Distinctive Image Features from Scale-

Invariant Keypoints. International Journal of Com-

puter Vision, 60(2), 91–110. https://doi.org/10.1023/

B:VISI.0000029664.99615.94

Lu, W.-L., Ting, J.-A., Little, J. J., & Murphy, K. P. (2013).

Learning to track and identify players from broadcast

sports videos. IEEE Transactions on Pattern Analysis

and Machine Intelligence, 35(7), 1704–1716.

MLB.com. (2024). Major League Baseball Video Search |

MLB Film Room. MLB.Com. https://www.mlb.com/

video

Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016).

You only look once: Unified, real-time object detection.

Proceedings of the IEEE Conference on Computer Vi-

sion and Pattern Recognition, 779–788. https://www.

cv-foundation.org/openaccess/content_cvpr_2016/htm

l/Redmon_You_Only_Look_CVPR_2016_paper.html

Ren, S., He, K., Girshick, R., & Sun, J. (2016). Faster R-

CNN: Towards real-time object detection with region

proposal networks. IEEE Transactions on Pattern

Analysis and Machine Intelligence, 39(6), 1137–1149.

Shi, F., Marchwica, P., Higuera, J. C. G., Jamieson, M., Ja-

van, M., & Siva, P. (2022). Self-supervised shape align-

ment for sports field registration. Proceedings of the

IEEE/CVF Winter Conference on Applications of Com-

puter Vision, 287–296. http://openaccess.thecvf.com/c

ontent/WACV2022/html/Shi_Self-Supervised_Shape_

Alignment_for_Sports_Field_Registra-

tion_WACV_2022_paper.html

Unity Technologies. (2023a). Starter Assets—ThirdPerson

| Updates in new CharacterController package | Essen-

tials | Unity Asset Store. https://assetstore.unity.com/

packages/essentials/starter-assets-thirdperson-updates-

in-new-charactercontroller-pa-196526

Unity Technologies. (2023b). Unity (Version 2022.3.10f1).

https://unity.com/

Zhang, Y., Sun, P., Jiang, Y., Yu, D., Weng, F., Yuan, Z.,

Luo, P., Liu, W., & Wang, X. (2022). ByteTrack: Multi-

object Tracking by Associating Every Detection Box.

In S. Avidan, G. Brostow, M. Cissé, G. M. Farinella, &

T. Hassner (Eds.), Computer Vision – ECCV 2022 (Vol.

13682, pp. 1–21). Springer Nature Switzerland.

https://doi.org/10.1007/978-3-031-20047-2_1

A Player Position Tracking Method Based on a Wide-Area Pan-Tilt-Zoom Video