D-PLS: Decoupled Semantic Segmentation for

4D-Panoptic-LiDAR-Segmentation

Maik Steinhauser, Laurenz Reichardt, Nikolas Ebert and Oliver Wasenm

uller

Mannheim University of Applied Sciences, Germany

{m.steinhauser, l.reichardt, n.ebert, o.wasenmueller}@hs-mannheim.de

Keywords:

Lidar, Point Cloud, Panoptic, Segmentation, 3D, Deep Learning, AI.

Abstract:

This paper introduces a novel approach to 4D Panoptic LiDAR Segmentation that decouples semantic and

instance segmentation, leveraging single-scan semantic predictions as prior information for instance segmen-

tation. Our method D-PLS ﬁrst performs single-scan semantic segmentation and aggregates the results over

time, using them to guide instance segmentation. The modular design of D-PLS allows for seamless integra-

tion on top of any semantic segmentation architecture, without requiring architectural changes or retraining.

We evaluate our approach on the SemanticKITTI dataset, where it demonstrates signiﬁcant improvements over

the baseline in both classiﬁcation and association tasks, as measured by the LiDAR Segmentation and Track-

ing Quality (LSTQ) metric. Furthermore, we show that our decoupled architecture not only enhances instance

prediction but also surpasses the baseline due to advancements in single-scan semantic segmentation.

1 INTRODUCTION

Light Detection and Ranging (LiDAR) sensors have

become integral in providing highly accurate spatial

information, enabling the 3D perception of surround-

ings. These capabilities make LiDAR indispensable

in various applications, including autonomous navi-

gation, mapping, robotics, and numerous industrial

processes. The rise of deep learning has further en-

hanced the utility of LiDAR by becoming the de facto

standard for processing its data, allowing for more so-

phisticated interpretation and application.

However, the task of 4D-Panoptic-LiDAR-

Segmentation (Ayg

un et al., 2021), which involves

segmenting LiDAR data at an instance level while

maintaining temporal consistency across multiple

frames, remains less explored compared to traditional

LiDAR segmentation or detection methods. As

such, many new network architectures and training

paradigms in the space are developed for single scan

data only (Thomas et al., 2019; Tang et al., 2020;

Yan et al., 2022; Wu et al., 2024). However, this

task is crucial for applications where understanding

the evolution of objects and environments over

time is essential. In this context, we propose that

semantic classes can serve as prior knowledge to

improve instance segmentation by providing an

initial ”coarse” clustering of points. We introduce

an approach that decouples the task of 4D-Panoptic-

D-PLS (ours)

Ground Truth

Figure 1: Qualitative comparison of our methods (D-PLS)

results when compared to the ground-truth on the Se-

manticKITTI dataset.

LiDAR-Segmentation, ﬁrst utilizing single scan

semantic predictions as the prior information for

Steinhauser, M., Reichardt, L., Ebert, N. and Wasenmüller, O.

D-PLS: Decoupled Semantic Segmentation for 4D-Panoptic-LiDAR-Segmentation.

DOI: 10.5220/0013114900003912

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2025) - Volume 2: VISAPP, pages

645-650

ISBN: 978-989-758-728-3; ISSN: 2184-4321

645

the following step of temporal instance segmenta-

tion. Our Decoupled Semantic Segmentation for

4D-Panoptic-LiDAR-Segmentation (D-PLS) sepa-

rates temporal and instance components, offering

a dedicated module on top of existing segmenta-

tion networks for analyzing dynamic environments

through LiDAR data. To assess the performance of

our novel approach, we employ the SemanticKITTI

(Behley et al., 2019) dataset. This dataset provides

LiDAR point clouds and pose data for training,

validation, and testing, and facilitates comparison

with other state-of-the-art networks.

2 RELATED WORK

The ﬁeld of 4D-Panoptic-LiDAR-Segmentation has

seen signiﬁcant advancements, with multiple meth-

ods proposed to address the challenges of segmenting

objects at an instance level with temporal consistency.

The foundational work in this area, introduced by 4D-

PLS (Ayg

un et al., 2021), established the 4D panop-

tic segmentation task, along with a framework for its

evaluation. Using known poses from the vehicle ego

motion, 4D-PLS fuses consecutive LiDAR scans into

a uniﬁed spatio-temporal point cloud, which under-

goes semantic segmentation. Instance clustering is

performed based on predicted centers through a prob-

abilistic framework.

Subsequent methods have built upon and extended

the 4D-PLS framework. 4D-DS-Net (Hong et al.,

2022) and 4D-StOP (Kreuzberg et al., 2022) are no-

table for their focus on reﬁning the process of cluster-

ing instances by incorporating spatio-temporal prox-

imity. 4D-DS-Net enhances the previous DS-Net

(Hong et al., 2021) model by introducing a dynamic

shifting module that progressively adjusts individual

point coordinates to move towards instance centers

within the spatio-temporal context, ensuring a spatial

separation and thus more accurate clustering. Simi-

lar to our method, 4D-DS-Net decouples the semantic

segmentation branch, but masks the foreground (i.e.

things) classes before passing them to the instance

branch. However, a major drawback of masking is

that it eliminates spatial information and may exclude

or incorrectly include misclassiﬁed points. In con-

trast, our method preserves all semantic information,

ensuring the full spatial information in the instance

branch.

4D-StOP, in contrast, adopts a different strategy

by replacing the probabilistic approach of 4D-PLS

with an instance-centric voting mechanism. This

method involves a dual-network structure, where a

voting module predicts the center of each object, and

a proposal module identiﬁes the corresponding points

associated with that center. This voting-based ap-

proach provides a more precise and targeted method

for determining instances over time.

Modifying the 4D-StOP method, Eq-4D-StOP

(Zhu et al., 2023) introduces the concept of rotation

equivariance to the model. By predicting equivariant

ﬁelds, this approach ensures that the model remains

robust to rotational variations in the data, thereby im-

proving the accuracy of feature learning. Instance

clustering itself remains similar to 4D-StOP.

Lastly, Mask4D (Marcuzzi et al., 2023) and

Mask4Former (Yilmaz et al., 2024) diverge from

traditional clustering methods by utilizing a mask

transformer-based paradigm. In this approach,

learned queries representing potential instances are

processed by attention-based decoders effectively

merging segmentation and tracking into a single op-

eration. This approach eliminates the need for non-

learned clustering techniques or complex network

structures, instead directly predicting spatio-temporal

instance masks along with their corresponding se-

mantic labels.

While these approaches have shown promising re-

sults, their development is decoupled from the space

of semantic segmentation, with their own speciﬁc net-

work architectures. Adapting to the temporal context

of 4D-Panoptic-LiDAR-Segmentation would require

adapting the architecture or at the very least retrain-

ing. In contrast, our work treats semantic segmen-

tation on a single LiDAR scan as a ”pre-clustering”

step, which can provide valuable initial information

for temporal consistency and instance level segmen-

tation.

3 APPROACH

We identify three key insights and limitations in

the current state of the art for 4D-Panoptic-LiDAR-

Segmentation.

Firstly, while semantic segmentation can be seen

as a preliminary step to instance segmentation, where

LiDAR points are initially clustered by class and then

further distinguished as individual instances within

each class, most methods require the network to per-

form both tasks simultaneously, without sharing se-

mantic information that could enhance instance pre-

dictions.

Secondly, the few networks that do use semantic

labels as the ﬁrst step for instance segmentation typ-

ically only utilize them to mask out non-thing back-

ground classes such as building or vegetation. How-

ever, incorrect semantic predictions in these cases

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

646

Temporal

Aggregation

LiDAR Point Cloud

t = N

LiDAR Point Cloud with

Semantic Prior

t = N

Aggregated 4D-LiDAR

Point Cloud with

Semantic Prior

LiDAR Point Cloud

t = 0

LiDAR Point Cloud with

Semantic Prior

t = 0

Frozen

... ...

Poses

Center Prediction

Semantic Branch

Instance Branch

Refinement

Predicted Offsets

Sampling

Grouping

Clustering

Instance

Predictions

Predicted Centers

Figure 2: Our method D-PLS decomposes the task of 4D-Panoptic-LiDAR-Segmentation into two modular branches: the sin-

gle scan semantic segmentation branch which provides prior information to the temporally aggregated instance segmentation

branch.

lead to the complete removal of geometric informa-

tion for those points, depriving the instance branch of

potentially valuable data.

Thirdly, developments in 4D-Panoptic-LiDAR-

Segmentation often operate separately from advance-

ments in single-scan Semantic Segmentation, with

each domain focusing on distinct architectures. As

a result, 4D-Panoptic-LiDAR-Segmentation does not

always beneﬁt from improvements in single-scan seg-

mentation.

Our approach addresses these challenges by de-

coupling semantic segmentation from instance seg-

mentation into two distinct steps. First we perform

single-scan semantic segmentation, which acts as a

coarse initial clustering of points. In a second step

wen pass these temporally aggregated semantic re-

sults as prior information to aid in instance segmenta-

tion, without masking out any points. This ensures

that the full geometric information is available for

instance prediction. Thanks to this decoupling, our

method is modular, allowing any semantic segmenta-

tion network to be used as a source of semantic prior

information without requiring architectural changes

or retraining (given that the desired classes remain the

same).

Our method is illustrated in Figure 2. The follow-

ing subsections describe both steps of our method in

detail.

3.1 Semantic Segmentation

Firstly, a pre-trained single scan segmentation net-

work D

seg

predicts semantic labels from point coor-

dinates x, y, z and their features f (usually remission

and/or intensity in LiDAR data) and the predicted

semantic labels S = D

seg

(x, y, z, f ) are appended to

the original point cloud. Then, multiple 3D point

clouds are temporally aggregated over N time steps

using ego-motion data, resulting in a 4D point cloud

P = R

N×[x,y,z, f ,S]

For our experiments, we utilize the SPVCNN

(Tang et al., 2020) network variant proposed in

2DPASS (Yan et al., 2022). This network combines

point-based and sparse voxel-based approaches, al-

lowing efﬁcient 3D data processing while maintaining

high resolution and modeling across various receptive

D-PLS: Decoupled Semantic Segmentation for 4D-Panoptic-LiDAR-Segmentation

647

ﬁeld sizes. The point-based branch preserves high-

resolution data, while the sparse voxel branch ap-

plies Sparse Convolution. These two branches com-

municate through sparse voxelization and devoxeliza-

tion, with trilinear interpolation used for devoxeliza-

tion. Feature fusion occurs by adding outputs from

both branches, enhancing ﬁne details with minimal

computational cost. The 2DPASS variant features a

lightweight decoder composed of trilinear interpola-

tion and MLPs fusing the hierarchical features of the

encoder.

This architecture is particularly effective as it

maintains a pointwise branch while still leveraging hi-

erarchical features in the voxel branch, thus avoiding

the upsampling errors that occur when voxel predic-

tions are directly upsampled as pointwise predictions.

Nonetheless, any semantic segmentation architecture

is compatible with our method. To preserve modular-

ity and independence from speciﬁc architectures, we

deliberately avoid using semantic embeddings, limit-

ing the semantic prior information to one-hot predic-

tions or conﬁdence scores.

3.2 Temporal Instance Segmentation

The temporally aggregated 4D point cloud, enriched

with semantic prior information P = R

N×[x,y,z, f ,S]

, is

passed to the instance segmentation branch. Here, the

point cloud is processed by a hierarchical encoder-

decoder architecture based on Kernel Point Convolu-

tion (KP-Conv) (Thomas et al., 2019), which predicts

the offset ∆ p between each point’s p coordinates and

the corresponding instance’s ground truth center co-

ordinate c

. This allows us to coarsely assign each

point to a predicted instance centerc

= p + ∆ p. Un-

like 4D-StOP, the KP-Conv network can focus solely

on instance prediction. Following 4D-StOP, we min-

imize the Huber loss between the predicted center

and the instance center c

. Notably, background

points such as buildings and vegetation are masked

out using the semantic class S, but this masking is

only applied to the loss calculation, not within the net-

work itself.

These coarse predicted centers are further reﬁned

using the instance proposal module from 4D-StOP,

which generates the ﬁnal instance masks. This mod-

ule ﬁrst samples and groups the aggregated point

cloud based on the spatial and temporal proximity of

the predicted centers c

, using Farthest Point Sam-

pling (FPS) to select K potential instance centers

⊂ c

. Points p are then grouped with these cen-

ters if they fall within a predeﬁned radius. Follow-

ing this grouping, a PointNet-like module (Qi et al.,

2017) with a multi-layer perceptron (MLP) is used

to extract advanced geometric embeddings from the

grouped points. These embeddings are further pro-

cessed by MLPs to predict reﬁned proposal centers,

object radii, and bounding box sizes, the latter serv-

ing as an auxiliary loss to improve aggregation. This

aggregation-loss function, which incorporates errors

in center position, radius, and bounding box dimen-

sions, guides the reﬁnement process.

Finally, DBScan (Ester et al., 1996) clustering is

applied to the learned embeddings, aggregating points

within the same cluster to form the ﬁnal instance

masks. For each instance, we assign the most fre-

quently occurring semantic label to all points within

the instance.

4 EXPERIMENTS

Our experiments were conducted on the widely used

SemanticKITTI (Behley et al., 2019) dataset, which

is divided into training, validation, and test sets, en-

compassing more than 43K LiDAR scans captured by

a Velodyne HDL-64E laser scanner in diverse urban

settings. Each point within the LiDAR point clouds

is annotated with one of 19 semantic categories (e.g.

car, building, cyclist), and foreground classes also

have a unique instance ID that remains consistent

over time. Additionally, the dataset provides accu-

rate pose estimates for the ego vehicle at each time

step, allowing for the spatial and temporal aggrega-

tion of point clouds, which is crucial for 4D-Panoptic-

LiDAR-Segmentation.

We evaluated our method using the LiDAR

Segmentation and Tracking Quality Metric (LSTQ)

(Ayg

un et al., 2021), which assesses the performance

of 4D panoptic segmentation approaches. The overall

LSTQ metric is computed as the geometric mean of

the classiﬁcation score S

cls

and the association score

assoc

LST Q =

cls

× S

assoc

The classiﬁcation score S

cls

evaluates the

method’s ability to assign correct semantic labels to

LiDAR points, calculated as the mean intersection

over union (mIoU) across all classes. The associa-

tion score S

assoc

measures the accuracy of instance

tracking throughout the entire LiDAR sequence,

independent of the predicted semantic class. The

geometric mean ensures that strong performance in

both classiﬁcation and association tasks is necessary

to achieve a high overall LSTQ score.

We refer to our proposed method as D-PLS (De-

coupled Panoptic LiDAR Segmentation). For the se-

mantic prior information, we use the original check-

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

648

Table 1: Ablation study of the components of our method on the SemanticKITTI validation set using N = 2 scans.

One-Hot Semantics Semantic Conﬁdences Auxiliary Semantic Loss LST Q% S

assoc

% S

cls

% IoU

T h

% IoU

none (Baseline) (Kreuzberg et al., 2022) 58.01 65.50 51.38 45.15 60.86

✓ 69.50 73.00 66.17 69.93 69.45

✓ ✓ 68.71 71.42 66.10 69.75 69.45

✓ 69.26 72.53 66.13 69.84 69.45

✓ ✓ 69.34 72.63 66.20 70.00 69.45

point of the 2DPASS network to generate single-scan

semantic predictions, which are subsequently incor-

porated during the temporal aggregation of the point

cloud. The coarse instance segmentation stage is

trained for 400K iterations before being frozen, fol-

lowed by an additional 150K iterations to train the re-

ﬁnement stage.

4.1 Ablation Study

We conducted an ablation study to assess the contri-

bution of each component in our approach. In this

study, 4D-StOP serves as the baseline, predicting both

semantic and instance labels directly from a single

spatio-temporally aggregated point cloud. In con-

trast, our method ﬁrst predicts single-scan semantic

labels using a semantic network before aggregating

point clouds across multiple time steps. We evaluate

whether this semantic prior information is best added

as one-hot semantics or as semantic conﬁdences.

We also experiment with having the instance stage

of our method being supervised by an auxiliary se-

mantic segmentation loss, in order to further reﬁne

the semantic prior going into the network. The rea-

soning behind this experiment is to determine whether

predicting both semantic and instance information is

detrimental.

Table 1 presents the results, where N = 2 scans

were aggregated spatio-temporally during training

and evaluation. Note that the baseline results dif-

fer from the original 4D-StOP paper, as we reduced

the maximum points per batch to accommodate our

hardware limitations. As a result, both the baseline

and our method were retrained under identical con-

ditions to ensure a fair comparison. Our ﬁndings in-

dicate that introducing one-hot semantics as prior in-

formation for instance segmentation results in a clear

improvement in LSTQ compared to the baseline. In

particular, the classiﬁcation score S

cls

saw signiﬁcant

enhancement, demonstrating that network architec-

tures and training strategies developed for single-scan

data can be highly beneﬁcial in the context of panop-

tic segmentation. This suggests that future advance-

ments in single-scan segmentation could further im-

prove panoptic segmentation.

Moreover, the use of semantic predictions as prior

information has positively impacted the association

score S

assoc

, leading to improved instance predictions.

Our hypothesis that decoupling semantic and instance

predictions beneﬁts panoptic segmentation is further

supported by the observation that retaining the auxil-

iary semantic loss negatively affects the instance seg-

mentation stage. We use the conﬁguration leading to

the best result for comparison with other state-of-the-

art methods in the next section.

5 COMPARISON WITH STATE OF

THE ART

For our ﬁnal evaluation, we compared our approach

with the current state-of-the-art methods. Based on

the results from the ablation study in Section 4.1, we

use one-hot semantics as the semantic prior and we

retrained the model with N = 4 spatio-temporally ag-

gregated scans. This setting allows for equal compa-

rability with the state of the art. The results, shown in

Table 2, demonstrate the effectiveness of our method,

which outperforms the state of the art, except for

Mask4D. However, unlike our approach, Mask4D is

not modular, meaning that integrating new develop-

ments from single-scan applications requires adapta-

tions and retraining.

Table 2: Comparison of our method using N = 4 scans com-

pared to others on the SemanticKITTI validation set.

Method LST Q% S

assoc

% S

cls

% IoU

T h

% IoU

4DPLS (Ayg

un et al., 2021) 59.86 58.79 60.95 64.96 63.06

CA-Net (Marcuzzi et al., 2022) - 72.90 - - -

4DStOP (Kreuzberg et al., 2022) 66.40 71.80 61.40 64.90 64.10

Eq-4D-PLS (Zhu et al., 2023) 65.00 67.70 62.30 64.60 66.40

4D-DS-Net (Hong et al., 2022) 68.00 71.30 64.80 64.50 65.30

Mask4D (Marcuzzi et al., 2023) 71.40 75.40 67.50 65.80 69.90

Eq-4D-StOP (Zhu et al., 2023) 70.10 77.60 63.40 67.10 66.40

Mask4Former (Yilmaz et al., 2024) 70.50 74.30 66.90 67.10 66.60

ours 70.49 74.87 66.36 70.40 69.45

Our method shows a substantial improvement

over the baseline 4D-StOP, conﬁrming the viability

of decoupling semantic segmentation from instance

mask prediction. This concept could also be applied

to methods like Mask4D, with further improvements

possible by incorporating recent advancements in se-

mantic segmentation networks such as Point Trans-

former V3 (Wu et al., 2024). We leave this for fu-

ture research. Notably, when compared to 4D-DS-Net

which uses semantic results to mask out points prior

to instance reﬁnement, our method performs better,

indicating that our approach of not using semantic

masking in the instance branch leads to better results.

D-PLS: Decoupled Semantic Segmentation for 4D-Panoptic-LiDAR-Segmentation

649

6 CONCLUSION

This paper introduces a novel two-stage approach

to 4D Panoptic Lidar Segmentation by decoupling

single-scan semantic and multi-scan instance segmen-

tation tasks. Our D-PLS framework leverages se-

mantic predictions as a coarse form of clustering and

prior information to aid in instance segmentation, sig-

niﬁcantly outperforming the baseline, while offering

a modular and ﬂexible solution. D-PLS capitalizes

on pretrained networks from the well-explored do-

main of single-scan semantic segmentation to sup-

ply these prior predictions, making it adaptable to a

wide range of architectures. Future work will ex-

plore extending our approach to diverse semantic and

panoptic networks. We validated our results using the

SemanticKITTI dataset, and aim to further advance

this research by incorporating label-efﬁcient LiDAR

segmentation techniques (Reichardt et al., 2023), in-

stance speciﬁc augmentation (Reichardt et al., 2024)

and motion information (Rishav et al., 2020).

ACKNOWLEDGEMENTS

This work was funded by the German Federal

Ministry for Economic Affairs and Climate Action

(BMWK) under the grant SERiS (KK5335502LB3).

REFERENCES

Ayg

un, M., Osep, A., Weber, M., Maximov, M., Stachniss,

C., Behley, J., and Leal-Taixe, L. (2021). 4D Panoptic

Segmentation. In Conference on Computer Vision and

Pattern Recognition (CVPR).

Behley, J., Garbade, M., Milioto, A., Quenzel, J., Behnke,

S., Stachniss, C., and Gall, J. (2019). SemanticKITTI:

A Dataset for Semantic Scene Understanding of Li-

DAR Sequences. In International Conference on

Computer Vision (ICCV).

Ester, M., Kriegel, H.-P., Sander, J., Xu, X., et al. (1996).

A density-based algorithm for discovering clusters in

large spatial databases with noise. In KDD.

Hong, F., Zhou, H., Zhu, X., Li, H., and Liu, Z. (2021).

Lidar-based panoptic segmentation via dynamic shift-

ing network. In Conference on Computer Vision and

Pattern Recognition (CVPR).

Hong, F., Zhou, H., Zhu, X., Li, H., and Liu, Z. (2022).

Lidas-based 4d panoptic segmentation via dynamic

shifting network. arXiv preprint arXiv:2203.07186.

Kreuzberg, L., Zulﬁkar, I. E., Mahadevan, S., Engelmann,

F., and Leibe, B. (2022). 4d-stop: Panoptic segmenta-

tion of 4d lidar using spatio-temporal object proposal

generation and aggregation. In European Conference

on Computer Vision (ECCV).

Marcuzzi, R., Nunes, L., Wiesmann, L., Marks, E., Behley,

J., and Stachniss, C. (2023). Mask4d: End-to-end

mask-based 4d panoptic segmentation for lidar se-

quences. Robotics and Automation Letters (RA-L).

Marcuzzi, R., Nunes, L., Wiesmann, L., Vizzo, I., Behley,

J., and Stachniss, C. (2022). Contrastive instance

association for 4d panoptic segmentation using se-

quences of 3d lidar scans. Robotics and Automation

Letters (RA-L).

Qi, C. R., Su, H., Mo, K., and Guibas, L. J. (2017). Pointnet:

Deep learning on point sets for 3d classiﬁcation and

segmentation. In Conference on Computer Vision and

Pattern Recognition.

Reichardt, L., Ebert, N., and Wasenm

uller, O. (2023). 360

◦

from a single camera: A few-shot approach for lidar

segmentation. In International Conference on Com-

puter Vision Workshop (ICCVW).

Reichardt, L., Uhr, L., and Wasenm

uller, O. (2024).

Text3daug – prompted instance augmentation for lidar

perception. In International Conference on Intelligent

Robots and Systems (IROS).

Rishav, R., Battrawy, R., Schuster, R., Wasenm

uller, O., and

Stricker, D. (2020). Deeplidarﬂow: A deep learning

architecture for scene ﬂow estimation using monocu-

lar camera and sparse lidar. In IEEE/RSJ International

Conference on Intelligent Robots and Systems (IROS).

Tang, H., Liu, Z., Zhao, S., Lin, Y., Lin, J., Wang, H.,

and Han, S. (2020). Searching efﬁcient 3d architec-

tures with sparse point-voxel convolution. In Euro-

pean Conference on Computer Vision (ECCV).

Thomas, H., Qi, C. R., Deschaud, J.-E., Marcotegui, B.,

Goulette, F., and Guibas, L. J. (2019). Kpconv: Flex-

ible and deformable convolution for point clouds. In

International Conference on Computer Vision (ICCV).

Wu, X., Jiang, L., Wang, P.-S., Liu, Z., Liu, X., Qiao, Y.,

Ouyang, W., He, T., and Zhao, H. (2024). Point trans-

former v3: Simpler faster stronger. In Conference on

Computer Vision and Pattern Recognition (CVPR).

Yan, X., Gao, J., Zheng, C., Zheng, C., Zhang, R., Cui, S.,

and Li, Z. (2022). 2dpass: 2d priors assisted seman-

tic segmentation on lidar point clouds. In European

Conference on Computer Vision (ECCV).

Yilmaz, K., Schult, J., Nekrasov, A., and Leibe, B. (2024).

Mask4former: Mask transformer for 4d panoptic seg-

mentation. In International Conference on Robotics

and Automation (ICRA).

Zhu, M., Han, S., Cai, H., Borse, S., Ghaffari, M., and

Porikli, F. (2023). 4d panoptic segmentation as invari-

ant and equivariant ﬁeld prediction. In International

Conference on Computer Vision (ICCV).

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

650