DaDe: Delay-Adaptive Detector for Streaming Perception

Wonwoo Jo

1,∗

, Kyungshin Lee

2,†

, Jaewon Baik

1,∗

, Sangsun Lee

1,∗

, Dongho Choi

1,∗

and Hyunkyoo Park

1,∗

C&BIS Co.,Ltd., Republic of Korea

Independent Researcher, Republic of Korea

∗

Keywords:

Streaming Perception, Real-Time Processing, Object Detection.

Abstract:

Recognizing the surrounding environment at low latency is critical in autonomous driving. In real-time envi-

ronment, surrounding environment changes when processing is over. Current detection models are incapable

of dealing with changes in the environment that occur after processing. Streaming perception is proposed to

assess the latency and accuracy of real-time video perception. However, additional problems arise in real-

world applications due to limited hardware resources, high temperatures, and other factors. In this study, we

develop a model that can reﬂect processing delays in real time and produce the most reasonable results. By

incorporating the proposed feature queue and feature select module, the system gains the ability to forecast

speciﬁc time steps without any additional computational costs. Our method is tested on the Argoverse-HD

dataset. It achieves higher performance than the current state-of-the-art methods(2022.12) in various environ-

ments when delayed . The code is available at https://github.com/danjos95/DADE.

1 INTRODUCTION

Recognizing the surrounding environment and react-

ing with low latency is critical in autonomous driv-

ing for a safe and comfortable driving experience. In

practice, the gap between the input data and the sur-

rounding environment widens as the processing la-

tency increases. As shown in Figure 1, surround-

ing environment changes from sensor input time t

to t + n when the model ﬁnishes processing. To ad-

dress this issue, some detectors (Redmon and Farhadi,

2018)(Ge et al., 2021) are focused on lowering the la-

tency. They can complete the entire processing be-

fore the next sensor input. It appears reasonable but

the processing delay caused a gap between the results

and the changing environment. Furthermore, current

image detection metrics such as average precision,

and mean average precision do not consider a real-

time online perception environment. It causes detec-

tors to prioritize accuracy over the delay. To evaluate

streaming performance, (Li et al., 2020) proposed a

new metric, "streaming accuracy" to integrate latency

and accuracy into a single metric for real-time online

perception. Strong detectors (He et al., 2017)(Cai

and Vasconcelos, 2018) showed a signiﬁcant perfor-

mance drop in streaming perception. StreamYOLO

(Yang et al., 2022) created the model to predict future

frames by combining the previous and current frames.

Figure 1: Visualization of the results of detectors (red

boxes). Ground Truth (green boxes) changes due to image

ﬂow. Our method can detect objects correctly even after

multiple time-step passes during processing.

It achieved state-of-the-art streaming AP performance

by successfully detecting future frame-keeping real-

time performance. In the real-world environment,

processing time varies for various reasons(e.g. Tem-

perature, Multi-modal Multi-task and sequential pro-

cessing) while running the model. It requires de-

tectors with the ability to perform multi-time-step

Jo, W., Lee, K., Baik, J., Lee, S., Choi, D. and Park, H.

DaDe: Delay-Adaptive Detector for Streaming Perception.

DOI: 10.5220/0011610700003417

In Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2023) - Volume 5: VISAPP, pages

39-46

ISBN: 978-989-758-634-7; ISSN: 2184-4321

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

predictions. For example, in Figure 1(b), StreamY-

OLO outperforms the baseline detector in predicting

the next time-step frame when the processing time

is shorter than the inter-frame time. When process-

ing time exceeds inter-frame time, the system begins

to collapse and produces poor prediction results, as

shown in Result (t + n) in Figure 1. The proposed

system can function properly in an unexpectedly long

processing time environment and provide the best de-

tection at the time. In this study, we construct a

model that provides insight into the required time-

step frame. The feature queue module saves multi-

time-step image features. It prevents the need for ad-

ditional feature extraction processes, which incur ad-

ditional computational costs. The dynamic feature se-

lect module constantly monitors the processing delay

and chooses the best feature for the estimated delay.

We use StreamYOLO’s dual ﬂow perception (DFP)

module, which fuses the previous and current frames

to make objects moving trend. The model can make

moving trend of the target time step by fusing fea-

tures from the dynamic feature select module. Ex-

periments are conducted on Argoverse-HD (Chang

et al., 2019) dataset. We tested delays in various set-

tings. Our method showed signiﬁcant improvement

when compared to the baseline (StreamYOLO). As

the mean processing delay increased, the difference

between the baseline and ours widened.

This research makes three main contributions as

follows:

• We introduce delay-adaptive detector (DaDe),

which can produce future results that are tailored

to the output environment. Processing delay can-

not be stable in a real-time environment and must

be considered. We improved the baseline so that

it could handle unexpected delays. This process

has no computational cost or accuracy trade-off.

• We create a simple feature control module that

can choose the best image feature based on the

current delay trend. The feature queue module

stores previous features to avoid additional com-

putation. The feature select module monitors

pipeline delays and chooses the best feature to

create the moving trend. We can achieve accu-

rate target time results by using the DFP module

from StreamYOLO.

• Our system outperformed the state-of-the-art

(StreamYOLO) model in delay-varying environ-

ments. We changed the mean delay time to stim-

ulate delayed environments during the evalua-

tion. Our method achieved 0.7 ∼ 1.5% higher

sAP compared to the state-of-the-art method. This

work also shows considerable room for a delay-

critical system.

2 RELATED WORK

2.1 Image Object Detection

Both latency and accuracy are critical factors in ob-

ject detection. Two-stage detectors (Girshick et al.,

2013)(Girshick, 2015)(He et al., 2017)(Lin et al.,

2016) use a regional proposal system to focus on the

accuracy of ofﬂine applications. However, in the real

world, latency is more important than accuracy be-

cause the environment changes throughout process-

ing. One-stage detectors (Lin et al., 2017)(Ma et al.,

2021) have an advantage in processing time compared

to regional proposed methods. YOLO series (Red-

mon and Farhadi, 2018)(Ge et al., 2021) is one of the

mainstream performances in the one-stage detection

method. Our research is based on the YOLOX (Ge

et al., 2021) detector. YOLOX added many advanced

detection technologies (anchor free, decoupled head,

etc.) from YOLOv3 (Redmon and Farhadi, 2018) to

achieve powerful performance.

2.2 Video Detection and Prediction

There have been attempts to improve the detection

performance by using previous images from the video

stream. Recent methods, such as (Bergmann et al.,

2019)(Chen et al., 2020)(Deng et al., 2019)(Zhu et al.,

2017) use attention, optical ﬂow, and a tracking

method to aggregate image features and achieve high

detection accuracy. The video future frame predic-

tor (Bhattacharyya et al., 2019)(Luc et al., 2017) cre-

ates unobserved future images from previous images.

The ConvLSTM-based autoencoder (Chang et al.,

2021)(Lee et al., 2021)(Chaabane et al., 2020)(Jin

et al., 2020)(Wang et al., 2020) generates representa-

tions of previous frames, and then the decoder gen-

erates future frame pixels based on those represen-

tations. However, they cannot be used in real-time

applications because they are designed for an ofﬂine

environment and do not account for latency.

2.3 Progressive Anytime Algorithm

There are previous studies on planning systems un-

der resource constraints (Boddy and Dean, 1994) and

ﬂexible computation. The anytime algorithm can re-

turn results at any point in time. The quality of the

results gradually increases as the processing time in-

creases (Zilberstein, 1996). However, the preceding

studies do not consider environmental changes that

occur during processing. It bridges the gap between

the actual environment and the results. Our efforts are

aimed at making the model more robust, even in de-

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

Figure 2: Delay-adaptive Detector is illustrated for delay adaptive stream perception. The image feature is extracted by

CSPDarknet-53 and PANet, and stored in the ﬁxed-length Feature Queue Module. The Feature Select Module investigates

delay using the currently performed preprocessing delay and the most recent inference delay. The Feature Select Module

chooses appropriate features from a list of stored features. Selected features are used to perform detection in the StreamYOLO

pipeline.

layed situations. It can produce appropriate results to

the current time even when processing time becomes

unexpectedly long.

2.4 Streaming Perception

There are two types of detectors in a real-time system:

real-time and non-real-time. A real-time detector can

complete all processing steps before the next frame

arrives. There was no metric to evaluate consider-

ing both latency and accuracy. As latency becomes

more important in a real-world application, (Li et al.,

2020) integrates latency and accuracy into a single

metric. Streaming AP (sAP) was proposed to evaluate

accuracy while considering time delay. It also sug-

gests methods for recovering sAP performance drops,

such as decision-theoretic scheduling, asynchronous

tracking, and future forecasting. StreamYOLO (Yang

et al., 2022) reduced streaming perception to the task

of creating a real-time detector that predicts the next

frame. It achieved state-of-the-art performance, but

when real-time processing is violated, the system can

collapse because it can only produce the next time-

step result.

2.5 Delay Variance

If an additional delay occurs while processing, the gap

between the results and the surrounding environment

grows larger. This leads to a bad decision, which can

lead to serious safety issues. Delays can occur for var-

ious reasons while running real-world applications.

2.5.1 Temperature

Temperature is a well-known issue in a real-time sys-

tem. Performance may be reduced to avoid permanent

damage to the processing chips. In our hardware envi-

ronment, processing speed drops to 70% at 90

◦

C, with

a performance drop of up to 50% at 100

◦

C. In au-

tonomous driving, the temperature control gets harder

since the device operates in various outdoor environ-

ments.

2.5.2 Multiple Sensor Conﬁguration

It is critical to use multiple sensors to perceive the

surrounding environment to obtain diverse data or

a multi-view of the environment. It is becoming

more common to use lidar-camera fusion (Liu et al.,

2022)(Chen et al., 2016) or multi-camera system (Li

et al., 2022)(Wang et al., 2021) to make surrounding

recognition. Each sensor has a unique frequency and

time delay. Here, fast sensor data must wait until all

sensor data arrives, which can add to the delay.

2.5.3 Multiple Model Deploy

To build an autonomous driving system, multiple

tasks (for example, lane segmentation, trafﬁc light

DaDe: Delay-Adaptive Detector for Streaming Perception

detection, 3D object detection, multi-object tracking,

and so on) must be performed parallel and dynam-

ically. (Lobanov and Sholomov, 2021) proposes a

multiple detection model with a shared backbone to

reduce computational costs. In any case, running mul-

tiple models can cause an unexpected delay in the sys-

tem’s sequential task. If the processing delay exceeds

the inter-frame spacing, the real-time system fails and

the detector results become out-of-date. We changed

the mean delay time to stimulate delayed environ-

ments during the evaluation.

2.6 Preprocessing and Inference Delay

There are preprocessing and inference stages in the

processing sensor system. The main processing unit

receives raw sensor data and reﬁnes it to create a

proper shape for the model during the preprocess-

ing stage. Following preprocessing, input data ﬂows

into a neural processing unit, which produces re-

sults through neural networks. Since it runs sequen-

tially, preprocessing is fast but the delay can be easily

changed by a processing bottleneck. As the task be-

comes more demanding, the number of sequential op-

erations in the main processing unit increases, which

can directly affect the delay.

3 METHODOLOGY

In this study, we present a simple and effective

method for achieving the desired future results. As

in Figure 2, the Delay-adaptive Detector (DaDe) can

return the desired time-step by equipping the feature

queue module, feature select module, and Dual-Flow

Perception (DFP) module. We avoid adding extra

computation in processing to maintain real-time per-

formance.

3.1 Dual Flow Perception Module

StreamYOLO (Yang et al., 2022) suggested a DFP

module for tracking moving trends. DFP uses the

previous frame’s feature map. Dynamic ﬂow in DFP

fuses the features of two adjacent frames to create an

object moving trend, whereas static ﬂow is created us-

ing a standard feature extraction based on the current

frame only. 1x1 convolution layer is used for dynamic

ﬂow, which is followed by the batchnorm and SiLU

(Ramachandran et al., 2017) to reduce the channel for

previous and current features in half. Then, DFP con-

catenates these two reduced features to generate the

dynamic features. The detector can sense the object’s

Figure 3: Pipeline of DaDe. The feature queue of the mod-

ule stores previous image features to avoid additional com-

putation. The feature select module selects features from

queues based on the current delay time. The dual ﬂow per-

ception module generates object moving trends based on

input features.

moving trends and detect future frames without addi-

tional latency by combining dynamic ﬂow and static

ﬂow. To advance this method, we use a multi-time-

step feature fusing system to create a delay adaptive

detector. The model can make a multi-time-step pre-

diction by following the feature queue and feature se-

lect module.

3.2 Feature Queue Module

The feature queue module stores extracted image fea-

tures in a featured queue of a ﬁxed length. Features

are represented as F

, whereas t is the correspond-

ing time-step. The computational cost of saving old

image features is as same as StreamYOLO. The fea-

ture queue module continues to save features from

the four previous frames. Storing the previous fea-

ture prevents additional processing while occupying

little additional memory (100MB for each frame fea-

ture). This enables the model to use older time-step

features.

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

3.3 Delay Analyze

Preprocessing and inference time in time step t are de-

noted as P

and I

respectively. The inter-frame time

between input frames is denoted by the symbol T

The delay trend can be calculated by adding the most

recent inference delay and the current preprocessing

delay. The ﬁnal delay trend D

is as follows:

= P

+ I

(t−1)

(1)

3.4 Feature Select Module

The feature select module chooses two time-step fea-

tures that best match the current delay trend. When

the delay trend is entered into the feature select mod-

ule, the target time-step n can be calculated using

Equation 2. The expected time-step is the delay trend

quotient over inter-frame time. The target time-step n

is the next time-step of the division result.

n =





+ 1 (2)

In the feature select module, features for the re-

quired time step are extracted as the equation below.

FSM([F

: F

(t−4)

], D

) = (F

, F

(t−n)

) (3)

If there is no last inference delay, the result of FSM

should be (F

, F

t−1

). If no suitable feature exists for

n, the nearest F should be chosen. In summary, the

proposed model calculates the time delay by combin-

ing the current preprocessing delay and the last infer-

ence delay. After determining the desired time-step

using Equation 2, the module chooses the stored fea-

tures using Equation 3. The DFP creates a dynamic

time-step object moving trend using the selected fea-

ture. The entire feature ﬂow process is explained in

Figure 3.

4 EXPERIMENTS

4.1 Environment Settings

4.1.1 Dataset

Argoverse-HD (Chang et al., 2019) provides high-

resolution (1920x1080) and high-frame-rate driving

video image data (30FPS). The Argoverse-HD valida-

tion set contains 24 videos of 15 ∼ 30 seconds each,

totaling 15k image frames. Streaming perception (Li

et al., 2020) adds dense 2D annotations in MS COCO

(Lin et al., 2014) format.

Table 1: Delay analysis in various delay environments. All

delays are measured in milliseconds.

Environment

Mean

delay

Standard

deviation

Minimum

delay

Maximum

delay

Low

StreamYOLO 23.5 3.2 21.8 69.1

DADE(Ours) 24.1 3.66 21.9 66

Medium

StreamYOLO 40.1 9.35 22.3 86.8

DADE(Ours) 39.3 9.22 22.3 88

High

StreamYOLO 63 12.5 41.7 121

DADE(Ours) 63.1 12.7 41.3 124

4.1.2 Streaming AP

Streaming AP (Li et al., 2020) is used in the evalu-

ation of experiments. During evaluation, the model

updates the output buffer with its latest prediction of

the current state of the world. At the same time, the

benchmark constantly queries the output buffer for es-

timates of world states. So, output will be evaluated

with the current ground truth frame in world state. To

AP metric from COCO (Lin et al., 2014), sAP eval-

uates the average mAP over intersection-over-union

(IoU) thresholds at 0.5 and 0.75. sAP

, sAP

, and

sAP

denote sAP for object size.

4.1.3 Platforms

Each experiment in this study was run on an RTX

3070Ti GPU (8GB VRAM, TDP 80W), an Intel

12700 H CPU, and 16 GB RAM. Pytorch 1.7.1 with

CUDA 11.4 was used in the software environment.

4.1.4 Delay

To change mean delay time, we deployed multi-

ple DaDe model on single environment. Table 1

shows various delay environments (Low, Medium,

and High) by running up to 3 models in parallel. Inter-

frame space is 33ms at 30FPS, so processing time

should be less than 33ms to maintain real-time per-

formance. In a low delay environment, the average

delay is 24ms, so StreamYOLO and DaDe can pro-

cess each frame before the next frame arrives. In

a medium delay environment, the average delay is

40ms, after which some frames fail to meet real-time

performance. As a result, their evaluation is based

on frame(t + 2) rather than frame(t + 1). The mean

delay time in a high delay environment is more than

60ms, so the difference between the assumed future

frame and the real environment increases in the base-

line. Figure 4 shows the processing delay distribu-

tion in each delay environment. A vertical dotted

DaDe: Delay-Adaptive Detector for Streaming Perception

Figure 4: Visualization of delay distribution for various

delay environments. Distribution gradually spreads as the

mean delay lengthens.

line shows inter-frame spaces at 30 frames per sec-

ond (FPS). Frames in each inter-frame space should

produce corresponding time-step results.

4.2 Results

Table 2 shows the results of streaming perception

methods. The performance is the same as in a low-

delay environment between baseline (StreamYOLO)

and ours (DaDe). Standard YOLOX suffers from a

signiﬁcant performance drop in the absence of future

prediction. The baseline fails to follow the trend of fu-

ture frames as the mean delay increases and the frame

begins to fail to ﬁnish processing before the next

frame arrives. In high-delay conditions, all frames

fail to detect in real-time, increasing the gap between

the baseline and the suggested method. As shown

in Table 2, a 1.5% improvement in sAP is achieved

compared to the baseline in a high delay environment.

Table 2: Performance of each method in Argoverse-HD

dataset. YOLOX is the standard real-time detector.

Method sAP sAP

sAP

Low (mean delay: 24 ± 1ms)

YOLOX 31.7 52.3 29.7 12.5 54.8 30.1

StreamYOLO 36.7 63.8 37.0 14.6 57.9 37.2

DADE(Ours) 36.7 63.9 36.9 14.6 57.9 37.3

Medium (mean delay: 39 ± 1ms)

YOLOX 23.0 40.0 21.3 6.1 45.5 21.1

StreamYOLO 29.1 48.1 27.4 9.5 51.6 27.7

DADE(Ours) 29.8 50.1 28.3 10.8 52.5 28.4

High (mean delay: 63 ± 1ms)

YOLOX 19.2 31.2 15.5 5.1 38.8 16.3

StreamYOLO 25.7 40.2 23.9 8.6 47.3 23.7

DADE(Ours) 27.2 41.5 26.1 9.5 48.2 25.9

Figure 5 shows the visualization of each method in a

high delay environment. Ours outperforms others in

various conditions and scenarios.

5 CONCLUSIONS

This study proposes a simple but effective method for

detecting objects in real time. Multi-time-step predic-

tion can be performed using the feature queue mod-

ule and feature select module without any additional

computation. It achieved state-of-the-art performance

in a delayed environment. Furthermore, real-time

feature operation in DaDe can signiﬁcantly improve

the performance of other real-world object detection

tasks. Predicting the appropriate time step can signif-

icantly improve their localization accuracy. For fur-

ther research, only one time-step feature is considered

in our method to generate the object’s moving trend.

Accuracy can be improved if each time-step feature is

considered within a reasonable computation cost.

REFERENCES

Bergmann, P., Meinhardt, T., and Leal-Taixé, L. (2019).

Tracking without bells and whistles. In 2019

IEEE/CVF International Conference on Computer Vi-

sion, ICCV 2019, Seoul, Korea (South), October 27 -

November 2, 2019, pages 941–951. IEEE.

Bhattacharyya, A., Fritz, M., and Schiele, B. (2019).

Bayesian prediction of future street scenes using syn-

thetic likelihoods. In 7th International Conference on

Learning Representations, ICLR 2019, New Orleans,

LA, USA, May 6-9, 2019. OpenReview.net.

Boddy, M. S. and Dean, T. L. (1994). Deliberation schedul-

ing for problem solving in time-constrained environ-

ments. Artif. Intell., 67(2):245–285.

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

Figure 5: Qualitative analysis of results. Detection is conducted in a streaming environment with a high delay. DaDe can

successfully predict the future environment by forming object-moving trends. It achieves better results not only for moving

objects (top row) but also for detecting objects during ego-vehicle turns (bottom row).

Cai, Z. and Vasconcelos, N. (2018). Cascade R-CNN: delv-

ing into high quality object detection. In 2018 IEEE

Conference on Computer Vision and Pattern Recogni-

tion, CVPR 2018, Salt Lake City, UT, USA, June 18-

22, 2018, pages 6154–6162. Computer Vision Foun-

dation / IEEE Computer Society.

Chaabane, M., Trabelsi, A., Blanchard, N., and Beveridge,

J. R. (2020). Looking ahead: Anticipating pedestrians

crossing with future frames prediction. In IEEE Win-

ter Conference on Applications of Computer Vision,

WACV 2020, Snowmass Village, CO, USA, March 1-

5, 2020, pages 2286–2295. IEEE.

Chang, M., Lambert, J., Sangkloy, P., Singh, J., Bak, S.,

Hartnett, A., Wang, D., Carr, P., Lucey, S., Ramanan,

D., and Hays, J. (2019). Argoverse: 3d tracking

and forecasting with rich maps. In IEEE Conference

on Computer Vision and Pattern Recognition, CVPR

2019, Long Beach, CA, USA, June 16-20, 2019, pages

8748–8757. Computer Vision Foundation / IEEE.

Chang, Z., Zhang, X., Wang, S., Ma, S., Ye, Y., Xin-

guang, X., and Gao, W. (2021). MAU: A motion-

aware unit for video prediction and beyond. In Ran-

zato, M., Beygelzimer, A., Dauphin, Y. N., Liang,

P., and Vaughan, J. W., editors, Advances in Neural

Information Processing Systems 34: Annual Confer-

ence on Neural Information Processing Systems 2021,

NeurIPS 2021, December 6-14, 2021, virtual, pages

26950–26962.

Chen, X., Ma, H., Wan, J., Li, B., and Xia, T. (2016). Multi-

view 3d object detection network for autonomous

driving.

Chen, Y., Cao, Y., Hu, H., and Wang, L. (2020). Memory

enhanced global-local aggregation for video object

detection. In 2020 IEEE/CVF Conference on Com-

puter Vision and Pattern Recognition, CVPR 2020,

Seattle, WA, USA, June 13-19, 2020, pages 10334–

10343. Computer Vision Foundation / IEEE.

Deng, J., Pan, Y., Yao, T., Zhou, W., Li, H., and Mei, T.

(2019). Relation distillation networks for video object

detection. In 2019 IEEE/CVF International Confer-

ence on Computer Vision, ICCV 2019, Seoul, Korea

(South), October 27 - November 2, 2019, pages 7022–

7031. IEEE.

Ge, Z., Liu, S., Wang, F., Li, Z., and Sun, J. (2021).

YOLOX: exceeding YOLO series in 2021. CoRR,

abs/2107.08430.

Girshick, R. B. (2015). Fast R-CNN. CoRR,

abs/1504.08083.

Girshick, R. B., Donahue, J., Darrell, T., and Malik, J.

(2013). Rich feature hierarchies for accurate ob-

DaDe: Delay-Adaptive Detector for Streaming Perception

ject detection and semantic segmentation. CoRR,

abs/1311.2524.

He, K., Gkioxari, G., Dollár, P., and Girshick, R. B. (2017).

Mask R-CNN. In IEEE International Conference on

Computer Vision, ICCV 2017, Venice, Italy, October

22-29, 2017, pages 2980–2988. IEEE Computer Soci-

ety.

Jin, B., Hu, Y., Tang, Q., Niu, J., Shi, Z., Han, Y., and Li, X.

(2020). Exploring spatial-temporal multi-frequency

analysis for high-ﬁdelity and temporal-consistency

video prediction. In 2020 IEEE/CVF Conference

on Computer Vision and Pattern Recognition, CVPR

2020, Seattle, WA, USA, June 13-19, 2020, pages

4553–4562. Computer Vision Foundation / IEEE.

Lee, S., Kim, H. G., Choi, D. H., Kim, H., and Ro, Y. M.

(2021). Video prediction recalling long-term motion

context via memory alignment learning. In IEEE Con-

ference on Computer Vision and Pattern Recognition,

CVPR 2021, virtual, June 19-25, 2021, pages 3054–

3063. Computer Vision Foundation / IEEE.

Li, M., Wang, Y., and Ramanan, D. (2020). Towards

streaming perception. In Vedaldi, A., Bischof, H.,

Brox, T., and Frahm, J., editors, Computer Vision -

ECCV 2020 - 16th European Conference, Glasgow,

UK, August 23-28, 2020, Proceedings, Part II, volume

12347 of Lecture Notes in Computer Science, pages

473–488. Springer.

Li, Z., Wang, W., Li, H., Xie, E., Sima, C., Lu, T.,

Qiao, Y., and Dai, J. (2022). Bevformer: Learning

bird’s-eye-view representation from multi-camera im-

ages via spatiotemporal transformers. arXiv preprint

arXiv:2203.17270.

Lin, T., Dollár, P., Girshick, R. B., He, K., Hariharan, B.,

and Belongie, S. J. (2016). Feature pyramid networks

for object detection. CoRR, abs/1612.03144.

Lin, T., Goyal, P., Girshick, R. B., He, K., and Dollár, P.

(2017). Focal loss for dense object detection. CoRR,

abs/1708.02002.

Lin, T., Maire, M., Belongie, S. J., Bourdev, L. D., Girshick,

R. B., Hays, J., Perona, P., Ramanan, D., Dollár, P.,

and Zitnick, C. L. (2014). Microsoft COCO: common

objects in context. CoRR, abs/1405.0312.

Liu, Z., Tang, H., Amini, A., Yang, X., Mao, H., Rus, D.,

and Han, S. (2022). Bevfusion: Multi-task multi-

sensor fusion with uniﬁed bird’s-eye view represen-

tation. arXiv.

Lobanov, M. and Sholomov, D. (2021). Application of

shared backbone dnns in adas perception systems.

Proceedings of SPIE - The International Society for

Optical Engineering, 11605.

Luc, P., Neverova, N., Couprie, C., Verbeek, J., and LeCun,

Y. (2017). Predicting deeper into the future of seman-

tic segmentation. In IEEE International Conference

on Computer Vision, ICCV 2017, Venice, Italy, Oc-

tober 22-29, 2017, pages 648–657. IEEE Computer

Society.

Ma, Y., Liu, S., Li, Z., and Sun, J. (2021). Iqdet: Instance-

wise quality distribution sampling for object detec-

tion. In IEEE Conference on Computer Vision and

Pattern Recognition, CVPR 2021, virtual, June 19-25,

2021, pages 1717–1725. Computer Vision Foundation

/ IEEE.

Ramachandran, P., Zoph, B., and Le, Q. V. (2017). Search-

ing for activation functions. CoRR, abs/1710.05941.

Redmon, J. and Farhadi, A. (2018). Yolov3: An incremental

improvement. arXiv.

Wang, Y., Guizilini, V., Zhang, T., Wang, Y., Zhao, H.,

and Solomon, J. (2021). DETR3D: 3d object de-

tection from multi-view images via 3d-to-2d queries.

In Faust, A., Hsu, D., and Neumann, G., editors,

Conference on Robot Learning, 8-11 November 2021,

London, UK, volume 164 of Proceedings of Machine

Learning Research, pages 180–191. PMLR.

Wang, Y., Wu, J., Long, M., and Tenenbaum, J. B. (2020).

Probabilistic video prediction from noisy data with a

posterior conﬁdence. In 2020 IEEE/CVF Conference

on Computer Vision and Pattern Recognition, CVPR

2020, Seattle, WA, USA, June 13-19, 2020, pages

10827–10836. Computer Vision Foundation / IEEE.

Yang, J., Liu, S., Li, Z., Li, X., and Sun, J. (2022). Real-time

object detection for streaming perception. In Proceed-

ings of the IEEE/CVF Conference on Computer Vision

and Pattern Recognition (CVPR), pages 5385–5395.

Zhu, X., Wang, Y., Dai, J., Yuan, L., and Wei, Y. (2017).

Flow-guided feature aggregation for video object de-

tection. In IEEE International Conference on Com-

puter Vision, ICCV 2017, Venice, Italy, October 22-

29, 2017, pages 408–417. IEEE Computer Society.

Zilberstein, S. (1996). Using anytime algorithms in intelli-

gent systems. AI Mag., 17(3):73–83.

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications