Low Latency Pedestrian Detection Based on Dynamic Vision Sensor and

RGB Camera Fusion

Bingyu Huang

, Gianni Allebosch

, Peter Veelaert

, Tim Willems

, Wilfried Philips

and Jan Aelterman

TELIN-IPI, Ghent University, Sint-Pietersnieuwstraat, Ghent, Belgium

{bingyu.huang, gianni.allebosch, peter.veelaert, tim.willems, wilfried.philips, jan.aelterman}@ugent.be

Keywords:

Dynamic Vision Sensor, Motion Segmentation, Sensor Fusion, Autonomous Driving.

Abstract:

Advanced driver assistance systems currently adopt RGB cameras as visual perception sensors, which rely

primarily on static features and are limited in capturing dynamic changes due to ﬁxed frame rates and mo-

tion blur. A very promising sensor alternative is the dynamic vision sensor(DVS) with microsecond temporal

resolution that records an asynchronous stream of per-pixel brightness changes, also known as event stream.

However, in autonomous driving scenarios, it’s challenging to distinguish between events caused by the ve-

hicle’s motion and events caused by actual moving objects in the environment. To address this, we design a

motion segmentation algorithm based on epipolar geometry and apply it to DVS data, effectively removing

static background events and focusing solely on dynamic objects. Furthermore, we propose a system that

fuses the dynamic information captured by event cameras and rich appearance details from RGB cameras.

Experiments show that our proposed method can effectively improve detection performance while showing

great potential in decision latency.

1 INTRODUCTION

Autonomous driving systems rely on robust and reli-

able perception mechanisms to ensure safety and ef-

ﬁciency in dynamic environments. Motion segmen-

tation, the process of identifying and isolating mov-

ing objects within a scene, is a critical component

of visual perception (Kulchandani and Dangarwala,

2015). Segmenting a moving foreground against a

static background is a relatively easy task for event

cameras thanks to its dynamic response characteristic

and edge-like output (Schraml et al., 2010). Moving

objects in autonomous driving scenarios is more chal-

lenging due to the more disturbing changing back-

ground. To counteract the interference of changing

background pixels, approaches such as optical ﬂow

analysis (Kim and Kwon, 2015), motion clustering

(Kim et al., 2010), and neural networks (Mane and

Mangale, 2018) are introduced to characterize the dif-

https://orcid.org/0000-0003-0258-2684

https://orcid.org/0000-0003-2502-3746

https://orcid.org/0000-0003-4746-9087

https://orcid.org/0000-0002-5264-919X

https://orcid.org/0000-0003-4456-4353

https://orcid.org/0000-0002-5543-2631

ference between foreground and background. These

methods all develop high-dimensional feature repre-

sentations for backgrounds or targets and are robust

to general scenarios. The disadvantage is that con-

centrated features rely on prior dataset training and

reduce accuracy in the detection under occlusion or

under/over exposure. Another limitation of traditional

cameras is their inherent frame-based nature, which

can lead to latency issues (Narasimhan and Nayar,

2003; Zhu et al., 2020; Chang et al., 2021), partic-

ularly in scenarios requiring rapid decision-making.

In contrast, Dynamic Vision Sensors (DVS), or

event cameras, capturing per-pixel brightness changes

asynchronously with microsecond temporal resolu-

tion (Brandli et al., 2014; Taverni et al., 2018; Gal-

lego et al., 2020). This technology inherently pro-

vides a high temporal resolution that can signiﬁcantly

reduce latency and enhance responsiveness in time-

critical scenarios. Event cameras are particularly ad-

vantageous in situations with low lighting conditions

or rapid motion.

However, in autonomous driving scenarios, it is

challenging to distinguish between events caused by

the vehicle’s ego-motion and events caused by actual

moving objects in the environment, such as pedestri-

Huang, B., Allebosch, G., Veelaert, P., Willems, T., Philips, W. and Aelterman, J.

Low Latency Pedestrian Detection Based on Dynamic Vision Sensor and RGB Camera Fusion.

DOI: 10.5220/0013126600003912

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2025) - Volume 2: VISAPP, pages

841-850

ISBN: 978-989-758-728-3; ISSN: 2184-4321

841

ans, cyclists, or other vehicles. This difﬁculty arises

because the vehicle’s motion induces a large num-

ber of events from static objects, like buildings, road

signs, and trees, making it harder to identify and iso-

late meaningful events generated by truly dynamic

elements in the scene. Properly ﬁltering out these

motion-induced events while retaining the relevant

ones is crucial for ensuring reliable and accurate per-

ception in fast-changing driving environments.

To address these challenges, we propose an epipo-

lar geometry-based method to remove events trig-

gered by static background objects. We further make

use of the data from DVS to complement dynamic

information for RGB detections and achieve a faster

detection response. In summary, the contributions of

this paper are:

• We introduce a novel motion segmentation algo-

rithm for DVS employing the epipolar geometry

principle.

• We propose a fusion method that makes use

of motion information from DVS and appear-

ance characteristics from RGB to accomplish low-

latency detection.

2 RELATED WORK

2.1 Motion Segmentation on DVS

Event-based vision, particularly utilizing Dynamic

Vision Sensors, has emerged as a transformative ap-

proach in computer vision, offering high temporal

resolution and low latency, which are crucial for

time-critical scenarios, such as autonomous driving,

robotics, and industrial automation. Research in

this domain can be broadly divided into two main

methodologies: ego-motion compensation and neu-

ral network-based approaches. Ego-motion compen-

sation focuses on deriving motion trajectories, opti-

cal ﬂow, and other geometric properties directly from

the event streams. By estimating the ego-motion, it’s

possible to predict the expected change in brightness

at each pixel due to the vehicle’s motion and sub-

tract this from the DVS data (Stoffregen et al., 2019;

Zhou et al., 2021; Parameshwara et al., 2020; Mishra

et al., 2017). In contrast, neural network-based ap-

proaches harness the capabilities of deep learning

to process complex, sparse, and asynchronous data

from DVS. These techniques enable the extraction

and segmentation of motion directly from raw event

streams using various types of neural networks, in-

cluding spiking neural networks (Parameshwara et al.,

2021), graph neural networks (Mitrokhin et al., 2020;

Alkendi et al., 2024), and recurrent neural networks

(Zhang et al., 2023).

2.2 Fusion of RGB and DVS for Object

Detection

Fusion of RGB images with Dynamic Vision Sensor

(DVS) data has become a prominent approach for en-

hancing object detection, particularly in challenging

environments such as low-light conditions (Liu et al.,

2023) or scenes with rapid motion (Gehrig and Scara-

muzza, 2024). Since event data and RGB data are

fundamentally different in nature, most of the meth-

ods adopt deep learning models to learn features from

multi-modal data. A common approach involves us-

ing separate convolutional neural networks (CNNs)

to extract features and combine them in deeper lay-

ers (Zhou et al., 2023; Tomy et al., 2022). Some

researchers introduce novel architectures to improve

performance, such as the attention mechanism (Cao

et al., 2021; Cho and Yoon, 2022), temporal and re-

current networks (Wan et al., 2023; Hamaguchi et al.,

2023), and spiking neural network(SNN) (Cao et al.,

2024).

3 METHODOLOGY

Our method is inspired by a geometric-based tech-

nique (Allebosch et al., 2023) designed to distinguish

static backgrounds from true motion, using a linear

array of RGB cameras on a moving vehicle. We gen-

eralize the technique to asynchronous streams from

DVS, enabling lower-latency detection. In the follow-

ing sections, we outline the fundamental principles of

DVS, discuss the motion segmentation challenges we

aim to address and introduce our proposed solution.

3.1 Background Removal on DVS

3.1.1 Working Principle of DVS

DVS operates by asynchronously detecting changes

in brightness at individual pixels. Let I(t, x) denote

the intensity of light at time t, where x = (x, y) is the

pixel location. DVS responds to changes in the loga-

rithm of the intensity,

L(t, x) = log I(t, x). (1)

Each pixel sensor continuously monitors the

change in L(t, x) over time. An event is triggered

when the change in logarithmic intensity exceeds a

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

842

Figure 1: Camera conﬁguration of our method. A pair of

cameras observes a static object located at (X,Y, Z). The

reference camera at current timestamp and the side camera

at past timestamp are at positions (0,0,0), (X

, Z

). f is

the focal length.

predeﬁned threshold C. Speciﬁcally, an event is gen-

erated at pixel x and time t

L(t

, x) − L(t

− ∆t

, x)

≥ C. (2)

When an event is triggered at pixel location x

, y

) , it is represented as

= (x

, p

), (3)

where t

is the timestamp of the event, and p

is the

polarity of the event, indicating whether the bright-

ness increased or decreased,

(

+1 if L(t

, x

) − L(t

− ∆t

, x

) ≥ C

−1 if L(t

, x

) − L(t

− ∆t

, x

) ≤ −C.

(4)

Hence a static DVS is inherently sensitive to mo-

tion in the scene. It ignores static objects or back-

grounds because they do not cause any change in

brightness. This property is particularly useful in ap-

plications where detecting motion is the primary goal,

such as surveillance (Bolten et al., 2019), object track-

ing (Borer et al., 2017), or motion analysis (Xu et al.,

2020), as it reduces the amount of data that needs to

be processed by focusing only on areas with motion.

When a DVS is mounted on a moving vehicle, the

situation becomes more complex. The ego-motion of

the vehicle will cause the entire ﬁeld of view to shift,

and this will generate a large number of events across

the sensor. A key challenge in this scenario is to dif-

ferentiate between events caused by the motion of the

vehicle (global motion) and those caused by indepen-

dent motion within the scene (e.g., a pedestrian cross-

ing the street).

3.1.2 Epipolar Geometry Principle

We will now show how it is possible to identify DVS

events that are triggered by static background through

epipolar geometry principles. Given the assumption

that the vehicle is driving in a straight line at a con-

stant speed during the period, with no vertical dis-

placement of the vehicle, the overview of the geomet-

ric calculation model is shown in Figure 1.

The projection (x, y) on image plane of reference

camera and side camera are denoted as follows,

(x, y) =

(X,Y ), (5)

′

, y

′

) =

Z − Z

(X −X

,Y −Y

). (6)

For a static object locates at (X ,Y, Z), the disparity

between projection x on the reference camera and the

projection x

′

on the side camera is

(∆

static,x

, ∆

static,y

) =

Z − Z

− X

−Y

(7)

Except for depth Z, the variables in disparity vec-

tor (∆

static,x

, ∆

static,y

) can be obtained through camera

conﬁguration. X

and Y

are the horizontal and ver-

tical setup distances between two cameras. Note that

= 0 because the cameras are set on the same level.

is the driving distance. We assume the camera pair

is mounted on a moving vehicle driving at speed v,

and the driving distance is

= v∆t, (8)

where ∆t is the time interval between the reference

camera at the current timestamp and the side camera

at the past timestamp. We note that the driving speed

doesn’t have to be constant, only the vehicle displace-

ment needs to be known.

Object depth can be obtained by several potential

approaches, such as inferring object depth from DVS

data by neural network model (Hidalgo-Carri

o et al.,

2020), or speculating from stereo vision (Ghosh and

Gallego, 2022). In this paper, we use separate depth

sensors, which are fast and suitable for real-time pro-

cessing. It is shown in (Allebosch et al., 2023) that

even sparse depth information can already provide

tight uncertainty bounds for disparity analysis. We

also refer to this work for an in-depth description of

the relation between depth accuracy and disparity.

When we review expression 7, substituting Z

for

v∆t (eq 8) , the only value that is still unknown for the

right-hand side of the equation is ∆t. Therefore, we

can determine the speciﬁc time interval for which the

disparity along the x-direction is 0. Let ∆

static,x

= 0

and we get the desired time interval,

∆t = −

f X

. (9)

The projections of two cameras that satisfy the above

condition are located at the intersection of the com-

mon line of sight and image plane. We denote these

special locations as epipoles x

in Figure 1.

Low Latency Pedestrian Detection Based on Dynamic Vision Sensor and RGB Camera Fusion

843

The epipolar geometry principle reveals that a

static object observed by current reference camera is

also observed by the side camera at ∆t before. If not,

it means the visual information is triggered by moving

objects. We can therefore distinguish moving pedes-

trians with static backgrounds in the event stream.

3.1.3 Event Frame Representation

In this paper, we employ the event frame (Perot et al.,

2020; Gehrig et al., 2019) as event representation in-

stead of a single event. Theoretically, when we accept

an event from the reference camera, we can search

for matching point in the side camera by the epipo-

lar geometry principle and judge if it is triggered

by a moving object. The challenge is that with sin-

gle events, slight deviations in time and location can

cause signiﬁcant errors in the computation of match-

ing epipoles, leading to mismatches in feature cor-

respondence, since DVS is highly sensitive to small

changes in lighting, noise, and other environmental

factors. Another disturbance is that even if there is

no change in light intensity, there will be event out-

put due to thermal noise and junction leakage current

(Feng et al., 2020).

Another consideration is real-time performance.

In real-time applications like autonomous driving,

drones, or augmented reality, processing individual

events can lead to signiﬁcant computational overhead

and delay. By using event frames, we can balance

the need for temporal resolution and real-time perfor-

mance because event frames can be processed more

efﬁciently than streams of single events.

As shown in Figure 2a, we visually compare dif-

ferent temporal widths that can be used to create the

event frame representation. We zoom in on the left

bottom area (left leg of the pedestrian) and compare

the distribution of the events in each temporal win-

dow. When the window size is 1 ms, the event distri-

butions on the image plane are not consistent in two

consecutive temporal windows. Since epipolar geom-

etry relies on matching points across views, incon-

sistency of triggered events and transient noise could

mislead the estimation of corresponding points be-

tween cameras. When we expand the window width

to 2ms, the event distribution on the image plane is

more stable and better for epipolar geometry analysis.

Figure 2b shows more visualizations of different tem-

poral window sizes. As the window width extends,

the contour of the pedestrian is more clear. Smaller

windows provide faster response times but may cap-

ture fewer details, while larger windows enhance fea-

ture clarity by accumulating more events, though at

the cost of slower responses. On the other hand, large

wide window sizes may cause overcrowding of events

and overlap between edges. In the 12ms temporal

window visualization, the pedestrian’s ankle begins to

blur due to its faster motion compared to other body

areas, causing a higher event density in that region.

The temporal width is a hyperparameter that can be

tuned according to the driving speed. In this paper,

we set the temporal window size as 8ms.

To build an event frame, we aggregate the events

at each pixel in the time interval centered around t,

with a window width ∆t

. The set of events on the

image plane in the time window is E = {e

}

k=1

, N

the number of events in the time window. We deﬁne

the event frame as a 2D grid that stores the accumu-

lation of the events at each pixel x(with coordinates

(x, y)) in the time window. Formally, the event ac-

cumulation at each pixel in the event frame can be

expressed as

F(x,t) =

∑

k|x

=x,t

∈

[

t−

∆t

,t+

∆t

]

. (10)

3.1.4 Background Removal

Once the event frame is constructed, we apply epipo-

lar geometry to match corresponding points between

different camera views. For each pixel location x=

(x, y) where there is event accumulation F

re f

(t, x) in

the current event frame of the reference camera, we

retrieve the corresponding matching point F

side

(t +

∆t, x) from the side camera at time t + ∆t. The dif-

ference D(t, x) is

D(t, x) = F

re f

(t, x) − F

side

(t + ∆t, x). (11)

To classify the event as either being caused by the

background or by a moving object, we use a thresh-

old τ to determine whether the event is from the back-

ground or a moving object. In this paper, we deﬁne

pixels where preserved events accumulate as active

pixels, the indicator function is

A(t, x) =

(

1, if |D(t, x)| ≥ τ

0, if |D(t, x)| < τ.

(12)

The difference threshold τ can be tuned according to

the event density. In this paper, we set τ =1.

Figure 3 shows the background removal result of

examples with event frames. The part with limited

disparity is removed, and the moving pedestrians are

preserved. Due to camera jitter and measurement er-

rors, there is still a small amount of residual back-

ground. In the next section, we will introduce an intu-

itive and effective fusion method to further make use

of the amount of ﬁltered versus non-ﬁltered events.

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

844

(a) Visualization of events in temporal window.

2ms 4ms.

8ms 12ms.

(B) comparison of different temporal

window size.

Figure 2: Visualization of event data in different temporal widths. The green and red pixels represent positive and negative

polarity respectively.

Figure 3: The ﬁrst row is the image frame from the RGB

camera; the second row is the corresponding event frame.

The bottom row is the result of background removal by our

proposed method.

3.2 Detection Fusion of RGB Frames

and DVS Data

In the previous section, we explain how we removed

static background on event frames and kept dynamic

information, which is complementary to extracting in-

formation from RGB cameras. Detectors based on

RGB image frames rely on features like texture and

visible contrast structure, which are inherently static.

Hence, an object looks roughly the same whether it

is moving or not. Common object detection meth-

ods are divided into two major categories: traditional

methods that combine feature extraction with classi-

ﬁers (Viola and Jones, 2001; Dalal and Triggs, 2005;

Felzenszwalb et al., 2009), and deep learning meth-

ods that use neural network models (Girshick et al.,

2014; Redmon, 2016; Lin, 2017; Zhao et al., 2024).

In this paper, we consider an example of a neural net-

work model as the RGB detector, which generates

bounding boxes with conﬁdence scores as detection

output. We deﬁne a bounding box B as a rectangular

area bounded by a set of pixel coordinates, with an

associated scalar conﬁdence score C

A common approach for selecting target bound-

ing boxes is to set a conﬁdence threshold and pick up

boxes that meet or exceed this threshold. As shown

in Figure 4, if a pedestrian appears suddenly behind

cars, single conﬁdence level based judgment would

lose early detections due to low conﬁdence scores

in these scenarios. In this case, the background re-

moval result from the DVS can provide complemen-

tary information for the bounding boxes, distinguish-

ing early-appearing pedestrians from misdetections in

low-conﬁdence bounding boxes C

Speciﬁcally, we deﬁne the active pixel percentage

for each bounding box B as

(B) =

∑

x∈B

A(t, x)

∑

x∈B

F (t, x)

, (13)

where F (t, x) is an indicator function that judges if

events accumulate at the pixel x,

F (t, x) =

(

1, if |F(t, x)| > 0

0, if |F(t, x)| = 0.

(14)

By calculating the active pixel percentage of a

bounding box, we can further determine whether the

Low Latency Pedestrian Detection Based on Dynamic Vision Sensor and RGB Camera Fusion

845

Figure 4: Illustration of proposed fusion method. Our algorithm designed for event stream to remove static background is

introduced in Section 3.1.

bounding box comes from a static or dynamic object.

We proposed a straightforward and efﬁcient fusion

strategy. We set a low conﬁdence threshold T

c,l

and

a high conﬁdence threshold T

c,h

for the conﬁdence

score C

. There are three possibilities for RGB de-

tection results:

1) C

≥T

c,h

: This signiﬁes that RGB detectors as-

sign a conclusive conﬁdence score, suggesting a high

probability of a person being present. We retain this

candidate in the ﬁnal set of detections, with minimal

risk of false positives.

2) C

< T

c,l

: This indicates a very low detection

score. We assume a negligible likelihood of a person

being present and exclude the candidate from the ﬁnal

detections.

3) T

c,l

≤C

< T

c,h

: These bounding boxes have a

medium detection score, signaling no clear preference

to either keep or discard it as a detection. Therefore,

this range beneﬁts the most from our epipolar geome-

try based detection on event frame. If the active pixel

percentage in the bounding box P

(B) is higher than

a threshold T

, it’s more likely that there is a crossing

person and we keep the candidate.

By making full use of candidate bounding boxes,

our proposed fusion method is able to boost the early

detection performance for moving pedestrian detec-

tion. In the next section, we evaluate both detection

performance and latency on a custom dataset.

3.3 Experimental Evaluation

When a pedestrian ﬁrst appears from behind an oc-

clusion, only a small portion of their body is visible,

making it difﬁcult for detectors to recognize them.

RGB detectors based on neural networks often assign

a low conﬁdence score in such situations. We set up

an experiment to demonstrate that our fusion method

can differentiate between bounding boxes coming

from early-emerging pedestrians and those caused by

background misdetections. Hence, our method can

detect pedestrians earlier and provide more reaction

time for the drivers to avoid collisions.

3.3.1 Custom Dataset

We designed a platform with three Go Pro Hero 7

cameras mounted on an electric cargo bicycle. The

side cameras are set up beside the center camera per-

pendicular to the driving direction with an equal dis-

tance. In addition, radar is also equipped to record

driving speed and obtain object depth. The vehicle

was driving on a straight road in the city center. Two

experimenters crossed the road in front of the vehicle

individually or together, and the obtained videos were

captured into 21 sequences. The recorded RGB se-

quences are converted into event data using DVS sim-

ulator ESIM (Rebecq et al., 2018) with default simu-

lation settings.

To measure the reaction time (latency) between

the ﬁrst visible instance and the ﬁrst detection,

we manually initialize the ﬁrst visible annotation

of pedestrians, followed by semi-automatic inter-

val tracking using SSD (Liu et al., 2016)and DSST

(Danelljan et al., 2016), which would be manually

corrected when necessary. Our evaluation comprises

18,982 annotation boxes and 33,123 frames. We

choose Yolo v4 (Wang et al., 2021) as the detector

for RGB cameras and use the original structure and

pre-trained weights. In this paper we focus only on

detecting objects in the ‘person’ class.

In section 3.2, we introduce the fusion method,

which uses three parameters: T

c,h

, T

c,l

and T

We deﬁne a range of values for each parame-

ter and evaluate all possible combinations: T

c,h

∈

{0.5, 0.7, 0.8, 0.9}, T

c,l

∈ {0.1, 0.2, 0.3, 0.5}, T

∈

{0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9}. For each

combination, we evaluate the performance by three

metrics, precision, recall, and latency. We select the

best points from the Pareto front (Ngatchou et al.,

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

846

2005), optimized using the F1 score. The Precision-

Recall curve and Precision-Latency curve to compare

the detection performance and latency are shown in

Figure 5.

3.3.2 Result

Table 1: Performance of Yolov4 detector and fusion

method.

Method F1

max

MAP average latency(ms)

Yolo v4 88.35 90.49 375.65

Fusion 89.28 91.59 328.09

(a) Precision-Recall curve. (b) Precision-Latency curve.

Figure 5: Comparative analysis of the Yolo v4 detector and

our proposed fusion method.

Latency. We deﬁne the detection latency as the re-

action time between the ﬁrst annotation and the ﬁrst

detection. Figure 6 shows how the fusion method re-

duces latency by leveraging motion information from

event data. By analyzing the active pixel percentage

in bounding boxes with low conﬁdence scores, we

can distinguish the early appearance of moving pedes-

trians from false detections. The Precision-Latency

curve Figure 5b demonstrates that the fusion method

maintains high precision while reducing detection la-

tency by an average of 76.72 milliseconds. Figure 7

shows several examples of gain in reaction time and

distance by our proposed fusion method.

Detection Performance. The Precision-Recall curve

Figure 5a illustrates that the fusion method achieves

an average increase of 2.05% in recall at equivalent

precision levels. It means that, for the same number of

false positive detections, our fusion method can detect

actual pedestrians more often. Setting the precision

threshold at 95%, the fusion method shows a 1.05%

higher recall and an average latency gain of 58.23 ms.

These results suggest that the fusion method is more

efﬁcient, providing accurate detections with reduced

response times. Table 1 shows that our proposed fu-

sion method improves MAP by 1.10% while the re-

sponse is 47.56 ms faster than Yolo v4 detector on

average.

3.4 Discussion

The results demonstrate that our proposed fusion ap-

proach of using DVS data, ﬁltered by epipolar ge-

ometry, combined with RGB-based detections signif-

icantly improves pedestrian detection in autonomous

driving scenarios. By eliminating static background

events, our method enables more accurate and fo-

cused detection of dynamic objects, such as pedestri-

ans, even in challenging environments with complex

backgrounds.

One key advantage of our approach lies in the

use of epipolar geometry to ﬁlter out irrelevant events

generated by static objects, allowing the system to fo-

cus on truly dynamic elements in the scene. The fu-

sion with RGB data further complements the system

by leveraging the rich spatial and appearance informa-

tion from the RGB frames, ensuring that both static

and dynamic visual cues are effectively utilized.

Compared to previous methods that rely solely on

RGB or DVS data, our approach offers signiﬁcant im-

provements in terms of detection latency and accu-

racy. The ability to detect pedestrians earlier, even

when partially occluded or moving quickly, makes

our method particularly suitable for real-time applica-

tions in autonomous driving. Nevertheless, real-world

testing and further reﬁnement are needed to fully val-

idate the system’s robustness across different driving

conditions.

4 CONCLUSION AND FUTURE

WORK

In this paper, we presented a novel fusion algo-

rithm that integrates DVS data with RGB camera

detections for improved pedestrian detection in au-

tonomous driving scenarios. By applying epipolar ge-

ometry to remove static background events from the

DVS data, we signiﬁcantly enabled the system to fo-

cus on dynamic objects. The proposed fusion of the

high temporal resolution of DVS with the rich spa-

tial detail of the RGB cameras brings faster and more

accurate detection, particularly in low-light and fast-

moving environments.

Our method demonstrates strong potential for re-

ducing detection latency and improving performance

in real-time perception systems, especially in safety-

critical applications such as autonomous vehicles. By

combining the complementary strengths of both sen-

sor modalities, our system addresses the limitations

faced by traditional camera-based detection methods.

Looking forward, several areas require further ex-

ploration and reﬁnement. First, we plan to extend our

Low Latency Pedestrian Detection Based on Dynamic Vision Sensor and RGB Camera Fusion

847

Figure 6: Detection latency of proposed fusion method and Yolo v4. Traditional RGB-based detector tend to give a low

conﬁdence score on suddenly appearing pedestrian at the early moment until the pedestrian is sufﬁciently visible. The fusion

method is able to detect a pedestrian ealier while Yolo takes much time to recognize an occluded pedestrian. In the shown

scenario, the vehicle is driving at the speed of 2.80 m/s and our proposed fusion method detects the pedestrian 258.59 ms

earlier than Yolov4, referring to a gain in distance of 0.72m for driver to avoid the collision.

Figure 7: Detection result of low-latency fusion. The ﬁrst column is the ﬁrst detection of the fusion method and the last

column is the ﬁrst detection of YOLO v4 detector. The second and third columns are the raw event frame and the background

removal result by our proposed method. The conﬁdence score is shown above bboxes. The conﬁdence score threshold for

Yolo v4 to accept a detection is 0.8.

research by testing the system on real-world event

data, addressing potential challenges such as sen-

sor noise and environmental variability. Our future

work will also focus on optimizing the algorithm for

edge processing devices, enabling it to be deployed in

resource-constrained environments such as embedded

systems in autonomous vehicles.

By continuing to reﬁne and validate the proposed

method in real-world conditions, we aim to develop

a robust, efﬁcient, and reliable solution for pedestrian

detection and other object detection tasks in dynamic

environments. In future work, we will study more di-

verse and complex trafﬁc scenarios.

REFERENCES

Alkendi, Y., Azzam, R., Javed, S., Seneviratne, L., and

Zweiri, Y. (2024). Neuromorphic vision-based motion

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

848

segmentation with graph transformer neural network.

arXiv preprint arXiv:2404.10940.

Allebosch, G., Van Hamme, D., Veelaert, P., and Philips,

W. (2023). Efﬁcient detection of crossing pedestrians

from a moving vehicle with an array of cameras. Op-

tical Engineering, 62(3):031210–031210.

Bolten, T., Pohle-Fr

ohlich, R., and T

onnies, K. D. (2019).

Application of hierarchical clustering for object track-

ing with a dynamic vision sensor. In Computational

Science–ICCS 2019: 19th International Conference,

Faro, Portugal, June 12–14, 2019, Proceedings, Part

V 19, pages 164–176. Springer.

Borer, D., Delbruck, T., and R

osgen, T. (2017). Three-

dimensional particle tracking velocimetry using dy-

namic vision sensors. Experiments in Fluids, 58:1–7.

Brandli, C., Berner, R., Yang, M., Liu, S.-C., and Delbruck,

T. (2014). A 240× 180 130 db 3 µs latency global

shutter spatiotemporal vision sensor. IEEE Journal of

Solid-State Circuits, 49(10):2333–2341.

Cao, H., Chen, G., Xia, J., Zhuang, G., and Knoll, A.

(2021). Fusion-based feature attention gate compo-

nent for vehicle detection based on event camera.

IEEE Sensors Journal, 21(21):24540–24548.

Cao, J., Zheng, X., Lyu, Y., Wang, J., Xu, R., and Wang, L.

(2024). Chasing day and night: Towards robust and

efﬁcient all-day object detection guided by an event

camera. In 2024 IEEE International Conference on

Robotics and Automation (ICRA), pages 9026–9032.

IEEE.

Chang, M., Feng, H., Xu, Z., and Li, Q. (2021). Low-light

image restoration with short-and long-exposure raw

pairs. IEEE Transactions on Multimedia, 24:702–714.

Cho, H. and Yoon, K.-J. (2022). Event-image fusion stereo

using cross-modality feature propagation. In Proceed-

ings of the AAAI Conference on Artiﬁcial Intelligence,

volume 36, pages 454–462.

Dalal, N. and Triggs, B. (2005). Histograms of oriented

gradients for human detection. In 2005 IEEE com-

puter society conference on computer vision and pat-

tern recognition (CVPR’05), volume 1, pages 886–

893. Ieee.

Danelljan, M., H

ager, G., Khan, F. S., and Felsberg, M.

(2016). Discriminative scale space tracking. IEEE

transactions on pattern analysis and machine intelli-

gence, 39(8):1561–1575.

Felzenszwalb, P. F., Girshick, R. B., McAllester, D., and

Ramanan, D. (2009). Object detection with discrim-

inatively trained part-based models. IEEE transac-

tions on pattern analysis and machine intelligence,

32(9):1627–1645.

Feng, Y., Lv, H., Liu, H., Zhang, Y., Xiao, Y., and Han,

C. (2020). Event density based denoising method for

dynamic vision sensor. Applied Sciences, 10(6):2024.

Gallego, G., Delbr

uck, T., Orchard, G., Bartolozzi, C.,

Taba, B., Censi, A., Leutenegger, S., Davison, A. J.,

Conradt, J., Daniilidis, K., et al. (2020). Event-based

vision: A survey. IEEE transactions on pattern anal-

ysis and machine intelligence, 44(1):154–180.

Gehrig, D., Loquercio, A., Derpanis, K. G., and Scara-

muzza, D. (2019). End-to-end learning of represen-

tations for asynchronous event-based data. In Pro-

ceedings of the IEEE/CVF International Conference

on Computer Vision, pages 5633–5643.

Gehrig, D. and Scaramuzza, D. (2024). Low-latency

automotive vision with event cameras. Nature,

629(8014):1034–1040.

Ghosh, S. and Gallego, G. (2022). Multi-event-camera

depth estimation and outlier rejection by refo-

cused events fusion. Advanced Intelligent Systems,

4(12):2200221.

Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2014).

Rich feature hierarchies for accurate object detec-

tion and semantic segmentation. In Proceedings of

the IEEE conference on computer vision and pattern

recognition, pages 580–587.

Hamaguchi, R., Furukawa, Y., Onishi, M., and Sakurada,

K. (2023). Hierarchical neural memory network for

low latency event processing. In Proceedings of the

IEEE/CVF Conference on Computer Vision and Pat-

tern Recognition, pages 22867–22876.

Hidalgo-Carri

o, J., Gehrig, D., and Scaramuzza, D. (2020).

Learning monocular dense depth from events. In 2020

International Conference on 3D Vision (3DV), pages

534–542. IEEE.

Kim, D.-S. and Kwon, J. (2015). Moving object detec-

tion on a vehicle mounted back-up camera. Sensors,

16(1):23.

Kim, J., Ye, G., and Kim, D. (2010). Moving object de-

tection under free-moving camera. In 2010 IEEE In-

ternational Conference on Image Processing, pages

4669–4672. IEEE.

Kulchandani, J. S. and Dangarwala, K. J. (2015). Moving

object detection: Review of recent research trends. In

2015 International conference on pervasive comput-

ing (ICPC), pages 1–5. IEEE.

Lin, T. (2017). Focal loss for dense object detection. arXiv

preprint arXiv:1708.02002.

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S.,

Fu, C.-Y., and Berg, A. C. (2016). Ssd: Single shot

multibox detector. In Computer Vision–ECCV 2016:

14th European Conference, Amsterdam, The Nether-

lands, October 11–14, 2016, Proceedings, Part I 14,

pages 21–37. Springer.

Liu, Z., Yang, N., Wang, Y., Li, Y., Zhao, X., and Wang,

F.-Y. (2023). Enhancing trafﬁc object detection in

variable illumination with rgb-event fusion. arXiv

preprint arXiv:2311.00436.

Mane, S. and Mangale, S. (2018). Moving object detec-

tion and tracking using convolutional neural networks.

In 2018 second international conference on intelli-

gent computing and control systems (ICICCS), pages

1809–1813. IEEE.

Mishra, A., Ghosh, R., Principe, J. C., Thakor, N. V., and

Kukreja, S. L. (2017). A saccade based framework

for real-time motion segmentation using event based

vision sensors. Frontiers in neuroscience, 11:83.

Mitrokhin, A., Hua, Z., Fermuller, C., and Aloimonos, Y.

(2020). Learning visual motion segmentation using

event surfaces. In Proceedings of the IEEE/CVF Con-

Low Latency Pedestrian Detection Based on Dynamic Vision Sensor and RGB Camera Fusion

849

ference on Computer Vision and Pattern Recognition,

pages 14414–14423.

Narasimhan, S. G. and Nayar, S. K. (2003). Contrast

restoration of weather degraded images. IEEE trans-

actions on pattern analysis and machine intelligence,

25(6):713–724.

Ngatchou, P., Zarei, A., and El-Sharkawi, A. (2005). Pareto

multi objective optimization. In Proceedings of the

13th international conference on, intelligent systems

application to power systems, pages 84–91. IEEE.

Parameshwara, C. M., Li, S., Ferm

uller, C., Sanket, N. J.,

Evanusa, M. S., and Aloimonos, Y. (2021). Spikems:

Deep spiking neural network for motion segmenta-

tion. In 2021 IEEE/RSJ International Conference on

Intelligent Robots and Systems (IROS), pages 3414–

3420. IEEE.

Parameshwara, C. M., Sanket, N. J., Gupta, A., Fermuller,

C., and Aloimonos, Y. (2020). Moms with events:

Multi-object motion segmentation with monocular

event cameras. arXiv preprint arXiv:2006.06158,

2(3):5.

Perot, E., De Tournemire, P., Nitti, D., Masci, J., and

Sironi, A. (2020). Learning to detect objects with a

1 megapixel event camera. Advances in Neural Infor-

mation Processing Systems, 33:16639–16652.

Rebecq, H., Gehrig, D., and Scaramuzza, D. (2018). Esim:

an open event camera simulator. In Conference on

robot learning, pages 969–982. PMLR.

Redmon, J. (2016). You only look once: Uniﬁed, real-time

object detection. In Proceedings of the IEEE confer-

ence on computer vision and pattern recognition.

Schraml, S., Belbachir, A. N., Milosevic, N., and Sch

on,

P. (2010). Dynamic stereo vision system for real-

time tracking. In Proceedings of 2010 IEEE Inter-

national Symposium on Circuits and Systems, pages

1409–1412. IEEE.

Stoffregen, T., Gallego, G., Drummond, T., Kleeman, L.,

and Scaramuzza, D. (2019). Event-based motion seg-

mentation by motion compensation. In Proceedings of

the IEEE/CVF International Conference on Computer

Vision, pages 7244–7253.

Taverni, G., Moeys, D. P., Li, C., Cavaco, C., Motsnyi, V.,

Bello, D. S. S., and Delbruck, T. (2018). Front and

back illuminated dynamic and active pixel vision sen-

sors comparison. IEEE Transactions on Circuits and

Systems II: Express Briefs, 65(5):677–681.

Tomy, A., Paigwar, A., Mann, K. S., Renzaglia, A., and

Laugier, C. (2022). Fusing event-based and rgb cam-

era for robust object detection in adverse conditions.

In 2022 International Conference on Robotics and Au-

tomation (ICRA), pages 933–939. IEEE.

Viola, P. and Jones, M. (2001). Rapid object detection us-

ing a boosted cascade of simple features. In Proceed-

ings of the 2001 IEEE computer society conference on

computer vision and pattern recognition. CVPR 2001,

volume 1, pages I–I. Ieee.

Wan, Z., Mao, Y., Zhang, J., and Dai, Y. (2023). Rpe-

ﬂow: Multimodal fusion of rgb-pointcloud-event for

joint optical ﬂow and scene ﬂow estimation. In Pro-

ceedings of the IEEE/CVF International Conference

on Computer Vision, pages 10030–10040.

Wang, C.-Y., Bochkovskiy, A., and Liao, H.-Y. M. (2021).

Scaled-YOLOv4: Scaling cross stage partial network.

In Proceedings of the IEEE/CVF Conference on Com-

puter Vision and Pattern Recognition (CVPR), pages

13029–13038.

Xu, L., Xu, W., Golyanik, V., Habermann, M., Fang, L.,

and Theobalt, C. (2020). Eventcap: Monocular 3d

capture of high-speed human motions using an event

camera. In Proceedings of the IEEE/CVF Conference

on Computer Vision and Pattern Recognition (CVPR).

Zhang, S., Sun, L., and Wang, K. (2023). A multi-scale

recurrent framework for motion segmentation with

event camera. IEEE Access.

Zhao, Y., Lv, W., Xu, S., Wei, J., Wang, G., Dang, Q., Liu,

Y., and Chen, J. (2024). Detrs beat yolos on real-time

object detection. In Proceedings of the IEEE/CVF

Conference on Computer Vision and Pattern Recog-

nition, pages 16965–16974.

Zhou, Y., Gallego, G., Lu, X., Liu, S., and Shen, S.

(2021). Event-based motion segmentation with spatio-

temporal graph cuts. IEEE transactions on neural net-

works and learning systems, 34(8):4868–4880.

Zhou, Z., Wu, Z., Boutteau, R., Yang, F., Demonceaux, C.,

and Ginhac, D. (2023). Rgb-event fusion for moving

object detection in autonomous driving. In 2023 IEEE

International Conference on Robotics and Automation

(ICRA), pages 7808–7815. IEEE.

Zhu, Z., Wei, H., Hu, G., Li, Y., Qi, G., and Mazur, N.

(2020). A novel fast single image dehazing algorithm

based on artiﬁcial multiexposure image fusion. IEEE

Transactions on Instrumentation and Measurement,

70:1–23.

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

850