Multi-view Data Capture using Edge-synchronised Mobiles

Matteo Bortolon, Paul Chippendale, Stefano Messelodi and Fabio Poiesi

Fondazione Bruno Kessler, Trento, Italy

Keywords:

Synchronisation, Free-viewpoint Video, Edge Computing, Augmented Reality, ARCloud.

Abstract:

Multi-view data capture permits free-viewpoint video (FVV) content creation. To this end, several users

must capture video streams, calibrated in both time and pose, framing the same object/scene, from different

viewpoints. New-generation network architectures (e.g. 5G) promise lower latency and larger bandwidth

connections supported by powerful edge computing, properties that seem ideal for reliable FVV capture. We

have explored this possibility, aiming to remove the need for bespoke synchronisation hardware when capturing

a scene from multiple viewpoints, making it possible through off-the-shelf mobiles. We propose a novel and

scalable data capture architecture that exploits edge resources to synchronise and harvest frame captures. We

have designed an edge computing unit that supervises the relaying of timing triggers to and from multiple

mobiles, in addition to synchronising frame harvesting. We empirically show the beneﬁts of our edge computing

unit by analysing latencies and show the quality of 3D reconstruction outputs against an alternative and popular

centralised solution based on Unity3D.

1 INTRODUCTION

Immersive computing represents the next step in hu-

man interactions, with digital content that looks and

feels as if it is physically in the same room as you.

The potential for immersive digital content impacts

upon entertainment, advertising, gaming, mobile tele-

presence and tourism (Shi et al., 2015; Jiang and Liu,

2017; Elbamby et al., 2018; Rematas et al., 2018;

Park et al., 2018; Qiao et al., 2019). But for immer-

sive computing to become mainstream, a vast amount

of easy-to-create free-viewpoint video (FVV) content

will be needed. The production of 3D digital objects

inside a real-world space is almost considered a solved

problem (Schonberger and Frahm, 2016), but doing

the same for FVV is still a challenge (Richardt et al.,

2016). This mainly because FVVs require the tempo-

ral capturing of dynamic objects from different, and

calibrated viewpoints (Guillemaut and Hilton, 2011;

Mustafa and Hilton, 2017). Synchronisation across

viewpoints dramatically reduces reconstruction arte-

facts in FVVs (Vo et al., 2016). For controlled se-

tups, frame-level synchronisation can be achieved us-

ing shutter-synchronised cameras (Mustafa and Hilton,

2017), however this is impractical in uncontrolled en-

vironments with conventional mobiles. Moreover, mo-

bile pose estimation (i.e. position and orientation) with

respect to a global coordinate system is necessary for

content integration. This can be achieved using frame-

by-frame Structure from Motion (SfM) (Schonberger

and Frahm, 2016) or Simultaneous Localisation And

Mapping (SLAM) (Mur-Artal et al., 2015). The latter

has shown to be effective for collaborative Augmented

Reality though the ARCloud (ARCore Anchors, 2019).

ARCore Anchors

are targeted at AR multiplayer gam-

ing, but we experimentally tested that they are not

yet ready for FVV production as is, i.e. relative pose

estimation is not sufﬁciently accurate for 3D recon-

struction. As per our knowledge a ﬂexible and scalable

solution for synchronised calibrated data capture de-

ployable on conventional mobiles does yet not exist.

As FVVs require huge data transmissions, throughput

is another challenge for immersive computing. Fig. 1

shows four people recording a person with their mo-

biles. The measured throughput in this setup was about

52Mbps with frames captured at 10Hz, with a resolu-

tion of 640

480 pixels and encoded in JPEG. Note

that only a portion of the recorded person was covered.

For a more complete, 360-degree coverage, several

more mobiles would be needed. With the introduc-

tion of new wireless architectures that target high-data

throughput and low latency, such as Multi-access Edge

Computing (MEC), device-to-device communications

and network slicing in 5G networks, scalable com-

munication mechanisms that are appropriate for the

The AR Anchor is a rigid transformation from local to

a global coordinate system.

730

Bortolon, M., Chippendale, P., Messelodi, S. and Poiesi, F.

Multi-view Data Capture using Edge-synchronised Mobiles.

DOI: 10.5220/0008971807300740

In Proceedings of the 15th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2020) - Volume 5: VISAPP, pages

730-740

ISBN: 978-989-758-402-2; ISSN: 2184-4321

frame

stream

pose estimation

frame capture

edge processing

recording moving person

sync

data

relaying

recon

cam 3

cam 2

cam 1

cam 4

meshing

texturing

ARCloud

Figure 1: Multi-view data captures for free-viewpoint video creation generates large data throughput, and requires synchronised

and calibrated cameras. Our solution ofﬂoads computations from mobiles and the cloud to the edge, handling synchronisation

and image processing more efﬁciently. Moving processing closer to the user improves performance and fosters scalability.

deployment of immersive content on consumer de-

vices are key (Shi et al., 2016; Qiao et al., 2019). Hy-

brid cloud/fog/edge solutions will ensure that users get

low-lag feedback as well as the possibility to ofﬂoad

computationally intensive tasks.

In this paper, we present a system that harvests data

generated from multiple-handheld mobiles at the edge

instead of harvesting it in the cloud. This promotes

scalability, lower latency and facilitates synchronisa-

tion. Our implementation consists of a server, namely

Edge Relay, that handles communications across mo-

biles and a Data Manager that harvests the content

captured by the mobiles. We handle synchronisation

through a relay server because, as opposed to a peer-

to-peer one, relay servers can effectively reduce band-

width usage and improve connection quality (Hu et al.,

2016). Although relay servers may lead to increases

of network latency, peer-to-peer connection may be

ineffective because when mobiles are within differ-

ent networks (e.g. different operators in 5G networks),

they cannot retrieve their respective IP address due

to Network Address Translation (NAT). Moreover, to

mitigate the latency problem, we designed a latency

compensation strategy, that we empirically tested to

be effective when the network conditions are fairly

stable. We developed an app that each mobile uses

to capture frames and estimate pose with respect to a

global coordinate system through ARCore (ARCore,

2019). As the relative poses from ARCore are not

accurate enough for FVV, we reﬁne them using SfM

(Schonberger and Frahm, 2016). To summarise, the

key contributions of our system are (i) the protocols we

have designed to allow users under different networks

(or operators) to join the same data capture session, (ii)

the integration of a latency compensation mechanism

to mitigate the communication delay among devices,

and (iii) the integration of these modules with a SLAM

framework to estimate mobile pose in real time and to

globally localise mobiles in an environment through

the ARCloud. As per our knowledge, this is the ﬁrst

proof-of-concept, decentralised system for synchro-

nised multi-view data capture usable for FVV content

creation. We have carried out a thorough experimental

analysis by jointly assessing latency and temporal 3D

reconstruction. We have compared results against an

alternative and popular centralised solution based on

Unity3D (Unity3D Multiplayer Service, 2019).

2 RELATED WORK

Low Latency Immersive Computing.

To foster im-

mersive interactions between multiple users in aug-

mented spaces, low-latency computing and commu-

nications must be supported (Yahyavi and Kemme,

2013). Although mobiles are the ideal medium to de-

liver immersive experiences, they have ﬁnite resources

for complex visual scene understanding, reasoning

and graphical tasks, hence computational ofﬂoading

is preferred for demanding and low-latency interac-

tions. Fog and edge computing, soon to be mainstream

thanks to 5G, will be one of the key enablers. Thank-

fully, not all immersive computing tasks (e.g. scene

understanding, gesture recognition, volumetric recon-

struction, illumination estimation, occlusion reasoning,

rendering, collaborative interaction sharing) have the

same time-critical nature. (Chen et al., 2016; Zhang

et al., 2018) showed that scene understanding via ob-

ject classiﬁcation could be performed at a rate of sev-

eral times per minute by outsourcing computations

to the cloud. (Zhang et al., 2018) showed how an

optimised image retrieval pipeline for a mobile AR

application can be created by exploiting fog comput-

ing, reducing data transfer latency up to ﬁve times

compared to cloud computing. (Sukhmani et al., 2019)

analysed the concept of dynamic content caching for

mobiles, i.e. what to cache and where, and they illus-

trated that a dramatic performance increase could be

obtained by devising appropriate task ofﬂoading strate-

gies. (Bastug et al., 2017) showed how a pro-active

content request strategy could effectively be used to

predict content before it was actually requested, thus

reducing immersive experience latency, at the cost of

increased data overhead. In the cases of FVV, which is

very sensitive to synchronisation issues, communica-

Multi-view Data Capture using Edge-synchronised Mobiles

731

tions must be executed as close to the user as possible

to reduce lag. Solutions to address the problem can be

hardware or software based. Hardware-based solutions

include timecode synchronisation with or without gen-

lock (Kim et al., 2012; Wu et al., 2008), and Wireless

Precision Time Protocol (Garg et al., 2018). Hardware

based solutions is not our target as they require impor-

tant modiﬁcations to the communication infrastructure.

Software-based solutions are often based on the Net-

work Time Protocol (NTP), instructing devices in a

session to acquire frames at prearranged time intervals

and then attempt to compensate/anticipate for delays

(Latimer et al., 2015). Cameras can share timers that

are updated by a host camera (Wang et al., 2015). Al-

ternatively, errors in temporal frame alignment have

been addressed using spatio-temporal bundle adjust-

ment, in an ofﬂine post-processing phase (Vo et al.,

2016). However, this type of post-alignment also in-

curs a high computational overhead, as well as adding

more latency to the creation and consumption of FVV

reconstructions. Although NTP approaches are simple

to implement, they are unaware of situational-context.

Hence, the way in which clients are instructed to cap-

ture images in a session is totally disconnected from

scene activity, hence they are unable to optimise ac-

quisition rates either locally or globally, prohibiting

optimisation techniques such as (Poiesi et al., 2017),

that aim to save bandwidth and maximise output qual-

ity. Our solution operates online and is aimed at de-

centralising synchronisation supervision, thus is more

appropriate for resource-efﬁcient, dynamic-scene cap-

ture.

Free-viewpoint Video Production.

Free-viewpoint

(volumetric or 4D) videos can be created either through

the synchronised capturing of objects from different

viewpoints (Guillemaut and Hilton, 2011; Mustafa

and Hilton, 2017) or with Convolutional Neural Net-

works (CNN) (Rematas et al., 2018) that estimate

unseen content. The former strategy needs camera

poses to be estimated/known for each frame, using

approaches like SLAM (Zou and Tan, 2013) or by hav-

ing hardware calibrated camera networks (Mustafa and

Hilton, 2017). Typically, estimated poses lead to less-

accurate reconstructions (Richardt et al., 2016), when

compared to calibrated setups (Mustafa and Hilton,

2017). Converserly, CNN-based strategies do not

need camera poses, but instead need synthetic training

data of 3D objects extracted, for example, from video

games or Youtube videos (Rematas et al., 2018) as they

need to estimate unobserved data. Traditional FVV

(i.e. non-CNN) approaches can be based on shape-

from-silhouette (SFS) (Guillemaut and Hilton, 2011),

shape-from-photoconsistency (SFP) (Slabaugh et al.,

2001), multi-view stereo (MVS) (Richardt et al., 2016)

Cloud        

Google

Cloud Platform

Unity3D

Match Making

Router Switch

Router

Edge

Edge Relay

Server

Data

Manager

Local Network

Figure 2: Block diagram of our edge-based architecture.

Mobiles are connected to the same local network (e.g. WiFi

or 5G). When they perform a FVV capture, data relaying and

processing is performed on the edge, ensuring low-latency

and synchronised frame capture.

or deformable models (DM) (Huang et al., 2014). SFS

methods aim to create 3D volumes (or visual hulls)

from the intersections of visual cones formed by 2D

outlines (silhouettes) of objects visible from multiple

views. SFP methods create volumes by assigning in-

tensity values to voxels (or volumetric pixels) based

on pixel-colour consistencies across images. MVS

methods create dense point clouds by merging the re-

sults of multiple depth-maps computed from multiple

views. DM methods try to ﬁt known reference 3D-

models to visual observations, e.g. 2D silhouettes or

3D point clouds. All these methods need frame-level

synchronised cameras. (Vo et al., 2016) proposed a

spatio-temporal bundle adjustment algorithm to jointly

calibrate and synchronise cameras. Because it is a

computationally costly algorithm, it is desirable to ini-

tialise it with “good” initial camera poses and synchro-

nised frames. Amongst these methods, MVS produces

reconstructions that are geometrically more accurate

than the other alternatives, albeit at a higher compu-

tational cost. Approaches like SFS and SFP are more

suitable for online applications as they are fast, but

outputs have less deﬁnition.

3 DATA CAPTURING OVERVIEW

Our system carries out synchronisation and data cap-

ture at the edge, and uses cloud services for the session

initialisation. The edge hosts two applications: an

Edge Relay and a Data Manager. Services used in

the cloud are a Unity3D Match Making server and

the Google Cloud Platform. Fig. 2 shows the block

diagram of our system. Fig. 3 details the session setup.

When a group of users want to initiate a multi-view

capture, they must connect their mobiles to a wireless

network via WiFi or 5G, and then open our frame-

capture app. One user, designated the session host,

creates a new session in the app. This sends a request

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

732

Google Cloud

Platform

Unity3D

Match Making

Edge Relay

Server

Client

Host

Client

(a)

(b)

(c)

(d)

(e)

(f)

Figure 3: Session setup procedure. Description of the steps

can be found in text.

to the Match Making server (Unity3D MatchMaking,

2019) to create an acquisition session (a). This request

includes the IP address of the host as seen from the

Edge Relay. The Match Making server adds this ses-

sion to a list of active sessions, associating it with a

unique session ID. The Match Making server publi-

cises this session ID for others to join as clients. The

host device informs the Edge Relay that it is ready to

accept client devices (b). Client devices see the list

of active sessions from the Match Making server and

they choose one to join (c). When a client chooses

a session, they retrieve the session ID, the host’s IP

address from the Match Making server, and it uses

this information to connect to the Edge Relay (d). The

Edge Relay validates client connections by verifying

that the information provided is correct. When all

clients have joined a session, the Match Making server

is not used again.

Before starting the frame capture, (i) all users must

map their local surroundings in 3D using the app’s

built-in ARCore functionality, then (ii) the host mea-

sures the communication latency between itself and

the clients to inform the clients of the compensation

needed to handle network delays. Latency compen-

sation is explained in Sec. 4. 3D mapping involves

capturing sparse geometric features of the environment

(Cadena et al., 2016). Once mapping is complete, the

host user places an AR Anchor in the mapped scene to

deﬁne a common coordinate system for all devices (e).

This AR Anchor is uploaded by the host to the Google

Cloud Platform and then automatically downloaded

by clients through HTTPS (Belshe et al., 2015), or

the QUIC protocol by devices that support it (QUIC,

2019) (f). Finally, the capture session starts when the

host client presses ‘start’ on their app.

Snapshots, or captured frames, can be taken either

periodically (Knapitsch et al., 2017), or dynamically

based on scene content (Resch et al., 2015; Poiesi

et al., 2017). Frame captures based on scene content is

desirable because one can avoid excessive data trafﬁc

when a scene is still and then capture fast dynamics

by increasing the rate when high activity is observed.

However, the latter is more challenging than the former

as it requires mobiles to perform on-board processing

and a decentralised mechanism to reliably relay syn-

chronisation signals. We designed our system to be

suitable for dynamic frame captures, hence we have

chosen to let the host mobile drive synchronisation,

rather than ﬁxing the rate beforehand on an edge or

cloud server. To trigger the other mobiles to capture a

frame, the host sends a snapshot (or synchronisation)

trigger to the Edge Relay. Snapshot triggers instruct

mobiles in a session to capture frames. The Edge Re-

lay forwards triggers received, from the host, to all

clients, instructing them to capture frames and gen-

erate associated meta-data (e.g. mobile pose, camera

parameters). Captured frames and meta-data are mo-

mentarily buffered on the devices and then transmitted

to the Data Manager asynchronously. Without loss of

generality, the host uses an internal timer to take snap-

shots that expire every

C = 1/F

. A snapshot counter

is incremented each time the countdown expires and it

is used as unique identiﬁer for each snapshot taken.

4 EDGE RELAY

Traditional architectures for creating multi-user expe-

riences are based on authoritative servers (Yahyavi

and Kemme, 2013), typically exploiting relay servers

(Unity3D Multiplayer Service, 2019; Photon, 2019).

An authoritative-server based system allows one of the

participants to be both a client and the host at the same

time, thus having the authority to manage the session

(Unity3D HLAPI, 2019). Our Edge Relay routes ses-

sion control messages from the host to the clients via

UDP, to avoid delays caused by ﬂow-control systems

(e.g. TCP). The Edge Relay handles four different

types of messages: Start Relaying, Connect-to-Server,

Data and Disconnect Client.

The host makes a Start Relaying request to the

Edge Relay to begin a session. This request carries

the connection conﬁguration (e.g. client disconnection

deadline, maximum packet size, host’s IP address and

port number), which is used as a veriﬁcation mecha-

nism for all the clients to connect to the Edge Relay

(MLAPI Conﬁguration, 2019). The veriﬁcation is per-

formed through a cyclic redundancy check (CRC). If

the host is behind NAT, clients will not be able to re-

trieve its IP address nor the port number, hence they

will not be able to include them in the connection con-

ﬁguration, and hence they will not pass the veriﬁcation

stage. In order to mitigate this NAT-related issue, we

required the Edge Relay to communicate to the host

the IP address and port number with which the Edge

Multi-view Data Capture using Edge-synchronised Mobiles

733

Relay sees the host. This IP address can be the actual

IP of the host if the Edge Relay and host are within

the same local network, or the IP address of the router

(NAT) if host and Edge Relay are on different net-

works. After the host receives this information from

the Edge Relay, the host communicates host’s IP ad-

dress and port number to the Match Making server.

In this way, when the clients discover the session ID

from the Match Making server, they can retrieve the

host’s IP address and port number, and use them for

veriﬁcation to connect to the Edge Relay.

When a client decides to join a session, the client

sends a Connect-to-Server request message to the Edge

Relay. This message contains the IP and port address

of the host, which the client retrieved from the Match

Making server. The Edge Relay checks to see if the

requested session associated to this IP address and port

is already hosted by a mobile. If it is, then the Edge

Relay adds this client to the list of session participants.

Data messages carry information from one device

to another in a session. When a data packet is re-

ceived by the Edge Relay it explores the header to

understand where the packet must be forwarded to:

either to speciﬁc devices or broadcast to all. We use

a data messaging system that involves two types of

messages: State Update packages or Remote Proce-

dure Calls (RPCs) (MLAPI Messaging System, 2019).

State Update packages are used to update elements

in the session and to propagate the information to all

participants, i.e. the AR Anchor, while RPCs are used

for control commands, i.e. synchronisation triggers

and the latency estimation mechanism. The RPC of

the synchronisation trigger carries the information of

the snapshot counter (Sec. 3).

A Disconnect Client message is exchanged when

a user exits the session. This message can be sent

by the client or by the Edge Relay. The Edge Relay

detects the exit of a client if a Keep-alive packet is

not received within a timeout. We set the Keep-alive

time at 100ms and disconnect timeout as 1.6s. Upon

disconnection of a client, the Edge Relay informs all

the other participants and removes this device from the

list of participants of the session.

5 FRAME CAPTURE APP

OPTIMISATIONS

To deal with latency variation and large throughput,

we have implemented two optimisation strategies.

The latency between client/host and the Edge Re-

lay can vary due to the distance between devices and

the antenna, network trafﬁc, or interference with other

networks (Soret et al., 2014). A high latency can

negatively affect the geometric quality of the recon-

structed object, so it must be understood and com-

pensated for (Vo et al., 2016). To cope with network

latency issues, we have implemented a latency com-

pensation mechanism that uses Round Trip Time mea-

surements on the communication link between host

and clients. We model the latency measured between

devices to delay the capture of a frame upon the re-

ception of a synchronisation trigger for each device

independently. This enables the devices to capture

frames (nearly) synchronously. Speciﬁcally, during

the initialisation phase, the host builds a

N × M

ma-

trix

, where the element

p(i, j)

is the

-th measured

Round Trip Time (i.e. ping) between the host and

the

-th client.

is the number of clients and

the total number of ping measurements. The host

computes the average ping for each client, such as

¯p(i) =

∑

j=1

p(i, j)

, and extracts the maximum ping

ˆp = max({ ¯p(1),..., ¯p(N)})

. Then the

-th client cap-

tures the snapshot with a delay of

∆t

· ( ˆp − ¯p(i))

ms, whereas the host captures the snapshot with a

delay of ˆp ms.

Each time a client receives a synchronisation trig-

ger, it captures a frame, and the associated meta-data,

i.e. pose (with respect to the global coordinate system),

camera intrinsic parameters (i.e. focal length, principal

point), device identiﬁer and snapshot counter. To ef-

fectively handle frame captures when synchronisation

triggers are received, we use two threads on the mo-

biles. Triggers are managed by the main thread, which

uses a scheduler to guarantee that a frame is captured

when the calculated delay

∆t

expires. Each captured

frame is passed to a second thread that encodes it into

a chosen format (e.g. JPEG) and enqueues it for trans-

mission to the Data Manager. If there is bandwidth

available over the communication channel, frames are

transmitted immediately, otherwise they are buffered.

In the next section we explain how the Data Manager

processes the received frames.

6 DATA MANAGER

The Data Manager is an application that resides on

the edge unit and functions independently from the

Edge Relay (Fig. 2). The communication between the

Data Manager and the participants is based on HTTP

requests (Berners-Lee et al., 1996). Fig. 4 shows the

architecture of our Data Manager. The Data Manager’s

operation consists of three phases: Stream Initialisa-

tion, Frame Transmission and Stream End.

During the Stream Initialisation phase, the host

sends an initial HTTP request to the Data Manager

containing information about the number of devices

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

734

request handling thread

decoding

frames and

meta-data

frame

meta-data

...

HTTP

data manager

synchronised

queue

cpu core

frame

meta-data

frame

meta-data

frame

meta-data

post

requests

post

requests

post

requests

post

requests

request handling thread

cpu core

merging

cpu core

Figure 4: Data manager architecture.

that it should expect to receive data from within a ses-

sion. When this request is received, the Data Manager

creates a unique Stream ID for the session, which is

sent to the host and clients. Then, the Data Manager

initialises a new thread to perform decoding and merg-

ing operations (explained later).

Frame Transmission occurs when the criteria for

taking a snapshot has been met (Sec. 5). In particular,

a client, after it receives the RPC and after it com-

pensates for the latency, sends the requested frame

along with the meta-data to the Data Manager through

a HTTP request. We measured that the Data Manager

can process a HTTP request in about 200 to 300 ms. To

optimise the HTTP-request ingestion rate on the Data

Manager, each mobile creates multiple and simultane-

ous HTTP requests that will be processed in parallel

by Request Handling Threads. We create as many Re-

quest Handling Threads as the number of CPU cores

available. Then, we measured that a mobile can pro-

cess up to 12 simultaneous requests with negligible

computational time and that the Data Manager can han-

dle up to 100 requests. Therefore we create a policy

where, if

is the number of mobiles connected, each

mobile can create up to

r = min{bN/100c,12}

HTTP

requests, where

b·c

is the rounding to lower integer

operation. Each Request Handling Thread processes

each HTTP request and pushes it into a synchronised

queue, which in turn feeds a decoder in charge of con-

verting frames into a single format (e.g. in JPG, PNG).

We measured that this operation can handle up to four

mobiles transmitting at 20 fps in real-time. Lastly, we

use a merging operation to re-organise data based on

their snapshot counter. If the merging operation de-

tects that, for a given snapshot trigger, the number of

frames received is not the same as the number of mo-

biles connected, the received frames will be labelled

as partial when stored in the database, so the FVV

reconstruction algorithm can handle them accordingly.

A Stream End occurs when the host ends a capture

session. In addition to stopping frame acquisition, the

host also sends a request to terminate the session to

the Data Manager, which in turn waits until the last

acquired snapshots have been received before termi-

nating of the opened Threads.

7 RESULTS

Motivation.

Evaluating multi-view data capture quan-

titatively is challenging because both pose estimation

and synchronisation should be assessed. A possibility

could be to create a FVV using the captured frames

and assess the output quality. However, FVV ground

truth is difﬁcult to obtain, especially when an object

being reconstructed is non-rigid. (Mustafa and Hilton,

2017; Richardt et al., 2016) mainly evaluated their

FVV outputs qualitatively, and selected sub-modules

for the quantitative assessment. Based on a similar

idea, we have performed a qualitative analysis consist-

ing of a live recording using handheld mobiles. We

used four mobiles (two Huawei P20Pro, one OnePlus

Five and one Samsung S9) simultaneously observing a

moving person and we reconstructed this person using

a popular SfM technique, i.e. COLMAP (Schonberger

and Frahm, 2016). Then, we quantiﬁed the perfor-

mance of our system under controlled conditions by

evaluating the reconstruction accuracy (3D triangula-

tion error) of a rotating texture-friendly object. Lastly,

to explicitly determine the end-to-end time difference

of the acquired frames we performed a two-view frame

capture of a stopwatch displayed on an iPad screen.

Implementation.

Our Edge Relay is based on MLAPI

(MLAPI, 2019) and is developed in C#. The Data

Manager is developed in Python. Both applications are

run in a Docker container to facilitate deployment. Our

Edge Relay is a laptop with CPU i7 and 16GB RAM.

The application running on mobiles is developed in

Unity3D using ARCore 1.7 and OpenCV 4.0 libraries.

Captured frames have a size of 640×480 pixels.

7.1 Experimental Setup

3D Reconstruction Assessment.

We placed a refer-

ence object on an adjustable angular-velocity turntable,

and rotated it 270 degrees clockwise and then 270 de-

grees anticlockwise. We 3D-reconstructed the refer-

ence object over time from images captured from two

Huawei P20Pro positioned on tripods: vertically at

the same height, horizontally at 20cm far from each

other, and 40cm far from the rotating object. We used

tripod mounts to reduce pose-estimation errors, as we

wanted to quantify reconstruction errors brought about

by network lag and synchronisation effects. Pairs of

frames with the same snapshot counter are fed into

Multi-view Data Capture using Edge-synchronised Mobiles

735

(a) (b)

Figure 5: Experimental setup: (a) Left-hand camera: red

box shows region of interest for our analysis. (b) Right-hand

camera: green points show keypoints extracted and blue

points show keypoints that have been 3D triangulated.

COLMAP. The 3D-reconstruction algorithm processed

all the pairs captured in the experiment. Fig. 5a shows

the object from the left-hand camera; the red bounding

box highlights the region of interest we have used for

our analysis of keypoints/3D points. Fig. 5b shows the

view from the right-hand camera with the keypoints,

highlighted in green, and the keypoints that have been

3D triangulated, shown in blue.

Latencies have been compared to those obtained us-

ing the Unity3D Relay (Unity3D Multiplayer Service,

2019). We also assessed the quality of 3D reconstruc-

tions over time by comparing the volumetric models of

the reference object under different network latencies

with respect to a ground truth, which was created by

reconstructing the reference object in 3D, frame-by-

frame with one degree of separation between frames,

for a total of 270 degrees. This created two sequences

of 270 aligned frames, one sequence for each cam-

era. By picking one frame from the ﬁrst sequence and

then another frame from the second sequence captured

with a different pose, we could simulate different an-

gular velocities and different frame-capture rates. For

example, say we wanted to simulate the reference ob-

ject rotating at

50deg/s

, captured at

F = 10Hz

. This

corresponds to a 100ms interval between frames, cor-

responding to an object rotation of

5deg

. We can pick

frame

from the ﬁrst sequence and frame

t + 5

from

the second sequence to simulate this condition. In

our experiments we simulated angular velocities be-

tween

50deg/s

and

100deg/s

with steps of

10deg/s

and modelled snapshots that were captured with a fre-

quency of

F = 10Hz

. We then modelled the latency

between the mobile and the Edge Relay by adding nor-

mally distributed delays, i.e.

N (µ, σ)

, where

is the

mean and

is the standard deviation that we measured

on our experimental WiFi network. In order to use real-

istic latency estimates, we recreated latency variation

conditions, ranging from

ms to

150

ms with a step

ms, by injecting delays into the network using

NetEm (NetEm, 2019). The mobiles were connected

via a WiFi 2.4GHz network. We used an off-the-shelf

Table 1: Round Trip Time between mobile and Edge Relay

measured on our WiFi network. ‘Set’ are the latencies set

with NetEm (NetEm, 2019), and ‘Meas’ are mean

standard

deviation calculated over 40 Round Trip time measurements.

Set (ms) 0 30 60 90 120 150

Meas (ms) 14 ± 11 50 ± 14 80 ± 31 108 ± 26 141 ±26 167 ±15

WiFi access point (Thomson TG587nV2). Due to in-

terference caused by neighbouring networks (a typical

scenario nowadays), we observed typical Round Trip

Times (RTT) shown in Tab. 1.

We used the data in Tab. 1 to quantify the triangu-

lation error between ground-truth 3D points and 3D

points triangulated under simulated delays. We cal-

culated the triangulation error using a grid composed

of 16

16-pixel cells deﬁned within the bounding box

shown in Fig. 5a. Within each cell we select the key-

points that have been triangulated in 3D and calculated

the centre of mass of their 3D projection. We then cal-

culated the Euclidean distance between the 3D centres

of mass of the ground truth and the 3D centres of mass

of the points triangulated with different latencies.

End-to-End Delay Assessment.

We used two mo-

biles conﬁgured as the 3D reconstruction case to cap-

ture the time ticked by a stopwatch (up to the mil-

lisecond precision). We extracted the time information

from each frame pair of frames using OCR (Amazon

Textract, 2019) and computed their time difference.

We performed this experiment using the delay com-

pensation activated. The conﬁgurations tested are with

our Edge Relay operating locally (i.e. at the edge) and

in the cloud. For the latter we deployed our Edge Re-

lay on Amazon Web Services Cloud (AWS), and con-

nected the mobiles to the cloud through a high-quality

optical-ﬁbre-based internet connection and though 4G,

to resemble a real capture scenario.

7.2 Experiments

Relay Latency.

We assessed synchronisation trigger

delays by measuring the latency between the host and

clients in the case of our Edge Relay and the Unity3D

Relay. We performed these measurements with a dedi-

cated feature integrated into the app that produces 250

RTT measurements. We then calculated the average

measured RTTs. Unity3D Relay is cloud based and

can be located anywhere around the globe, based on a

user’s location, the closest is usually queried (Unity3D

Network Manager, 2019). In the case of our Edge Re-

lay, we measured the host-client RTT and obtained an

average of

66 ± 37

ms. Then we measured the mobile-

Edge Relay RTT and obtained an average of

14±11

ms.

This means that

66 − 14 · 2 = 30

ms is consumed by

the network devices (e.g. router, access point) and by

the Edge Relay for processing. In the case of Unity3D

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

736

Relay (Unity3D Multiplayer Service, 2019), we mea-

sured the host-client RTT and obtained an average of

89 ± 12

ms. Then we measured the mobile-Unity3D

Relay RTT and obtained an average of

28 ± 13

ms.

This means that

89 − 28 · 2 = 33

ms is consumed by

network devices to reach the cloud and by the Unity3D

Relay for processing. The Unity3D Relay’s processing

time is comparable to that of our Edge Relay, but more

stable, as the standard deviation is smaller. We believe

that we can increase the efﬁciency of our Edge Relay

by re-implementing some core modules of MLAPI

(MLAPI, 2019) in C++. Furthermore, the RTT of

ms in the case of Unity3D Relay has been measured

inside a research centre with a high-quality optical-

ﬁbre-based internet connection. To have an idea of

how other types of internet connections could affect

latency, we performed the same RTT measurement

towards Unity3D Relay but using a traditional 6Mbs

home broadband and a 4G connection, which turned

out to be 255 ± 66ms and 80 ± 7ms, respectively.

To illustrate the direct impact of synchronisation

delays on the 3D reconstruction, we designed another

experiment where our reference object is reconstructed

while rotating at an angular velocity of

ω = 80deg/s

This analysis is performed without using ground-truth

information. We performed various reconstructions of

the object by varying the latency between the mobiles.

In one experiment, the object performed two spins:

the ﬁrst spin was 720 degrees counterclockwise, the

second spin was 720 degrees clockwise. We conducted

six experiments in total, where we injected 30ms of

communication delay into the WiFi network (using

NetEm (NetEm, 2019)) between the two mobiles for

each experiment. Note that if synchronisation trig-

gers are delivered with a delay, the object will appear

with a different pose in the two camera frames. This

affects the computation of the 3D points; matched

keypoints will correspond to different 3D locations,

and there will be parts of an object that might also

be occluded. In this ﬁrst experiment, we could not

accurately measure the triangulation error through the

ground truth because we did not have direct access to

the angular state of the turntable during a spin. There-

fore, to make 3D reconstructions for each trigger and

for each experiment comparable, we quantiﬁed the

output as being the ratio between the number of 3D

points reconstructed and the keypoints visible within

the region of interest. We compared cases with latency

compensation disabled and then enabled. Fig. 6 shows

the variation of the 3D point ratio over time in these

two case. We refer to the left-hand mobile as A and

the right-hand as B. The ﬁrst 100 frames correspond

to 720 degrees of counterclockwise spin. The eight

peaks correlate to a face of the box pointing towards

Figure 6: Instability of the 3D reconstruction when the la-

tency between two mobiles increases up to 150 ms. The 3D

point ratio is deﬁned as the ratio between the number of 3D

points and the number of keypoints counted inside the region

of interest.

both mobiles. When latency compensation is disabled,

mobile A does not delay the frame capture by

ˆp

(Sec. 5) when it sends a synchronisation trigger. As

the induced latency increases, mobile B receives its

trigger later and later. During this time, the object will

have rotated a few degrees in the same the direction

as mobile B, resulting in a more favourable viewpoint

for keypoint matching and 3D triangulation (i.e. it is

almost seeing an identical view as mobile A). Hence,

the 3D point ratio is seen to increase as the induced

latency increases when the rotation is counterclock-

wise. However, the computed 3D points will not be

calculated correctly and they will also have inaccurate

coordinates (we show quantitative evidence of this in

the next session). Vice-versa, when the rotation is

clockwise, mobile A is more likely to capture frames

of object regions that are occluded from mobile B’s

viewpoint as they will have already rotated out of view.

When our latency compensation algorithm is enabled,

3D reconstructions become symmetric in both spin-

ning directions, illustrating its effectiveness.

3D Reconstruction Analysis.

We quantify the recon-

struction accuracy using simulated latency (e.g. ground

truth). Fig. 7 and 8 show the 3D triangulation error in

cases of counterclockwise and clockwise spin, respec-

tively. From these graphs we see that the triangulation

error increases as simulated latency increases. This

occurs because after keypoints are matched across the

two image planes, and, after the keypoints are trian-

gulated in 3D (with an initially-guessed projection

matrix), the Bundle Adjustment algorithm in the SfM

pipeline tries to optimise the parameters of the projec-

tion matrix by minimising re-projection (3D to 2D)

error (Schonberger and Frahm, 2016). Hence, an er-

roneous object’s pose is captured, thus affecting the

estimation of the extrinsic and intrinsic parameters

(e.g. providing different focal-length estimates), and,

on the estimation of 3D points (i.e. highly likely to

be estimated somewhere in between the real 3D posi-

Multi-view Data Capture using Edge-synchronised Mobiles

737

Figure 7: 3D triangulation error in the case of a

counterclockwise-spinning object. The error is computed

relative to the distance between the centre of mass of the two

cameras and the object, which is 40cm. The camera baseline

is ﬁxed at 20cm.

Figure 8: 3D triangulation error in the case of a clockwise-

spinning object. The error is computed relative to the dis-

tance between the centre of mass of the two cameras and

the object, which is 40cm. The camera baseline is ﬁxed at

20cm.

tion of the keypoints of the two frames). Error-rates

in the two cases differ due to the same phenomena

illustrated in Fig. 6. Because synchronisation triggers

are generated from the left-hand mobile, which takes

snapshots upon generation, the right-hand mobile only

takes snapshots when triggers are received. When the

object spins clockwise, the larger the delay the more

often the object appears with self-occluded parts in the

two cameras as it will have rotated more.

End-to-End Delay:

Tab. 2 shows the end-to-end de-

lay between mobiles, measured as the difference be-

tween the times captured from two mobiles through

OCR under different communication conﬁgurations

with the Edge Relay. As expected, the experiments

show evidence that when the Edge Relay is deployed

at the edge we can capture frames with the lowest la-

tency. When the Edge Relay is deployed in the cloud,

even through a highly reliable optical-ﬁbre connection,

we can see that there is a worsening in the performance

due to the extra communication link to AWS.

Qualitative Analysis:

We qualitatively analyse the

performance of our approach by comparing the 3D

reconstruction over time of a moving person using the

standard Unity3D Relay (Unity3D Multiplayer Ser-

vice, 2019) and our Edge Relay. We used the AR

Anchor to estimate the scale of the point cloud in met-

ric units. Fig. 9 shows the dense point clouds in two

Table 2: End-to-end delay between mobiles, automatically

measured by capturing the time ticked by a stopwatch. Case

studies when Edge Relay was deployed in the cloud and

at the edge (i.e. locally). The connection to the cloud was

carried out through 4G and a high-quality optical ﬁbre.

Cloud

Edge

4G Optical ﬁbre

36.36 ± 25.46ms 25.40 ± 18.68ms 20.46 ± 18.95ms

instants of time where (1a,2a) are the outputs with

the Unity3D Relay and (1b,2b) with our Edge Re-

lay. We can see that when the object is still (Case

1) the results (i.e. density of triangulated 3D points)

using Unity3D Relay and Edge Relay are comparable.

Whereas, when the object moves (Case 2), synchro-

nisation is key to achieve accurate 3D triangulation,

and using the Unity3D Relay leads to sparser recon-

structions. We quantiﬁed the 3D triangulation accu-

racy by calculating the average reprojection error after

Bundle Adjustment. We measured

0.22 ± 0.04

pixels

using the Unity3D Relay and

0.20 ± 0.02

pixels us-

ing our Edge Relay. This result shows that we could

achieve a more accurate reconstruction using the Edge

Relay. During these experiments, we also monitored

the percentage of frames successfully received by the

Data Manager from all the mobiles. Given a snapshot

trigger, an instant of time that is captured only by a

subset of mobiles leads to a partial capture, as only

the frames of those mobiles that received the trigger

and performed the capture will be transmitted to the

Data Manager. We name these frames “partial frames”

and, ideally, we would like to achieve zero percent

of partial frames. The percentage of partial frames

we measured using Unity3D Relay was

41%

, whereas

with our Edge Relay it was

24%

. Frame loss is be-

cause snapshot triggers are transmitted through UDP,

that does not acknowledge if packets are received and

does not re-send packets in the case of failed recep-

tion (unlike with TCP). However UDP is necessary to

guarantee the timely delivery of packets. The video

of our qualitative analysis can be found at this link

https://youtu.be/znoJmovdCgs. The video illustrates

the effect of losing 3D points and frames with the

Unity3D Relay.

8 CONCLUSIONS

We proposed a system to move data-relaying from

the cloud to the edge, showing that this is key to

make frame capture synchronisation more reliable than

cloud-based solutions and to enable number-of-users

scalability. Our implementation consists of an Edge

Relay to handle snapshot triggers used for the captur-

ing of images for FVV production, and a Data Manager

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

738

(1a) Unity3D Relay

(1b) Edge Relay

(2a) Unity3D Relay

(2b) Edge Relay

Figure 9: Examples of a moving person reconstructed using

(1a,2a) the Unity3D Relay (Unity3D Multiplayer Service,

2019) and (1b,2b) our Edge Relay. Case 1: When the object

is still we can see that results (i.e. density of triangulated 3D

points) using Unity3D Relay and Edge Relay are comparable.

Case 2: When the object moves, synchronisation is key to

achieve accurate 3D triangulation, and using the Unity3D

Relay leads to sparser reconstructions.

to receive capture frames via HTTP requests. Synchro-

nisation triggers are generated by a host, rather than

by a system timer, to enable a motion-based, adaptive

sampling-rate, fostering reduced data throughput. Al-

though the creation of high-quality FVVs was not the

scope of this work, we succeeded to show the beneﬁt of

our decentralised data capturing system using a state-

of-the-art 3D reconstruction algorithm (i.e. COLMAP)

and by implementing the assessment of end-to-end

capture delays though OCR.

Future research directions include the integration

of a volumetric 4D reconstruction algorithm that can

be executed in real-time on the edge to providing tele-

presence functionality together with the integration

of temporal ﬁltering of 3D reconstructed points to

provide more stable volumetric videos. We also aim

to improve reconstruction accuracy by postprocessing

ARCore’s pose estimates. By the end of this year we

will deploy our system on a 5G network and carry out

the ﬁrst FVV production in uncontrolled environments

using off-the-shelf mobiles.

REFERENCES

Amazon Textract (Accessed: Dec 2019). https://aws.amazon.

com/textract/.

ARCore (Accessed: Dec 2019). https://developers.google.

com/ar.

ARCore Anchors (Accessed: Dec 2019). https://developers.

google.com/ar/develop/developer-guides/anchors.

Bastug, E., Bennis, M., Medard, M., and Debbah, M. (2017).

Toward Interconnected Virtual Reality: Opportunities,

Challenges, and Enablers. IEEE Communications Mag-

azine, 55(6):110–117.

Belshe, M., Peon, R., and Thomson, M. (2015). Hypertext

Transfer Protocol Version 2. RFC 7540.

Berners-Lee, T., Fielding, R., and Nielsen, H. (1996). Hy-

pertext Transfer Protocol Version 1. RFC 1945.

Cadena, C., Carlone, L., Carrillo, H., Latif, Y., Scaramuzza,

D., Neira, J., Reid, I., and Leonard, J. (2016). Past,

Present, and Future of Simultaneous Localization And

Mapping: Towards the Robust-Perception Age. IEEE

Trans. on Robotics, 32(6):1309–1332.

Chen, Y.-H., Balakrishnan, H., Ravindranath, L., and Bahl,

P. (2016). GLIMPSE: Continuous, Real-Time Object

Recognition on Mobile Devices. GetMobile: Mobile

Computing and Communications, 20(1):26–29.

Elbamby, M., Perfecto, C., Bennis, M., and Doppler, K.

(2018). Toward Low-Latency and Ultra-Reliable Vir-

tual Reality. IEEE Network, 32(2):78–84.

Garg, A., Yadav, A., Sikora, A., and Sairam, A. (2018). Wire-

less Precision Time Protocol. IEEE Communication

Letters, 22(4):812–815.

Guillemaut, J.-Y. and Hilton, A. (2011). Joint Multi-Layer

Segmentation and Reconstruction for Free-Viewpoint

Video Applications. International Journal on Com-

puter Vision, 93(1):73–100.

Hu, Y., Niu, D., and Li, Z. (2016). A Geometric Approach

to Server Selection for Interactive Video Streaming.

IEEE Trans. on Multimedia, 18(5):840–851.

Huang, C.-H., Boyer, E., Navab, N., and Ilic, S. (2014). Hu-

man Shape and Pose Tracking Using Keyframes. In

Proc. of IEEE Computer Vision and Pattern Recogni-

tion, Columbus, US.

Jiang, D. and Liu, G. (2017). An Overview of 5G Require-

ments. In Xiang, W., Zheng, K., and Shen, X., editors,

5G Mobile Communications. Springer.

Kim, H., Guillemaut, J.-Y., Takai, T., Sarim, M., and Hilton,

A. (2012). Outdoor Dynamic 3D Scene Reconstruc-

tion. IEEE Trans. on Circuits and Systems for Video

Technology, 22(11):1611–1622.

Knapitsch, A., Park, J., Zhou, Q.-Y., and Koltun, V. (2017).

Tanks and Temples: Benchmarking Large-Scale Scene

Reconstruction. ACM Transactions on Graphics,

36(4):1–13.

Latimer, R., Holloway, J., Veeraraghavan, A., and Sabharwal,

A. (2015). SocialSync: Sub-Frame Synchronization in

a Smartphone Camera Network. In Proc. of European

Conference on Computer Vision Workshops, Zurich,

CH.

MLAPI (Accessed: Dec 2019). https://midlevel.github.io/

MLAPI.

MLAPI Conﬁguration (Accessed: Dec 2019). https://github.

com/MidLevel/MLAPI.Relay.

MLAPI Messaging System (Accessed: Dec 2019). https:

//mlapi.network/wiki/ways-to-syncronize/.

Mur-Artal, R., Montiel, J., and Tard

os, J. (2015). ORB-

SLAM: a versatile and accurate monocular SLAM sys-

tem. IEEE Trans. on Robotics, 31(5):1147–1163.

Mustafa, A. and Hilton, A. (2017). Semantically Coher-

ent Co-Segmentation and Reconstruction of Dynamic

Scenes. In Proc. of IEEE Computer Vision and Pattern

Recognition, Honolulu, US.

Multi-view Data Capture using Edge-synchronised Mobiles

739

NetEm (Accessed: Dec 2019). http://man7.org/linux/

man-pages/man8/tc-netem.8.html.

Park, J., Chou, P., and Hwang, J.-N. (2018). Volumetric

media streaming for augmented reality. In Proc. of

IEEE Global Communications Conference, Abu Dhabi,

United Arab Emirates.

Photon (Accessed: Dec 2019). https://www.photonengine.

com.

Poiesi, F., Locher, A., Chippendale, P., Nocerino, E., Re-

mondino, F., and Gool, L. V. (2017). Cloud-based

Collaborative 3D Reconstruction Using Smartphones.

In Proc. of European Conference on Visual Media Pro-

duction.

Qiao, X., Ren, P., Dustdar, S., Liu, L., Ma, H., and Chen,

J. (2019). Web AR: A Promising Future for Mobile

Augmented Reality–State of the Art, Challenges, and

Insights. Proceedings of the IEEE, 107(4):651–666.

QUIC (Accessed: Dec 2019). https://www.chromium.org/

quic.

Rematas, K., Kemelmacher-Shlizerman, I., Curless, B., and

Seitz, S. (2018). Soccer on your tabletop. In Proc. of

IEEE Computer Vision and Pattern Recognition, Salt

Lake City, US.

Resch, B., Lensch, H. P. A., Wang, O., Pollefeys, M., and

Sorkine-Hornung, A. (2015). Scalable structure from

motion for densely sampled videos. In Proc. of IEEE

Computer Vision and Pattern Recognition, Boston, US.

Richardt, C., Kim, H., Valgaerts, L., and Theobalt, C. (2016).

Dense Wide-Baseline Scene Flow From Two Handheld

Video Cameras. In Proc. of 3D Vision, Stanford, US.

Schonberger, J. and Frahm, J.-M. (2016). Structure-from-

motion revisited. In Proc. of IEEE Computer Vision

and Pattern Recognition, Las Vegas, USA.

Shi, W., Cao, J., Zhang, Q., Li, Y., and Xu, L. (2016). Edge

Computing: Vision and Challenges. IEEE Internet of

Things Journal, 5(3):637–646.

Shi, Z., Wang, H., Wei, W., Zheng, X., Zhao, M., and Zhao,

J. (2015). A novel individual location recommendation

system based on mobile augmented reality. In Proc.

of IEEE Identiﬁcation, Information, and Knowledge in

the Internet of Things, Beijing, CN.

Slabaugh, G., Culbertson, B., Malzbender, T., and Schafer,

R. (2001). A Survey of Methods for Volumetric Scene

Reconstruction from Photographs. In International

Workshop on Volume Graphics, New York, US.

Soret, B., Mogensen, P., Pedersen, K., and Aguayo-Torres,

M. (2014). Fundamental tradeoffs among reliability,

latency and throughput in cellular networks. In Proc.

of IEEE Globecom Workshops, Austin, US.

Sukhmani, S., Sadeghi, M., Erol-Kantarci, M., and Saddik,

A. E. (2019). Edge Caching and Computing in 5G for

Mobile AR/VR and Tactile Internet. IEEE Multimedia,

26(1):21–30.

Unity3D HLAPI (Accessed: Dec 2019). https://docs.unity3d.

com/Manual/UNetUsingHLAPI.html.

Unity3D MatchMaking (Accessed: Dec 2019).

https://docs.unity3d.com/520/Documentation/

Manual/UNetMatchMaker.html.

Unity3D Multiplayer Service (Accessed: Dec 2019). https:

//unity3d.com/unity/features/multiplayer.

Unity3D Network Manager (Accessed: Dec 2019).

https://docs.unity3d.com/ScriptReference/

Networking.NetworkManager-matchHost.html.

Vo, M., Narasimhan, S., and Sheikh, Y. (2016). Spatiotem-

poral Bundle Adjustment for Dynamic 3D Reconstruc-

tion. In Proc. of IEEE Computer Vision and Pattern

Recognition, Las Vegas, US.

Wang, Y., Wang, J., and Chang, S.-F. (2015). CamSwarm:

Instantaneous Smartphone Camera Arrays for Collabo-

rative Photography. arXiv:1507.01148.

Wu, W., Yang, Z., Jin, D., and Nahrstedt, K. (2008). Im-

plementing a distributed tele-immersive system. In

Proc. of IEEE International Symposium on Multime-

dia, Berkeley, US.

Yahyavi, A. and Kemme, B. (2013). Peer-to-peer architec-

tures for massively multiplayer online games: A survey.

ACM Comput. Surv., 46(1):1–51.

Zhang, W., Han, B., and Hui, P. (2018). Jaguar: Low Latency

Mobile Augmented Reality with Flexible Tracking. In

Proc. of ACM International Conference on Multimedia,

Seoul, KR.

Zou, D. and Tan, P. (2013). COSLAM: Collaborative visual

slam in dynamic environments. IEEE Trans. on Pattern

Analysis and Machine Intelligence, 35(2):354–366.

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

740