Self-Supervised Partial Cycle-Consistency for Multi-View Matching

Fedor Taggenbrock

1,2 a

, Gertjan Burghouts

1 b

and Ronald Poppe

2 c

Utrecht University, Utrecht, Netherlands

TNO, The Hague, Netherlands

Keywords:

Self-Supervision, Multi-Camera, Feature Learning, Cycle-Consistency, Cross-View Multi-Object Tracking.

Abstract:

Matching objects across partially overlapping camera views is crucial in multi-camera systems and requires

a view-invariant feature extraction network. Training such a network with cycle-consistency circumvents the

need for labor-intensive labeling. In this paper, we extend the mathematical formulation of cycle-consistency

to handle partial overlap. We then introduce a pseudo-mask which directs the training loss to take partial

overlap into account. We additionally present several new cycle variants that complement each other and

present a time-divergent scene sampling scheme that improves the data input for this self-supervised setting.

Cross-camera matching experiments on the challenging DIVOTrack dataset show the merits of our approach.

Compared to the self-supervised state-of-the-art, we achieve a 4.3 percentage point higher F1 score with our

combined contributions. Our improvements are robust to reduced overlap in the training data, with substantial

improvements in challenging scenes that need to make few matches between many people. Self-supervised

feature networks trained with our method are effective at matching objects in a range of multi-camera settings,

providing opportunities for complex tasks like large-scale multi-camera scene understanding.

1 INTRODUCTION

Matching people and objects across cameras is essen-

tial for multi-camera understanding (Hao et al., 2023;

Loy et al., 2010; Zhao et al., 2020). Matches are

commonly obtained by solving a multi-view match-

ing problem. One crucial factor that determines the

quality of the matching is the feature extractors’ gen-

eralization to varying appearances as a result of ex-

pressiveness and view angle (Ristani and Tomasi,

2018). Feature extractors can be trained in a super-

vised setting, which requires labor-intensive data la-

beling (Hao et al., 2023). The lack or scarcity of la-

beled data for novel domains is a limiting factor. Self-

supervised techniques thus offer an attractive alterna-

tive because they can be trained directly on object and

person bounding boxes without labels.

Effective, view-invariant feature networks have

been learned with self-supervision through cycle-

consistency, for use in multi-view matching, cross-

view multi-object tracking, and re-identiﬁcation (Re-

ID) (Gan et al., 2021; Wang et al., 2020). Training

https://orcid.org/0009-0002-6166-0865

https://orcid.org/0000-0001-6265-7276

https://orcid.org/0000-0002-0843-7878

these networks only requires sets of objects where

there is a sufﬁcient amount of overlap between sets

of objects between views. For multi-person matching

and tracking, sets are typically detections of people

from multiple camera views (Gan et al., 2021; Hao

et al., 2023). When the overlapping ﬁeld of view

between cameras decreases in the training data, self-

supervised cycle-consistency methods have a diluted

learning signal.

In this work, we address this situation and ex-

tend the theory of cycle-consistency for partial over-

lap with a new mathematical formulation. We then

implement this theory to effectively handle partial

overlap in the training data through a pseudo-mask,

and introduce trainable cycle variations to obtain a

richer learning signal, see Figure 1. Consequently,

we can get more out of the training data, thus provid-

ing a stronger cycle-consistency learning signal. Our

method is shown to be robust in more challenging set-

tings, with less overlap between cameras and fewer

matches in the training data. It is especially effective

for challenging scenes where few matches need to be

found between many people. The additional informa-

tion from partial cycle-consistency thus leads to sub-

stantial improvements, as shown in the experimental

Taggenbrock, F., Burghouts, G. and Poppe, R.

Self-Supervised Partial Cycle-Consistency for Multi-View Matching.

DOI: 10.5220/0013080900003912

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2025) - Volume 3: VISAPP, pages

19-29

ISBN: 978-989-758-728-3; ISSN: 2184-4321

Figure 1: Overview of our self-supervised cycle-consistency training method. Trainable cycle variations (left bottom) are

constructed from sampled batches (left top). Cycle matrices represent chains of matches starting and ending in the same view.

With partial overlap, however, we construct a pseudo-mask of the identity matrix (top right) to determine which speciﬁc cycles

should be trained due to partial overlap. This pseudo-mask is then used to provide a weighted loss signal with more emphasis

on the positive predicted cycles (right bottom).

section. The code is also made open source

Our contributions are as follows:

1. We extend the mathematical formulation of

cycle-consistency to handle partial overlap, lead-

ing to a new formulation for partial cycle-

consistency.

2. We use pseudo-masks to implement partial cycle-

consistency and introduce several cycle variants,

motivating how these translate to a richer self-

supervision learning signal.

3. We experiment with cross camera matching on

the challenging DIVOTrack dataset, and obtain

systematic improvements. Our experiments high-

light the merits of using a range of cycle variants,

and indicate that our approach is especially effec-

tive in more challenging scenarios.

Section 2 covers related works on self-supervised

feature learning. Section 3 summarizes our

mathematical formulation and derivation of cycle-

consistency with partial overlap. Section 4 details our

self-supervised method. We discuss the experimental

validation in Section 5 and conclude in Section 6.

For the open source code and theoretical analysis, see

the Supplementary Materials available at Github.

2 RELATED WORK

We ﬁrst address the general multi-view matching

problem, and highlight its application areas. Sec-

tion 2.2 summarizes supervised feature learning,

whereas Section 2.3 details self-supervised alterna-

tives.

2.1 Multi-View Matching

Many problems in computer vision can be framed

as a multi-view matching problem. Examples in-

clude keypoint matching (Sarlin et al., 2020), video

correspondence over time (Jabri et al., 2020), shape

matching (Huang and Guibas, 2013), 3D human pose

estimation (Dong et al., 2019), multi-object track-

ing (MOT) (Sun et al., 2019), re-identiﬁcation (Re-

ID) (Ye et al., 2021), and cross-camera matching

(CCM) (Han et al., 2022). Cross-view multi-object

tracking (CVMOT) combines CCM with a tracking

algorithm (Gan et al., 2021; Hao et al., 2023). The

underlying problem is that there are more than two

views of the same set of objects, and we want to ﬁnd

matches between the sets. For MOT, detections be-

tween two subsequent time frames are matched (Wo-

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

jke et al., 2017). Instead, in CCM, detections from

different camera views should be matched. One par-

ticular challenge is that the observations have signif-

icantly different viewing angles. Such invariancies

should be handled effectively through a feature ex-

traction network. Such networks can be trained us-

ing identity label supervision but obtaining consistent

labels across cameras is labor-intensive (Hao et al.,

2023), highlighting the need for good self-supervised

alternatives.

2.2 Supervised Feature Learning

Supervised Re-ID methods (Wieczorek et al., 2021;

Ye et al., 2021) work well for CCM. With labels, fea-

ture representations from the same instance are metri-

cally moved closer, while pushing apart feature repre-

sentations from different instances. Other approaches

such as joint detection and Re-ID learning (Hao et al.,

2023), or training speciﬁc matching networks (Han

et al., 2022) have been explored. Supervised meth-

ods for CCM typically degrade in performance when

applied to unseen scenes, indicating issues with over-

ﬁtting. Self-supervised cycle-consistency (Gan et al.,

2021) has been shown to generalize better (Hao et al.,

2023).

2.3 Self-Supervised Feature Learning

Self-supervised feature learning algorithms do not

exploit labels. Rather, common large-scale self-

supervised contrastive learning techniques (Chen

et al., 2020) rely on data augmentation. We argue that

the signiﬁcant variations in object appearance across

views cannot be adequately modeled through data

augmentations, meaning that such approaches can-

not achieve view-invariance. Clustering-based self-

supervised techniques (Fan et al., 2018) are also not

designed to deal with signiﬁcant view-invariance. An-

other alternative is to learn self-supervised features

through forcing dissimilarity between tracklets within

cameras while encouraging association with track-

lets across cameras (Li et al., 2019). Early work on

self-supervised cycle-consistency has shown that this

framework signiﬁcantly outperforms clustering and

tracklet based self-supervision methods (Wang et al.,

2020). Self-supervision with cycle-consistency is es-

pecially suitable for multi-camera systems because

it enables learning to associate consistently between

the object representations from different cameras and

at different timesteps. Trainable cycles can be con-

structed as series of matchings that start and end at

the same object. Each object should be matched back

to itself as long as the object is visible in all views. If

an object is matched back to a different one, a cycle-

inconsistency has been found which then serves as a

learning signal (Jabri et al., 2020; Wang et al., 2020).

Given the feature representations of detections in

two different views, a symmetric cycle between these

two views can be constructed by combining two soft-

maxed similarity matrices, matching back and forth.

The feature network can then be trained by forc-

ing this cycle to resemble the identity matrix with

a loss (Wang et al., 2020). This approach can be

extended to transitive cycles between three views,

which is sufﬁcient to cover cycle-consistency between

any number of views (Gan et al., 2021; Huang and

Guibas, 2013). With little partial overlap in the train-

ing data, forcing cycles to resemble the full identity

matrix (Gan et al., 2021; Wang et al., 2020) provides

a diluted learning signal that trains many non-existent

cycles without putting proper emphasis on the actual

cycles that should be trained. To effectively handle

partial overlap, it is therefore important to differen-

tiate between possibly existing and absent cycles in

each batch. To this end, we implement a strategy that

makes this differentiation. A work that was developed

in parallel to ours (Feng et al., 2024) has also found

improvements with a related partial masking strategy.

Our work conﬁrms their observations that consider-

ing partial overlap improves matching performance.

In addition, we provide a rigid mathematical under-

pinning for our method, introduce more cycle varia-

tions, and trace back improvements to characteristics

of the scene including the amount of overlap between

views.

Learning with cycle-consistency is not exclu-

sive to CCM. Cycles between detections at different

timesteps can be employed to train a self-supervised

feature extractor for MOT (Bastani et al., 2021), and

cycles between image patches or video frames can

serve to learn correspondence features at the image

level (Dwibedi et al., 2019; Jabri et al., 2020; Wang

et al., 2019). This highlights the importance of a rigid

mathematical derivation of partial cycle-consistency

in a self-supervised loss.

3 PARTIAL

CYCLE-CONSISTENCY

We summarize the main contributions from our the-

oretical extension of partial cycle-consistency, which

appears in full in the supplementary materials

. Given

are pairwise similarities S

i j

∈ R

×n

∀i, j between the

views V

, that contain n

, n

bounding boxes. Par-

tial multi-view matching aims to obtain the optimal

partial matching matrices P

i j

∈ {0, 1}

×n

∀i, j, given

Self-Supervised Partial Cycle-Consistency for Multi-View Matching

Figure 2: Partial cycle-consistency and an interpretation of Equation 5. I

i jki

[a, a] = 1 because a is matched to b, matched to c

which is then matched back to a. The same does not hold for a

′

, so this cycle is absent.

the S

i j

that are partially cycle-consistent with each

other. See also Figure 2. Partial cycle-consistency

implies that, among others, matching from view V

to view V

and then to view V

should be a sub-

set of the direct matching between V

and V

. We

make this subset relation explicit, pinpointing which

matches get lost through view V

by inspecting the

pairwise matches, proving equivalence to the orig-

inal deﬁnition. We then prove that partial cycle-

consistency in general implies the most usable form

of self-supervision cycle-consistency, where matches

are combined into full cycles that start and end in

the same view and should thus be a subset of the

identity matrix. We are able to explicitly deﬁne this

usable form of partial cycle-consistency in proposi-

tion 1. Based on this insight, in Section 4, we con-

struct subsets of the identity matrix during training to

serve as pseudo-masks, improving the training pro-

cess with partially overlapping views. Our explicit

cycle-consistency proposition is as follows:

Proposition 1 (Explicit partial cycle-consistency).

If a multi-view matching {P

i j

}

∀i, j

is partially cycle-

consistent, it holds that:

= I

×n

∀i ∈ {1, . . . , N}, (1)

i j

= I

i ji

∀i, j ∈ {1, . . . , N}, (2)

i j

= I

i jki

∀i, j, k ∈ {1, . . . , N}, (3)

where I

i ji

⊆ I

×n

is the identity map from view i

back to itself, ﬁltering out matches that are not seen

in view V

i ji

[a, c] =

(

1 if a = c & ∃b s.t. P

i j

[a, b] = 1.

0 else,

(4)

and where I

i jki

⊆ I

×n

is the identity mapping

from view i back to itself, ﬁltering out all matches that

are not seen in views V

and V

i jki

[a, d] =











1 if a = d & ∃b, c s.t. P

i j

[a, b]

= P

[b, c] = P

[c, d] = 1.

0 else.

(5)

The notation X[·,·] is used for indexing a matrix X.

The intuition behind Equation 5 can be best un-

derstood through the visualization in Figure 2. Here,

i jki

′

, a

′

] = 0 because there is a detection of a

′

absent

in view V

, while I

i jki

[a, a] = 1 because a full cycle

is formed from the corresponding pairwise matches.

The proofs with detailed explanations are given in the

supplementary materials

4 SELF-SUPERVISION WITH

PARTIAL

CYCLE-CONSISTENCY

The theory of cycle-consistency and its relation to

partial overlap can be translated into a self-supervised

feature network training strategy. The main chal-

lenges are to determine which cycles to train, which

loss to use, and how to implement the ﬁndings from

Proposition 1 to handle partial overlap. Section 4.1

explores what cycles to train and how to construct

them. Section 4.2 explores how to obtain partial

overlap masks for the cycles that approximate the

i ji

, I

i jki

⊆ I

×n

from Proposition 1. It also explores

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

how these masks can be incorporated in a loss to deal

with partial overlap during training.

4.1 Trainable Cycle Variations

Given are the pairwise similarities S

i j

between all

view pairs, obtained from the feature extractor φ that

we wish to train. The idea is to combine softmax

matchings of the S

i j

into cycles, similar to Equations 2

and 3. For this we use the temperature-adaptive row-

wise softmax f

(Wang et al., 2020) on a similar-

ity matrix S to perform a soft row-wise partial match-

ing. This function has the differentiability needed to

train a feature network and the ﬂexibility to make non-

matches for low similarity values. We get:

(S[a, b]) =

exp(τS[a, b])

∑

′

exp(τS[a, b

′

])

, (6)

where the notation S[·, ·] is used for matrix index-

ing. The temperature τ depends on the size of S as

in (Wang et al., 2020).

4.1.1 Pairwise Cycles

The pairwise cycles need to be constructed from just

i j

= S

. To this end, we take:

i j

= f

i j

), A

= f

i j

), A

i ji

= A

i j

. (7)

The pairwise cycle A

i ji

, originally proposed in (Wang

et al., 2020), represents a trainable variant of the pair-

wise cycle P

i j

from Equation 2, and so a learn-

ing signal is obtained by forcing it to resemble I

i ji

Note that the A

i j

and A

differ because they match the

rows and columns of S

i j

, respectively. This is impor-

tant because the loss then forces these different soft

matchings to be consistent with each other, modelling

the partial cycle-consistency constraint in Equation 2.

The loss will be the same for A

i ji

and A

i ji

= A

i j

, so

just Equation 7 sufﬁces.

4.1.2 Triplewise Cycles

The triplewise cycles are constructed from S

i j

, S

and

, and should resemble the P

i j

from Equa-

tion 3. The authors in (Feng et al., 2024) propose to

use:

i jki

= A

i j

, (8)

while in (Gan et al., 2021), the similarities are com-

bined ﬁrst so that:

i jk

= S

i j

, A

i jk

= f

i jk

), (9)

with which their triplewise cycle is created as:

i jki

= A

i jk

k ji

. (10)

We discovered that using multiple triplewise cycle

constructions in the training improves the results.

Each of the constructed cycles exposes a different in-

consistency in the extracted features, so that a combi-

nation of cycles provides a robust training signal. We

propose to use the following four triplewise cycles:

i jki

= A

i j

, (11)

i jki

= A

i jk

k ji

, (12)

i jki

= A

i jk

, (13)

i jki

= A

i jk

ki j

jki

. (14)

The cycles from Equation 12-14 are also visualized

in Figure 1 as the three blue swirls. In the following,

i jki

can be used to refer to any of the four triplewise

cycles in Equations 11- 14, and additionally A

i ji

when

assuming j = k. The symmetric property of the loss

makes transposed versions of Equations 11 - 14 re-

dundant.

4.2 Masked Partial Cycle-Consistency

Loss

The A

i jki

can be directly trained to resemble the iden-

tity matrix I

, by training each diagonal element in

i jki

to be a margin m greater than their correspond-

ing maximum row and column values, similar to the

triplet loss (Wang et al., 2020; Gan et al., 2021). This

is achieved through:

i jki

) =

∑

i=1

relu(max

b̸=a

i jki

[a, b])−A

i jki

[a, a]+m).

(15)

The following loss enforces this margin over both the

rows and columns:

i jki

) =

i jki

) + L

i jki

)). (16)

This loss, however, does not distinguish between ab-

sent and existing cycles that occur with partial over-

lap. Note that the ground truth I

i jki

are masks (or sub-

sets) of the I

that exactly ﬁlter out such absent cy-

cles, while keeping the existing cycles, according to

Equation 5 and visualized in Figure 2. In this ﬁgure,

detections of the blue person form an absent cycle be-

cause the pairwise matches are not connected. The

i jki

are constructed based on the ground truth matches

i j

. We therefore propose to construct pseudo-masks

i jki

from pseudo-matches

i j

that are available during

self-supervised training. For this we use:

i j

(

[ f

i j

) > 0.5] if |V

| < |V

i j

)

> 0.5

if |V

| < |V

(17)

Self-Supervised Partial Cycle-Consistency for Multi-View Matching

where the Iverson bracket [Predicate(X)] binarizes

matrix X, with elements equal to 1 for which the pred-

icate is true, and 0 otherwise. In

i j

, each element in

a view with fewer elements can be matched to at most

one element in the other view, as desired for a partial

matching. We construct the pseudo-masks as:

i jki

[a, a] =











1 if ∃b, c s.t.

i j

[a, b]

[b, c] =

[c, a] = 1.

0 else.

(18)

i jki

is invariant to the order in the i, j, k sequence, and

independent of the cycle variant for which it is used

as a mask. Equation 18 can be vectorized as:

i jki

i j

⊙ I

≥ 1

. (19)

Our masked partial cycle-consistency loss extends

the loss from Equation 16 with the pseudo-masks

i jki

for which only the diagonal elements of predicted ex-

isting cycles are 1. The absent cycles have diagonal

elements of 0. The loss uses two different margins

> m

> 0, where m

is used for cycles that are

predicted to exist with

i jki

, and m

is used for the cy-

cles predicted to be absent:

explicit

(

i jki

⊙ A

i jki

) + L

((I

−

i jki

) ⊙ A

i jki

)

(20)

5 RESULTS AND EXPERIMENTS

We demonstrate the merits of a stronger self-

supervised training signal from the addition of our

cycle variations and partial cycle-consistency mask.

We introduce the training setting, before detailing our

quantitative and qualitative results.

Dataset and Metrics. DIVOTrack (Hao et al., 2023)

is a large and varied dataset of time-aligned overlap-

ping videos with consistently labeled people across

cameras. The train and testset are disjoint sets with 9k

frames from three overlapping camera’s each. Three

time-aligned overlapping frames are one matching in-

stance. Frames from 10 different scenes are used,

equally distributed over the train and test set. Our self-

supervised feature network trains with the 9k match-

ing instances of the trainset without its labels. We re-

port the average cross-camera matching precision, re-

call and F1 score (Han et al., 2022) over the 9k match-

ing instances of the test set, averaged over ﬁve train-

ing runs with standard deviation. The average number

of people per matching instance is around 19, but this

varies per scene

Implementation Details. Our contributions ex-

tends the state-of-the-art in self-supervised cycle-

consistency (Gan et al., 2021). Our cycle varia-

tions from Equations 11-14 are used instead of theirs,

providing a diverse set of cycles to capture differ-

ent cycle-inconsistencies. Previous methods with-

out masking (Wang et al., 2020) (Gan et al., 2021)

use the loss from Equation 16. Our partial masking

strategy instead constructs pseudo-masks with Equa-

tion 18 and uses these in our explicit partial masking

loss from Equation 20, with m

= 0.7 and m

= 0.3.

We use the same training setup as (Gan et al., 2021)

for fair comparison. Speciﬁcally, we use annotated

bounding boxes without identity labels to extract fea-

tures and train a ResNet-50 (He et al., 2016) for 10

epochs with an Adam optimizer with learning rate

1e − 5. Matching inference uses the Hungarian algo-

rithm between all view pairs, with an optimized par-

tial overlap parameter to handle non-matches.

Time-Divergent Scene Sampling. Detections from

multiple cameras at two timesteps are used in a batch

such that cycles are constructed and trained between

the pairs and triples for 2C views of the same scene,

with C the number of cameras (Feng et al., 2024; Gan

et al., 2021). Time-divergent scene sampling gradu-

ally increases the interval ∆t between timesteps dur-

ing training, with ∆t equal to the current epoch num-

ber. It also uses fractional sampling to obtain a bal-

anced batch order, such that the local distribution of

scenes resembles the average global distribution of

scenes.

5.1 Main Results

We show the overall effectiveness of our cycle varia-

tions and partial masking as additions to the existing

SOTA within the framework of self-supervised cycle-

consistency (Gan et al., 2021) in Table 1.

The ﬁrst paper in this framework (Wang et al.,

2020) used a simple baseline approach with just

pairwise cycles, and showed the effectiveness com-

pared to multiple other self-supervised feature learn-

ing methods using clustering (Fan et al., 2018) and

tracklet based techniques (Li et al., 2019) among oth-

ers. The authors in (Gan et al., 2021) and (Feng

et al., 2024) expanded upon this framework, where

(Feng et al., 2024) is not open source. We report

the results in our paper both with and without time-

divergent scene sampling, as this simply makes the

data input richer, improving performance regardless

of which cycle-consistency method is used. We ﬁnd

that combining cycle variations, partial masking and

Time Divergent Scene Sampling boosts the F1 match-

ings score of the previous SOTA by 4.3 percentage

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

Table 1: Cycle variations and partial masking together improve the overall matching performance by 2.5-2.1 percentage

points. Every method beneﬁts from time-divergent scene sampling, and combining everything boosts the previous SOTA by

4.3 percentage points, also improving stability.

Model Standard Time-Div. Scene Sampling

Precision Recall F1 Precision Recall F1

MvMHAT (Gan et al., 2021) 66.3 60.1 63.1±1.7 68.0 62.8 65.3±1.3

Cycle variations (CV) 68.8 61.1 64.7±1.9 70.4 62.3 66.1±1.4

CV + Partial masking 71.0 61.0 65.6±1.1 71.7 63.6 67.4±0.9

Table 2: Results per scene. Our methods improve the average F1 score on every scene. Crowded challenging test scenes like

Ground, Side and Shop beneﬁt most, with improvements of 9.1, 5.6 and 4.7 percentage points respectively.

Methods Gate2 Square Moving Circle Gate1

MvMHAT (Gan et al., 2021) 88.1 73.3 73.1 67.4 67.2

Ours w\o Masking 88.3(+0.2) 74.9(+1.6) 74.9(+1.8) 68.7(+1.3) 69.6(+2.4)

Ours 88.3(+0.2) 74.9(+1.6) 76.2(+3.1) 69.9(+2.5) 70.4(+3.2)

Methods Floor Park Ground Side Shop

MvMHAT (Gan et al., 2021) 64.7 58.2 56.9 56.0 42.1

Ours w\o Masking 65.2(+0.5) 58.4(+0.2) 64.5(+7.6) 58.9(+2.9) 45.5(+3.4)

Ours 66.8(+2.1) 60.4(+2.2) 66.0(+9.1) 61.6(+5.6) 46.8(+4.7)

points, and that this combination is also the most con-

sistent of all approaches. To put the results of Table

1 in perspective, we report that the F1 matching score

of a Resnet pretrained on ImageNet is 16.8, while a

supervised SOTA Re-ID model (Ye et al., 2021) with

an optimized network architecture and hard negative

mining is able to obtain a matching score of 82.28.

This illustrates the strength of self-supervised cycle-

consistency in general, showcasing its ability to sig-

niﬁcantly improve the feature quality of a pretrained

ResNet. It also shows that our unoptimized self-

supervised method is not to far from an optimized su-

pervised baseline.

5.1.1 Results per Scene

The 10 scenes in the train and test data provide dif-

ferent challenges. During training, scenes with lit-

tle overlap provide a worse learning signal for the

overall model. During testing, scenes that require

few matchings between many people are signiﬁcantly

more challenging. Insights into the overlap and num-

ber of people per scene is provided in the supplemen-

tary materials

. The scenes Ground, Side and Shop

contain the highest number of people, around 24-32

per frame on average. The scenes Side and Shop

also have little overlap, so that few matches needed

to be correctly found from many possible ones. These

scenes can thus be considered as the most challenging

test set scenes. Table 2 reports the matching results

per scene. Our methods outperform (Gan et al., 2021)

on every test set, with the largest (relative) gains on

Ground, Side and Shop, with 9.1, 5.6 and 4.7 percent-

age points, respectively, highlighting the improved

expressiveness of our feature network.

5.1.2 Partial Overlap Experiments

We experiment with artiﬁcially reducing the ﬁeld of

view in the training data by 20-40%. We implement

this by reducing the actual width of each camera view

starting from the right, throwing away the bounding

boxes outside this reduced ﬁeld of view. We train on

these reduced overlap datasets and measure the ro-

bustness for each method, because self-supervision

through cycle-consistency learns from overlap. An

overlap analysis for the original and reduced datasets

is provided in Table 3, and the evaluation results when

training with the reduced data are shown in Table 4.

We observe that our method is robust and contributes

to the performance even in these harder training sce-

narios.

5.1.3 Cycle Variations Ablation

Our cycle variations use Equations 11- 14 to construct

multiple trainable cycles to obtain a richer learning

signal. We perform an ablation study on the effec-

tiveness of each cycle, with and without masking, in

Table 5. We ﬁnd that our new A

i jk

and A

i jk

ki j

jki

cycles from Equations 13 and 14 perform well on

their own and even better when combined with the cy-

cles from Equations 11 and 12. We observe that mul-

tiple cycle variations work especially well in the pres-

Self-Supervised Partial Cycle-Consistency for Multi-View Matching

Figure 3: Qualitative example during training. Each of the blue swirls, representing Equations 12-14, constructs a cycle matrix

with various cycle-inconsistencies. Partial overlap requires that only some of the diagonal elements are trained as cycles. The

pseudo-mask correctly ﬁnds the existing cycles, except for a heavily occluded one. A strong learning signal is obtained from

one of the diagonals of the dark blue cycle.

Table 3: The original train dataset has an average of 40% IoU between any two cameras, 26% people visible in all three

cameras, and 18.4 unique people per frame. We reduce the FOV to simulate harder train data with less overlap.

Jaccard Index Full Train|Test 80% Train Overlap 60% Train Overlap

Two Cameras 0.40|0.38 0.37 0.29

Three Cameras 0.26|0.23 0.24 0.15

Num People 18.4|19.4 16.5 14.0

Table 4: Our methods consistently improve performance, even with sparser training data that is reduced in partial overlap.

Full Train 80% Train Overlap 60% Train Overlap

Methods test set F1 score

MvMHAT (Gan et al., 2021) 63.1±1.7 60.6±1.6 55.0±2.3

Ours w\o Masking 66.1±1.4 63.0±1.9 56.5±2.3

Ours 6

4±0

9 6

8 ± 1

2 5

9 ± 1

ence of masking, showing that these methods partly

complement each other.

5.2 Qualitative Results

Figure 3 illustrates the contribution of the various cy-

cles and pseudo-mask during training. In this speciﬁc

example, it can be seen how the varied cycle construc-

tions are cycle-inconsistent in different ways. Con-

sequently, a robust learning signal is obtained from

combining these cycle variants. The ﬁgure also shows

the pseudo-mask I

i jki

that is constructed for this batch,

where the existing cycles are correctly found with the

exception of a severely occluded one in the top left,

which you would not want to train anyway. The low

value of 0.36 on the diagonal of the dark blue cycle

means that a strong self-supervised learning signal is

obtained from the masking, forcing the model to out-

put more similar features for the different views of the

person in pink.

Figure 4 provides insight into the test set match-

ing performance of our model compared to (Gan

et al., 2021). It shows how our model effectively

ﬁnds the pairwise matches at test time in a crowded

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

Table 5: Ablation of the cycle variations, also linking the Equations with illustrations. Our new A

i jk

and A

i jk

ki j

jki

cycles

work well individually, and even better in combination with the other cycle variations. Partial masking is also most effective

when combined with multiple cycle variations. Our ﬁnal method uses the setup from the bottom row with masking.

i jk

k ji

i jk

ki j

jki

i j

w\o with

Eq 12 (Gan et al., 2021) Eq 13 [Ours] Eq 14 [Ours] Eq 11 (Feng et al., 2024) Masking Masking

✓ 65.1 ± 0.9 66.7± 0.6

✓ 6

4 ± 1

0 66.2±1.5

✓ 65.6 ± 1.8 66.4±1.2

✓ 57.7 ± 1.5 55.9±1.2

✓ ✓ 65.6 ± 1.5 66.7 ± 0.9

✓ ✓ ✓ 66.2 ± 1.2 66.9 ± 0.7

✓ ✓ ✓ ✓ 66.3 ± 1.0 6

2 ± 1

Figure 4: Qualitative example during matching inference for a difﬁcult frame in the test set. Our model is able to match with

signiﬁcantly fewer false positives. The matches found with our method are based on subtle clothing details, and have been

correctly found in the presence of signiﬁcant view angle differences and occlusion, signiﬁcantly improving over the previous

SOTA

scene. Note the difﬁculty of the matching problem,

and how our method has signiﬁcantly fewer false pos-

itive matches. The ﬁgure also demonstrates that our

method is able to match signiﬁcantly different repre-

sentations of the same person across cameras, while

differentiating between very similar looking people

based on subtle clothing details.

Self-Supervised Partial Cycle-Consistency for Multi-View Matching

6 CONCLUSIONS

We have extended the mathematical formulation of

cycle-consistency to partial overlaps between views.

We have leveraged these insights to develop a self-

supervised training setting that employs multiple

new cycle variants and a pseudo-masking approach

to steer the loss function. The cycle variants ex-

pose different cycle-inconsistencies, ensuring that the

self-supervised learning signal is more diverse and

therefore stronger. We also presented a time di-

vergent batch sampling approach for self-supervised

cycle-consistency. Our methods combined improve

the cross-camera matching performance of the cur-

rent self-supervised state-of-the-art on the challeng-

ing DIVOTrack benchmark by 4.3 percentage points

overall, and by 4.7-9.1 percentage points for the most

challenging scenes.

Our method is effective in other multi-camera

downstream tasks such as Re-ID and cross-view

multi-object tracking. One limitation of self-

supervision with cycle-consistency is its dependence

on bounding boxes in the training data. Detections

from an untrained detector could be used to train with

instead, but this would likely degrade performance.

Another area for improvement is to take location and

relative distances into account both during training

and testing, as this provides informative identity in-

formation.

Self-supervision through cycle-consistency is ap-

plicable to many more settings than just learning

view-invariant object features. We believe the tech-

niques introduced in this paper also beneﬁt works that

use cycle-consistency to learn image, patch, or key-

point features from videos or overlapping views.

REFERENCES

Bastani, F., He, S., and Madden, S. (2021). Self-supervised

multi-object tracking with cross-input consistency.

Advances in Neural Information Processing Systems,

34:13695–13706.

Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020).

A simple framework for contrastive learning of visual

representations. In International conference on ma-

chine learning, pages 1597–1607. PMLR.

Dong, J., Jiang, W., Huang, Q., Bao, H., and Zhou, X.

(2019). Fast and robust multi-person 3d pose esti-

mation from multiple views. In Proceedings of the

IEEE/CVF conference on computer vision and pattern

recognition, pages 7792–7801.

Dwibedi, D., Aytar, Y., Tompson, J., Sermanet, P., and Zis-

serman, A. (2019). Temporal cycle-consistency learn-

ing. In Proceedings of the IEEE/CVF conference on

computer vision and pattern recognition, pages 1801–

1810.

Fan, H., Zheng, L., Yan, C., and Yang, Y. (2018). Un-

supervised person re-identiﬁcation: Clustering and

ﬁne-tuning. ACM Transactions on Multimedia Com-

puting, Communications, and Applications (TOMM),

14(4):1–18.

Feng, W., Wang, F., Han, R., Qian, Z., and Wang, S. (2024).

Unveiling the power of self-supervision for multi-

view multi-human association and tracking. arXiv

preprint arXiv:2401.17617.

Gan, Y., Han, R., Yin, L., Feng, W., and Wang, S. (2021).

Self-supervised multi-view multi-human association

and tracking. In Proceedings of the 29th ACM Inter-

national Conference on Multimedia, pages 282–290.

Han, R., Wang, Y., Yan, H., Feng, W., and Wang, S. (2022).

Multi-view multi-human association with deep as-

signment network. IEEE Transactions on Image Pro-

cessing, 31:1830–1840.

Hao, S., Liu, P., Zhan, Y., Jin, K., Liu, Z., Song, M., Hwang,

J.-N., and Wang, G. (2023). Divotrack: A novel

dataset and baseline method for cross-view multi-

object tracking in diverse open scenes. International

Journal of Computer Vision, pages 1–16.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-

ual learning for image recognition. In Proceedings of

the IEEE conference on computer vision and pattern

recognition, pages 770–778.

Huang, Q.-X. and Guibas, L. (2013). Consistent shape maps

via semideﬁnite programming. Computer graphics fo-

rum, 32(5):177–186.

Jabri, A., Owens, A., and Efros, A. (2020). Space-time cor-

respondence as a contrastive random walk. Advances

in neural information processing systems, 33:19545–

19560.

Li, M., Zhu, X., and Gong, S. (2019). Unsupervised tracklet

person re-identiﬁcation. IEEE transactions on pattern

analysis and machine intelligence, 42(7):1770–1782.

Loy, C. C., Xiang, T., and Gong, S. (2010). Time-delayed

correlation analysis for multi-camera activity under-

standing. International Journal of Computer Vision,

90:106–129.

Ristani, E. and Tomasi, C. (2018). Features for multi-target

multi-camera tracking and re-identiﬁcation. In Pro-

ceedings of the IEEE conference on computer vision

and pattern recognition, pages 6036–6046.

Sarlin, P.-E., DeTone, D., Malisiewicz, T., and Rabinovich,

A. (2020). Superglue: Learning feature matching

with graph neural networks. In Proceedings of the

IEEE/CVF conference on computer vision and pattern

recognition, pages 4938–4947.

Sun, S., Akhtar, N., Song, H., Mian, A., and Shah, M.

(2019). Deep afﬁnity network for multiple object

tracking. IEEE transactions on pattern analysis and

machine intelligence, 43(1):104–119.

Wang, X., Jabri, A., and Efros, A. A. (2019). Learning cor-

respondence from the cycle-consistency of time. In

Proceedings of the IEEE/CVF Conference on Com-

puter Vision and Pattern Recognition, pages 2566–

2576.

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

Wang, Z., Zhang, J., Zheng, L., Liu, Y., Sun, Y., Li, Y., and

Wang, S. (2020). Cycas: Self-supervised cycle as-

sociation for learning re-identiﬁable descriptions. In

Computer Vision–ECCV 2020: 16th European Con-

ference, Glasgow, UK, August 23–28, 2020, Proceed-

ings, Part XI 16, pages 72–88. Springer.

Wieczorek, M., Rychalska, B., and D ˛abrowski, J. (2021).

On the unreasonable effectiveness of centroids in im-

age retrieval. In Neural Information Processing: 28th

International Conference, ICONIP 2021, Sanur, Bali,

Indonesia, December 8–12, 2021, Proceedings, Part

IV 28, pages 212–223. Springer.

Wojke, N., Bewley, A., and Paulus, D. (2017). Simple on-

line and realtime tracking with a deep association met-

ric. In 2017 IEEE international conference on image

processing (ICIP), pages 3645–3649. IEEE.

Ye, M., Shen, J., Lin, G., Xiang, T., Shao, L., and Hoi, S. C.

(2021). Deep learning for person re-identiﬁcation: A

survey and outlook. IEEE transactions on pattern

analysis and machine intelligence, 44(6):2872–2893.

Zhao, J., Han, R., Gan, Y., Wan, L., Feng, W., and Wang,

S. (2020). Human identiﬁcation and interaction detec-

tion in cross-view multi-person videos with wearable

cameras. In Proceedings of the 28th ACM Interna-

tional Conference on Multimedia, pages 2608–2616.

Self-Supervised Partial Cycle-Consistency for Multi-View Matching