Tracking Verification Algorithm Based on Channel Reliability

JunChang Zhang

1,2

, ChenYang Xia

and JinJin Wan

Institute of Electronics and Information, Northwestern Polytechnical University, Xi’an 710072, China

Key Laboratory of Photoelectric Control Technology, Luoyang, Henan 471000 China

Keywords: Correlation filter, Channel reliability, Tracker, Validator, Siamese convolutional neural network

Abstract: For most algorithms, the problem of Tracking target performance degradation in the case of fast moving,

illumination changes, target deformation, occlusion, out-of-plane rotation, low-resolution images, etc. This

paper proposes a tracking verification algorithm based on channel reliability. The tracker part of the

algorithm is tracked by the method of correlation filter based on channel reliability. By calculating the

reliability weight of each feature channel of the input correlation filter, and multiplying the weight by the

response of the corresponding channel to obtain the final response, so that the target positioning will be

more accurate. The validator part uses the Siamese dual input network in the deep learning convolutional

neural network. Every few frames, the verifier will verify the results of the tracker part of the algorithm. If

the reliability is verified, the tracking result will not be modified. Otherwise, the validator will re-detect the

target location and verify the reliability through the Siamese dual-input network. The tracker will regard this

location as the new location of our target continues to be tracked, making target tracking more durable and

robust. The experimental evaluation of the OTB13 video sequence proves that the proposed algorithm has

good adaptability to target fast motion, illumination change, target deformation, occlusion, and out-of-plane

rotation, and has good robustness.

1 INTRODUCTION

As one of the basic technologies of computer vision,

target tracking technology is widely used in video

surveillance, human-computer interaction, robot

(Smeulders and Chu, 2014) and other fields.

Although the target tracking technology has

achieved a series of results in recent years, there are

still many difficulties and challenges in object

tracking, occlusion, rotation, illumination changes,

and posture changes.

Existing model-free visual tracking algorithms

are often classified as Discriminating or generating.

Discriminating algorithms can be learned by multi-

instance learning (MIL), compressed sensing, P-N

learning, structured output SVM (Hare, Golodetz,

Saffari, Vineet, Cheng, Hicks, and Torr, 2016),

online enhancement, and the like. In contrast, the

generated class tracker typically treats the tracking

as the most similar area of the search to the target.

To this end, various object appearance modeling

methods have been proposed, such as incremental

subspace learning and sparse representation (Fan

and Xiang, 2017) Currently, one of the new trends in

improving tracking accuracy is the use of deep

learning tracking methods (Fan and Ling, 2017, Ma,

Huang and Yang, 2015, Nam and Han, 2016)

because they have strong discriminative power, as

shown in (Nam and Han, 2016). However, the use of

deep learning-based tracking algorithms is

computationally intensive and less real-time.

Since MOSSE algorithm was proposed, the

correlation filter (CF) has been considered as a

robust and efficient method for visual tracking

problems (Bolme, Beveridge, Draper and Lui, 2010).

Currently, the proposed improvements based on the

MOSSE algorithm include the inclusion of kernel

and HOG features, the addition of color name

features or color histograms (Bertinetto, Valmadre,

Golodetz, Miksik, and Torr, 2016), and sparse fusion

tracking (Zhang, Bibi and Ghanem, 2016), adaptive

scales, mitigation of boundary effects (Danelljan,

Hager, Shahbaz Khan, and Felsberg, 2015), based on

Context-Aware correlation filter (Mueller, Smith,

Ghanem, 2017) and fusion of deep convolutional

network functions (Ma, Huang and Yang, 2015)

algorithm.

Zhang, J., Xia, C. and Wan, J.

Tracking Veriﬁcation Algorithm Based on Channel Reliability.

DOI: 10.5220/0008095900590064

In Proceedings of the International Conference on Advances in Computer Technology, Information Science and Communications (CTISC 2019), pages 59-64

ISBN: 978-989-758-357-5

Although the speed or accuracy of the tracking

algorithms mentioned above has improved, real-time

high-quality tracking algorithms are still rare. So

seeking trade-offs between speed and accuracy is a

trend in future tracking (Mueller, Smith, Ghanem,

2017, Ma, Yang, Zhang and M.H. Yang, 2015).

Context-Aware Correlation Filter Tracking (Mueller,

Smith, Ghanem, 2017) proposes a new correlation

filter framework that can add more background

information and incorporate global background

information into the learned filters for processing.

The algorithm adds background information to the

Staple algorithm, and the robustness to large size

changes, background clutter and partial occlusion is

improved and the impact of speed is relatively small.

However, the algorithm is relatively less robust in

the target plane, out-of-plane rotation, dramatic

illumination changes, and fast motion. Therefore, in

order to better and more accurately track the target, a

tracking algorithm that balances the advantages and

disadvantages of both can be found between real-

time and high robustness. Therefore, this paper

proposes a video target verification tracking

algorithm based on channel reliability.

The algorithm in this article consists of two parts:

a tracker and a validator. The validator is

implemented by the Siamese network in the deep

learning convolutional neural network. These two

parts are independent of each other and work in

harmony. Advantages (1): The channel reliability

method is used to make the target positioning more

accurate. That is, each feature channel is added with

a corresponding weight, and then summed. (2):

Verify the result of the tracker every few frames.

When the verification system finds that the result of

tracking a certain frame is incorrect, it will re-target

the target to find the target position information and

put the target new. The position returns to the

tracker as the target position of the error frame,

causing the tracker to continue tracking from this

position.

2 THE TARGET VERIFICATION

TRACKING ALGORITHM

BASED ON CHANNEL

RELIABILITY

The algorithm in this paper consists of two parts:

tracker T and verifier V. The tracker is implemented

using a correlation filter method based on Context-

Aware to ensure real-time and location of the target.

At the same time, the tracker sends a verification

request to the validator with a fixed number of

frames and responds to feedback from the validator

by adjusting the tracking or updating model. The

validator is implemented using the Siamese network

in the deep learning convolutional neural network.

After receiving a request from the tracker, the

validator will first verify that the tracking results are

correct and then provide feedback to the tracker. The

overall block diagram is shown in Figure 1.

Figure 1: Overall block diagram of video target

verification tracking algorithm based on channel reliability.

2.1 Channel Reliability Estimation

Channel reliability is calculated by constraining the

properties of least squares solutions during the filter

design process. The channel reliability score is used

to represent the weight of each channel filter

response when positioned, as shown in Figure 2.

Figure 2: Channel reliability weights calculated in the

constraint optimization step of correlation filter learning

reduce the noise of the weighted average filter response.

The characteristic channel reliability of the target

positioning phase is obtained by multiplying the

learning channel reliability measurement value and

the channel detection reliability measurement value.

Assume that is the total number of channels for a

given correlation filter Hog feature. The

corresponding set of mutually independent

channel features is

 

 

f f f R







A discriminative feature channel

produces a

CTISC 2019 - International Conference on Advances in Computer Technology, Information Science and Communications

filter

whose output

fw

is almost identical

to the ideal response g.

On the other hand, the output response is noisy

on feature channels with low discriminating power,

and the global response error due to least squares

will significantly reduce the peak of the maximum

response. Therefore, the value of channel learning

reliability is the maximum response of the learned

filter.

In the channel detection reliability measurement

phase, the expressive power of the main mode in

each channel response can indicate the detection

reliability of each channel. In addition, Bolme et al

also proposed a similar method to detect target loss.

Our measure is based on the ratio between the

second and first major mode in the response map, i.e.

Note that this ratio penalizes cases when

multiple similar objects appear in the target vicinity

since these result in multiple equally expressed

modes, even though the major mode accurately

depict the target position. To prevent such

penalizations, the ratio is clamped by 0.5. Therefore,

the per-channel detection reliability is estimated as:

(1)

2.2 Algorithm for Correlation Filter of

Context-Aware Based on Channel

Reliability

The traditional correlation filter tracking algorithm

uses ridge regression to classify. is a circular

matrix of all cyclically translated image blocks:

(2)

Unlike traditional correlation filter frameworks,

more background information is added to the

framework of Context-Aware Correlation Filter.

In each frame, we sample the k Context-Aware

image blocks around the target

according to a uniform sampling strategy (k=4). The

corresponding cyclic matrices are and

These Context-Aware image blocks contain

global background information that causes various

interference factors and different background forms,

which can be considered as true negative samples.

Intuitively, you need to learn a filter that has a high

response to the target and a filter that is close

to zero response to the background image

information patch block. The purpose is achieved by

adding a Context-Aware image patch block as a

normalization constraint to a standard formula (2).

The result is as follows, the response regression of

the target image block is the ideal response y, and

the context image block is returned to zero by the

parameter constraint .

(3)

Where corresponds to a cyclic matrix formed

by all cyclic shifts of image block based on

contextual background information obtained around

the target. indicates the number of associated

filter feature channels.

Therefore, the final response of the algorithm is

the product of the maximum response value obtained

by formula (3) and the reliability estimation value

of the feature channel detection, so that the

position information of the target can be more

accurately located.

2.3 Siamese Verification Network

This paper uses the Siamese network (Comaniciu,

Ramesh and Meer, 2000) to design the verifier V.

The network consists of two convolutional neural

network (CNN) branches and processes two inputs

separately. In this network, VGGNet (Perronnin,

Sanchez and Mensink, 2010) was borrowed from the

architecture of CNNS and an additional area pooling

layer was added. In the detection process, since V

needs to process a plurality of regions in the image,

and select one candidate most similar to the target as

an output result. Therefore, the region pooling layer

can simultaneously process a group of regions in

each frame of image, thereby significantly reducing

the amount of computation.

When the tracking result from T is input to the

Siamese network, if its verification score is lower

than the threshold , V considers that the frame

target tracking fails. In this case, V still uses the

Siamese network to re-detect the target. Unlike the

verification phase, the test needs to verify multiple

image patches in a local area and find the target with

the highest score.

The square area of size is centered on the

position of the tracking result in the verification

frame, which is the detection area. Where w and h

are the width and height of the tracking target, and β

is the target size factor.



1max2max

)

,min(1

1max2max

(det)





min

WyW











nn



W 



 

 



i d

ddd

min



(det)



)(





Tracking Veriﬁcation Algorithm Based on Channel Reliability

The target candidate set generated by the sliding

window is recorded as , and the detection result

is obtained by:

(4)

 

obj i

v x c

represents the verification score

between the tracking target

obj

and the candidate

target

After obtaining the test results, we determine

whether to use it as an alternative to the tracking

result based on the verification score, as shown in

Figure 3.

If the test result is unreliable (the verification

score of the test result is less than the threshold ),

then we do not change the tracking result of the

tracker; and the algorithm reduce the verification

interval V, and enlarge the size of the local area to

search for the target, repeat the above process until

the detection To a reliable result. Then restore the

verification interval and the size of the search area.

Return the results from the validator to the tracker T

and continue tracking down from the revised target

new location. In order to effectively reduce the

calculator calculation time, the algorithm chooses to

verify every ten frames.

For our paper, the verification interval V is

initially set to 10; the verification and detection

thresholds are set to 1.0 and 1.6 respectively.

The parameter β is initialized to 1.5 and can be

adaptively adjusted based on the detection result.

Figure 3: Tracking-Verification.

3 EXPERIMENTAL

VERIFICATION AND RESULTS

ANALYSIS

3.1 Experimental Configuration

In order to evaluate the tracking performance and

efficiency of the proposed algorithm, the

experimental results in this paper are based on the

Core i7, 3.6GHz CPU, Win10 system, through the

Matlab R2016a software testing OTB13 dataset

Obtained using the algorithm of this paper. The test

dataset contains attributes such as lighting changes,

occlusion, fast movement, scale changes, motion

blur, and in-plane rotation. In the experiment, this

paper selects 10 algorithms to compare the result

(DAT, DCF_CA, DSST, SAMF, MEEM, KCF, LCT,

Staple, STAPLE_CA and Our), and then “ Our ”

represents the algorithm that we proposed.

3.2 Quantitative Analysis

Quantitative analysis is a commonly used standard

for measuring algorithm tracking results. This

section uses the average center position error (CLE)

and overlap rate (OR) to evaluate the performance of

the algorithm. CLE is the Euclidean distance

between the target's true center position and the

center position calibrated by the tracking algorithm.

The overlap ratio of the tracking is the ratio of the

area where the tracking succeeds to the real

bounding box:

Score =

𝐴𝑟𝑒𝑎(𝐵

𝑇

∩ 𝐵

𝐺

)

Area(𝐵

𝑇

∪ 𝐵

𝐺

)

Where B

represents the tracking target frame of

each frame, and B

represents the real bounding box

of the corresponding frame.

Table 1 and Table 2 show the comparison results

of the center position average error and the average

value of the tracking bounding box overlap rate of

the tracking results of different algorithms in each

video sequence, respectively.

Table 1: Center position average error.

}{

i

iobj

,...,2,1,

),(maxarg







CTISC 2019 - International Conference on Advances in Computer Technology, Information Science and Communications

Table 2: Average value of bounding box overlap.

Note: The best and second best results are marked in

bold and black italics, respectively

In general, the smaller the average error and the

larger the overlap rate, the more accurate the

tracking result. According to the results of the

average position error of the center position in Table

1 and the average value of the overlap rate of the

bounding box in Table 2, the average error of the

center position of the target and the tracking frame

overlap in the tracking process of the algorithm in

this paper. The rate performance is better than the

benchmark algorithm STAPLE_CA, especially in

the case of the rotation of the target plane, the partial

occlusion of the target, and the disorder of the target

background, the robustness is improved.

3.3 Qualitative Analysis

This paper uses the OTB13 evaluation benchmark to

perform three experiments on 51 video sequences:

One-pass Evaluation (OPE), Temporal Robustness

Evaluation (TRE), and Spatial Robustness

Evaluation (SRE) Experiments. All these evaluation

indicators represent the performance of the tracker in

the form of an accuracy map and a success rate

diagram, which means that the tracker can

successfully track the percentage of the total number

of frames in the video at different thresholds.

By testing 51 video sequences, the experimental

results of the accuracy score map (a) and the success

score graph (b) of the obtained SRE are shown in

Fig. 4. From the experimental results in Fig. 4, the

legend illustrates the ranking scores for each tracker,

and our algorithm ranks first on the top. From the

legend can be analyzed that the performance of the

proposed algorithm is improved compared with the

other nine different types of algorithms. Compared

with the benchmark algorithm STAPLE_CA,

although the tracking speed is about half of the

benchmark algorithm, the average accuracy score

and the average AUC score performance are

improved by more than 10%.

(a) accuracy score map (b) success score map

Figure 4: OTB13 video sequence algorithm evaluation

results. The legend illustrates the ranking scores for each

tracker, and our algorithm ranks first on the top(in SRE

Evaluation Experiments).

The 51 video sequences provided in OTB13

contain 11 attributes: illumination changes,

occlusion, fast motion, scale changes, motion blur,

and in-plane rotation. Fig 5(a)-(e) represent test

results for partial attribute success rates of a video

sequence.

Through the analysis of the OTB13 video

sequence success rate evaluation graph of Fig. 5, it

can be obtained that the algorithm which calculates

the channel reliability for each feature channel in the

input correlation filter, and adds the deep learning

dual input Siamese network to the correlation filter,

has attributes ranked first in condition of fast motion,

deformation, illumination variation, occlusion, out

of plane rotation. So compared with other algorithms,

the algorithm has certain advantages, and the

performance has been improved to some extent.

Especially in the condition of tracking target fast

movement, illumination change, target deformation,

occlusion, and out-of-plane rotation, the algorithm is

more advantageous.

（a）fast motion （b）illumination variation

（c）deformation （d）occlusion

Tracking Veriﬁcation Algorithm Based on Channel Reliability

（e）out of plane rotation

Figure 5: OTB13 video sequence success rate evaluation.

The success plots of ten challenging attributes. The legend

illustrates the ranking scores for each tracker. Our

algorithm has attributes ranked first in condition of fast

motion, deformation, illumination variation, occlusion, out

of plane rotation.

4 CONCLUSIONS

In this paper, the channel reliability method is used

to calculate the reliability weight of each feature

channel and weighted to make the target location

more accurate. The depth-learned dual-input

Siamese network is used to verify and re-search the

results of the correlation filter.

Through the evaluation benchmark analysis of

OTB video sequences, the experimental results show

that the algorithm has a certain degree of

performance for fast motion, illumination change,

target deformation, occlusion, and target rotation

outside the plane.

ACKNOWLEDGEMENTS

This research and Algorithm implementation in this

paper is supported by the 2018 Aviation Funds of

China.

REFERENCES

A. W. Smeulders., D. M. Chu., R. Cucchiara., S.

Calderara., A. Dehghan., and M. Shah., 2014. The

Journal. IEEE TPAMI. Visual tracking: An

experimental survey.

S. Hare., S. Golodetz., A. Saffari., V. Vineet., M.-M.

Cheng., S. L. Hicks., and P. H. Torr., 2016. The

Journal. IEEE TPAMI. Struck: Structured output

tracking with kernels.

H. Fan and J. Xiang., 2017. The Journal. IEEE TCSVT.

Robust visual tracking with multitask joint dictionary

learning.

H. Fan and H. Ling., 2017. The conference. CVPRW.

SANet: Structure-aware network for visual tracking.

C. Ma., J. B. Huang., X. Yang., and M.H. Yang., 2015.

The conference. ICCV. Hierarchical convolutional

features for visual tracking.

H. Nam and B. Han., 2016. The conference. CVPR.

Learning multi-domain convolutional neural networks

for visual tracking.

D. S. Bolme., J. R. Beveridge., B. Draper., Y. M. Lui.,

2010. The conference. IEEE Conference on Computer

Vision and Pattern Recognition, CVPR. Visual object

tracking using adaptive correlation filters.

L. Bertinetto., J. Valmadre., S. Golodetz., O. Miksik., and

P. H. S. Torr., 2016. The conference. The IEEE

Conference on Computer Vision and Pattern

Recognition. Staple: Complementary learners for real-

time tracking.

T. Zhang. A. Bibi. and B. Ghanem., 2016. The conference.

CVPR. In defense of sparse tracking: Circulant sparse

tracker.

M. Danelljan., G. Hager., F. Shahbaz Khan., and M.

Felsberg., 2015. The conference. IEEE International

Conference on Computer Vision. Learning spatially

regularized correlation filters for visual tracking.

Matthias Mueller, Neil Smith, Bernard Ghanem., 2017.

The conference. CVPR. Context-Aware Correlation

Filter Tracking.

C. Ma, X. Yang, C. Zhang, and M.H. Yang., 2015. The

conference. CVPR. Long-term correlation tracking.

D. Comaniciu, V. Ramesh, and P. Meer. 2000. The

conference. CVPR. Real-time tracking of non-rigid

objects using mean shift.

F. Perronnin, J. Sanchez., and T. Mensink., 2010. The

conference. ECCV. Mensink. Improving the fisher

kernel for large-scale image classification.

CTISC 2019 - International Conference on Advances in Computer Technology, Information Science and Communications