Tracking Algorithm is beneficial either with RGB-D 
information  or  just  with  RGB  information.  When 
using the GPU, the impact of the Tracking Algorithm 
is less, but in both situations, it is beneficial to use the 
Tracking Algorithm. 
It  is  important  to  emphasize  that,  using  the 
Tracking  Algorithm  and  the  GPU,  the  depth 
information  has  no  impact  on  the  leather 
segmentation  velocity.  The  FPS  results  were 
calculated with the models performing the prediction 
in all frames and, when it is necessary to increase the 
tracking velocity, the prediction does not need to be 
performed in all frames, increasing the FPS. 
4  CONCLUSIONS 
Using a U-NET architecture for  the Deep Learning 
model  helped  us  to  get  better  results  compared  to 
creating an architecture from scratch. In addition to 
not having to train for many epochs, the results are 
also satisfactory.(Ronneberger et al., 2015) 
In general, the depth information is important for 
the  Deep  Learning  Model  when  it  is  intended  to 
segment  a deformable object, but  it  is necessary to 
pay attention to the dataset size. With a larger dataset 
the  depth  information  has  more  impact,  but  if  the 
dataset is small or the images have low resolution, the 
depth  information  is  not  so  useful  for  the  Deep 
Learning Model. 
The  Tracking  Algorithm  proved  to  be  useful  to 
increase the system's processing velocity, in this way, 
it  is  possible  to  increase  the  number  of  FPS  and 
manage  to  track  the  Leather  in  the  same  way. 
However, it is necessary to pay attention to situations 
in which the Bounding Box Model loses the object. 
To prevent this  situation,  it is  possible  to create  an 
architecture  that,  in  some  situations,  uses  the  Full 
Image Model to guarantee the correct location of the 
Leather. 
For  future  work  it  is  important  to  increase  the 
dataset  size  and  include  different  types  of  Leather. 
This way, the depth information will have a greater 
impact on the Models. In addition, it is possible to add 
more data processing steps, but it is necessary to pay 
attention  to  the  impact  of  each  technique  on  the 
system as it will have to work in real time. 
ACKNOWLEDGEMENTS 
This  work  is  supported  by:  European  Structural  
and  Investment  Funds  in  the  FEDER  component, 
through  the  Operational  Competitiveness  and 
Internationalization  Programme  (COMPETE  2020) 
[Project  nº  42778;  Funding  Reference:  POCI-01-
0247-FEDER-042778]. 
REFERENCES 
Hu, Z., Han, T., Sun, P., Pan, J., & Manocha, D. (2019). 3-
D Deformable Object Manipulation Using Deep Neural 
Networks.  IEEE  Robotics  and  Automation  Letters, 
4(4),  4255–4261.  https://doi.org/10.1109/LRA.2019.2 
930476 
Kingma,  D.  P.,  &  Ba,  J.  (2017).  Adam:  A  Method  for 
Stochastic  Optimization.  ArXiv:1412.6980  [Cs]. 
http://arxiv.org/abs/1412.6980 
Lai, K., Bo, L., Ren, X., & Fox, D. (2011). A large-scale 
hierarchical  multi-view  RGB-D  object  dataset.  2011 
IEEE  International  Conference  on  Robotics  and 
Automation,  1817–1824.  https://doi.org/10.1109/ 
ICRA.2011.5980382 
Liu, Z., Shi, S., Duan, Q., Zhang, W., & Zhao, P. (2019). 
Salient  object  detection  for  RGB-D  image  by  single 
stream  recurrent  convolution  neural  network. 
Neurocomputing, 363, 46–57. https://doi.org/10.1016/ 
j.neucom.2019.07.012 
Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., 
Gross,  M.,  &  Sorkine-Hornung,  A.  (2016).  A 
Benchmark  Dataset  and  Evaluation  Methodology  for 
Video Object Segmentation. 2016 IEEE Conference on 
Computer  Vision  and  Pattern  Recognition  (CVPR), 
724–732. https://doi.org/10.1109/CVPR.2016.85 
Pont-Tuset,  J.,  Perazzi,  F.,  Caelles,  S.,  Arbeláez,  P., 
Sorkine-Hornung, A., & Van Gool, L. (2018). The 2017 
DAVIS  Challenge  on  Video  Object  Segmentation. 
ArXiv:1704.00675  [Cs].  http://arxiv.org/abs/1704.00 
675 
Ronneberger,  O.,  Fischer,  P.,  &  Brox,  T.  (2015).  U-Net: 
Convolutional  Networks  for  Biomedical  Image 
Segmentation.  ArXiv:1505.04597  [Cs].  http://arxiv. 
org/abs/1505.04597 
Russell, B. C., Torralba, A., Murphy, K. P., & Freeman, W. 
T. (2008). LabelMe: A Database and Web-Based Tool 
for  Image  Annotation.  International  Journal  of 
Computer  Vision,  77(1–3),  157–173.  https://doi.org/ 
10.1007/s11263-007-0090-8 
Song,  S.,  &  Xiao,  J.  (2013).  Tracking  Revisited  Using 
RGBD  Camera:  Unified  Benchmark  and  Baselines. 
2013  IEEE  International  Conference  on  Computer 
Vision,  233–240.  https://doi.org/10.1109/ICCV.20 
13.36 
Voigtlaender, P., Krause, M., Osep, A., Luiten, J., Sekar, B. 
B. G., Geiger, A., & Leibe, B. (2019). MOTS: Multi-
Object  Tracking  and  Segmentation.  2019  IEEE/CVF 
Conference  on  Computer  Vision  and  Pattern 
Recognition  (CVPR),  7934–7943.  https://doi.org/ 
10.1109/CVPR.2019.00813