Figure 4. structure comparison of FPN, Panet, Bi FPN. 
Compared  with  Panet  (Polar-Agency  Network), 
BiFPN (BiFPN) with weighted bidirectional features 
has  a  different  node  join  pattern  from  Panet  (Panel 
Agency  Network),  the  optimization  methods  of  the 
cross-scale join include:  
(1) deleting the unique input nodes in the PANET 
-LRB-Path  Agency  Network).  Because  there  is  no 
node with fusion characteristic, the nodes of p 3 and 
P  6  are  eliminated,  and  a  small  simplified  binary 
network is obtained.  
(2)  at  the  same  scale,  the  frequency-hopping 
connection  between  the  input  and  output  nodes  is 
increased, so that the frequency-hopping connection 
on the same feature layer can be fused at more levels 
with limited computation.  
(3)  unlike  Panet  (Patholic  Agency  Network)  , 
which  has  only  one  top-down  and  one  bottom-up 
feature  channels,  Bi-FPN  (weighted  bidirectional 
feature  cone)  treats  each  bidirectional  channel  as  a 
feature  Network  layer,  and  through  repeated 
processing  of  this  layer  features,  thus  achieving  a 
higher dimension of feature fusion. 
Swin-Transformer  improves  the  prediction  head 
based  on  Swin  Transformer  encoder.  Swin-
transformer  replaces  the  moving  window  with  the 
moving window, performs self-attention computation 
on  the  non-overlapping  local  feature  layer,  and 
completes the neighbor feature aggregation by using 
the method of layer connectivity. 
In  the  field  of  object  detection,  due  to 
Transformer's dependence on high-resolution images, 
its attention complexity is about the square of image 
size.  On  this  basis,  a  sparse  representation  method 
based  on  multi-scale  features  is  proposed. 
SWINTRANSFORMER  fuses  adjacent  smaller 
image blocks to create a hierarchical feature map for 
deep  mining. When  the  number  of  image  blocks  in 
each  feature  layer  is  constant,  the  computational 
complexity is linear with the image size. 
This  method  makes  use  of  the  common 
hierarchical  construction  method  in  convolutional 
neural  network  and  the  concept  of  image  region  to 
realize the self-attention computation of inconsistent 
image window. Compared to the convolution process 
in  convolutional  neural  network  (CNN),  Swin 
Transformer performs a convolution on each window 
to get a window's properties, while Swin Transformer 
performs a self-focusing calculation on each window, 
a new window is obtained, and then the new window 
is  fused  once,  and  then  the  fused  window  is  fused 
once. 
In this model, the  traditional long-term attention 
mode (MSA) is transformed into a moving window 
mode. Swin converter consists of a sliding window-
based Multilayer perceptron (MSA), which connects 
two different types of Multilayer perceptron (mlps) in 
series. 
Instead of the Swin Transformer framework, the 
traditional Transformer framework needs to perform 
global self-attention computation on the image, which 
consumes a lot of computing resources, and it needs 
to  divide  the  image  into  m  ×  m  non-overlapping 
blocks, on this basis, the computational complexity of 
global-based  MSA  and  moving  window-based  W-
sma are: 
Ω
𝑀𝑆𝐴
=4hwC
+2
ℎ𝑤
C
4
 
Ω
𝑊𝑀𝑆𝐴
=4hwC
+2MhwC
5
 
From formula (4)(5) , we can see that the 
operation  complexity  of  MSA  is  the  square  of  the 
number  of  image  blocks  HW,  the  operation 
complexity  of  W-sma  based  on  moving  window  is 
linear with the number of image blocks. 
3  SUMMARY 
With  the  wide  application  of  deep  learning  and 
machine  vision,  transmission  line  inspection  is 
changing  from  traditional  manual  inspection  to 
intelligent  inspection. In  this  paper,  target  detection 
and fault identification in transmission line inspection 
are studied, and the task of small target detection and 
fault identification in transmission line inspection is 
studied.  On  this  basis,  it  is  improved  by  using 
converter,  sven  converter,  weighted  bidirectional 
characteristic  pyramid,  and  convolutional  attention 
model, in this paper, we extend the defective samples 
by  using  saliency  map,  and  adopt  the  method  of 
enhanced  feature  pyramid  and  deep  semantic 
embedding. 
ACKNOWLEDGEMENTS 
This  work  was  supported  by  the  National  Key 
Research and Development Program of China under 
Grant 2020AAA0107500.