Table 5: Accuracy versus Complexity of proposed network
on COCO validation subset.
AP, % GFLOPs
Dilated MobileNet v1 n/a 3.7
conv4 3 n/a 0.3
conv4 4 n/a 0.3
Initial stage 35 1.3
Refinement stage 1 41.4 3.4
2-stage network, retrained
with all refinement stages 42.8 9
Table 6: Final inference fps for a video with more than 20
estimated poses. Numbers in braces are network inference
and post-processing fps.
NUC CPU
Baseline 1.17 (3.92/1.66) 0.95 (2.47/1.54)
Proposed 28 (33/160) 26 (33/125)
4.2 Fast Post-processing
We profiled the code and removed extra memory
allocations, parallelized keypoints extraction with
OpenCV’s routine. This made code significantly
faster, and the last bottleneck was the resize feature
maps to the input image size.
We decided to skip the resize step and performed
grouping directly on network output, but accuracy
dropped significantly. Thus step with upsampling fea-
ture maps cannot be avoided, but it is not necessary to
do it to input image size. Our experiments shown, that
with upsample factor 8 the accuracy is the same, as if
resize to input image size. We used upsample factor 4
for the demo purposes.
4.3 Inference
For the network inference we use Intel
®
OpenVINO
TM
Toolkit R4 (Intel, 2018), which
provides optimized inference across different
hardware, such as CPU, GPU, FPGA, etc. Final
performance numbers are shown in the Table 6, they
were measured for a challenging video with more
than 20 estimated poses.
We used two devices: Intel NUC6i7KYB, which
performed inference on the integrated GPU Iris Pro
Graphics P580 in half-precision floating-point for-
mat (FP16), and 6-core Core i7-6850K CPU, which
performed inference in single-precision floating-point
format (FP32). Network input size was set to
456x256, which is similar to 368x368, but with 16:9
aspect ratio, suitable for processing video streams.
5 CONCLUSION
In this work, we approached the problem of human
pose estimation network, suitable for real-time per-
formance on edge devices. We proposed the solution,
based on OpenPose method, with heavily optimized
network design and post-processing code. The accu-
racy versus network complexity ratio was increased
in more than 6.5 times due to the use of dilated Mo-
bileNet v1 feature extractor with depthwise separable
convolutions and lightweight refinement stage design
with residual connections. The network can be down-
loaded as a part of the OpenVINO Toolkit under the
name human-pose-estimation-0001. The network de-
scription is available in the Open Model Zoo reposi-
tory.
The full solution runs in real time on a usual CPU,
as well as on NUC mini PC and closely matches ac-
curacy of the baseline 2-stage network. Some tech-
niques may further improve performance and accu-
racy, such as quantization, pruning, knowledge distil-
lation. We left them for the future research.
REFERENCES
Bradski, G. (2000). The OpenCV Library. Dr. Dobb’s Jour-
nal of Software Tools.
Cao, Z., Simon, T., Wei, S., and Sheikh, Y. (2017). Real-
time multi-person 2d pose estimation using part affin-
ity fields. In CVPR.
Fang, H.-S., Xie, S., Tai, Y.-W., and Lu, C. (2017). RMPE:
Regional multi-person pose estimation. In ICCV.
He, K., Gkioxari, G., Doll
´
ar, P., and Girshick, R. (2017).
Mask R-CNN. In ICCV.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep Resid-
ual Learning for Image Recognition. In CVPR.
Hong, S., Roh, B., Kim, K.-H., Cheon, Y., and Park, M.
(2016). PVANet: Lightweight Deep Neural Networks
for Real-time Object Detection. In arXiv preprint
arXiv:1611.08588.
Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D.,
Wang, W., Weyand, T., Andreetto, M., and Adam,
H. (2017). Mobilenets: Efficient convolutional neu-
ral networks for mobile vision applications. In arXiv
preprint arXiv:1704.04861.
Intel (2018). OpenVINO Toolkit. In
https://software.intel.com/en-us/openvino-toolkit.
Jindal, A. and et al. (2018). Enabling full body ar with
mask r-cnn2go. In https://research.fb.com/enabling-
full-body-ar-with-mask-r-cnn2go/.
Kim, I. (2018). tf-pose-estimation.
https://github.com/ildoonet/tf-pose-estimation.
Kocabas, M., Karagoz, S., and Akbas, E. (2018). Mul-
tiPoseNet: Fast multi-person pose estimation using
pose residual network. In ECCV.
Real-time 2D Multi-Person Pose Estimation on CPU: Lightweight OpenPose
747