Figure 1: An example of a complex technical drawing with
several visual objects and text patterns.
ings, since only those objects and text patterns carry
the information of concern and they could be used as
search indexes for retrieving the drawings later. An
example of the data is demonstrated in Figure 1. With
that requirement, the outputs of the system are the de-
tected objects and text patterns (i.e. their locations in
the drawings and their type/class). As for the text pat-
terns, the system not only locates their positions but
also transcribes them into text, which is a typical OCR
process. Our models are trained and evaluated using a
real dataset containing nearly five thousand technical
drawings that were labeled manually by human op-
erators. On this dataset, the system shows promising
results and the capability to reduce the human label-
ing effort to a great extent. We believe the system’s
performance is scalable to a much larger dataset, and
hence, it can be applied directly in the process of dig-
itizing scanned documents.
2 RELATED WORK
Since the 1950s, when the first commercial OCR
products became available in the United States, other
OCR systems have been researched and developed
(Fujisawa, 2007). In the 1960s, IBM introduced mod-
els of optical readers for businesses. One of them
can read 200 types of fonts of printed materials. In
the 1970s, commercial OCR products flourished in
Japan, most notable is the national project including
the Kanji handwriting recognition project. The first
handwriting recognition product with touching char-
acters was introduced in 1983. By the 1990s, with
the development of hardware, operating systems and
programming languages, OCR products running on
computers had become very popular in the market.
Nowadays, with only smart mobile devices, it is pos-
sible to perform OCR on documents with high accu-
racy. Also, for large-scale applications, many cloud
service providers such as Google Vision, AWS Tex-
tract, Azure OCR, etc. offer text detection as one of
their various computer vision capabilities.
Tesseract OCR is the most well-known open-
source system developed by HP between 1984 and
1994, appearing for the first time in the “UNLV An-
nual Test of OCR Accuracy” contest in 1995 (Rice
et al., 1995) and surpassing all other commercial OCR
systems at the time. Ever since 2006, the system
has continued to be developed under the investment
of Google (Smith, 2007). Because it is open-source,
the architecture of Tesseract OCR is published. De-
velopers can use Tesseract OCR as an engine to build
their own recognition system. The accuracy of Tesser-
act OCR ranges from 90% to 99% depending on the
language being recognized. However, it can only per-
form well on clean input images and pre-defined fonts
while noisy images and custom fonts or layouts would
cause the system to be unusable.
Besides commercial off-the-shelf systems, OCR,
especially non-traditional problems (text extraction in
images with complex backgrounds and unstructured
layouts), is still an active field of research with numer-
ous novel methods proposed every year (Zhu et al.,
2016; Long et al., 2018). Currently, the prominent
trend for solving non-traditional OCR is to combine
a text detection module with a text recognition mod-
ule (Jaderberg et al., 2016; Liu et al., 2018; Borisyuk
et al., 2018; Zhan et al., 2019). In (Jaderberg et al.,
2016), the proposed system is based on a region pro-
posal mechanism for detection and deep convolu-
tional neural networks for recognition. However, their
recognition model is word-based instead of character-
based as ours. Liu et al. introduced a unified end-
to-end trainable Fast Oriented Text Spotting (FOTS)
network in (Liu et al., 2018). This network is a com-
bination of detection and recognition modules with
the computation and visual information shared among
the two complementary tasks. The Rosetta system
(Borisyuk et al., 2018) is another deployed and scal-
able OCR system, designed to process images up-
loaded daily at Facebook scale. It is also divided into
a two-staged process, where the Faster-RCNN model
(Ren et al., 2015) is used for text detection and a
sequence-to-sequence with CTC loss (Graves et al.,
2006) is used for text recognition.
3 PROPOSED METHOD
As mentioned previously, a typical OCR system con-
sists of an object detection module and a text recog-
nition module. Object detection is the process of lo-
Object Detection and Text Recognition in Large-scale Technical Drawings
613