To be clear, this is not an ideal validation. Though we
have been using cross validation approaches through-
out, a significant test set would be more convincing
(future work is to obtain this).
The pipeline was run on the 116 work orders with
1106 images where 117 are FAA documents (one
work order had a duplicate FAA form). All 117
FAA documents were correctly oriented. The doc-
ument type classifier predicted 121 documents to be
FAA where four were false positives (non-FAA im-
ages classified as FAA). The text classifier predicted
three of the four as other and one as tested (i.e. it mit-
igated three of four false positives). The text classifier
correctly predicted the status of 116 FAA documents
with one incorrectly predicted as tested versus actual
repaired.
9 CONCLUSIONS
In this paper we proposed and demonstrated an ap-
proach for document analysis using a combination
of supervised machine learning models for orienta-
tion classification and document classification. The
form style documents were first partitioned to pro-
duce symbols from which features were generated.
The features were then used to train machine learn-
ing algorithms. When the image is oriented and the
document identified, document streams are sent to an
OCR engine to produce a text file from which a simple
match is made to determine the desired form’s status.
We then employed a feature selection approach for the
document type classifier to produce a parsimonious
model and showed that it was as accurate as the full
model. Finally, the end-to-end results were presented
to demonstrate the effectiveness of our approach.
ACKNOWLEDGEMENTS
This work was supported in part by the University
of Montevallo Contract #19-0501-001. The authors
greatly appreciate the support of the airline company
employees involved in the project. Without their ef-
forts this research could not have been conducted.
REFERENCES
Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regular-
ization paths for generalized linear models via coordi-
nate descent. Journal of Statistical Software, 33(1):1–
22.
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann,
P., and Witten, I. H. (2009). The WEKA Data Min-
ing Software: An Update. SIGKDD Explor. Newsl.,
11(1):10–18.
James, G., Witten, D., Hastie, T., and Tibshirani, R. (2014).
An Introduction to Statistical Learning: With Appli-
cations in R. Springer Publishing Company, Incorpo-
rated.
Kay, A. (2007). Tesseract: An open-source optical character
recognition engine. Linux J., 2007(159):2–.
Lu, S. and Tan, C. L. (2006). Automatic document orien-
tation detection and categorization through document
vectorization. In Proceedings of the 14th ACM Inter-
national Conference on Multimedia, MM ’06, pages
113–116, New York, NY, USA. ACM.
O’Gorman, L. (1993). The document spectrum for page lay-
out analysis. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 15(11):1162–1173.
Raeder, T., Forman, G., and Chawla, N. V. (2012). Learn-
ing from Imbalanced Data: Evaluation Matters, pages
315–331. Springer Berlin Heidelberg, Berlin, Heidel-
berg.
Rangoni, Y., Shafait, F., Van Beusekom, J., and Breuel,
T. M. (2009). Recognition driven page orientation
detection. In Proceedings of the 16th IEEE Inter-
national Conference on Image Processing, ICIP’09,
pages 1969–1972, Piscataway, NJ, USA. IEEE Press.
Sedgewick, R. and Wayne, K. (2011). Algorithms, 4th Edi-
tion. Addison-Wesley.
Tan, P., Steinbach, M., Karpatne, A., and Kumar, V. (2019).
Introduction to Data Mining, Second Edition. Pear-
son.
Tibshirani, R. (1994). Regression shrinkage and selection
via the lasso. JOURNAL OF THE ROYAL STATISTI-
CAL SOCIETY, SERIES B, 58:267–288.
Yang, C., Yin, X., Yu, H., Karatzas, D., and Cao, Y. (2017).
ICDAR2017 robust reading challenge on text extrac-
tion from biomedical literature figures (DeTEXT). In
2017 14th IAPR International Conference on Docu-
ment Analysis and Recognition (ICDAR), volume 01,
pages 1444–1447.
Ye, Q. and Doermann, D. (2015). Text detection and
recognition in imagery: A survey. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence,
37(7):1480–1500.
Zou, H. and Hastie, T. (2005). Regularization and variable
selection via the elastic net. Journal of the Royal Sta-
tistical Society, Series B, 67:301–320.
ICPRAM 2020 - 9th International Conference on Pattern Recognition Applications and Methods
424