Table 4: Test Set Evaluation Results.
Levels True +ve False +ve True -ve False -ve Precision Recall F-Measure
Level 1 28 0 282 0 1 1 1
Level 2 53 6 385 0 0.89 1 0.94
Level 3 26 8 82 0 0.76 1 0.86
Overall 107 14 749 0 0.88 1 0.93
porated with the system, it enables the system to be
generic enough to use it for documents from any other
domain. The presented system is totally autonomous
and can process the documents without any human
feedback. The presented system is able to produce
output efficiently irrespective of the size of document.
It is also very robust as it can process documents from
a bunch of different brands with no standardization of
terminologies or layouts. Reliability of output is rep-
resented in the report generated along the output files,
where each table has separate confidence score with
reasoning.
The presented system is implemented in such a
way that it does not adhere to any specific use case,
but can also work for any other domain documents
with relevant data tables extraction problem. The pre-
sented system could be tested on any other domain
documents by simply replacing the current ontology
with the desired domain ontology.
REFERENCES
Adelfio, M. D. and Samet, H. (2013). Schema extraction
for tabular data on the web. Proc. VLDB Endow.,
6(6):421–432.
Chang, C.-H., Kayed, M., Girgis, M. R., and Shaalan,
K. F. (2006). A survey of web information extrac-
tion systems. IEEE Trans. on Knowl. and Data Eng.,
18(10):1411–1428.
Chao, H. and Fan, J. (2004). Layout and Content Extraction
for PDF Documents, pages 213–224. Springer Berlin
Heidelberg, Berlin, Heidelberg.
Freitag, D. (1998). Information Extraction from HTML:
Application of a General Machine Learning Ap-
proach. In AAAI/IAAI, pages 517–523.
Gatterbauer, W. and Bohunsky, P. (2006). Table extrac-
tion using spatial reasoning on the css2 visual box
model. In Proceedings of the 21st National Confer-
ence on Artificial Intelligence - Volume 2, AAAI’06,
pages 1313–1318. AAAI Press.
Liu, Y., Mitra, P., Giles, C. L., and Bai, K. (2006). Auto-
matic extraction of table metadata from digital docu-
ments. In Proceedings of the 6th ACM/IEEE-CS Joint
Conference on Digital Libraries, JCDL ’06, pages
339–340, New York, NY, USA. ACM.
Milosevic, N., Gregson, C., Hernandez, R., and Nenadic,
G. (2016). Extracting patient data from tables in clini-
cal literature - case study on extraction of bmi, weight
and number of patients. In Proceedings of the 9th
International Joint Conference on Biomedical Engi-
neering Systems and Technologies (BIOSTEC 2016),
pages 223–228.
Peng, F. and McCallum, A. (2006). Information extraction
from research papers using conditional random fields.
Inf. Process. Manage., 42(4):963–979.
Pinto, D., McCallum, A., Wei, X., and Croft, W. B. (2003).
Table extraction using conditional random fields. In
Proceedings of the 26th Annual International ACM SI-
GIR Conference on Research and Development in In-
formaion Retrieval, SIGIR ’03, pages 235–242, New
York, NY, USA. ACM.
Rahman, A. F. R., Alam, H., and Hartono, R. (2001). Con-
tent extraction from html documents. In Int. Workshop
on Web Document Analysis (WDA), pages 7–10.
Ramakrishnan, C., Patnia, A., Hovy, E., and Burns, G. A.
(2012). Layout-aware text extraction from full-text
pdf of scientific articles. Source Code for Biology and
Medicine, 7(1):7.
Rosenfeld, B., Feldman, R., and Aumann, Y. (2002). Struc-
tural extraction from visual layout of documents. In
Proceedings of the Eleventh International Conference
on Information and Knowledge Management, CIKM
’02, pages 203–210, New York, NY, USA. ACM.
Ruffolo, M. and Oro, E. (2008). Xonto: An ontology-based
system for semantic information extraction from pdf
documents. 2008 20th IEEE International Conference
on Tools with Artificial Intelligence (ICTAI), 01:118–
125.
Tengli, A., Yang, Y., and Ma, N. L. (2004). Learning table
extraction from examples. In Proceedings of the 20th
International Conference on Computational Linguis-
tics, COLING ’04, Stroudsburg, PA, USA. Associa-
tion for Computational Linguistics.
Wei, X., Croft, B., and Mccallum, A. (2006). Table extrac-
tion for answer retrieval. Inf. Retr., 9(5):589–611.
ICAART 2018 - 10th International Conference on Agents and Artificial Intelligence
500