cognitive computing pipeline uses text and
knowledge features to drive a process based on semi-
supervised learning to produce material synthesis
recipes.
This paper presents a rule-based algorithm used to
extract section titles, beyond header information, and
group lines of text in their corresponding paragraphs
while placing those paragraphs in their correct
sequential order. To measure the effectiveness of our
algorithm in section classification and ordering, we
developed a user interface to manually extract section
titles and their content from 300 documents to create
our ground truth. With the purpose of creating a
transferable tool across different domain topics, we
compared efficiency measures between the domain-
topic used to develop MATESC, material synthesis,
and other random domains. Half of the documents
were relevant to material synthesis determined by
field professionals, and the other half were randomly
crawled from the web using the open-source web-
crawling platform, Scrapy (Myers et. al., 2015). The
length of the longest common subsequence (LLCS)
(Paterson et. al., 1994), and the length of the longest
common substring, (LLCSTR) (Crochemore et al.,
2015) were measured to determine similarity,
precision, recall and accuracy between the manually
extracted ground truth and the sections extracted by
MATESC. For ordering of section measurements, we
use different variations of k, which determines
comparison of sections only if they k indices apart.
1.1 Background
QA tasks rely heavily on the amount of information
publicly available in the world wide web (Jurafsky et.
al, 2009). With the tremendous growth of scientific
documents publicly available, the format disparity
across different publishers and domain topics
increases. Although there seems to be a general
guideline for scientific papers, there are various
format differences that bring challenges in handling
this disparity to create a generalized tool. In some
documents, section subtitles are not included, making
it difficult for natural language processing to parse
header data.
To address format disparity challenges, metadata
extraction tools have been developed for specific
entities extraction, specifically headers (e.g. title,
authors, keywords, abstract) and bibliographic data.
Apache PDFBox (Apache, 2018), PDFLib TET
(PDFLib, 2018) and Poppler (Noonburg, 2018)
extract text and attributes of PDF documents. Open-
source header and bibliographic data parsers include
GROBID (Lopez, 2009), ParsCit (Prasad et al., 2018)
and SVMHeaderParse (Han et al. 2003) For table and
figures extraction, PDFFigures (Clark et. al., 2016)
and Tabula (Aristaran et. al., 2013) have been
developed for general academic publications. To
encapsulate all of these various open source tools into
one framework, PDFMEF (Wu et al., 2015) brings
users a customizable and scalable tool to bring the
best capabilities of each tool into one tool. Extraction
of first-page header information is useful for
clustering documents and identifying duplicates,
where a combination of authors and title are assumed
to be unique to each document. For structured recipe
extraction, sections beyond the first page and
bibliographic data are necessary to extract step-like
recipe entities. GROBID has been shown to have
advantages over other methods in first-page and
bibliographic sections (Lipinski et al., 2013). Other
sections, e.g. materials, methodology, results and
discussion, are not fully extracted or classified by the
mentioned tools and are often in the wrong order. For
recipe extraction, sequential order is essential for the
accurate extraction of synthesis steps. In this paper,
we compare the accuracy, precision, and recall (based
on edit distance) of three products of information
extraction: (1) manually extracted ground truth (text
selected and ordered by manual annotation); (2) the
section output of GROBID (Lopez, 2009); and (3) the
output of MATESC.
1.2 Applications
MATESC is the metadata-aware payload extraction
component of a broader project whose long-term goal
is to acquire a corpus of scientific and technical
documents that are restricted to a specific domain and
extract free-text recipes consisting of procedural steps
and entities organized in a sequential form. For our
specific application domain of nanomaterials
synthesis, the documents of interest are academic
papers collected from open-access web sites using a
custom crawler and scraper ensemble. The initial
seeds for the document crawl were provided by the
subject matter expert. The papers to be analyzed by
MATESC are PDF files, from which structured
information such as titles, author lists, keyword lists,
sets of figures with captions, and specific named
sections such as the introduction, background and
related work, experimental method, result data, and
summary and conclusions, are captured. The next
stage of analysis is to extract recipes, which are
sequences of steps that specify materials needed and
methods utilized to produce a nanomaterial. These are
similar in structure and length to cooking recipes.
Steps of a recipe may consist of basic unit operations