Figure 1: System overview. – implies future components.
chapters. Then, these chapters, which are in PDF
format, have been converted to texts. Finally, from
these texts the metadata have been generated.
2.1.1 Partition of Books into Chapters
The documents we have consist of course books in
PDF format. Each of the course books has different
number of chapters. To make metadata extraction
available, we have divided the course books into
smaller PDFs according to their chapters. The
number of course books published in PDF format
was about 230. Some of the course books were
scanned through the hardcopy of the originals, so we
had to eliminate them. Among the course books we
have, 205 of them were partitioned into smaller
PDFs. At last, we obtained 2654 PDFs, which is the
number of the chapters of the 205 course books.
2.1.2 PDF to Text Conversion
In document processing, we have converted the PDF
documents to texts to make them more suitable to
allow processing. At this step, PDFBox
(http://www.pdfbox.org), which is an open source
Java PDF library to work with PDF documents, has
been used.
While converting PDFs’ into texts, we have
faced some problems. Most of the PDF documents
were legacy. Also, since they are written in Turkish,
when we convert them to text, Turkish language
specific characters like ‘ı’, ‘İ’, ‘ğ’, ‘Ğ’, ‘ş’, and ‘Ş’
were corrupted. Except ‘ş’ and ‘Ş’, the other
corrupted Turkish characters have been corrected by
replacement as each of them was referred by a non-
Turkish single character. However we could not
corrected ‘ş’ and ‘Ş’ characters by replacement.
When these two characters are converted to text, ‘ş’
becomes ‘fl’ and ‘Ş’ becomes ‘fi’, and both ‘fl’ and
‘fi’ can take place in meaningful Turkish words. To
overcome this issue, we thought that we need a spell
checker, which will check if a word is correctly
spelled or not. If there is a wrong spelled word, it
will be changed by the correct one. As a spell
checker, we used Zemberek, which is an open
source Turkish NLP library
(http://code.google.com/p/zemberek). Zemberek
provides basic NLP operations such as spell
checking, morphological parsing, stemming, word
construction, word suggestion, converting words
written only using ASCII characters (so called
'deasciifier') and extracting syllables.
Although Zemberek has overcomed many
problems and been useful, in some cases it logically
failed to do the right correction, because of the
proper names and missspelled words. We created a
correction map file, which contains a list of correct
spellings of proper names and common words, to do
the correction.
2.1.3 Metadata Extraction and Discourse
Analysis
At this stage full-text of the chapters, which were
obtained in the PDF to text conversion step, are
converted into XML representations.
Instead of writing the full-text under a tag, we
first extract the metadata such as author, summary,
keywords and learning objectives so that we could
display this information to the user in the result set.
We followed similar research that has been done for
(Yilmazel, Finneran, & Liddy, 2004) as it takes great
amount of time and effort to create metadata of
digital contents manually.
After the discourse analysis of the chapters, we
had found that chapters were organized as chapter
number, chapter title, introduction, text body,
abstract, evaluation tests, and references. However,
some of the chapters don’t contain evaluation tests
and references. Also, the introduction parts of the
chapters show differences. Some of them may
contain information like chapter author, aim,
keywords and suggestions. We implemented a rule
based extraction system to extract metadata of the
chapter texts automatically.
We observed that our document collection could
be separated into six categories according to the
differences of the chapter full-texts. So, we designed
a chapter parser which determines the category of
the full-text. When a document is sent to this parser,
it decides the category and extracts the metadata of
the document.
Finally, we obtained the following metadata
elements: Course No, Book Name, Book Author,
Book ISBN, Chapter No, Chapter Title, Chapter
Author, Chapter Begin, Foreword, Learning
Objective, Keywords, Content, Suggestions,
TURKISH QUESTION ANSWERING - Question Answering for Distance Education Students
321