rect
re f erence
.height + threshold]. We increased the
right limit of the lane in order to include those cell
which span multiple rows.
Given the example presented in Figure 7, let’s say
we want to select all cells on the 4th row (i.e. all cells
denoted by green marked letters). We will use the first
cell of the row as reference.
The right limit of the desired line should be large
enough to cover all spanned rows (e.g. cells denoted
by L and M letters) and narrow enough not to cover
others cells from the next row (e.g. cell denoted by T
letter). The table borders might be not straight lines,
as showed in this example. If we set the limit to
rect
re f erence
.y + threshold as we did in the previous
subsection, then the algorithm might not be able to
identify all cells from the required row. But if we set
the limit to rect
re f erence
.y + rect
re f erence
.height just to
cover all cells of the row, we might get cells from the
next row. To solve both problems, we set the limit to
rect
re f erence
.y + rect
re f erence
.height + threshold.
3.4.8 Removing Cell Borders
Due to table cropping, the cells surfaces have dark
borders around them. These can be erroneously
picked up as extra characters, especially if they vary
in shape and gradation. In order to get good quality
output from OCR software, we might remove those
borders, as follows: first we create a mask with the
same size as the input cell, then we apply Otsu’s
thresholding method succeeded by a dilation morpho-
logical transformation to make small borders more
visible, then we find line segments in this binary mask
image using the probabilistic Hough transform which
returns a vector of lines. We iterate through this vec-
tor of lines and we remove from the cell image (i.e.
changing line color to white) only those lines which
are placed on the cell borders within a threshold.
3.4.9 Arranging Cells in Rows
It is clear that an ordinary ascending sort of the array
containing row cells does not lead to the expected re-
sult in all cases. Our algorithm is focusing on those
cells which do not span multiple rows and it is divided
into three distinct steps: a) sort the array contain-
ing row cells in ascending order by each cell x-axis
value, b) iterate through the array and try to group
those cells with top-let corner on (almost) the same
x-axis; we use a threshold value since the borders are
not perfectly aligned, c) sort the elements of group in
ascending order by each cell y-axis value and add it
to the result array
It is worth mentioning that, of course, every cell
which span multiple rows will be the single element
of a group. Basically, after the algorithm has finished
arranging the cells, we will get an array with number-
of-table-columns elements of cell groups. Each group
consist on only those cell belonging to the same col-
umn in the context of the current row.
In Figure 8, cells with numbers 1, 2, 7 and 8 span
two rows: cells with numbers 3 and 4, and cells with
numbers 5 and 6 respectively. After the first step of
the algorithm, the array will contain cells in the fol-
lowing order: [1, 2, 3, 4, 5, 6, 7, 8].
The second step will group these cells as follows:
as long as the next cell has the top-left corner x-
axis coordinate between cell
current
.x − threshold and
cell
current
.x + threshold, add it to the current group
and repeat this step with the next cell, otherwise cre-
ate a new group and move to the next cell. After this
step, we will have the following structure: [[1], [2],
[3, 4], [5, 6], [7], [8]], where the inside arrays de-
notes a group of cells. As we can notice, the third and
fourth groups have two elements. The third step im-
plies sorting of each group elements, but in this case
they are already ordered as we desired.
We have previously introduces the idea of read-
ing tables just like reading out the elements of a ma-
trix, row by row from a starting entry. In the follow-
ing, we will make use of these five functions in order
to explain how the table structure recovery algorithm
works.
As shown in Figure 9, the process of recovering
table structure begins with getting the reference cell.
Then we will find all cells on the same column as the
one given as reference using the function described in
the second subsection and start looping through the
resulted array. At each iteration, we will use the third
section function to obtain all cells on the same row
as the current one and arrange them using the fourth
section function. From now we assume that the cells
of the current row fit the table structure and we start
iterating the array and outputting the extracted data
as follows: first we will find the maximum number of
rows spanned by a cell which is equal to the maximum
number of elements of a group (i.e. rowspan), then
we will use idx variable iterator going from zero to
rowspan and at each step we will also iterate through
each group; if a group has an element with index idx,
then the algorithm will output a field separator and
extract data from that cell after removing its borders
using the fifth function, otherwise it will just output a
field separator.
Enhancing Open Data Knowledge by Extracting Tabular Data from Text Images
225