Excel documents may contain lots of financial docu-
ments on different sheets. Therefore, annotators also
have labeled each sheet of Excel documents. At the
end of the labeling stage, we obtain around 5300 fi-
nancial documents where almost half of them (47%)
are t-balances.
Feature Engineering. In order to apply machine
learning techniques, we extract features that differ-
entiate t-balances from other financial documents by
focusing on the differences between t-balances and
other financial documents.
According to our realization, the word “mizan”
(t-balance in Turkish) may be present in the sheet
name of Excel documents. (i)We use the existence
of this word as a boolean feature. As we indicated be-
fore, balance sheets and income statements are struc-
tured documents which means the shape is standard-
ized. Moreover, positions and the total number of nu-
meric columns are also fixed for these financial doc-
uments. T-balances, on the other hand, are semi-
structured and both shape and positions differ from
t-balance to t-balance. Therefore, (ii, iii) the shape
of the documents, (iv)the position of the first numeric
column and (v) the total number of numeric columns
are used as features. Focusing on only the structure of
the documents is inadequate because t-balances and
other financial documents are table formatted docu-
ments and have too many structural similarities. Fur-
thermore, companies may also provide some docu-
ments which contain specific part of t-balance such
as cheques and these documents are even more con-
fusing than other financial documents for machine
learning algorithms. According to our knowledge,
t-balances must contain at least the majority of the
following account codes in order to show assets, lia-
bilities and shareholder’s equity; 100-cash, 101/103-
checks, 102-bank accounts, 108-other liquid assets
120-account receivables, 121-321 bonds, 320-account
payables, and 500-shareholders. (vi)How much of
these account codes exist is used as a feature.
At the end of data collection and feature extrac-
tion, we obtain a [5300, 7] sized corpora where the
seventh columns indicate the label. We divide our
data into two parts where two-thirds of it is for train-
ing and one third is for testing.
3.1.2 Corpora for Dividing Company Titles
In order to divide company titles into proper name,
sector and type parts, we collect 2500 random Turkish
human and company names. Human annotators have
labeled the data for named entity recognition models.
An example is shown in Figure 2.
Figure 2: Human and company name data examples. Red
examples are companies and black examples are human
names.
3.1.3 Corpora for Testing Customer Matching
We collect six different datasets in two different col-
lections in order to test customer matching.
The first collection consists of five different sets,
A, B, C, D and E, which include existing bank cus-
tomers. Datasets A, B, and D were created from RMs
while C and E datasets were generated from the data
sets.
The second collection consists of 2000 labeled
queries; CIF is used for bank customers, positive sam-
ples, and -1 is used for others, negative samples.
3.1.4 Dictionaries for Information Mapping
We use dictionary-based algorithms in order to extract
valuable information. For this purpose, we have four
different dictionaries for bank names, company indi-
cators, annotations and sectors’-company types’ ab-
breviations. These dictionaries contain synonyms, ab-
breviations and common misspelling versions of each
word.
3.2 Classification of Financial
Documents
In order to detect t-balances among the financial doc-
uments, we need to classify them. For this purpose,
we focus on document structure and layout analysis,
document/text and binary classification algorithms.
Document structure and layout analysis algorithms
mostly work on picture-based datasets. Text classi-
fication algorithms classify text by analyzing words
and frequencies. Our data neither consist of pictures
nor contain long texts. Hence, we deeply focus on
feature engineering to represent our data with num-
bers and then practice binary classification methods.
We perform following state of the art binary clas-
sification algorithms; Multi-Layer Perceptron, Sup-
port Vector Machines, Random Forrest Classifiers,
KNN and Decision Trees, (Shawe-Taylor et al.,
2004), (Natarajan, 2014).
FOCA: A System for Classification, Digitalization and Information Retrieval of Trial Balance Documents
177