FOCA: A System for Classiﬁcation, Digitalization and Information

Retrieval of Trial Balance Documents

Gokce Aydugan Baydar and Sec¸il Arslan

R&D and Special Projects Department of Yapı Kredi Teknoloji, Istanbul, Turkey

Keywords:

Pattern Recognition, Document Digitalization, Information Retrieval, Classiﬁcation, ElasticSearch.

Abstract:

Credit risk evaluation and sales target optimization are core businesses for ﬁnancial institutions. Financial

documents like t-balances, balance sheets, income statements are the most important inputs for both of these

core businesses. T-balance is a semi-structured ﬁnancial document which is constructed periodically by ac-

countants and contains detailed accounting transactions. FOCA is an end to end system which ﬁrst classiﬁes

ﬁnancial documents in order to recognize t-balances, then digitalizes them into a tree-structured form and

ﬁnally extracts valuable information such as bank names, human-company distinction, deposit type and lia-

bility term from free format text ﬁelds of t-balances. The information extracted is also enriched by matching

human and company names who are in a relationship with existing customers of the bank from the customer

database. Pattern recognition, natural language processing, and information retrieval techniques are utilized

for these capabilities. FOCA supports both decision/operational processes of corporate/commercial/SME sales

and ﬁnancial analysis departments in order to empower new customer engagement, cross-sell and up-sell to

the existing customers and ease ﬁnancial analysis operations by digitalizing t-balances.

1 INTRODUCTION

Corporate, commercial and Small/Medium Enter-

prise (SME) banking requires customers to period-

ically provide ﬁnancial documents, which are bal-

ance sheets, income statements and trial balances (t-

balances), in order to evaluate their credibility and

also convenience for the bank’s products. Balance

sheets and income statements are structured tables

and demonstrate periodic snapshots of the company’s

ﬁnancial situation. T-balance, on the other hand,

records all assets, liabilities and shareholders

equity

in details. In other words, a company’s t-balance

shows its all detailed transactions.

T-balance is used by two different business units

by YapıKredi Bank in Turkey with different purposes.

In the Credit Risk Evaluation Process Department, it

is used by expert ﬁnancial analysts to check out the

ﬁnancial situation of the customers. At the end of

the evaluation, they decide the credibility of customer

loans in a considerable way. Customer Relation Man-

agement and Sales Department use t-balance for both

extracting cross-sell and up-sell opportunities for cur-

rent customers and detecting new customers who are

in a relationship with existing customers of the bank.

In Turkey, a t-balance is a semi-structured free

text format document which has around 1000 rows

on average. This document contains debit and credit

amounts of accounting items such as current and non-

current assets, long and short term liabilities, stocks

and so on, (Williams et al., 2005), (Warren, 2014).

Therefore, reading and processing t-balances manu-

ally is highly time-consuming for both ﬁnancial an-

alysts and relational managers (RMs). Furthermore,

in the bank’s previous document management system,

all ﬁnancial documents were stored in the same place

as a bulk of customer documents without any descrip-

tive annotations. Thus; an employee who wants to

access customer

s t-balances which were provided be-

fore the system change has to look through each cus-

tomer document until ﬁnding the desired one. A typ-

ical credit risk evaluation process takes about 6 days

because ﬁnancial analysts examine all t-balances of

the past three years. On the other hand, value chains,

credit line distribution among competing banks, de-

posit distribution, cheque/note values all serve as sale

targets for RMs. Processing all these documents and

extracting all valuable information by matching exter-

nal resources, such as a customer database, are time-

consuming procedures and even infeasible with hu-

man effort in limited time.

174

Aydugan Baydar, G. and Arslan, S.

FOCA: A System for Classiﬁcation, Digitalization and Information Retrieval of Trial Balance Documents.

DOI: 10.5220/0007843201740181

In Proceedings of the 8th International Conference on Data Science, Technology and Applications (DATA 2019), pages 174-181

ISBN: 978-989-758-377-3

Figure 1: Framework of FOCA.

In this paper, we present a new system called

FOCA for t-balances

classiﬁcation and digitaliza-

tion, Figure 1. We classify t-balances and digitalize

them with valuable information by using a combina-

tion of Pattern Recognition, Natural Language Pro-

cessing and Information Retrieval techniques. In dig-

ital t-balance, we provide all accounting items with

mapped bank names, currencies, demand/time de-

posits, liability terms, and human/company distinc-

tion. We match the mentioned human and company

names with the bank’s customer database and extend

the knowledge about the related customer. We also

present annotations such as factoring, leasing, and

goodwill, which are crucial for ﬁnancial analysis.

Our system has been integrated into the bank in

June 2018 and is being used since then. For CRM

usage of FOCA, we have classiﬁed and digitalized

around 170,000 historical customer documents and

extracted around 50 million lines of information from

100,000 t-balances. This output is consumed by RMs

and 1700 new/potential companies, 540 of them are

high volume companies, are extracted as a sales tar-

get. We have also served for ﬁnancial analysis pur-

poses; around 250,000 t-balances are digitalized by

FOCA. We detect t-balances with approximately 93%

recall rate and digitalize them with a 94% success rate

as of February 2019. Furthermore, we match human

and company names from the customer database in

milliseconds with 74% success. Since FOCA classi-

ﬁes and digitalizes a t-balance in few seconds, it eases

RMs

and ﬁnancial analysts

job and shortens the pro-

cesses.

The rest of the paper is as follows. T-balance

concept and difﬁculties caused by both the structure

and content of the t-balance are given in Section 2.

Section 3 explains datasets, classiﬁcation model, cus-

tomer matching, t-balance digitalization algorithms,

and information retrieval techniques. Experimental

setup and results are given in Section 4 and Section

5 gives the conclusion of the paper.

2 CONCEPT AND DIFFICULTIES

2.1 Concept

The Italian accounting system is adopted in Turkey.

T-balances consist of at least four main columns;

a speciﬁc account code, account explanation, total

debit, and total credit amounts.

Account codes are globally standard indicators in

order to show certain accounting items. Most com-

mon account codes are given in Table 1.

Table 1: Common Accounting items and their globally stan-

dard account codes.

100 Cash

101/103 Cheques

102 Bank Accounts

108 Other Liquid Assets

120 Account Receivables

121/321 Bonds

301/401 Leasing

320 Account Payables

500 Shareholders

Account explanation is a free format short text

which summarizes the related accounting item. For

instance, 102 bank account of the company may

contain bank name, deposit type, time vs. demand

type and currency. T-balance is a valuable docu-

ment because it contains detailed ﬁnancial informa-

tion about the t-balance owner. In the credit risk eval-

uation, cheatings over the other ﬁnancial documents

are caught thanks to t-balances. It also contains a

company’s all business relationships and this nature

make t-balance a golden mine for new sales oppor-

tunities and potential customer extraction. All this

valuable information can be extracted from account

explanation cells.

Total debit and credit amounts state the volume of

the record.

2.2 Structure based Difﬁculties

Although the Italian accounting system is adopted in

Turkey, there is not a standard for bookkeeping. T-

balances are semi-structured documents and this na-

ture causes lots of problems.

Account code starts with a globally speciﬁc three

digit number but breakdowns, in which account de-

tails are listed, depending on the accountant’s styling.

For instance, bank accounts are recorded with “102”

account code but breakdowns can be given as “102-

01-01” or “102.01” for the same record.

T-balance is a semi-structured table and there is

not any constraint about the maximum number of

FOCA: A System for Classiﬁcation, Digitalization and Information Retrieval of Trial Balance Documents

175

rows or columns. According to our knowledge, t-

balances have mostly four to eight columns. Besides

the main four columns, transaction currency, monthly

debit and credit columns and debit, and credit balance

columns are commonly given. Since the number of

columns is not standardized; detecting the positions

of the main t-balance columns becomes a difﬁculty.

Some t-balances have header as a row which contains

a deﬁnition of columns and headers ease to ﬁnd the

positions of the main t-balance columns. The ones

which do not have any header bring the problem of

analyzing the data in each column in order to detect

whether it is a target column or not. The number of

rows depends on both the size of the company and

the accountants’ style. For instance, t-balances have

the same number of bank accounts may contain a dif-

ferent number of records depending on the detailing

levels. The beginning of a t-balance is also not stan-

dard; some directly start with bookkeeping and some

contain other information such as t-balance period,

owner’s title and so on. The rows that locate above

the t-balance origins should be detected to ﬁnd the

beginning of the bookkeeping.

T-balance is a free-format document, which means

that a t-balance can be constructed as any type of doc-

uments like PDF or Excel. This causes the problem of

tackling different document type problems.

2.3 Information Retrieval Difﬁculties

Account explanations contain valuable information

but there is not any linguistic standard for this ﬁeld.

The data here is shaped by accountants; thus, the ex-

planation column may contain lots of abbreviations

and is highly prone to mistakes such as a misspelling.

Furthermore, although there are not any character

constraints, the text given in this ﬁeld is highly short,2

to 10 words on average. This weakness is the basic

problem of information retrieval part of the t-balance

digitalization.

Bank names pass through the accounting items

are mostly misspelled in t-balances. For instance

”Yapıkredi Bankası” commonly written as ”Yapı

ve Kredi Bankası”, ”Y.K.B”, ”Yapıkredi” and ”Yk

bank”. According to our research, we have detected

33 different phrasings of this bank name and this

number does not include the misspelled variations

such as ”Ypı Kredi”. In addition, there are banks with

similar names in Turkey. For instance ”Finansbank”

and ”T

urkiye Finans Katılım Bankası” are two differ-

ent banks. ”T

urkiye Finans Katılım Bankası” is gen-

erally abbreviated as ”T. Finans Bank” and this abbre-

viation is highly similar to ”Finansbank” for informa-

tion retrieval algorithms. Another bank name related

problem is the short bank names such as ”Ing Bank”

and ”HSBC”.

Sectors and types of company titles are the most

abbreviated parts in account explanations. It is a com-

mon behavior to abbreviate these but there is more

than one abbreviation and even they are prone to mis-

spelling. For instance, the sector ”sanayi” is com-

monly abbreviated as ”san”, ”sn” and even ”sana”.

Misspelling on short texts causes information algo-

rithms to fail to ﬁnd the similarities.

Table 2: Example of common proper name problem from

bank

s customer database.

ABC a

gac¸ paletc¸ilik makine ve tekstil sanayi as¸

ABC international hijyen ltd s¸ti

ABC dayanıklı t

uketim mam

ulleri ticaret ltd s¸ti

ABC da

gıtım gıda pazarlama ve ticaret as¸

ABC spor e

gitim ve sosyal yardim vakfı is¸letmesi

The most important ability FOCA provides is cus-

tomer matching through the mentioned human and

company names in t-balance but there are many prob-

lems here. One of them is that there are lots of dif-

ferent companies with the same proper name. Table 2

shows some of 125 different customers of the bank

which starts with the same proper name and differen-

tiate with sector and type parts. Companies have lots

of sectors in their ofﬁcial title and bank records cus-

tomers with the ofﬁcial title. However, accountants

generally use only a few sectors in t-balances. These

problems cause decreasing of similarity algorithms

success and bring a need to manipulate the data before

using these algorithms. Another problem is that there

are around 25 million customers in the bank and they

are stored in a relational database. Searching hun-

dreds of fuzzy queries in a big relational database is

not feasible with a time constraint.

3 METHODOLOGY

3.1 Dataset Preparation and Feature

Engineering

3.1.1 Corpora for T-Balance Classiﬁcation

We collect around 5000 PDF and Excel ﬁnancial doc-

uments of commercial/corporate/SME customers and

human annotators label them with binary labels “T-

BALANCE” and “NOT T-BALANCE”. PDF docu-

ments in this set contain one ﬁnancial document but

Proper name “ABC” given in Table 2 is not the real

name, it has changed due to the privacy-preserving con-

cerns.

DATA 2019 - 8th International Conference on Data Science, Technology and Applications

176

Excel documents may contain lots of ﬁnancial docu-

ments on different sheets. Therefore, annotators also

have labeled each sheet of Excel documents. At the

end of the labeling stage, we obtain around 5300 ﬁ-

nancial documents where almost half of them (47%)

are t-balances.

Feature Engineering. In order to apply machine

learning techniques, we extract features that differ-

entiate t-balances from other ﬁnancial documents by

focusing on the differences between t-balances and

other ﬁnancial documents.

According to our realization, the word “mizan”

(t-balance in Turkish) may be present in the sheet

name of Excel documents. (i)We use the existence

of this word as a boolean feature. As we indicated be-

fore, balance sheets and income statements are struc-

tured documents which means the shape is standard-

ized. Moreover, positions and the total number of nu-

meric columns are also ﬁxed for these ﬁnancial doc-

uments. T-balances, on the other hand, are semi-

structured and both shape and positions differ from

t-balance to t-balance. Therefore, (ii, iii) the shape

of the documents, (iv)the position of the ﬁrst numeric

column and (v) the total number of numeric columns

are used as features. Focusing on only the structure of

the documents is inadequate because t-balances and

other ﬁnancial documents are table formatted docu-

ments and have too many structural similarities. Fur-

thermore, companies may also provide some docu-

ments which contain speciﬁc part of t-balance such

as cheques and these documents are even more con-

fusing than other ﬁnancial documents for machine

learning algorithms. According to our knowledge,

t-balances must contain at least the majority of the

following account codes in order to show assets, lia-

bilities and shareholder’s equity; 100-cash, 101/103-

checks, 102-bank accounts, 108-other liquid assets

120-account receivables, 121-321 bonds, 320-account

payables, and 500-shareholders. (vi)How much of

these account codes exist is used as a feature.

At the end of data collection and feature extrac-

tion, we obtain a [5300, 7] sized corpora where the

seventh columns indicate the label. We divide our

data into two parts where two-thirds of it is for train-

ing and one third is for testing.

3.1.2 Corpora for Dividing Company Titles

In order to divide company titles into proper name,

sector and type parts, we collect 2500 random Turkish

human and company names. Human annotators have

labeled the data for named entity recognition models.

An example is shown in Figure 2.

Figure 2: Human and company name data examples. Red

examples are companies and black examples are human

names.

3.1.3 Corpora for Testing Customer Matching

We collect six different datasets in two different col-

lections in order to test customer matching.

The ﬁrst collection consists of ﬁve different sets,

A, B, C, D and E, which include existing bank cus-

tomers. Datasets A, B, and D were created from RMs

while C and E datasets were generated from the data

sets.

The second collection consists of 2000 labeled

queries; CIF is used for bank customers, positive sam-

ples, and -1 is used for others, negative samples.

3.1.4 Dictionaries for Information Mapping

We use dictionary-based algorithms in order to extract

valuable information. For this purpose, we have four

different dictionaries for bank names, company indi-

cators, annotations and sectors’-company types’ ab-

breviations. These dictionaries contain synonyms, ab-

breviations and common misspelling versions of each

word.

3.2 Classiﬁcation of Financial

Documents

In order to detect t-balances among the ﬁnancial doc-

uments, we need to classify them. For this purpose,

we focus on document structure and layout analysis,

document/text and binary classiﬁcation algorithms.

Document structure and layout analysis algorithms

mostly work on picture-based datasets. Text classi-

ﬁcation algorithms classify text by analyzing words

and frequencies. Our data neither consist of pictures

nor contain long texts. Hence, we deeply focus on

feature engineering to represent our data with num-

bers and then practice binary classiﬁcation methods.

We perform following state of the art binary clas-

siﬁcation algorithms; Multi-Layer Perceptron, Sup-

port Vector Machines, Random Forrest Classiﬁers,

KNN and Decision Trees, (Shawe-Taylor et al.,

2004), (Natarajan, 2014).

FOCA: A System for Classiﬁcation, Digitalization and Information Retrieval of Trial Balance Documents

177

3.3 Dividing Company Titles

Searching company titles as a whole decreases the

success because of the reasons we state in Section 2.3.

Thus, we divide each company title into three parts as

proper name, sectors, and type.

Company titles in Turkey have a speciﬁc pattern;

each title starts with a proper name and followed by

either another proper name, a sector or a type. After

a sector, either another sector or a type can come and

company type is the end of a title, sector or proper

name comes after this is irrelevant to the title.

In order to divide company titles, we utilize named

entity recognition techniques. The data that we want

to divide into parts is short text and the rules between

parts are strict and one way. Hence, we name enti-

ties with a Hidden Markov Model (HMM), (Morwal

et al., 2012). Calculated initial probabilities of our

HMM model, P(π) are given in Table 3. This table

supports the rule that a company title always starts

with proper names. The deviation is caused because

of the synonym words; for instance “demir” is a com-

mon sector and proper name.

Table 3: HMM initial Probabilities for dividing company

names.

Part P(π)

Proper Name 0.8832

Sector 0.1151

Company Type 0.0017

3.4 Database Preparation with

ElasticSearch

We create a NoSQL, search engine based database

with ElasticSearch due to the reasons indicated Sec-

tion 2.3.

For ElasticSearch, we collect target customers,

commercial, corporate and SME, with customer iden-

tiﬁcation number (CIF), national identiﬁcation num-

ber (TCKN), tax number (VKN) and name/title (UN-

VAN). On the collected data, we ﬁrst expand the ab-

breviations in order to standardize sectors and com-

pany types. Then, we divide company title into three

parts with the HMM model. Once the data is ready,

we index it to ElasticSearch where each property is

represented in a ﬁeld.

3.5 Digitalizing T-Balance

3.5.1 Reading and Extracting T-Balances

There are many frameworks for reading Excel docu-

ments but none of them covers all versions of Excel.

Thus, FOCA uses two different frameworks; Apache

POI for documents created with Excel version after

then 2007 and JExcelApi for older ones. The output

of both frameworks protects the table structure of Ex-

cel.

PDF documents have no indicators for the bor-

der of the table columns and commercial off-the-shelf

PDF readers are not adequate for protecting table

structure while reading. Thus, we implement an algo-

rithm that takes a raw text from PDFBox and ﬁnds the

column borders with the rules learned before. This al-

gorithm searches for boarders by checking the distinct

change on the sequence of characters. Since there is

no character limit for account code and explanation

columns, borders are located starting from the end.

Regardless of the further detail columns, FOCA

is interested in four main t-balance columns; ac-

count codes, account explanation, total debit, and to-

tal credit amount columns. In order to detect the po-

sitions of these four columns on the documents that

are classiﬁed as t-balance, we implement an algo-

rithm which searches the positions by using both type

and context of the columns. If there is a header, we

ﬁnd it with string approximation algorithms. In the

case of not founding the header, we search for ”text -

text - numeric - numeric” columns sequence. On the

found sequence, we examine the content of the ﬁrst

text column; the number of found account codes of

common account codes shows whether the sequence

indicates the desired columns or not. If the majority is

found, then the position sequence is used as the main

t-balance columns.

Once the positions are found, t-balance is read into

a tree structure in which parents represent the main

account code as total, like ”102” for bank account to-

tal, and leaves represent the breakdown details, such

as ”102.01.01 ING Bank USD”. This tree structure

provides a unique representation for t-balances re-

gardless of its length and detail level.

3.6 Information Retrieval

We map bank names, distinguish human and company

names, annotate ﬁnancial transactions, match bank

customers and enrich the knowledge through the dig-

italized t-balance in FOCA. In order to perform infor-

mation retrieval algorithms, we ﬁrst apply a prepos-

sessing on raw account explanation text in order to

standardize the representations. In this step, we con-

vert all characters to lower case, map Turkish charac-

ters to the Latin alphabet and remove non-alphabetic

characters. Then we utilize information retrieval tech-

niques in order to extract valuable information.

DATA 2019 - 8th International Conference on Data Science, Technology and Applications

178

3.6.1 Mapping and String Matching

FOCA performs a cascaded algorithm that is based

on edit distance and String-Searching(Melichar et al.,

2005) algorithms. The algorithm ﬁrstly looks a match

for account explanation as a whole. If the similar-

ity between account explanation and any of dictionary

items is not enough, θ, then the last word is cropped

and the same process is repeated fractionally. The

Turkish language tends to give the most valuable in-

formation ﬁrst, hence the algorithm uses a reverse n-

gram technique. This process is repeated from the

size-gram to 1-gram until a match is found.

FOCA utilizes this algorithm in order to make ﬁ-

nancially crucial annotations, represent bank names

and cities/regions in a unique way, expand abbrevia-

tions and distinguish human and company names us-

ing dictionaries.

3.6.2 Customer Matching with ElasticSearch

ElasticSearch provides a fuzzy search option. It cal-

culates the similarity between each item in the index

and given query with BM25 scoring and sorts the re-

sults in descending order. We search mentioned hu-

man and company names in the ElasticSearch index.

If there is a match, the knowledge is extended with

CIF, TCKN, and VKN. For the ones who are not a

customer of the bank are assigned with a unique nega-

tive number in order to highlight potential customers.

Similar to ElasticSearch indexing, the same pre-

processing steps are also applied to the queries. We

search the query with different importance levels. The

proper name is searched without any toleration be-

cause this part is the most distinguishing part and

ElasticSearch already tolerates misspelling. Sectors

are searched with some toleration due to the reasons

indicated in Section 2.3; if 60% of the words in query

has found in the index then it is accepted as a match.

Company types may not be given in each account ex-

planation, so this ﬁeld is searched with a should im-

portance. The highest scored one among the matches

is accepted as the result.

We map found customers all over the t-balance.

This utility extends the knowledge about the t-balance

owner. For instance, some t-balance account codes

refer to business partners and shareholders, namely

the group of the company. Detected business partners

are marked in t-balance to highlight the transactions

in the group members. Furthermore, the companies

and humans are assigned with a negative number are

considered as potential customers.

4 EXPERIMENTAL RESULTS

4.1 T-Balance Classiﬁcation

We test machine learning models on the dataset ex-

plained in Section 3.1.1. Among the experimented al-

gorithms, Multi-layer Perceptron(MLP) performs the

best scores on average, Figure 3. In terms of recall

score, Desicion Tree (DT) model provides the high-

est score bu its precision score is not good enough.

We need a classiﬁcation model in order to correctly

distinguish t-balances from similar documents. This

means that both recall and precision measures are cru-

cial for the system. As MLP outperforms all other

alternatives on average, we prefer the MLP model.

FOCA has classiﬁed approximately 100,000 t-

balances among 170,000 ﬁnancial documents and ex-

tracted 50 million rows of information for 8 months.

The user feedbacks also show that the classiﬁcation

model works.

Figure 3: T-balance classiﬁcation experiment results.

4.2 Customer Name Matching

We evaluate customer matching with two experiments

using collections explained in Section 3.1.3. In the

ﬁrst experiment, we use the ﬁrst collection which con-

sists of only existing bank customers. Since datasets

A, B and D contain t-balance records, the queries

are prone to human mistakes and may contain un-

usual abbreviations. C and E datasets are created by

RMs so the queries are more clear than other datasets.

The results show that our ElasticSearch based match-

ing mechanism is not only successful but also fast.

Searching for a query takes approximately 1 millisec-

ond and the result is correct for t-balance entries with

74% probability.

Secondly, we experiment with our system on the

second collection that contains both customers of the

bank and others who have no records in the bank’s

database. Table 5 demonstrates the confusion matrix

of this experiment. Here, p

- n

stands for actual la-

bels while p - n stands for the predicted ones. As a

FOCA: A System for Classiﬁcation, Digitalization and Information Retrieval of Trial Balance Documents

179

Table 4: Customer matching evaluation experiments and their results in terms of accuracy and elapsed time.

Dataset Name Dataset Size True Match Precision Elapsed Time (ms)

A 581 413 0.71 768

B 6926 4932 0.71 8.106

C 387 367 0.95 695

D 105 75 71.4 247

E 364 360 0.99 472

result, FOCA matches customers with 0.96 precision,

0.76 recall and 0.85 F1-score in this experiment. In

other words, if there is a match for someone, this one

is 96% actually a customer of the bank.

FOCA enriches knowledge about t-balance owner

using the customer number obtained from this match-

ing. Therefore, matching someone who is actually not

a customer is dangerous for further analysis. FOCA

searches entries with strict rules in order to decrease

false-positive ratio and this is the reason behind the

recall score.

Table 5: Confusion matrix for customer matching on a

mixed dataset.

n p

582 41 623

335 1083 1418

917 1124

4.3 T-Balance Digitalization and

Information Retrieval

In the new document management system, RMs have

the responsibility of obtaining and uploading cus-

tomer t-balances. We assume that the document an

RM uploads is a t-balance. In the past 8 months,

approximately 250,000 t-balances are digitalized by

FOCA and the extracted information is used for ﬁ-

nancial analysis. Figure 4 shows the number of doc-

uments FOCA digitalized per month with red bars,

how many of the uploaded t-balances are recognized

with blue bar and the precision values with green line.

FOCA detects t-balances with 92% precision.

Figure 4: Number of T-balances which is digitalized for ﬁ-

nancial analysis usage and the success rate among months.

We ask users to check the digitalized t-balance. If

there is a mistake in digital t-balance, such as wrong

annotation, users can edit the output and we keep the

edit ratio in order to measure how correctly we digi-

talize t-balances. The edit ratio is approximately 1.6%

which means that the information FOCA retrieves is

correct with 98.4% probability.

At sales opportunity extraction phase, we pro-

cessed all historical t-balances and retrieved demand

and time deposit items are investigated by corporate

and commercial sales business units.After this inves-

tigation, RMs are engaged for 1,700 new companies,

of which 540 are high volume. This analysis is infea-

sible with human effort in such short time.

5 CONCLUSION

We present FOCA which is a new end to end sys-

tem. FOCA takes raw t-balances as input and returns

digitalized t-balance as output. Digitalized t-balance

represents it in a tree structure, where parents are ac-

count totals and leaves are breakdown details, and

contains valuable information. Mapped bank names,

currencies, bank account deposit/demand types, lia-

bility terms, annotations and human/company distinc-

tion are extracted from t-balance. Furthermore, men-

tioned human and company names are matched from

customer database and the knowledge about the rel-

evant customer is extended in digitalized t-balance.

We test our system on different datasets and our re-

sults show that the system works fast and with high

accuracy.

Our system is consumed by two units of the

bank, Credit Risk Evaluation and Customer Relation

Management and Sales departments. Before FOCA,

examining and interpreting t-balance procedure was

highly time consuming and it was also highly prone

to human mistakes. FOCA completes classiﬁcation,

digitalization and information retrieval steps in few

seconds which is infeasible with human effort. As fu-

ture work, we plan to use the output of this project

in order to automatize ﬁnancial analysis and sales op-

portunity extraction with human effort.

DATA 2019 - 8th International Conference on Data Science, Technology and Applications

180

ACKNOWLEDGEMENTS

The authors would like to thank Bilge K

oro

glu and

Mert Basmacı for their contributions to the system.

This work is supported by TUBITAK 3170677.

REFERENCES

Melichar, B., Holub, J., and Polcar, J. (2005). Text search-

ing algorithms. Available on: http://stringology.

org/athens.

Morwal, S., Jahan, N., and Chopra, D. (2012). Named en-

tity recognition using hidden markov model (hmm).

International Journal on Natural Language Comput-

ing (IJNLC), 1(4):15–23.

Natarajan, B. K. (2014). Machine learning: a theoretical

approach. Elsevier.

Shawe-Taylor, J., Cristianini, N., et al. (2004). Kernel meth-

ods for pattern analysis. Cambridge university press.

Warren, C. (2014). Survey of accounting. Nelson Educa-

tion.

Williams, J. R., Haka, S. F., Bettner, M. S., and Carcello,

J. V. (2005). Financial and managerial accounting.

China Machine Press.

FOCA: A System for Classiﬁcation, Digitalization and Information Retrieval of Trial Balance Documents

181