ESG Data Collection with Adaptive AI

Francesco Visalli

1 a

, Antonio Patrizio

, Antonio Lanza

, Prospero Papaleo

, Anupam Nautiyal

Mariella Pupo

, Umberto Scilinguo

, Ermelinda Oro

2 b

and Massimo Ruffolo

1 c

altilia.ai, Piazza Vermicelli, c/o Technest, University of Calabria, Rende (CS), 87036, Italy

High Performance Computing and Networking Institute of the National Research Council (ICAR-CNR),

Via Pietro Bucci 8/9C, Rende (CS), 87036, Italy

Keywords:

Intelligent Document Processing, Environmental Social and Governance (ESG), Sustainable Investment,

Socially Responsible Investment (SRI), Artiﬁcial Intelligence, Natural Language Processing, Large

Language Models, Computer Vision, Information Retrieval, Deep Learning, Knowledge Graph, Workﬂow,

Hyperautomation.

Abstract:

The European Commission deﬁnes the sustainable ﬁnance as the process of taking Environmental, Social

and Governance (ESG) considerations into account when making investment decisions, leading to more long-

term investments in sustainable economic activities and projects. Banks, and other ﬁnancial institutions, are

increasingly incorporating data about ESG performances, with particular reference to risks posed by climate

change, into their credit and investment portfolios evaluation methods. However, collecting the data related to

ESG performances of corporate and businesses is still a difﬁcult task. There exist no single source from which

we can extract all the data. Furthermore, most important ESG data is in unstructured format, hence collecting

it poses many technological and methodological challenges. In this paper we propose a method that addresses

the ESG data collection problem based on AI-based approaches. We also present the implementation of the

proposed method and discuss some experiments carried out on real world documents.

1 INTRODUCTION

Environmental, Social, and Governance (ESG) pil-

lars describe areas that characterize a sustainable, re-

sponsible, or ethical investment. ESG investing has

evolved in recent years to meet the demands of in-

vestors and public authorities that wish to better in-

corporate long-term ﬁnancial risks and opportunities

into their investment decision-making processes.

There is currently a clear challenge with the qual-

ity and consistency of ESG data. It is important

to have standardized data, but only fragmented in-

formation is available from multiple sources. Data

sources can be divided into two main sub-groups:

primary and secondary data sources. By primary

data sources, we intend the self-reported ESG data

(company websites, annual and sustainability reports,

etc.), third-party ESG data (NGO/government web-

sites and reports), and real-time ESG signals (news,

https://orcid.org/0000-0002-6768-3921

https://orcid.org/0000-0002-5529-1007

https://orcid.org/0000-0002-4094-4810

social media, company reviews, and so on). Sec-

ondary data sources are ESG data vendors (or ESG

data providers), whose job consists in manually col-

lecting, systematizing, and analyzing ESG attributes

obtained from a primary data source. Primary data

sources lock ESG data within strongly unstructured

texts and documents, while secondary data sources

are slow, not timely, and provide limited subsets of

manually built ESG data for subsets of businesses.

There’s also a lack of a standard taxonomy. ESG

factors continue to evolve, and the dictionary changes

as investors move through sectors and industries.

ESG factors are difﬁcult to reproduce over time and

across geographies, partly because of differences in

data across regions and in how the recording of data

has evolved. Moreover, there may be discrepancies

between what certain factors are expected to do and

what they end up doing.

Traditionally, data was collected by a human ana-

lysts also responsible for pre-processing and analyz-

ing the data. This process requires a lot of human

capital and is very time-consuming, and the chances

468

Visalli, F., Patrizio, A., Lanza, A., Papaleo, P., Nautiyal, A., Pupo, M., Scilinguo, U., Oro, E. and Ruffolo, M.

ESG Data Collection with Adaptive AI.

DOI: 10.5220/0011844500003467

In Proceedings of the 25th International Conference on Enterprise Information Systems (ICEIS 2023) - Volume 1, pages 468-475

ISBN: 978-989-758-648-4; ISSN: 2184-4992

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

that human analysts can make some mistakes while

performing these tasks are very high. Advances in

AI have made it easier than ever before to automate

complex tasks at incredible speeds and volumes, thus

revolutionizing how companies work with data. Arti-

ﬁcial Intelligence (AI) and Intelligent Document Pro-

cessing (IDP) have the power to extract, ﬁlter and

structure crucial data that is used by rating agencies,

business analysts, investors, etc. at scale. IDP uses AI

technologies such as computer vision and language

models rooted in deep learning to classify, catego-

rize, extract, and validate the relevant information ex-

tracted from a variety of document formats.

In this paper, we propose an ESG data collec-

tion method grounded on Altilia Intelligent Automa-

tion (AIA) an AI-based Intelligent Document Pro-

cessing (IDP) platform. The proposed ESG data col-

lection method allows gathering various documents

and contents from disparate sources including annual

reports, sustainability report, notes to ﬁnancial state-

ments, news, NGO reports, company websites ob-

tained by web scraping approaches, etc. AIA platform

is grounded on an hybrid and adaptive AI paradigms

that makes use of deep learning algorithms for Com-

puter Vision (CV) and Natural Language Processing

(NLP). In particular, we combine large language mod-

els, Human-In-The-Loop AI techniques, continuous

learning, knowledge representation methods to imple-

ment machine reading comprehension techniques that

turn unstructured documents into structured data and

provide answers to many different ESG related ques-

tions. The main contributions of this work are the fol-

lowing:

• We describe a new approach to automatically

gather ESG related documents/contents, analyze

them and extract ESG data to use for creating

structured company proﬁles.

• We show results of some experiments, carried out

on real world documents, that show how banks,

wealth management agencies, rating agencies, in-

vestors, and business analysts, can use the pro-

posed IDP approach to perform ESG data collec-

tion and extraction for every company regardless

the size and the industry.

The rest of this paper is organized as follows: sec-

tion 2 presents the related work in the area of AI pow-

ered approaches for ESG investing and ESG data col-

lection; section 3 describes ESG data collection prob-

lems we address in the paper; section 4 describes the

Altilia Intelligent Automation platform and the pro-

posed ESG data collection method implemented by

the platform; section 5 presents some experiments

performed over a dataset we have built to show the

depth and breadth of the proposed approach; ﬁnally,

section 6 concludes the paper.

2 RELATED WORK

In this section we provide a short summary regarding

papers related to ESG with AI.

ESG data collection has really gone mainstream

because of the growing relevance that ESG rating is

gaining in the investment community. There are a

growing number of ESG rating agencies and reporting

frameworks, all of which have evolved to improve the

transparency and the consistency of the ESG informa-

tion that ﬁrms are reporting publicly.

(Hughes et al., 2021) describes how traditionally,

ESG ratings have been developed by human research

analysts following proprietary methodologies to an-

alyze company disclosures, articles, and industry re-

search among other sources to identify the ESG cre-

dentials of a company. Process underpinning analyst-

driven ESG research, imbued with subjectivity during

data analysis and rating generation and how within

the last few years, developments in Artiﬁcial Intel-

ligence and Machine Learning have led to creation

of a new type of ESG rating provider; one that ana-

lyzes the ESG risks and opportunities of companies

by collecting (or “scraping”) and analyzing unstruc-

tured data from internet sources using AI.

(Macpherson et al., 2021) discusses how Artiﬁ-

cial Intelligence and FinTech-powered ESG screen-

ing and analysis solutions have become “strategic en-

ablers” that can address some of the inherent ESG in-

formation biases and potentially even ESG rating di-

vergences arising from corporate self-reporting, and

annualised, backward looking reporting of informa-

tion. In this study they discussed about implications

of regulatory and industry expectations around ESG

data and frameworks management, and AI-backed so-

lutions to better manage and align ESG information

sources, e.g. for issuer and controversies screening.

(Lee et al., 2022) describes how to analyze ESG

data through ML methods including regression, clas-

siﬁcation, and anomaly detection methods for the

dataset to perform these experiments. Their main task

is to classify whether investors conducted excellent or

bad investments, detecting anomaly data to prevent

an adversarial attack, predicting the revenue based

on ESG funds, and classifying the sentences suggest-

ing a straightforward method for predicting their ESG

scores.

In (de Franco et al., 2020), an ML algorithm was

developed to identify the patterns, between ESG pro-

ﬁle and ﬁnancial performances for companies. The

ML algorithm maps region into high dimensional

ESG Data Collection with Adaptive AI

469

ESG features. The aggregated predictions are con-

verted into scores which are used to screen invest-

ments for stocks with positive scores. This Ma-

chine Learning algorithm nonlinearly links ESG fea-

tures with ﬁnancial performance. It is an efﬁcient

stock screening tool that outperforms classic strate-

gies, which screen stocks based on their ESG ratings.

(Gupta et al., 2021) provides a framework for con-

ducting statistical analysis and leveraging ML tech-

niques to gauge the importance of ESG parameters

for investment decisions and how they affect ﬁnan-

cial performance of ﬁrms. For companies with the

best ESG ratings, “return on equity” was found to be

greater than rest of the companies. While using lin-

ear and random forest regression models, prediction

accuracy of growth variables “proﬁt margin” and “re-

turn on assets” increased when ESG data was used

along with ﬁnancial data as input. Companies having

the highest “proﬁt margins” were the ones having the

best ESG ratings.

(Schultz and Tropmann-Frick, 2020) have de-

veloped a method for detecting unusual journal en-

tries within individual ﬁnancial accounts using auto-

encoder neural networks. A manually tagged list of

entries is compared with identiﬁed journal entries.

In the comparison, all analyzed ﬁnancial accounts

showed high F-scores and high recall.

(Rony et al., 2022) propose Climate Bot a machine

reading comprehension system for question answer-

ing over documents that provides answers related to

climate changes. The proposed Climate Bot makes

available an interface for users to ask questions in

natural language and get answers from reliable data

sources. All the papers discussed above describe gen-

eral AI methods adopted in the ESG investment prac-

tice. To the best of our knowledge, no work describes

the application of advanced adaptive AI techniques to

the extraction of information from ESG sources such

as sustainability reports, annual reports, websites, and

so on. This paper reﬂects one of the components in

our platform where we provide the possibility to ask

question related to ESG letting our AI-based platform

to extract structured and pre-processed information

before presenting it to the end user.

3 ESG DATA COLLECTION

PROBLEMS

In this section we discuss how ESG information can

be spread across different sources of information and

the importance of mixing information that comes

from heterogeneous sources.

3.1 ESG Data Sources

ESG information can be disclosed through different

types of documents such as non-ﬁnancial and sustain-

ability reports, as well as ﬁnancial statements, annual

and management reports. One of the main issues is

the scanty availability of ESG data for small-cap or

mid-cap companies. Large-cap companies, due to

the availability of more resources and the fact that

ESG practices are becoming mandatory for this type

of businesses, provide ESG data through annual re-

ports, sustainability reports, media reports, social me-

dia, news, etc. But for small-cap and mid-cap com-

panies, due to limited resources and lack of standard-

ization in ESG data disclosure, it’s hard to ﬁnd ESG-

related data.

There are two types of data sources: primary data

source and secondary data source. The primary data

source is available through the company website, an-

nual report, proxy report, sustainability reports, cor-

porate social report (CSR), news, social media, com-

pany reviews. This kind of data is typically available

for free, and can in turn be classiﬁed into self-reported

ESG data (all the data that the company itself dis-

closes), third-party ESG data (such as NGO or gov-

ernment websites and reports) and real-time ESG sig-

nals (such as news, social media, company reviews

etc.). The problem with self-reported ESG data is

that it can be biased as companies can choose what

type of information to disclose. On the other hand,

if we rely only on real-time signals it could happen

that, for example, a competitor spreads a mislead-

ing news about a company. So it is important to mix

both self-reported data and real-time signals. A ma-

jor challenge is the format of data available because

every company has its way to present ESG informa-

tion. Companies in different countries have different

standards to disclose data. Due, to a lack of standards

the data is present in different structures, for exam-

ple, it could be present in text form, complex tables,

or multifaceted data points. Another big challenge is

that there exists no speciﬁc place to ﬁnd all the ESG-

related information for the particular company, all the

data is fragmented and available in multiple places.

In secondary data source there are ESG data ven-

dors or ESG data providers whose job consists in

collecting, systematizing, and analyzing environmen-

tal, social, and governance attributes obtained from a

primary data source. The ﬁnal product of this type

of ESG information providers are reports that con-

tain analysis of a given company/sector equipped with

some ESG scores.

The problem is that nowadays as ESG is becom-

ing a hot topic and ﬁnancial institutions and ﬁrms

ICEIS 2023 - 25th International Conference on Enterprise Information Systems

470

Figure 1: Point-and-click document annotation.

are heavily investing in getting ESG data and re-

ports for rating companies. But primary data sources

lock ESG data within strongly unstructured texts and

documents, while secondary data sources are slow,

not timely, provide limited sub-sets of manually built

ESG data for subsets of businesses and, sometime,

cannot be trusted due to a lack of transparency.

4 ALTILIA INTELLIGENT

AUTOMATION PLATFORM

Altilia Intelligent Automation (AIA) is a platform de-

signed with the unique goal to democratize the use

of AI for Intelligent Document Processing (IDP). The

platform enables the automation of business processes

that require the understanding of complex documents

and unstructured data sources. It makes use of AI

techniques that allow the adaptation of its AI mod-

els to real-world changes by exploiting the human

feedback. The AIA platform gives enterprises a no-

code/low-code interface, in a cloud-based environ-

ment, allowing business domain experts to transfer

their knowledge into algorithms by training AI mod-

els (Figure 1), combining models to create AI skills,

and use AI skills within workﬂows to automate the

extraction of relevant ESG data. The platform allows

processing and understanding complex and visually

rich documents, with variable and non-standardized

layouts. This is crucial to automate processes that re-

quire the extraction and/or comparison of data buried

in long in-depth reports (e.g. ﬁnancial and annual re-

ports).

4.1 ESG Data Collection by Altilia

Intelligent Automation Platform

In order to allow a better understanding of the ESG

data collection workﬂow built by using the AIA plat-

form, we consider a running example based on the

collection of datapoints from notes to ﬁnancial state-

ments, sustainability reports, and web sites. Data-

Figure 2: The visual inspection and review of a datapoint

recognized within a document.

points considered in the running example are the fol-

lowing: (i) Company Description - it is a string

describing what the company concretely does; (ii)

Ateco - it is an entity representing the Italian code

for the industry which the company belongs to; (iii)

Green Product - it is a string that contains the name

and/or the description of a product obtained by fol-

lowing organic procedures or low environmental im-

pact; (iv) Non-Renewable Energy - it is a string

describing how much of non-renewable energy the

company uses for the production; (v) Efficiency

Initiatives - it is a string that describes company

initiatives to reduce the environmental impact of the

production; (vi) Environmental Certifications -

it is a an entity representing the name of the environ-

mental certiﬁcations obtained by the company (for ex-

ample ISO 14001).

In the following we explain step-by-step how we

leveraged the AIA platform to build the end-to-end

workﬂow, having the general structure depicted in ﬁg-

ure 3, that allows collecting the ESG datapoints we

have deﬁned above.

Gathering. This step aims at gathering the docu-

ments to use for the ESG data collection. In our run-

ning example we used three different types of docu-

ments. Annual and sustainability reports were manu-

ally gathered, while company websites were automat-

ically gathered by executing a data sources connector

capable of applying web crawling and web scraping

techniques. Such a connector works in two phases. In

the ﬁrst phase, it crawls the Web on the base of a list

of companies where each company is equipped by its

set of ﬁrmographic data such as: company name, vat

number, headquarter address, etc. The scope of the

Web crawling phase is to provide in output the web-

sites of the companies in the input list. In the second

phase, the connector applies web scraping techniques

to gather contents of the web pages in the website.

The output of this phase is a set of documents, one for

each web page, ready to be ingested in the platform.

Ingestion. In this step the AIA platform applies doc-

ument analysis and indexing methods. AIA takes in

ESG Data Collection with Adaptive AI

471

Figure 3: The ESG data collection process with Altilia Intelligent Automation platform.

input documents in PDF format and produces in out-

put the Altilia Spatial Document Object Model (AS-

DOM), a spatial document format (patented by Al-

tilia). More in details, document ingestion consists of

two main phases: document analysis and document

indexing.

In the ﬁrst phase the platform applies: (i) intelli-

gent optical character recognition (iOCR) algorithms

to documents in image format; (ii) language detection

and page orientation detection algorithms to enable

better layout and texts recognition within the docu-

ments; (iii) document layout analysis and recognition

algorithms to extract main layout elements. Further-

more, the OCR error correction technique deﬁned in

the paper (Nguyen et al., 2021) is applied. The output

of this ﬁrst phase is the document content turned into

the ASDOM format. ASDOM allows turning docu-

ment contents in machine readable format represent-

ing, in combined way, both textual contents and docu-

ment layout elements such as: text paragraphs; tables

and their sub elements like cells rows, columns, row

headers, and column headers; text columns; charts;

images; page headers and footers. ASDOM plays a

twofold role, it supports document searching, ﬁltering

and visualization, and it simpliﬁes Machine Learning

based document processing workﬂows. In particular,

ASDOM enables further document processing by ma-

chine learning models aimed at performing the con-

crete ESG data collection.

In the second phase the platform stores and in-

dexes the ASDOM within the Altilia Knowledge

Base allowing to jointly query and retrieve docu-

ment contents and layout elements. More in detail,

the platform applies text embedding algorithms like

word2vec (Mikolov et al., 2013), GloVe (Penning-

ton et al., 2014), BERT (Devlin et al., 2018), etc. to

produce dense vector representations of the text con-

tained in the documents. Dense vectors produced in

this phase are linked with the layout elements in the

ASDOM in order to create the ﬁnal document repre-

sentation stored, indexed, and managed by the Altilia

Knowledge Base.

Recognition and Extraction. This step takes as in-

put documents stored in ASDOM format within the

knowledge base and applies deep learning algorithms

to recognize and extract datapoints for which the al-

gorithms have been trained to. More in detail, the dat-

apoint recognition and extraction step works in two

main phases: text retrieval and text reading.

Text retrieval phase is needed because we deal

with documents that can count hundred of pages,

hence to accurately extract a speciﬁc datapoint, we

have ﬁrst to identify elements of the layout that are

candidate to contain the datapoint under analysis. We

adopt two different methods to execute the retrieving

phase. The ﬁrst method is based on BM25-like al-

gorithms (Kim and Gil, 2019; Amati, ), the second

makes use of dense passage retrieval search like the

one discussed in (Karpukhin et al., 2020). BM25-like

algorithms are based on queries composed by key-

words and do not take into account the semantics of

the datapoints. While dense passage retrieval allows

retrieving text passages by queries composed of piece

of texts that express the semantics of the datapoints.

The text reading phase depends on the complexity

of the documents in input and on how much difﬁcult

are the datapoints. For very easy datapoints we use

a pure syntactic approach base on Altilia Spatial In-

formation Extraction Language (ASIEL) a rule-based

spatial language, that has the expressiveness of a con-

text free grammar enriched by a spatial algebra, ex-

ploiting the ASDOM to recognize and extract data-

points and objects from documents. For more com-

plex and semantic datapoints, that need sense dis-

ambiguation, we use different types of LLMs (Large

Language Models) and machine reading and compre-

hension algorithms to perform NLP task such as: to-

ken sequence classiﬁcation, text classiﬁcation, entity

extraction, and question answering. In both cases we

get very high accuracy and extraction performances.

For example, for the Ateco datapoint we use ASIEL

because extraction at token level is preferred, while

for the datapoint Company Description it is required

an extraction at sentence level, and because company

descriptions can be very different each other, we need

powerful text classiﬁcation capabilities provided by

LLMs.

The ground truth dataset is used to develop both

the the text retriever and the text reader as discussed in

paper (Ni et al., 2018). The ﬁne-tuning of deep learn-

ing algorithms is performed by using Altilia Models

a platform tool that assists in ﬁne tuning LLMs. After

ICEIS 2023 - 25th International Conference on Enterprise Information Systems

472

ﬁne-tuning the text retriever and the text reader algo-

rithms, we combine them into an AI skill by means

of the platform module named Altilia Skills. The AI

skill is then used to conﬁgure a Workﬂow by the plat-

form tool Altilia Workﬂow to make everything ready

for the ﬁnal datapoint collection process.

Validation. In this step, when accuracy values of ex-

tracted data are under given thresholds the platform

sends a warning to the user that can inspect the data

by the Altilia Reviews tool. This way users can check,

validate and/or correct each single datapoint.

Review step (Figure 2) implements a Human-In-

The-Loop AI approach (Wu et al., 2021) because af-

ter each validation action performed by domain ex-

perts, validated data becomes feedback stored into the

ground truth dataset that can be used for the “contin-

uous retraining/ﬁne-tuning” of the AI models embed-

ded in a workﬂow.

The continuous retraining takes place on demand

or can be scheduled to be automatically executed by

the platform. This way models improve while using

the platform. It is worthwhile nothing that the human

feedback continuously enrich the ground truth dataset

with new examples coming from workﬂow execu-

tions, hence the platform can automatically maintain

AI models up-to-date ensuring that AI models don’t

degrade over time, avoiding the data drift (model

drift) phenomenon (Ackerman et al., 2020).

The validation step, and more in general the Al-

tilia Reviews tool, helps in visualizing collected data

for data quality check (Sheth and Thirunarayan, 2021)

and explainable AI (Do

silovi

c et al., 2018) purposes.

In particular, because we face long and complex docu-

ments such as annual and sustainability reports, a tool

that allows exploring results of deep learning models

and provide feedback is critical.

Our experience has taught us that such documents

are difﬁcult to label, many annotations are lost. In

the context of ESG data we performed at least two

rounds of Human-In-The-Loop interactions for each

datapoint, each time the model brought up a large

amount of “false” false positives (e.g. correct exam-

ples marked as ”false positive” because escaped the

human eye during the annotation phase). Hence, we

extended the ground truth dataset improving signiﬁ-

cantly the ﬁnal accuracy of the model.

Use. This last step of the workﬂow allows export-

ing extracted data towards third-party databases, ap-

plications, and tools by Connectors that are software

artifacts allowing the platform interoperability. Con-

nectors and platform APIs allow accessing the data

stored in the Altilia Knowledge Base and export them

in different formats such as XML, RDF, JSON, etc.

Output connectors can directly feed external systems

and applications or interacting with other RPA, CRM,

CMS, ERP systems, etc.

5 EXPERIMENTS

In this section we describe experiments carried out in

order to train AI Skills for the recognition of ESG

datapoints. First we present the dataset and the an-

notations within it. Then, we describe in depth the

experiments and the obtained results.

5.1 Dataset

We focused on three type of unstructured data

sources: annual reports, sustainability reports and

company websites. In particular, the numbers of doc-

uments took into account in the experiments are: 322

annual reports, 185 sustainability reports and 2495

web pages. Experiments involve around 2000 com-

panies spread over various industry and with revenues

ranging from 0-2 million to >1 billion.

Table 1 shows the number of annotations for each

datapoint (rows) in each documents type (column).

The ⟨datapoint, document⟩ pairs, shown in bold in the

table, are the ones with a sufﬁcient number of annota-

tions to continue with the experimental phase.

Table 1: Number of annotations for each couple ⟨datapoint,

document⟩.

Sustainability Report Annual Report Web pages

Company Description 242 592 645

Ateco 2 577 -

Green Product 371 15 146

Non-Renewable Energy 192 1 -

Efﬁciency Initiatives 274 21 19

Environmental Certiﬁcations 457 68 173

5.2 Experimental Settings

In order to extract the 5 datapoints of interest,

we created eight AI skills. Each skill listed here

is capable of extracting a datapoint from a spe-

ciﬁc document layout: (i) Company Description

in text composed by a syntactic retriever and neu-

ral reader; (ii) Company Description in table

composed by a syntactic retriever and a neural

reader; (iii) Ateco composed by a syntactic re-

triever and a syntactic reader; (iv) Green Product

composed by a neural retriever and neural reader;

(v) Non-Renewable Energy in text composed by

a neural retriever; (vi) Non-Renewable Energy in

table composed by a syntactic retriever; (vii)

Efficiency Initiatives composed by a neural re-

triever and a neural reader; (viii) Environmental

ESG Data Collection with Adaptive AI

473

Table 2: Results of applying AI skills to documents of interest.

Datapoint Document Type Precision Lenient (%) Recall Lenient (%) F1 Lenient (%)

Company Description - text Annual Report 61.00 73.61 63.32

Company Description - table Annual Report 97.00 99.00 97.99

Ateco Annual Report 99.97 100.00 99.98

Green Product Sustainability Report 77.50 96.50 84.30

Green Product Web Page 35.71 100.00 52.63

Non-Renewable Energy - text Sustainability Report 55.56 73.61 63.62

Non-Renewable Energy - table Sustainability Report 71.00 89.00 78.99

Efﬁciency Initiatives Sustainability Report 50.00 60.00 54.00

Environmental Certiﬁcations Annual Report 40.02 98.31 57.89

Environmental Certiﬁcations Sustainability Report 26.26 99.56 41.56

Environmental Certiﬁcations Web Page 33.12 92.31 48.75

Certifications composed by a syntactic retriever

and a syntactic reader. As regards the Company De-

scription and Non-Renewable Energy datapoints, we

created two different AI skills for handling data that

can be found in tables and in text paragraphs. When

we talk about syntactic retriever we refer to BM25-

like algorithms (Kim and Gil, 2019; Amati, ), instead

when we say syntactic reader we refer to rules writ-

ten in ASIEL. Finally, when we talk about neural re-

triever and neural reader we always refer to LLMs

and Transformers-based algorithms (Vaswani et al.,

2017).

All results presented in the next section have been

obtained by 5-fold cross validation. Since some of the

data are long text paragraphs we decided to use an F1-

lenient as evaluation metric where true positives are

deﬁned as follow: span prediction ∩ span annotation

AND at least 30% of words in common.

5.3 Results and Discussion

Table 2 shows the obtained results.

In general, we have obtained a good recall on all

datapoints. It means that both syntactic and neural

retrieval perform well. The worst - but still accept-

able - datapoint in terms of recall is Efﬁciency Initia-

tives (60.00%), this is due to the fact that to capture

this datapoint is very complex because of its semantic.

The precision of some datapoints is instead below an

acceptable threshold and will be the subject of future

works.

When analyzing the results of these experiments,

we have to take into consideration that we have

worked with a view to having AI skills that gener-

alizes on all document categories. It means, for ex-

ample, that the same model leveraged for extracting

Green Product from sustainability reports has been

leveraged also for extracting the same datapoint from

web pages. It explains why, in the ﬁrst case, we have

a precision of 77.50% and in the second case just

35.71%. These models are still far from generalizing

on different document categories.

In contrast, the syntactic retriever performs very

well on pipelines that involve extracting data from ta-

bles (Company Description and Non-Renewable En-

ergy: 97.00% and 71.00% of recall, respectively).

This is due to the fact that tables, by their nature, re-

port information in the form of keywords.

6 CONCLUSIONS AND FUTURE

WORK

In this paper, we discussed how to build automatic

workﬂows that collect environmental, social, and

governance (ESG) data from different types of un-

structured data sources and the challenges associ-

ated with ESG data collection. We extensively dis-

cussed how the Altilia Intelligent Automation plat-

form and its modules are used to teach create AI

Skills and workﬂows that automate ESG data collec-

tion from documents having different formats and lay-

outs like: pdf documents, image documents, ﬂat text,

table structures, web pages etc. We presented experi-

ments carried our to extract ﬁve datapoints leveraging

eight AI skills making use of different data extraction

techniques, based on a retriever-reader pipeline. In-

formation of interest was extracted from different ele-

ment of the layout such as text paragraphs and tables.

ESG data extraction capabilities of the AIA plat-

form can be further improved, enhancing its docu-

ment layout analysis and recognition algorithms. As

future work we are extending the AIA platform to en-

able complete machine reading and comprehension

methods providing full question answering features

that allow extracting data from documents in a con-

versational manner.

Finally, we are working on extending the initial

set of datapoints in order to cover the ESG taxonomy

approved by the European Parliament that allows us

ICEIS 2023 - 25th International Conference on Enterprise Information Systems

474

to meet the ESG data collection needs of banks and

other players in the European ﬁnancial service arena.

ACKNOWLEDGEMENTS

This paper has been supported by the following

projects:

• “ESG - Alternative data in credit management” re-

alized in the context of the ﬁrst call for project of

the Fintech Milano Hub

;

• “Validated Question Answering” n.

F/190114/01/X44 - CUP: B28I20000040005

PON “I&C” 2014-2020 FESR - And for sustain-

able growth - Sustainable manufacturing DM

05.03.2018 - DD 20/11/2018, art. 38, 47 e 48

D.P.R. n. 445 of 28/12/2000.

REFERENCES

Ackerman, S., Farchi, E., Raz, O., Zalmanovici, M., and

Dube, P. (2020). Detection of data drift and outliers

affecting machine learning model performance over

time.

Amati, G.

de Franco, C., Geissler, C., Margot, V., and Monnier, B.

(2020). Esg investments: Filtering versus machine

learning approaches.

Devlin, J., Chang, M., Lee, K., and Toutanova, K. (2018).

BERT: pre-training of deep bidirectional transformers

for language understanding. CoRR, abs/1810.04805.

silovi

c, F. K., Br

c, M., and Hlupi

c, N. (2018). Explain-

able artiﬁcial intelligence: A survey. In 2018 41st In-

ternational Convention on Information and Communi-

cation Technology, Electronics and Microelectronics

(MIPRO), pages 0210–0215.

Gupta, A., Sharma, U., and Gupta, S. K. (2021). The role of

esg in sustainable development: An analysis through

the lens of machine learning. In 2021 IEEE Interna-

tional Humanitarian Technology Conference (IHTC),

pages 1–5.

Hughes, A., Urban, M. A., and W

ojcik, D. (2021). Alterna-

tive esg ratings: How technological innovation is re-

shaping sustainable investment. Sustainability, 13(6).

Karpukhin, V., Oguz, B., Min, S., Wu, L., Edunov, S.,

Chen, D., and Yih, W. (2020). Dense passage re-

trieval for open-domain question answering. CoRR,

abs/2004.04906.

Kim, S.-W. and Gil, J.-M. (2019). Research paper clas-

siﬁcation systems based on tf-idf and lda schemes.

Human-centric Computing and Information Sciences,

9(1):30.

https://www.bancaditalia.it/focus/milano-hub/

Lee, O., Joo, H., Choi, H., and Cheon, M. (2022). Propos-

ing an integrated approach to analyzing esg data via

machine learning and deep learning algorithms. Sus-

tainability, 14(14).

Macpherson, M., Gasperini, A., and Bosco, M. (2021). Ar-

tiﬁcial intelligence and ﬁntech technologies for esg

data and analysis (february 15, 2021).

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013).

Efﬁcient estimation of word representations in vector

space.

Nguyen, Q.-D., Le, D.-A., Phan, N.-M., and Zelinka, I.

(2021). Ocr error correction using correction pat-

terns and self-organizing migrating algorithm. Pattern

Analysis and Applications, 24(2):701–721.

Ni, J., Zhu, C., Chen, W., and McAuley, J. (2018). Learning

to attend on essential terms: An enhanced retriever-

reader model for open-domain question answering.

Pennington, J., Socher, R., and Manning, C. (2014). GloVe:

Global vectors for word representation. In Proceed-

ings of the 2014 Conference on Empirical Methods in

Natural Language Processing (EMNLP), pages 1532–

1543, Doha, Qatar. Association for Computational

Linguistics.

Rony, M. R. A. H., Zuo, Y., Kovriguina, L., Teucher, R., and

Lehmann, J. (2022). Climate bot: A machine read-

ing comprehension system for climate change ques-

tion answering. IJCAI.

Schultz, M. and Tropmann-Frick, M. (2020). Autoencoder

neural networks versus external auditors: Detecting

unusual journal entries in ﬁnancial statement audits.

Sheth, A. P. and Thirunarayan, K. (2021). The in-

escapable duality of data and knowledge. CoRR,

abs/2103.13520.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J.,

Jones, L., Gomez, A. N., Kaiser, L., and Polo-

sukhin, I. (2017). Attention is all you need. CoRR,

abs/1706.03762.

Wu, X., Xiao, L., Sun, Y., Zhang, J., Ma, T., and He, L.

(2021). A survey of human-in-the-loop for machine

learning. CoRR, abs/2108.00941.

ESG Data Collection with Adaptive AI

475