InfoGenie: A Chatbot that Enhances Information Extraction Using

Modern Natural Language Processing Techniques

Yerram Deekshith Kumar

, Manash Pratim Lahkar

, Aditya Kumar Singh

, Biki Dey

and Utpal Sharma

Department of Computer Science and Engineering, Tezpur University, Napaam, Assam, India

Keywords: Question Answering, Sentence Embeddings, Answer Generation, Information Extraction, PDF Processing.

Abstract: Information extraction and question-answering systems face challenges in efficiently extracting information

from large repositories, particularly when dealing with PDF files. In response, this paper presents an innova-

tive application of Natural language processing (NLP) techniques. We address these challenges by developing

an intelligent chatbot tailored for streamlined Information extraction. Leveraging established language mod-

els and embeddings, including the Hugging Face Transformers library and Sentence Transformer models, our

solution seamlessly integrates with the Chroma vector store. We outline a robust data ingestion process en-

compassing Portable Document Format(PDF) document parsing, text segmentation, and document embedding

creation. These embeddings serve as the foundation for a resilient vector store, enhancing Information extrac-

tion efficiency. The chatbot’s underlying model is fine-tuned for sequence-to-sequence learning, enabling it to

generate coherent responses to user queries. Implemented through a user-friendly web interface powered by

Streamlit, users can interact seamlessly with the chatbot, upload PDF documents, and ask queries based on

those PDF documents. Evaluation on a crowdsourced dataset collected by us demonstrates a 95% cosine simi-

larity between generated and ground truth answers. This research advances NLP-based Information extraction

systems, offering practical solutions and insights for future enhancements.

INTRODUCTION

The volume and diversity of textual material in the

digital world has risen considerably in recent years,

and an increasing number of papers in various forms

are being generated and exchanged. This flow of

data presents possibilities and problems, particularly

in the fields of information extraction and question-

and-answer systems. Despite the abundance of data

that might give insightful knowledge, users may find

it challenging to locate and extract the specific infor-

mation they seek owing to the volume and variety of

documents.

Traditional keyword-based search engines and in-

formation extraction systems, although occasionally

successful, usually struggle to understand the nu-

ances of human language and accurately interpret

https://orcid.org/0009-0001-1647-8522

https://orcid.org/0009-0008-7244-1653

https://orcid.org/0009-0009-3588-1366

https://orcid.org/0009-0000-4342-854X

https://orcid.org/0000-0002-9210-7168

user requests. The proliferation of unstructured ma-

terial, such as PDF documents, exacerbates the prob-

lem, since standard search algorithms struggle to sort

through its complexities.

In this research project, we developed an intelli-

gent chatbot capable of understanding user requests

and responding in natural language. The project’s

purpose is to bridge the knowledge gap between con-

sumers and the massive amounts of information con-

cealed behind digital archives. The chatbot’s goal is

to improve users’ access to relevant information by

utilizing sophisticated language understanding skills

and innovative document processing techniques. This

allows consumers to make more educated judgments

and gain deeper insights from textual data.

APPLICATIONS

Our study has broad applicability in numerous sec-

tors. Some of the main applications are:

Kumar, Y. D., Lahkar, M. P., Singh, A. K., Dey, B. and Sharma, U.

InfoGenie: A Chatbot that Enhances Information Extraction Using Modern Natural Language Processing Techniques.

DOI: 10.5220/0013312500004646

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 1st International Conference on Cognitive & Cloud Computing (IC3Com 2024), pages 239-247

ISBN: 978-989-758-739-9

239

2.1

Information Extraction

•

The key use case for our chatbot is information

extraction, which allows users to quickly search

through a library of papers for relevant informa-

tion.

•

This tool is beneficial in academic contexts, cor-

porate settings, research, and any other circum-

stance in which quick access to certain informa-

tion is required.

2.2

Document Summarizing

•

Our chatbot can simplify and summarize long-

form papers, allowing users to rapidly compre-

hend the essential arguments and conclusions

without having to read the entire thing.

•

This tool is beneficial for professionals that need

to quickly extract critical data from long reports

or research papers.

2.3

Customer Support and Faqs

•

Businesses may provide 24-hour customer service

by incorporating our chatbot into their customer

care systems.

•

The chatbot may react to customer queries regard-

ing goods, services, rules, and troubleshooting in

a timely and accurate manner, relieving the bur-

den on human support representatives.

2.4

Educational Tools

•

In the field of education, our chatbot might func-

tion as a virtual study assistant or tutor, assist-

ing students in identifying relevant literature, re-

sponding to questions, and offering clarification

on a number of issues.

•

Furthermore, it may assist teachers in creating in-

teractive lectures and determining how effectively

their pupils understand the subject taught in class.

2.5

Legal and Compliance

•

The chatbot can be used by law firms and le-

gal departments to look up pertinent precedents,

rules, or legal interpretations by searching through

statutes, legal documents, and case law.

•

Law companies and legal departments can utilize

the chatbot to search for relevant precedents, regu-

lations, or legal interpretations in legislation, legal

documents, and case law.

Although this kind of automated chatbot seems

promise, there are a few issues that need to be re-

solved. One of the main challenges is creating pre-

cise algorithms to examine the intricate structure of

digital documents, especially PDF files. Maintaining

real-time responsiveness while effectively and scal-

able managing massive amounts of data is a major

technical issue.

LITERATURE REVIEW

The following section covers research articles from

areas like NLP, Information extraction and Informa-

tion retrieval:

3.1

Research on Document Similarity,

Summarization, and Retrieval

Our chatbot system’s primary tasks are retrieval, sum-

marization, and document similarity. Here, we exam-

ine a number of studies that investigate methods for

these assignments:

Ascione and Sterzi conduct a comparative analy-

sis of embedding models for measuring patent simi-

larity (Ascione and Sterzi, 2023). The selection of an

appropriate model for our chatbot to evaluate docu-

ment relevance can be informed by embedding mod-

els, which capture semantic relationships between

documents. (Ascione and Sterzi, 2023).

Sequence-to-sequence Recurrent neural networks

(RNNs) are the means by which Nallapati et al.

propose an abstractive text summarization approach

(Nallapati et al., 2016). Abstractive summarization is

the process of creating fresh, succinct summaries that

highlight the key ideas in a work. Although our first

focus is on retrieval, adding abstractive summariza-

tion to our chatbot might be a useful addition in the

future(Nallapati et al., 2016).

For retrieval tasks, Askari et al. provide an effec-

tive transformer-based re-ranker (Askari et al., 2023).

Re-ranking is the process of picking the best papers

from a set of candidate documents that a previous

system had obtained. According to their research,

transformer-based models may significantly increase

the information extraction accuracy of our chatbot

(Askari et al., 2023).

Jiang et al. offer a technique for rating large

texts using query-directed sparse transformers (Jiang

et al., 2023). Ranking refers to the process of sort-

ing returned documents based on their relevance to

IC3Com 2024 - International Conference on Cognitive & Cloud Computing

240

the user’s query. Jiang’s work on sparse transform-

ers provides excellent ways for handling long doc-

uments when our chatbot works with huge datasets

(Jiang et al., 2023).

Together, these studies provide innovative ap-

proaches for increasing the efficiency and quality of

information extraction, which is critical for our chat-

bot’s ability to identify relevant material for users.

3.2

Integration of Language Models and

Artificial Intelligence(Ai) in

Educational and Enterprise Settings

Several studies explore integrating large language

models and AI services into educational and en-

terprise applications, demonstrating the potential of

these technologies in real-world settings:

Hsain and El Housni (Hsain and El Housni, 2023)

investigate the use of large language model-powered

chatbots to support students in higher education.

Their work suggests that large language models can

be beneficial for educational chatbots, providing a

foundation for our chatbot’s ability to interact with

users and answer their questions .

Jeong (Jeong, 2023) explores the implementation

of generative AI services in enterprise applications.

Generative AI models can be used for various tasks,

including text generation and chatbot development.

Jeong’s work highlights the potential for generative

AI to enhance the capabilities of enterprise chatbots,

providing insights for us to consider as we develop

our own chatbot .

Taipalus (Taipalus, 2023) discusses fundamen-

tal concepts and challenges associated with vector

database management systems, which are essential

for storing and retrieving high-dimensional data like

document embeddings. Efficient storage and retrieval

of document embeddings are crucial for our chatbot’s

performance. Taipalus’s work highlights the impor-

tance of considering appropriate data storage solu-

tions for our chatbot .

Shen et al. (Shen et al., 2023) propose a frame-

work for memory augmentation using language mod-

els, offering insights for enhancing the chatbot’s

knowledge retention and retrieval capabilities. Mem-

ory augmentation techniques can improve a chatbot’s

ability to access and process information, potentially

benefiting our chatbot’s ability to answer follow-up

questions and engage in multi-turn conversations .

Van de Cruys et al. (Van de Cruys et al., 2022) in-

vestigate question-answering techniques for technical

documents. While our initial focus might be on Infor-

mation extraction, incorporating question-answering

capabilities could be a valuable future extension for

our chatbot. Van de Cruys et al.’s work provides in-

sights into techniques for enabling our chatbot to an-

swer user questions directly within retrieved docu-

ments.

Adiba et al. (Adiba et al., 2023) propose meth-

ods for unsupervised domain adaptation in question-

answering systems. Unsupervised domain adaptation

allows a model to be trained on data from one domain

(e.g., general knowledge) and then applied to a dif-

ferent domain (e.g., legal documents) where labeled

data is scarce. While our initial focus might be on re-

trieval, incorporating question-answering capabilities

could be a valuable future extension for our chatbot,

especially when dealing with domain-specific docu-

ments. Adiba et al.’s work suggests that unsupervised

domain adaptation techniques could help enable our

chatbot to answer questions about these specialized

documents even if limited training data is available in

that specific domain (Adiba et al., 2023).

Cohan et al. (Cohan et al., 2023) explore the use of

pre-trained language models for sequential sentence

classification tasks. Sentence classification involves

categorizing sentences based on their meaning. While

our initial focus might be on retrieval, incorporating

functionalities like sentiment analysis or topic classi-

fication could be valuable extensions for our chatbot.

Cohan et al.’s work suggests that pre-trained language

models can be effective for these tasks, providing a

foundation for us to explore adding such functionali-

ties in the future .

Kamma (Kamma, 2023) discusses language mod-

eling for intelligent Information extraction systems.

Kamma’s work emphasizes the role of language mod-

els in understanding the semantics of documents and

queries, which is essential for effective retrieval.

Their insights can inform our selection and applica-

tion of language models within our chatbot’s retrieval

system .

Lappalainen and Narayanan (Lappalainen and

Narayanan, 2023) describe Aisha, a custom AI library

chatbot built using the ChatGPT API. While this work

directly utilizes an existing API, it showcases the po-

tential for building custom chatbots with capabilities

similar to our envisioned intelligent Information ex-

traction chatbot .

Trust et al. (Trust et al., 2024) explore techniques

for augmenting large language models to enhance in-

teraction with government data repositories. While

their focus is on a specific domain (government data),

their work highlights the potential for ongoing re-

search and development in large language models,

InfoGenie: A Chatbot that Enhances Information Extraction Using Modern Natural Language Processing Techniques

241

Tex

chunck

Embedding

Tex

chunck

LLM Generativ

Knowledg

bas

Ranked Result s

Embeddings-3

Build Semantic

Index

Embeddings-1 Text chuncks -1

Semantic Search

Query

Embeddings

Split in Chuncks

PDF Files

Extract

Data/context

Question

which can inform future advancements in our own

chatbot’s capabilities .

Vaswani et al. (Vaswani et al., 2017) introduce the

Transformer architecture, a neural network architec-

ture that has become foundational for many state-of-

the-art NLP models, including some of the works dis-

cussed previously (e.g., Askari et al. 2023). Under-

standing the core principles of the Transformer archi-

tecture can provide valuable background knowledge

for us as we develop our chatbot.

Wang et al. (Wang et al., 2019) discuss language

Input

Text chuncks -n

Embeddings-n

Figure 1: System Architecture.

models with transformers, providing a more in-depth

exploration of this architecture and its applications in

various NLP tasks. Similar to Vaswani et al. (2017),

this work can provide a deeper understanding of the

technical foundations underlying some of the recent

advancements in NLP relevant to our chatbot devel-

opment.

In conclusion, this review has explored a wide

range of advancements in NLP, Information extrac-

tion, and chatbot systems. The referenced studies of-

fer valuable insights into methodologies that can be

harnessed to enrich the capabilities of our intelligent

Information extraction chatbot. Key takeaways in-

clude:

Building an effective chatbot requires mastery of

essential information extraction methodologies. Em-

bedding models and abstractive summarization aid in

understanding PDF information and condensing large

documents into concise, understandable summaries.

Transformer-based models improve retrieval, provid-

ing for rapid access to important information. Inte-

grating massive language models with AI characteris-

tics such as instructional chatbots and generative AI

enhances capabilities for question response and so-

phisticated information processing. Keeping up with

NLP breakthroughs enables constant progress. Un-

derstanding Transformer architecture helps you make

educated decisions about how to deploy these tech-

nologies, such as creating a robust, user-friendly chat-

bot that successfully answers questions and offers im-

portant information.

SYSTEM ARCHITECTURE

Our system’s functionalities include processing user

queries quickly, retrieving pertinent data from a

knowledge base, and producing context-aware re-

sponses. Figure 1 shows the information flow from

user queries to the creation of ranked results, repre-

senting the system architecture. Important parts of the

architecture include the generative model, knowledge

base, text chunking, and embeddings generation.

Algorithm 1 describes the system’s workflow.

Algorithm 1 NLP-based Query Chatbot

Require: User query q, PDF documents D

, d

, ..., d

}

Ensure: Context-aware response r

0: function C

HATBOT

(q, D)

Extract text chunks: T ← {}

0: for d

∈ D do T ← T ∪ E

XTRACT

HUNKS

)

0: end for

Embed chunks: E ← {}

0: for t

∈ T do E ← E ∪ E

MBEDDING

)

0: end for

0: Create semantic index I from E

0: Transform query: q

embed

← E

MBEDDING

(q)

0: Retrieve relevant embeddings:

rel emb ← S

EMANTIC

EARCH

(I, q

embed

)

0: Rank results: ranked res ← R

ANK

(rel emb)

0: Generate response:

r ← G

ENERATE

ESPONSE

(ranked

res)

0: return r

0: end function=0

4.1

SYSTEM COMPONENT

The system is made up of a number of interrelated

components, each of which plays an important part in

obtaining chatbot functionality. Table 1 presents an

overview of these components and their interconnec-

tions.

4.2

Components of the Language Model

Pipeline

The Language Model Pipeline consists of following

key components:

•

MiniLM Model: HuggingFace Transformers’

MiniLM model, a transformer-based language

model, serves as the pipeline’s primary compo-

nent. MiniLM was chosen because it excels in un-

derstanding the context of user requests and pro-

viding well-reasoned responses. The model un-

derstands complicated linguistic patterns since it

has been pre-trained on a large corpus.

IC3Com 2024 - International Conference on Cognitive & Cloud Computing

242

Table 1: System Components Overview.

Component Description

User Initiates queries and interacts with

the chatbot.

Question User’s input in the form of a query.

ment d

in D. To extract the associated text chunks T

each document d

in D is iterated over and analysed

individually. It is possible to put this this way:

Text

Chunks

Segments of text extracted from

PDF files.

= ExtractChunks(d

) (1)

Sentences are added to the text chunks T

once

Embeddings Numerical representations of text

chunks.

PDF Files Source documents containing rele-

vant information.

they have been collected for each document d

. Let

be the collection of sentences that were taken out

of each of the text chunks T

. To do this, each text

Knowledge

Base

LLM Gen-

erative

Model

Semantic

Ranked Re-

sults

Repository of text chunks and em-

beddings.

Language model for context-aware

response generation.

Process of retrieving information

based on semantic similarity.

Ordered list of relevant results.

chunk t

in T

is divided into separate sentences s

This might be shown as:

[

{

| s

is a sentence in t

}

(2)

∈T

After obtaining the collection of sentences S

for

each document d

, embeddings are created for each

sentence to aid in semantic analysis. E

should be the

•

Tokenizer: Tokenization is the process of trans-

forming input text into units that the MiniLM

model understands. This pipeline uses Hug-

gingFace’s AutoTokenizer, making it straight-

forward to load the MiniLM model’s pre-trained

tokenizer.

•

Pipeline Parameters: A variety of settings may

be configured to optimize the behavior and per-

formance of the Language Model Pipeline. Table

2 summarizes the parameters utilized in the Lan-

guage Model Pipeline.

Table 2: Language Model Pipeline Parameters

Parameter Description

Model MiniLM

Max Length 256

Sampling True

Temperature 0.3

METHODOLOGY

The procedures for text preparation, indexing,

query processing, and result display are as follows:

5.1

TEXT PREPROCESSING AND

REPRESENTATION

PDF files are processed using PyPDFLoader and

PDFMinerLoader. Textual data is extracted from PDF

files. The retrieved text has been divided into manage-

able chunks.

Let D be the collection of PDF documents, and let

be the set of text excerpts extracted from each docu-

collection of embeddings corresponding to the sen-

tences in S

. As part of the embedding process, each

phrase s

in S

is iterated over, and the appropriate

embedding e

is calculated. This can be displayed as:

= {e

| s

∈ S

} (3)

To generate a semantic index I for efficient infor-

mation extraction, each document d

has its embed-

dings E

used.

Finally, the text segments, phrases,

and embeddings in each document are ready to be em-

ployed in responses to user queries.

5.2

INDEXING AND STORAGE

We save the sentence embeddings with a semantic in-

dex (e.g., Chroma). The accuracy and speed of in-

formation extraction are improved by integrating with

Chroma. In order for the system to efficiently and cor-

rectly retrieve data based on the semantic knowledge

embedded in the vectors, the vector storage is essen-

tial to similarity searches.

5.3

QUERY PROCESSING

To get ready for processing, a user query q is tok-

enized when it is received. Let q

embedding

stand for

the query’s embedding, which was acquired by using

the same procedure as the text chunks. To obtain rele-

vant results, relevant embeddings, a semantic search

is run on the stored embeddings in Chroma.

5.4

RESULT PRESENTATION

Based on how well the returned results match the

query semantically, they are ordered. The ordered

list of pertinent results from the semantic search

InfoGenie: A Chatbot that Enhances Information Extraction Using Modern Natural Language Processing Techniques

243

may be represented as follows: ranked results =

, r

, ..., r

}, where r

indicates the i-th result. A

tuple (text

, score

), with text

denoting the result’s

text content and score

denoting its semantic similar-

ity score with the query, makes up each result r

The last response is produced using a Language

Model (LM) Generative AI. The function that pro-

duces the response based on the query q and the

ranked results is denoted by A(q, ranked results).

The following yields the answer a:

44 questions on average per PDF file and responses

that were around 18.14 words long. We carefully ex-

amined and verified the annotations during the data

gathering procedure to ensure their validity. The eval-

uation dataset

was enhanced with a multitude of

viewpoints and insights by our crowdsourcing tech-

nique, which was made possible by the wide range

of questions and answers from human annotators.

We intend to increase the scope of our crowdsourced

dataset by include a greater variety of document for-

a = Restructure(argmax

In this notation:

∈ranked results

Score(r

)) (4)

mats and query types, in recognition of its limitations.

This will entail gathering information from a variety

of sources and making sure that our dataset accurately

•

a represents the final answer.

•

The function argmax selects the result r

with the

highest score from the ranked results.

•

Score(r

) denotes the score assigned to each result

•

Restructure represents the function that restruc-

tures the selected answer before presenting it as

the final response.

5.5

DATASET

We decided to use Google Forms for our assessment

dataset annotations, employing a crowdsourcing ap-

proach. Participants received the PDF documents for

the evaluation task and were asked to come up with

questions and answers based on the content. The goal

was to directly collect these annotations through the

form.In order to guarantee that the annotations were

excellent and pertinent, we took care to specify stan-

dards including clarity, correctness, and alignment

with the PDF text.

Distribution of Questions Across PDF Files

6.pdf

5.pdf

4.pdf

3.pdf

2.pdf

1.pdf

0 10 20 30 40 50

Number of Questions

Figure 2: Count of questions for each document

Six PDF files make up our dataset, which has

2,295 tokens and 982 unique words in total.

Each

reflects a range of real-world circumstances. Our goal

is to increase the assessment dataset’s diversity in or-

der to enhance the chatbot’s resilience and generaliz-

ability when processing various kinds of documents

and questions.

EXPERIMENTS AND RESULTS

We conduct a comprehensive investigation of the

chatbot’s performance in the Colab environment us-

ing scores from the Recall-Oriented Understudy for

Gisting Evaluation (ROUGE) and a range of word

embedding models to determine how effectively the

bot understands user queries and generates relevant

responses.

Table 3 shows that the average cosine simi-

larity scores for the various embeddings (GloVe,

Word2Vec, FastText, and BERT) vary from 0.82 to

0.95. The produced responses’ semantic closeness to

the ground truth answers may be inferred from these

ratings. We use four different embeddings: GloVe

(glove-wiki-gigaword-300), Word2Vec (word2vec-

google-news-300), FastText (fasttext-wiki-subwords-

300), and BERT (bert-base-uncased) to turn the

ground truth response and the anticipated answer into

vectors.

Following conversion, the similarity between the

ground truth answer’s and the anticipated answer’s

vectors was measured using cosine similarity. Fig. 3

displays the document-wise cosine similarity of the

anticipated responses with the ground truth.

Additionally, we evaluated the chatbot’s quality of

answer using ROUGE scores, a widely used natural

language processing metric. The following were the

acquired ROUGE scores:

•

ROUGE-1: Recall (r) = 0.54, Precision (p) =

0.44, F1-score (f) = 0.46

paragraph has around 382.5 tokens on average, dis-

persed over 14 phrases, or roughly 27.32 tokens per

sentence. We collected 264 submissions in all, with

https://drive.google.com/drive/folders/1B

Uw0u2 jb

9 f CIS5RWJcrweD9s4KSAgt?usp = sharing

Files

IC3Com 2024 - International Conference on Cognitive & Cloud Computing

244

Table 3: Average Cosine Similarity Scores for Different

Embeddings

Doc No. Embeddings

glove-wiki word2vec fasttext-wiki bert-base

-gigaword-300 -google-news-300 -subwords-300 -uncased

1.pdf 0.8

0.73 0.91 0.8

2.pdf 0.92 0.78 0.9

0.9

3.pdf 0.95 0.89 0.97 0.97

4.pdf 0.9

0.8

0.96 0.9

5.pdf 0.92 0.8

0.96 0.95

6.pdf 0.97 0.92 0.98 0.97

Average 0.92 0.82 0.95 0.92

•

ROUGE-2: Recall (r) = 0.31, Precision (p) =

0.28, F1-score (f) = 0.28

•

ROUGE-L: Recall (r) = 0.48, Precision (p) =

0.40, F1-score (f) = 0.42

The reported ROUGE scores show that the chat-

bot’s capacity to create replies with significant lexical

and semantic overlap with ground truth answers still

needs to be improved. Additional improvements are

required to improve the relevancy and accuracy of the

chatbot’s answers. By combining the ROUGE score

evaluation with the examination of average cosine

similarity scores, we can gain a comprehensive un-

derstanding of the chatbot’s ability to answer to user

inquiries in a relevant and accurate manner. These

measures are useful for determining the quality and

usefulness of a chatbot’s replies in real-world circum-

stances.

Cosine Similarity Scores for Each PDF Document

1.0

0.8

the interface’s development to ensure that users could

easily query information from documents.

Figure 4: User Interface of Chatbot Web Application.

7.1

Article Display

The article about the Bhopal gas tragedy is displayed

on the left side of the interface. The causes of the

incident, the number of casualties, and other relevant

details are all covered in detail in this article. Users

can consult the original content while interacting with

the chatbot because the article is presented in an easily

readable format.

7.2

Chatbot Interaction Panel

Users can interact with the chatbot through the chat

interface on the right side. Users are intended to

ask questions about the article’s content using this

panel. Utilising sophisticated natural language pro-

cessing methods, the chatbot retrieves pertinent data

from the document in response to user inquiries.

0.6

0.4

0.2

0.0

1.pdf 2.pdf 3.pdf 4.pdf 5.pdf 6.pdf

PDF Documents

7.3

Illustrative Interaction

A typical user interface interaction is shown in Figure

4. In this instance, the user poses particular queries

regarding the Bhopal gas tragedy, and the chatbot pro-

vides precise answers taken directly from the article.

The exchange shows off the chatbot’s abilities to:

Figure 3: Document wise average Cosine similarity of the

answers.

WEB-BASED USER INTERFACE

FOR ENHANCED

INTERACTION

This section introduces our chatbot application’s

web-based user interface, which is intended for inter-

active information retrieval and question-answering.

Streamlit and simple HTML templates were used in

•

Accurately identify and retrieve key details such

as the number of casualties.

•

Explain the reasons behind the tragedy, including

safety failures and procedural neglect.

•

Specify the type of gas involved in the incident.

The seamless user experience is made possible

by the interface’s aesthetic appeal and functionality,

which are ensured by the integration of HTML tem-

plates and Streamlit. We carried out official user test-

ing sessions to make sure the interface satisfies user

expectations and demands. A wide range of individ-

uals engaged with the chatbot throughout these ses-

love-wiki-

aword-300

word2vec-google-news-300

fasttext-wiki-news-subwords-300

bert-base-uncased

Cosi ne Similarity

InfoGenie: A Chatbot that Enhances Information Extraction Using Modern Natural Language Processing Techniques

245

sions, offering qualitative input on the usability, navi-

gation, and general performance of the interface. The

input received was crucial in directing more improve-

ments. Our system serves as an example of how con-

temporary chatbots can greatly improve information

extraction and retrieval, enabling users to more effec-

tively extract specific details from lengthy texts.

7.4

Proof of Concept

It is important to note that this interface is intended

as a proof of concept. While it demonstrates the po-

tential capabilities and effectiveness of the chatbot,

further development and refinement would be neces-

sary for practical deployment and broader application

across different domains and types of documents.

LEGAL AND ETHICAL ISSUES

The use of automated information extraction and

question-answering systems in sensitive sectors

presents significant legal and ethical concerns. These

systems must adhere to privacy standards, ensuring

that consumers are aware of data usage and provide

explicit consent. Data anonymization and encryption

are critical for ensuring user privacy. Furthermore,

NLP models might perpetuate biases found in their

training data, resulting in biassed outputs. To limit

this risk, employ diversified datasets, fairness-aware

algorithms, and conduct frequent bias checks. Trans-

parency helps foster trust by helping users to under-

stand how decisions are made, and developers must

accept responsibility for the system’s consequences,

including methods for fixing failures. Automated

technologies should support rather than replace hu-

man decision-making, especially in vital sectors such

as healthcare and finance. Ethical norms must be set

to avoid misuse and promote beneficial societal con-

sequences. Addressing these legal and ethical con-

cerns is critical for the proper development and de-

ployment of NLP-based information extraction sys-

tems, which will increase their value and adoption in

real-world applications.

CONCLUSION AND FUTURE

WORKS

In conclusion, our research demonstrates a signifi- cant

advancement in natural language processing and

Information extraction through the development of an

effective chatbot. Employing various word embed-

ding models and evaluating the chatbot’s performance

using average cosine similarity scores and ROUGE

scores, we have shown its ability to comprehend user

queries and provide accurate responses from a docu-

ment corpus. Despite constraints on directly applying

traditional metrics like precision and recall, our focus

remains on optimizing user experience and facilitat-

ing meaningful interactions. The analysis of average

cosine similarity scores across different embeddings

provides insights into the semantic similarity between

generated responses and ground truth answers. Fur-

thermore, ROUGE scores offer valuable indications

of lexical and semantic overlap, contributing to our

understanding of the chatbot’s performance. These

findings underscore the chatbot’s efficacy in generat-

ing relevant and accurate responses, laying a strong

foundation for future advancements in intelligent doc-

ument querying systems.

Looking ahead, future enhancements aim to en-

hance the system’s power and adaptability. Integrat-

ing advanced language models, exploring diverse vec-

tor store options, and extending features to support

multimedia content will contribute to the continuous

evolution of the chatbot.

A significant avenue for future development in-

volves customizing the system to be organization-

specific. Tailoring the chatbot to address the unique

queries and requirements of different organizations

aims to provide a valuable tool for enhancing Infor-

mation extraction in professional settings.

In summary, the chatbot not only achieves its pri-

mary goal of efficient Information extraction but also

lays the groundwork for further advancements in nat-

ural language processing. This project serves as a

foundation for ongoing research and application de-

velopment, contributing to the continuous evolution

of intelligent systems for document querying.

9.1

Future Directions

We will execute domain-specific assessments in the

financial, legal, and healthcare sectors to give a thor-

ough study of the chatbot’s performance across sev-

eral domains. This will evaluate how well the chat-

bot can handle language unique to its domain, adapt,

and deliver trustworthy information. We will also as-

sess the system’s performance in managing massive

amounts of data and real-time processing, addressing

scalability and efficiency issues. We will investigate

optimisation strategies including model compression

and parallel processing. Formal user testing sessions

will be carried out to improve the user experience by

collecting qualitative input on the usability and in-

IC3Com 2024 - International Conference on Cognitive & Cloud Computing

246

terface of the chatbot, which will direct future im-

provements. Finally, by applying sophisticated nat-

ural language understanding algorithms and context

management strategies, we will handle technical is-

sues including managing complicated queries, docu-

ment parsing failures, and keeping context throughout

multi-turn interactions. With these improvements, In-

foGenie should be a more reliable and adaptable tool

for automated information extraction, increasing its

wider application and practicality.

REFERENCES

Adiba, A. I., Homma, T., and Sogawa, Y. (2023). Unsuper-

vised domain adaptation on question-answering sys-

tem with conversation data.

Ascione, G. S. and Sterzi, V. (2023). A comparative analysis

of embedding models for patent similarity.

Askari, A., Verberne, S., A bolghasemi, A., Kraaij, W., and

Pasi, G. (2023). Retrieval for extremely long queries

and documents with rprs: a highly efficient and effec-

tive transformer-based re-ranker.

Cohan, A., Beltagy, I., King, D., Dalvi, B., and Weld, D. S.

(2023). Pretrained language models for sequential

sentence classification.

Hsain, A. and El Housni, H. (2023). Large language model-

powered chatbots for internationalizing student sup-

port in higher education.

Jeong, C. (2023). A study on the implementation of gen-

erative ai services using an enterprise data-based llm

application architecture.

Jiang, J.-Y., Xiong, C., Lee, C.-J., and Wang, W. (2023).

Long document ranking with query-directed sparse

transformer.

Kamma, A. (2023). An approach to language modelling for

intelligent document retrieval system.

Lappalainen, Y. and Narayanan, N. (2023). Aisha: A cus-

tom ai library chatbot using the chatgpt api.

Nallapati, R., Zhou, B., dos Santos, C., Gu¨lc¸ehre, , and Xi-

ang, B. (2016). Abstractive text summarization using

sequence-to-sequence rnns and beyond. IBM Watson.

Shen, J., Dudley, J. J., and Kristensson, P. O. (2023).

Encode-store-retrieve: Enhancing memory augmenta-

tion through language-encoded egocentric perception.

Taipalus, T. (2023). Vector database management systems:

Fundamental concepts, use-cases, and current chal-

lenges.

Trust, P., Omala, K., and Minghim, R. (2024). Augment-

ing large language models for enhanced interaction

with government data repositories. University College

Cork, Cork, Ireland.

Van de Cruys, T., Vanroy, B., and Peirsman, Y. (2022).

Question answering on technical documents.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,

L., Gomez, A. N., Kaiser, , and Polosukhin, I. (2017).

Attention is all you need. Google Research.

Wang, C., Li, M., and Smola, A. J. (2019). Language mod-

els with transformers. Amazon Web Services.

InfoGenie: A Chatbot that Enhances Information Extraction Using Modern Natural Language Processing Techniques

247