Natural Language Processing Applied in the Context of Economic
Defense: A Case Study in a Brazilian Federal Public Administration
Agency
Vanessa Coelho Ribeiro
1 a
, Jeanne Louize Emygdio
1 b
, Guilherme Pereira Paiva
1 c
,
Bruno Justino Garcia Praciano
1 d
, Val
´
erio Aymor
´
e Martins
1 e
, Edna Dias Canedo
1,2 f
,
F
´
abio L
´
ucio Lopes Mendonc¸a
1 g
, Rafael Tim
´
oteo de Sousa J
´
unior
1 h
and Ricardo Staciarini Puttini
1 i
1
National Science and Technology Institute on Cyber Security, Electrical Engineering Department,
University of Bras
´
ılia (UnB), P.O. Box 4466, Bras
´
ılia DF, Brazil
2
Department of Computer Science, University of Bras
´
ılia (UnB), P.O. Box 4466, Bras
´
ılia DF, Brazil
Keywords:
Natural Language Processing, Artificial Intelligence, Public Administration Agency, Jurisprudence, Antitrust.
Abstract:
Natural Language Processing (NLP) and Machine Learning (ML) resources can be used in Jurisprudence
to deal more accurately with the large volume of documents and data in this context to provide speed to
the execution of processes and greater accuracy to judicial decisions. This article aims to present applied
research with a qualitative approach and exploratory objective, technically characterized as a case study. The
research was conducted in a Brazilian federal public administration agency to verify the existence of antitrust
practices in the pharmaceutical field and the monitoring of such practices by the institution. To this end, a
methodological path was established based on three stages: building the corpus, running the NLP pipeline and
consultation of the results in the Jurisprudence Search System (BJ System). In compliance with the objective
of the case study, it was possible to identify the performance of the agency around the domain elicited, as well
as indications of the existence of antitrust practices, since the 276 documents retrieved from the BJ system
relate directly to routine processes executed by the agency, either in the sense of investigation, trial or analysis
of the business practices.
1 INTRODUCTION
Natural Language Processing (NLP) ((ISO) and
(IEC), 2022); ((ISO) and (IEC), 2021) comprises a
branch of studies that originate from the articulation
of theories, methods, and technologies fundamen-
tally derived from Computer Science, Artificial In-
telligence, and Linguistics to establish effective com-
munication between humans and machines employ-
ing natural language.
a
https://orcid.org/0000-0003-1070-9403
b
https://orcid.org/0000-0002-7329-4447
c
https://orcid.org/0000-0001-8978-139X
d
https://orcid.org/0000-0002-7423-6695
e
https://orcid.org/0000-0003-1070-9403
f
https://orcid.org/0000-0002-2159-339X
g
https://orcid.org/0000-0001-7100-7304
h
https://orcid.org/0000-0001-7100-7304
i
https://orcid.org/0000-0001-6433-1587
In the last few decades, various initiatives have
been implemented or are under development in Ju-
risprudence Search, using techniques from text min-
ing, machine learning, natural language processing,
and neural networks (Loutsaris and Charalabidis,
2020).
In the legal area, it is a fact that there are mas-
sive volumes of collections of documents that demand
case-by-case reading for decision-making, which rep-
resents an extensive period for the punctual solution
of each case and its correlation to similar cases.
The application of NLP techniques allied to ML
models tends to offer celerity in the processing of
this corpus, consequently optimizing the identifica-
tion of subsidies for the treatment of judicial pro-
cesses (Dias Canedo et al., 2021; Alrumayyan and Al-
Yahya, 2022), such as identification of similar opin-
ions that may guide and link similar cases (Loutsaris
and Charalabidis, 2020), summarization of legal texts
630
Ribeiro, V., Emygdio, J., Paiva, G., Praciano, B., Martins, V., Canedo, E., Mendonça, F., Sousa Júnior, R. and Puttini, R.
Natural Language Processing Applied in the Context of Economic Defense: A Case Study in a Brazilian Federal Public Administration Agency.
DOI: 10.5220/0011991900003467
In Proceedings of the 25th International Conference on Enterprise Information Systems (ICEIS 2023) - Volume 1, pages 630-637
ISBN: 978-989-758-648-4; ISSN: 2184-4992
Copyright
c
2023 by SCITEPRESS Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)
of great complexity and high volume (Finegan-Dollak
and Radev, 2016); sorting, reading, understanding ar-
guments in summaries, evaluating evidence, apply-
ing laws, identifying relevant cases, and drafting de-
cisions for analysis by legal experts (Park and Ko,
2020); extracting effect sentences from legal cases for
optimizing jurisprudence search in document collec-
tions (Mandal et al., 2021).
Besides the necessary NLP techniques, in the case
of this article, its application is in a jurisprudence
search system.
Jurisprudence search systems have access to doc-
umentary collections and search for similarities in
sources of judicial decisions. The result is a set of
similar legal situations that can serve as a basis for
various legal activities. Summarization of cases is,
therefore, essential.
The objective of this paper is to present a case
study of jurisprudence search by a Brazilian federal
public administration agency obtained from the adop-
tion of NLP methods. It reviews and analyzes the con-
cepts and progress of NLP, analyzes the NLP pipeline
created for the system in focus, and presents the stage
of development in the Jurisprudence Search system.
2 NATURAL LANGUAGE
PROCESSING
NLP comprises an interdisciplinary study area involv-
ing Computer Science, Linguistics, Statistics, Logic,
and Philosophy, among many others. Its studies
date back to the ’50s, specifically, Warren Weaver’s
(Weaver1949) research during the conception of a
project of automatic translation of documents. This
project, inspired by the work of Alan Turing (Tur-
ing, 1950), focused on developing similar methods for
translating documents between different languages
(Somers, 2012).
Weaver’s findings have stimulated the evolution
of research in NLP, basically under two types of ap-
proaches: i) rule-based: applied to the development of
robust systems that require extensive individual effort
from linguists in their construction, although they are
of simplified maintenance, based on modifications in
the translation rules and; ii) statistics-based approach:
applied to the development of systems in a shorter pe-
riod after the collection and cleaning of bilingual data,
but complex for adjustments after the start of opera-
tion. It is the dominant approach among researchers
in the area (Somers, 2012).
Nowadays, the application of NLP techniques re-
lies heavily on textual interpretation due to the wide
availability of digital information in this format. Tex-
tual interpretation systems help retrieve, categorize,
filter, and extract information from texts and are typ-
ified as information retrieval systems, textual catego-
rization systems, and data extraction systems (Russell
et al., 2010; Loutsaris and Charalabidis, 2020).
2.1 Classical Approaches to NLP
Classical approaches to NLP comprise a set of stages
in which the language analysis process is decom-
posed according to the theoretical linguistic distinc-
tions drawn between syntax, semantics, and pragmat-
ics (Dale, 2010).
It seeks, through analysis and practical actions, to
make a computer able to perform six stages of un-
derstanding communication: i) phonology: the study
of the sounds that make up words; ii) morphologi-
cal analysis (tokenization): fragmentation of an input
text to determine its components, the words, punctu-
ations, numbers, and signs (Palmer, 2010); iii) lexical
analysis (lemmatization): relation of morphological
variants to their lemmas, canonical forms or form in
which they are found in dictionaries, and their mean-
ings (Hippisley, 2010); iv) syntactic analysis: evalu-
ation of the grammar of the language used and rep-
resentation of the analyzed sentence (parsing) in the
form of a grammar; v) semantic analysis: extraction
of the meaning of a statement and its representation in
a semantic network, and vi) pragmatic analysis: dis-
course processing for intentionality analysis (Murphy,
2003; Dale, 2010).
2.2 Empirical and Statistical
Approaches to NLP
In the scope of empirical and statistical approaches,
NLP is used to decide the meaning of a word, its
category, its syntactic structure, and the semantic
scope around it. Thus, various models and techniques
are adopted for this purpose. Statistical models are
heavily used for building machine learning systems.
Among the main ones are: i) artificial neural networks
(ANN): computational systems inspired by biological
neural networks able to learn to perform tasks from
examples; ii) decision trees: predictive models run
over a vector of input values to return a unique output
value; iii) support-vector machine (SVM): framework
of methods for supervised learning, used for classifi-
cation and regression and; iv) Bayesian networks: a
graphical model of the probability distribution related
to a set of variables within the universe of a problem
(Mitchell, 1997; Russell et al., 2010).
The techniques generally adopted are: i) word
sense disambiguation (WSD): the computational
Natural Language Processing Applied in the Context of Economic Defense: A Case Study in a Brazilian Federal Public Administration
Agency
631
identification of the meaning of words in context
(Navigli, 2009); ii) corpora creation: collections of
texts used for learning linguistic models (Xiao, 2008),
iii) part-of-speech (POS) tagging: the process of tag-
ging each word in a given sentence with its correct
part of speech (G
¨
ung
¨
or, 2010); iv) treebank annota-
tion: corpora that present tree-structured annotations
(graph theory) representing syntactic, semantic, and
intersentential relationships (Haji
ˇ
cov
´
a et al., 2010)
and; v) alignment: automatic parallel text alignment
for translation validation purposes (Wu, 2010).
2.3 Related Work
From the linguistic perspective, (Wang, 2019; Jiang
and Lu, 2020) states that language comprises the fol-
lowing linguistic levels: phonetics, lexicon, grammar,
semantics, discourse, and pragmatics. For the lan-
guage studies cited, NLP applications can be subdi-
vided into these sections: machine translation, sound
recognition, sound synthesis, automatic information
retrieval, term database, optical character recognition,
human-machine dialogue, and others.
(Loutsaris and Charalabidis, 2020) presents
among the possibilities of NLP to assign predefined
category labels to new documents, understand the
meaning of natural language, and label a word in a
sentence or phrase to its appropriate part of speech
type.
(Kumar et al., 2022) emphasizes the importance
of Natural language understanding (NLU) in under-
standing human communication because, in textual
documents, the annotations used for machine learning
are punctual. In real-world communication, because
of interactions, the frequency of annotations is signif-
icant as marking part of speech, generating sentences,
or answering questions. We used frequency-enriched
datasets to compare the performance of (IC-NER))
and proposed two changes in domain generalization
approaches: domain masks for generalization (DMG)
and optimal transport (OT).
The applications of statistical techniques and ma-
chine learning are quite diverse. In his research on
the topic of neuroscience, (Sarmashghi et al., 2022)
presents a study of neural coding using existing Ma-
chine Learning (ML) approaches, particularly deep
network architectures, and the methods to integrate
them with statistical models. For both the simulation
and real data analyses, 70% of data were devoted to
training, 10% to validation, and 20% to testing with
the use of mini-lot gradient descent (GD) as the learn-
ing algorithm to update the model parameters. The re-
search demonstrates that the classical statistical meth-
ods and supervised machine learning algorithms have
complementary strengths and can be used together to
address the limitations of each method on their own.
(Finegan-Dollak and Radev, 2016) presents the
use of sentence simplification, compression, and dis-
aggregation for summarization applied to creating
sophisticated document summaries in the legal and
medical fields. The proposal is to have shorter sen-
tences of the original document reducing the size by
about 20%. Due to the texts’ complexity, the results
were not satisfactory, demonstrating that the tech-
niques applied need to be improved for the areas in
question.
(Park and Ko, 2020) presents the use of Machine
learning (ML) in the Legal and Economic area in the
Chinese context, based on regression modeling for
testing legal models. The authors apply three ML
models: Train-Test Cycle, Regularization, and Cross-
Validation to the Logit model. The authors state that
although NLP is reliably applied in the legal field for
classification, reading and understanding arguments
in briefs, evaluating evidence, applying relevant laws
and cases to a factual situation, and drafting a deci-
sion. However, it has not yet reached a maturity that
allows it to replace lawyers’ cognitive power and legal
reasoning skills.
Reading a summary of legal cases speeds up the
attorney’s work in searching for jurisprudence. (Man-
dal et al., 2021) presents a neural sequence tagging
model for extracting catchphrase from legal cases of
Supreme Court of India.Cross validation approach
was used to train and evaluate all supervised meth-
ods. For identification of catchphrase was identified
by scoring candidate sentences, modeling the task as
a sequence labeling task, use of document context in-
formation with sequence markers. As a result, the
authors identified that generic extraction methods do
not work well in extracting from legal documents, that
including the document context improves the perfor-
mance of the extraction model, and that the varia-
tion using noun-phrases outperforms the two varia-
tions using n-grams.
3 METHODOLOGY
The present research is characterized, from the point
of view of its nature, as applied research; from the
point of view of the way of approaching the problem,
as qualitative research; from the point of view of the
objectives, as exploratory research and; from the point
of view of the technical procedures, as a case study
considering that, for exemplification, the search for
jurisprudence in processes of economic defense in the
pharmaceutical sector will be presented (Gil, 1989).
ICEIS 2023 - 25th International Conference on Enterprise Information Systems
632
The objective of the case study is to verify the
existence of antitrust practices in the elicited domain
and its relation to the activities performed by a Brazil-
ian federal public administration, focused on the in-
vestigation, judgment, and analysis of such practices.
To this end, a methodological path was established
based on three stages:
i. Building the corpus;
ii. Running the NLP pipeline;
iii. Consultation of the results in the Jurisprudence
Search System (BJ System).
Figure 1 illustrates the architecture of the system.
Figure 1: Jurisprudence search system architecture.
A description of these steps are presented in the
following subsections.
3.1 Construction of the Documentary
Corpus
This step foresees the identification and organization
of relevant documents for the construction of a cor-
pus to be submitted to NLP and ML methods to meet
the objectives of the case study. Relevant documents
are those produced and maintained by the Brazilian
federal public administration agency to register deci-
sions, technical notes, opinions, and others related to
the institution’s performance in the prevention, judg-
ment, and analysis of antitrust practices. The docu-
ments must be available for consultation in electronic
format.
3.2 NLP Pipeline
Figure 2 contains the three steps of the NLP Pipeline:
Cleaning, pre-processing and modeling.
Figure 2: NLP Pipeline.
The cleaning step comprises the application of
techniques regarding text cleaning and normalization,
removing abbreviations and normalizing numeric for-
mats, and removing inappropriate characters originat-
ing from HTML texts.
Text pre-processing is a fundamental step for NLP,
and statistics for correct data loading are applied. This
step comprises a few steps:
Stopwords Removal: Stopwords are words that
help to understand the meaning of a sentence but
that do not carry in themselves any significance.
Words like “a”, “que”, “em” are present in the
Portuguese stopwords lists and are removed in
preprocessing because they have a high occur-
rence and do not add to the meaning of the text.
Stemming and Lemmatization: this techniques
are applied to normalize words by removing their
inflections(Hippisley, 2010).
POS-Tagging: Part-of-Speech (POS) tagging is
an NLP process that categorizes words from their
grammatical class(G
¨
ung
¨
or, 2010).
Tokenization: in the legal context, due to speci-
ficities in the text, such as abbreviations, Ro-
man numerals, and article and legislation cita-
tions, conventional tokenizers have their perfor-
mance affected. Therefore, the construction of
specific tokenizers is planned, taking into account
the characteristics of the text of the local govern-
ment’s documents.
The modeling step consists of statistical analysis
of the text and application of the text summarization,
Named Entity Recognition, and WordCloud models.
(Alshammari and Alanazi, 2020)
4 RESULTS AND DISCUSSIONS
Text summarization is a process that generates a docu-
ment summary by identifying its most important sen-
tences. It was implemented from the ensemble of a set
of summarization techniques (Luhn, 1958; Haghighi
and Vanderwende, 2009), where the output of these
models are combined with choosing the most relevant
sentences from the document. Ensemble learning, un-
like other methods, selects a set of hypotheses from
the hypothesis space, combines their predictions, and
reduces the correlation between possible errors in hy-
pothesis classification (Russell et al., 2010).
Most studies employ one of four summariza-
tion architectures: Sentence Extraction and Sum-
marization; Feature Extraction and Classification
or Classification-based Sentence Selection; Abstract
Sentence Compression and Compression; and Lan-
guage Modeling (Rezazadegan et al., 2022).
Natural Language Processing Applied in the Context of Economic Defense: A Case Study in a Brazilian Federal Public Administration
Agency
633
Named Entity Recognition is an NLP task that
identifies and categorizes real-world entities present
in (Grishman and Sundheim, 1995) texts. The names
of people, organization names, places, citations to
laws, and other documents are identified. A corpus
was built for training, and a Deep Learning model
specific to the documents of the municipality was
trained.
Word Cloud is a visual representation of text,
where keywords are highlighted from their frequency
in the corpus. This visualization is implemented in
the system for each document only, but for the context
of this paper, visualization across multiple documents
was implemented using Word Cloud (Mueller et al.,
2018) software.
Statistical methods are widely applied in NLP. In
the specific case of BJ, the feature extraction step for
building summarization are used calculation of Distri-
butions (relative and absolute frequency together with
IDF) (Navigli, 2009), cutting processes in FreqDistrib
using elbow techniques (Shi et al., 2021) and TF Pri-
oritization techniques (pre-calculations of IDF marks)
(Rahmah et al., 2019).
To conduct this case study, two versions of the Ju-
risprudence Search (BJ) System were used to achieve
specific results, being:
Version 1.0 (production environment):
Construction of the documentary corpus of ju-
risprudence;
Retrieval of the corpus necessary for the intended
scope of this research.
Version 1.2 (development environment):
Pipeline execution.
The results obtained in each step of the proposed
methodology is shown in the following subsections.
4.1 Construction of the Documentary
Corpus
The corpus was built using BJ v1 (Dias Canedo
et al., 2021), which can provide advanced search fil-
ters, with the option of conditionals, search with spe-
cific characters/terms, by proximity or Boolean op-
erators, search by relevance, phonetic search with
spell checker and autosuggestion. As a search result,
the system presents resources for word highlighting,
paging and sorting, controlled vocabulary synonyms,
term-stopwords definition, and document standard-
ization. In addition, the system allows the indexing
of various file extensions, such as PDF with OCR.
For the case study, we established a cutout around
the pharmaceutical industry. The keywords “medica-
ments”, “pharmaceutical”, “medicines”, and the logi-
cal operators available in the system were used in the
BJ system search in the filter resource. The addition
of the term “drugs” was evaluated in the search, but
since there was no relevant impact on the search re-
sults since the results referred to the context of illicit
drugs, the term was discarded.
The system finds documents related to veterinary
medicines from the search with the chosen key-
words. To adjust for these cases, the logical operator
“NOT” was used to exclude the words “animal”
and “veterinarians” from the search, resulting in
the search (pharmaceutical* OR medicine*
OR medicament*) NOT (veterinarian* OR
animal*).
For a demonstration of the pipeline results, the
document number SEI 1090146 was taken as a base,
where the results of the Summarization, and NER
models are available in the development environment,
Figures 3 and 4 respectively.
4.2 Running the NLP Pipeline and
Querying the Results in the BJ
System
Following all the steps described in the 3.2 section of
the methodology, the pipeline running process occurs
transparently within the BJ System. The results are
stored in databases and made available for query by
the BJ system through an API.
4.2.1 Results after Data Cleaning
For the data cleaning step provided in the pipeline, the
BJ System removes stopwords and punctuation.
The BJ system also implements in the cleaning
step the removal of abbreviations by replacing them
with their corresponding fully spelled ones.
The removal of plurals is implemented in a spe-
cific way in the BJ system for handling exceptions not
handled by commonly adopted Python libraries.
4.2.2 Results after Pre-Processing
For the pre-processing step, foreseen in the pipeline,
the BJ System implements the lemmatization process
in a step called “morphosyntactic tagging.
Another implementation, also performed in this
step, refers to the segmentation of representative sen-
tences of the semantic set to define propositions.
4.2.3 Results after Modeling
The query about the pipeline results started on
06/01/2022 and returned 535 documents from the
ICEIS 2023 - 25th International Conference on Enterprise Information Systems
634
jurisprudence of the Federal Government’s Depart-
ments. Of these, 276 were issued in the last five years.
The result as follows: 62 documents expedited in
the year 2018, 61 documents expedited in the year
2019, 80 documents expedited in the year 2020, 67
documents expedited in the year 2021, and 6 docu-
ments expedited in the year 2022.
The organizes and distribution of process cate-
gories as three are characterized as merger review-
ers, 70 are characterized as ordinary mergers, 140 are
characterized as summary mergers, one is character-
ized as consultation, 21 are characterized as admin-
istrative inquiries, six are characterized as prepara-
tory proceedings, 32 are characterized as administra-
tive proceedings, two are characterized as voluntary
appeals, and one is characterized as cease-and-desist
application.
The results related to summarization are customiz-
able according to the number of sentences or percent-
age of the text informed by the system user. This pa-
rameter is sent to the API, which selects the most rel-
evant fragments from the quantity informed. Figure
3 illustrates the results of selecting the most relevant
sentences.
Figure 3: Summarization output from the BJ System.
Figure 4 contains the results related to Named En-
tity Recognition (NER), where the entities recognized
in the texts are highlighted in different colors, us-
ing: red for locations, green for organizations, gray
for values, and yellow for jurisprudence regulation
documents. The entities identified are stored in the
database by the model during the execution of the
pipeline. At the moment of the user’s request in the
BJ system, they are retrieved and presented visually
on the screen.
In the context of economic defense, NER is a
tool with great potential since it is relevant to the re-
covery of entities and the discovery of knowledge,
emphasizing organizations mentioned in legal doc-
uments. Despite the great value in information re-
trieval provided by NER, the names of organizations
Figure 4: Named Entities recognized by the BJ System.
by themselves need to add more value to the legal con-
text. Therefore, by integrating the retrieval of entity
names with the knowledge discovery made possible
by the APIs of knowledge bases such as WikiData
and the Agency’s database, it is possible to extract,
besides the names of the entities, information such as
The Brazil National Registry of Legal entities Num-
ber (CNPJ), Corporate Name, Organizational Struc-
ture, the National Classification of Economic Activ-
ities (CNAE), among other types of data relevant to
Agency’s target audience.
Figure 5: Word Cloud generated by the system.
Figure 5 shows that besides the keywords refer-
ring to the business areas of the Brazilian federal pub-
lic administration agency, there were also found those
that refer to the cut proposed in this article, highlight-
ing: health insurance, hospital, medicine, medicine
distribution, patent.
5 FINAL REMARKS
The objective of this paper was to verify the exis-
tence of antitrust practices in the pharmaceutical field
and their relation with the activities performed by the
Brazilian federal public administration agency. Us-
Natural Language Processing Applied in the Context of Economic Defense: A Case Study in a Brazilian Federal Public Administration
Agency
635
ing the conception and execution of a mixed method-
ology, encompassing NLP techniques and contempo-
rary machine learning models, articulated in a techno-
logical architecture to support a jurisprudence search
system under development (Dias Canedo et al., 2021).
Section 2 presents background on NLP, covering
the classical, empirical, and statistical approaches,
methods, and techniques found in the literature, ac-
companied by practical examples of its application
in related research identified in the last five years.
Section 3 presented the three-stage methodology de-
signed for the research, covering the construction
of the document corpus, the execution of the NLP
pipeline, and the query of results in the BJ system.
Section 4 presents the results obtained during the ex-
ecution of the methodology and preliminary discus-
sions about them.
In compliance with the objective of the case study,
it was possible to identify the performance of the
agency around the domain elicited, as well as indi-
cations of the existence of antitrust practices, since
the 276 documents retrieved from the BJ system re-
late directly to routine processes executed by the
agency, either in the sense of investigation, trial or
analysis of the business practices. Details about
this processes known to merger review, ordinary
merger, summary merger, consultation, administra-
tive inquiry, preparatory proceeding, administrative
proceeding, voluntary appeals and cease-and-desist
application could be found at (BRASIL, Minist
´
erio
da Justic¸a e Seguranc¸a P
´
ublica. Conselho Admin-
istrativo de Defesa Econ
ˆ
omica, 2021; CADE, 2021;
Brasil, 2011) .
Given the exploratory nature of the research de-
scribed in this paper, the content analysis of the re-
covered documents is the object of a future publica-
tion. Using it for a better understanding of the flow of
processes in progress in the agency and the relation-
ship that the documents establish between themselves
since they can characterize progressive outputs of the
processes and sub-processes performed by the orga-
nization.
One of the differentials of this project rests on the
construction of a domain ontology to support the dis-
ambiguation of terms and consequently to optimize
ML processing in the BJ System.
ACKNOWLEDGMENTS
This work is supported in part by CNPq - Brazil-
ian National Research Council (Grant 310941/2022-9
PQ-1D), in part by FAPDF - Brazilian Federal Dis-
trict Research Support Foundation (Grant 625/2022
SISTeR City), in part by the University of Brasilia
(Grant 7129 UnB COPEI), in part by the General At-
torney of the Union (Grant AGU 697.935/2019), in
part by the Administrative Council for Economic De-
fense (Grant CADE 08700.000047/2019-14), and in
part by the General Attorney’s Office for the National
Treasure (Grant PGFN 23106.148934/2019-67).
REFERENCES
Alrumayyan, N. and Al-Yahya, M. (2022). Neural embed-
dings for the elicitation of jurisprudence principles:
The case of arabic legal texts. Applied Sciences, 12(9).
Alshammari, N. and Alanazi, S. (2020). An arabic dataset
for disease named entity recognition with multi-
annotation schemes. Data, 5(3).
Brasil (2011). Lei 12.529, de 30 de novembro de 2011.
Di
´
ario Oficial da Rep
´
ublica Federativa do Brasil.
BRASIL, Minist
´
erio da Justic¸a e Seguranc¸a P
´
ublica. Con-
selho Administrativo de Defesa Econ
ˆ
omica (2021).
CADE MECUM: Colet
ˆ
anea de normativos brasileiros
de defesa da concorr
ˆ
encia. Conselho Administra-
tivo de Defesa Econ
ˆ
omica, Bras
´
ılia: CADE. CDD
341.3787.
CADE, C. A. d. D. E. (2021). Regimento interno CADE.
CADE, Bras
´
ılia, 5a ed. edition.
Dale, R. (2010). Classical approaches to natural language
processing. In Handbook of Natural Language Pro-
cessing, Second Edition, pages 3–8. CRC Press - Tay-
lor and Francis Group, New York, NY, USA.
Dias Canedo, E., Aymor
´
e Martins, V., Coelho Ribeiro,
V., dos Reis, V. E., Carvalho Chaves, L. A.,
Machado Gravina, R., Alberto Moreira Dias, F.,
Lopes de Mendonc¸a, F. L., Orozco, A. L. S., Bala-
niuk, R., and de Sousa, R. T. (2021). Development
and evaluation of an intelligence and learning system
in jurisprudence text mining in the field of competition
defense. Applied Sciences, 11(23).
Finegan-Dollak, C. and Radev, D. R. (2016). Sentence sim-
plification, compression, and disaggregation for sum-
marization of sophisticated documents. Journal of the
Association for Information Science and Technology,
67(10):2437–2453.
Gil, A. C. (1989). M
´
etodos e T
´
ecnicas de Pesquisa Social.
Atlas, S
˜
ao Paulo, 2nd edition.
Grishman, R. and Sundheim, B. (1995). Design of the
MUC-6 evaluation. In Sixth Message Understanding
Conference (MUC-6): Proceedings of a Conference
Held in Columbia, Maryland, November 6-8, 1995.
G
¨
ung
¨
or, T. (2010). Part-of-speech tagging. In Handbook of
Natural Language Processing, Second Edition, pages
205–236. CRC Press - Taylor and Francis Group, New
York, NY, USA.
Haghighi, A. and Vanderwende, L. (2009). Exploring
content models for multi-document summarization.
In Proceedings of Human Language Technologies:
The 2009 Annual Conference of the North American
ICEIS 2023 - 25th International Conference on Enterprise Information Systems
636
Chapter of the Association for Computational Lin-
guistics, pages 362–370, Boulder, Colorado. Associ-
ation for Computational Linguistics.
Haji
ˇ
cov
´
a, E., Abeill
´
e, A., Haji
ˇ
c, J., M
´
ırovsk
´
y, J., and
Uresov
´
a, Z. (2010). Treebank annotation. In Hand-
book of Natural Language Processing, Second Edi-
tion, pages 167–188. CRC Press - Taylor and Francis
Group, New York, NY, USA.
Hippisley, A. (2010). Lexical Analysis. In Handbook of
Natural Language Processing, pages 31–58. Nitin In-
durkhya and Fred J. Damerau, New York: Chapman
& Hall /CRC Press, 2nd. edition.
(ISO), I. O. F. S. and (IEC), I. E. C. (2021). ISO/IEC TR
24030:2021(en), Information technology — Artificial
intelligence (AI) — Use cases.
(ISO), I. O. F. S. and (IEC), I. E. C. (2022). ISO/IEC
22989:2022(en), Information technology — Artificial
intelligence — Artificial intelligence concepts and ter-
minology.
Jiang, K. and Lu, X. (2020). Natural Language Process-
ing and Its Applications in Machine Translation: A
Diachronic Review. In 2020 IEEE 3rd International
Conference of Safe Production and Informatization
(IICSPI), pages 210–214, Chongqing City, China.
IEEE.
Kumar, M., Rumshisky, A., and Gupta, R. (2022). Chas-
ing the tail with domain generalization: A case study
on frequency-enriched datasets. In Proceedings of the
2nd Conference of the Asia-Pacific Chapter of the As-
sociation for Computational Linguistics and the 12th
International Joint Conference on Natural Language
Processing (Volume 1: Long Papers), pages 1–11, On-
line only. Association for Computational Linguistics.
Loutsaris, M. A. and Charalabidis, Y. (2020). Legal infor-
matics from the aspect of interoperability: a review of
systems, tools and ontologies. In Proceedings of the
13th International Conference on Theory and Prac-
tice of Electronic Governance, page 731–737, Athens
Greece. ACM.
Luhn, H. P. (1958). The automatic creation of literature
abstracts. IBM Journal of Research and Development,
2(2):159–165.
Mandal, A., Ghosh, K., Ghosh, S., and Mandal, S. (2021).
A sequence labeling model for catchphrase identifica-
tion from legal case documents. Artificial Intelligence
and Law, 30.
Mitchell, T. M. (1997). Machine Learning. McGraw-Hill
series in computer science. McGraw-Hill, New York.
Mueller, A., Fillion-Robin, J.-C., Boidol, R., Tian, F.,
Nechifor, P., yoonsubKim, Peter, Rampin, R., Corvel-
lec, M., Medina, J., Dai, Y., Petrushev, B., Langner,
K. M., Hong, Alessio, Ozsvald, I., vkolmakov, Jones,
T., Bailey, E., Rho, V., IgorAPM, Roy, D., May, C.,
foobuzz, Piyush, Seong, L. K., Goey, J. V., Smith,
J. S., Gus, and Mai, F. (2018). amueller/word cloud:
Wordcloud 1.5.0.
Murphy, M. L. (2003). Semantic Relations and the Lexicon:
Antonymy, Synonymy and other Paradigms. Cam-
bridge University Press.
Navigli, R. (2009). Word sense disambiguation: a survey.
ACM Computing Surveys, 41(2):1–69.
Palmer, D. (2010). Text preprocessing. In Handbook of Nat-
ural Language Processing, page 22. Nitin Indurkhya
and Fred J. Damerau, New York, NY, USA, 2nd. edi-
tion.
Park, S. and Ko, H. (2020). Machine learning and law and
economics: A preliminary overview. Asian Journal of
Law and Economics, 11(2):20200034.
Rahmah, A., Santoso, H. B., and Hasibuan, Z. A. (2019).
Exploring technology-enhanced learning key terms
using tf-idf weighting. In 2019 Fourth International
Conference on Informatics and Computing (ICIC),
pages 1–4.
Rezazadegan, D., Berkovsky, S., Quiroz, J. C., Kocaballi,
A. B., Wang, Y., Laranjo, L., and Coiera, E. W. (2022).
Symbolic and statistical learning approaches to speech
summarization: A scoping review. Comput. Speech
Lang., 72:101305.
Russell, S. J., Norvig, P., and Davis, E. (2010). Artificial
intelligence: a modern approach. Prentice Hall series
in artificial intelligence. Prentice Hall, Upper Saddle
River, 3rd ed edition.
Sarmashghi, M., Jadhav, S. P., and Eden, U. T. (2022). In-
tegrating statistical and machine learning approaches
for neural classification. IEEE Access, 10:119106–
119118.
Shi, C., Wei, B., Wei, S., Wang, W., Liu, H., and Liu, J.
(2021). A quantitative discriminant method of elbow
point for the optimal number of clusters in cluster-
ing algorithm. EURASIP J. Wirel. Commun. Netw.,
2021(1):31.
Somers, H. (2012). Machine Translation: History, Devel-
opment, and Limitations. In The Oxford Handbook
of Translation Studies, pages 1–9. Oxford University
Press Inc., New York, NY, USA, kirsten malmkjær
and kevin windle edition.
Turing, A. M. (1950). Computing machinery and intelli-
gence. Mind, LIX(236):433–460. Number: 236.
Wang, Y. (2019). Natural language processing and appli-
cations in machine learning. Modern Chinese, 5:187–
191.
Wu, D. (2010). Alignment. In Handbook of Natural Lan-
guage Processing, Second Edition, pages 367–408.
CRC Press - Taylor and Francis Group, New York,
NY, USA.
Xiao, R. (2008). Well-known and influential corpora. In
Corpus Linguistics: An International Handbook, vol-
ume 1, pages 383–457. Mouton de Gruyter, Berlin,
Germany, a. l
¨
udeling and m. kyto edition.
Natural Language Processing Applied in the Context of Economic Defense: A Case Study in a Brazilian Federal Public Administration
Agency
637