Using Formal Concept Analysis for Corpus Visualisation and Relevance

Analysis

Fabrice Boissier

1 a

, Irina Rychkova

and B

edicte Le Grand

LRE, EPITA, 14-16 Rue Voltaire, Le Kremlin-Bic

etre, France

CRI, Universit

e Paris 1 Panth

eon, Sorbonne, 90 Rue de Tolbiac, Paris, France

fabrice.boissier@epita.fr, {irina.rychkova, benedicte.le-grand}@univ-paris1.fr

Keywords:

Corpus Visualization, Relevance Visualization, Topic Modeling, Formal Concept Analysis, Text Mining,

Natural Language Processing.

Abstract:

Corpora analysis is a common task in digital humanities that proﬁts from the advances in topic modeling and

visualization from the computer science and information system ﬁelds. Topic modeling is often done using

methods from the Latent Dirichlet Allocation (LDA) family, and visualizations usually propose views based

on the input documents and topics found. In this paper, we ﬁrst explore the use of Formal Concept Analysis

(FCA) as a replacement for LDA in order to visualize the most important keywords and then the relevance of

multiple documents concerning close topics. FCA offers another method for analyzing texts that is not based

on probabilities but on the analysis of a lattice and its formal concepts. The main processing pipeline is as

follows: ﬁrst, documents are cleaned using TreeTagger and BabelFy; next, a lattice is built. Following this, the

mutual impact is calculated as part of the FCA process. Finally, a force-based graph is generated. The output

map is composed of a graph displaying keywords as rings of importance, and documents positioned based on

their relevance. Three experiments are presented to evaluate the keywords displayed and how well relevance

is evolving on the output map.

1 INTRODUCTION

In the age of data, knowledge is an essential fac-

tor that increases the capacity to make the best deci-

sions (North and Kumta, 2018). Data visualization re-

mains a difﬁcult task for knowledge extraction as nu-

merous visualizations are available and each business

and/or application face their own challenges (An-

drienko et al., 2020)(Padilla et al., 2018)(Engebretsen

and Kennedy, 2020). It requires a perfect understand-

ing of the goal to achieve and the manipulated data.

Visualizing text corpora is a cross-domain appli-

cation of interest for information systems and digital

humanities (J

anicke et al., 2015). Improvements in

visualization lead to better analysis of multiple types

of texts like historical newspapers (Menhour et al.,

2023) or even poetry (De Sisto et al., 2024). How-

ever, few work has been done to visualize the rele-

vance between documents based on their topics. Topic

modeling also contributes to texts’ analysis as a re-

search orientation of the information retrieval ﬁeld

by ”uncovering latent text topics by modeling word

https://orcid.org/0000-0002-0067-6524

associations” (Hambarde and Proenca, 2023). Top-

ics are simply a list of words that share a similar

theme: either each word is strongly/directly express-

ing the theme, or the collection of words illustrates

an abstract theme by their semantic links (but alone,

they would not make sense). Multiple topic modeling

methods exist (Alghamdi and Alfalqi, 2015)(Kherwa

and Bansal, 2019). They usually consider documents

as a bag-of-words where the order of words is not im-

portant; only their occurrence in each document is im-

portant. The main trend in topic modeling consists of

probabilistic methods where topics are built based on

the probability of each term being part of a speciﬁc

topic. Recent advances in natural language process-

ing (NLP) also introduced neural networks combined

with traditional methods, allowing the capture of the

context of words within documents and reusing it to

analyze newer documents.

In this paper, we explore the use of Formal Con-

cept Analysis (FCA) (Ganter and Wille, 2012) instead

of more traditional topic modeling methods, and we

propose a visualization of the main keywords of a cor-

pus and documents’ relevance on a force-based graph.

The strength of FCA resides in the fact that it an-

120

Boissier, F., Rychkova, I. and Le Grand, B.

Using Formal Concept Analysis for Cor pus Visualisation and Relevance Analysis.

DOI: 10.5220/0013047800003838

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 16th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2024) - Volume 3: KMIS, pages 120-129

ISBN: 978-989-758-716-0; ISSN: 2184-3228

alyzes the relationships within data and produces a

lattice that can be used for calculating useful mea-

sures like similarities. Instead of trying to ﬁnd back

memberships of an object between multiple distribu-

tions or the best matching weight of a parameter in a

neural network, FCA highlights the existing relation-

ships between objects based on their attributes. FCA

is known as a viable text mining method (Carpineto

and Romano, 2004) and is a good candidate for mul-

tiple applications in the knowledge ﬁeld (Poelmans

et al., 2013). FCA has been used in conjunction with

a topic modeling method (Akhtar et al., 2019) but not

instead of it.

The paper is organized as follows: First, we ex-

plain the main methods currently used in topic model-

ing as well as topic and relevance visualizations. Sec-

ond, we present our processing pipeline from the in-

put documents to the output map. Then, we present

three experiments, the text materials used, their top-

ics, and their relevance. Finally, we discuss the results

and conclude.

2 RELATED WORK

2.1 Topic Modeling

Analyzing documents implies creating statistics on

the used terms. Term Frequency - Inverse Document

Frequency (TF-IDF) (Salton, 1983) calculates a ra-

tio from the frequency of each term to the total num-

ber of documents. Documents are therefore seen as

a ratio of words independent of their ordering, like a

bag-of-words. TF-IDF is not exactly a topic modeling

method, but it shows the importance and uniqueness

of terms within the corpus.

Latent Semantic Analysis (LSA) (Deerwester

et al., 1990) transforms documents into a latent se-

mantic space from which multiple outputs can be an-

alyzed. The main method behind LSA is the singular

value decomposition, which produces three matrices

based on a parameter K given by the user: a matrix

of terms per K features, a matrix of K features per K

features, and a matrix of K features per documents.

Multiple analyses can be done on those matrices, but

in our case, the terms per K features one is the most

important as it allows us to know how well each term

is linked to each feature (that in fact represents la-

tent topics). One problem induced by LSA is that it

can’t manage polysemy: each term is used as the same

entity in any document. Homonyms, like synonyms

and even various forms of the same word, can pro-

duce an inconsistency because of the missing context

of each document. Standardization of the input, like

stemming or even lemmatization, can partially lever-

age this problem.

Probabilistic Latent Semantic Analysis

(PLSA) (Hofmann, 1999) is an upgrade of LSA

that introduces a probabilistic point of view by

building a generative model for each text corpus.

Because topics are scattered within documents,

probabilities help ﬁnd the terms that compose them.

PLSA relies on an aspect model (the probabilities

between terms, documents, and the latent topics) and

a mixture decomposition that obtains similar results

as the singular value decomposition, allowing PLSA

also to build the three matrices of LSA.

Latent Dirichlet Allocation (LDA) (Blei et al.,

2003) is also a generative probabilistic model that

aims to model text corpora. It aims to improve the

PLSA mixture decomposition by using a hierarchical

Bayesian model. LDA allows not only the ﬁnding of

topics of words within documents, in the case of text

corpora, but also its usage with more conﬁdence than

PLSA as a probabilistic generative model in multiple

domains.

LDA can be combined with Bidirectional Encoder

Representations from Transformers (BERT) (Devlin

et al., 2018) in order to increase the quality of its

results (Peinelt et al., 2020)(George and Sumathy,

2023). BERTs are a collection of pre-trained neu-

ral networks designed to help researchers in NLP by

considering words and their neighbors before and af-

ter them. By their nature, BERTs do not produce a

list of topics but can be used to generate a summary.

Similarly to BERTs, it can be noted that Generative

Pre-trained Transformers (GPT) (Brown et al., 2020)

achieve numerous NLP tasks. Still, they also share

the drawback of not explaining the weights within the

neural network to build their answers.

Correlated Topic Model (CTM) (Blei and Laf-

ferty, 2006) is derived from LDA and uses another

distribution in order to better capture topics and their

relations within documents. Indeed, when a docu-

ment concerns a theme, it usually talks about some

neighbor topics: a text about travel might talk about

tourism, beaches, and airplanes, but probably not

about ﬁghter jets. Topics are uncorrelated in LDA be-

cause of the Dirichlet distribution, whereas in CTM,

thanks to the logistic normal distribution, topics are

correlated and present links to one another. The pres-

ence of a topic triggers the possibility of ﬁnding one

or multiple other topics. CTM also proposes a visual-

ization of topics: each topic is represented in a bubble

of words, and each bubble is linked with the other cor-

related topics.

Using Formal Concept Analysis for Corpus Visualisation and Relevance Analysis

121

2.2 Visualization of Topics and

Relevance

The presentation of topics is important as reading a

list might confuse a human: a list forces an order of

reading, which is not always the best one depending

on the context. Visualization of topics is usually made

of tag clouds (Singh et al., 2017)(Lee et al., 2010),

which is convenient but shows terms as a whole block.

Some visualizations like (Singh et al., 2017) or (Gre-

tarsson et al., 2012) show the distinction between top-

ics and the terms composing them. However, if a

document used in the corpus is irrelevant or contains

too many irrelevant parts, the results might be altered

without the user being informed. In addition, each

document must be deeply reviewed to ﬁnd the dis-

crepancy. Few work has been done to explicit the rel-

evance of documents one to another based on their

topics.

(Assa et al., 1997) presents a relevance map be-

tween topics and keywords in the case of web search

queries. Their proposal uses dimension reduction to

create a 2D map and gravitational forces for placing

nodes. Similarly, VIBE (Olsen et al., 1992)(Olsen

et al., 1993) and its variants (Ahn and Brusilovsky,

2009) create Points of Interest (POI) that act precisely

like topics around which documents are placed based

on relevance. The main drawback is that it does not

highlight the most important common words and top-

ics but only places resources based on their relevance

to all the topics found.

(Fortuna et al., 2005) proposed a map of terms

and documents based on MultiDimensional Scaling

(MDS) methods. Terms and documents are placed

on a two-dimensional map, and the background color

varies based on the density. The main drawback of

this contribution is the difﬁculty of getting a clear

overview of the documents and terms.

In (Newman et al., 2010), the authors proposed

a topic map after studying topic modeling and 2D

projection. First, they compare three topic model-

ing methods and end by using LDA. Next, they com-

pare three methods of projection, namely Principal

Component Analysis (PCA), Local Linear Embedding

(LLE), and Force Directed Layout (FDL) which is the

best. The topic map presents documents as nodes col-

ored by their most important topic. Their position de-

pends on their relevance to one another. However, the

authors concluded that the evaluation of visualization

is complex and could be made only by human judg-

ment; in addition, they also stated that maps with a

dozen documents are probably the most accurate and

valuable in understandability and navigation for a hu-

man.

TopicViz (Eisenstein et al., 2012) proposes a vi-

sualization of documents and topics by nodes with

a force-directed layout and, more importantly, inter-

action with the user. The topic modeling method

used is LDA. The user can pin topic nodes in the

workspace, making the documents ﬂoat based on their

relevance to each topic. Such a map allows one to dis-

tinguish which document is more relevant to some or

more topics based on its position. The user can also

pin document nodes, making the topics ﬂoat between

them. This visualization is particularly interesting be-

cause document and topic nodes can be pinned, allow-

ing it to show relevance. However, the user must pin

the nodes himself in order to see the relevance. Based

on the number of detected topics, deciding where to

pin each topic to see the documents’ relevance better

might be difﬁcult.

PaperViz (di Sciascio et al., 2017) is a dedicated

tool for researchers during the paper-gathering step.

It offers multiple views for multiple contexts: tree

hierarchy for search queries, a tag cloud of the 20

most frequent terms, the strength of the relationship

between documents and a search query or a collec-

tion, and references management. The main strength

of PaperViz is its completeness in the user interface.

3 VISUALIZATION PIPELINE

WITH FORMAL CONCEPT

ANALYSIS

The visualization pipeline comprises two main phases

(see Figure 1): semantic pre-processing, where docu-

ments are analyzed in order to produce an occurrence

matrix of terms per documents, and structural anal-

ysis, where the matrix is analyzed in order to create

a graphical representation of the relevance. Glob-

ally, the pipeline relies on natural language processing

(NLP) methods in the ﬁrst phase and formal concept

analysis (FCA) methods in the second phase.

3.1 Semantic Pre-Processing

The semantic pre-processing phase aims to extract the

most important terms and concepts from each doc-

ument and gather them within a matrix representing

the whole text corpus. This phase is composed of 5

steps as follows:

(PI.0) Selection of Documents by the User: the

user selects the documents for analysis and compari-

son. Two requirements must be fulﬁlled for the best

results: each document must have enough content,

and the content must be mainly textual.

KMIS 2024 - 16th International Conference on Knowledge Management and Information Systems

122

Figure 1: The main steps of the pipeline.

(PI.1) Extraction of Texts: each document is

transformed into a regular ﬂat text ﬁle. This step re-

lies on optical character recognition (OCR) methods.

In our experiments, we used PDFtoText

as the OCR.

(PI.2) Cleaning of Extracted Texts: each text is

cleaned in order to increase its quality and reduce its

size, typically by removing the useless spaces and ar-

tifacts that the OCR would have created and, eventu-

ally, some of the stop words. In our experiments, we

used TreeTagger (Schmid, 1994)(Schmid, 1995) with

a custom list of words to keep.

https://www.xpdfreader.com/pdftotext-man.html

Figure 2: The formal concept analysis (PII.1) sub-steps.

(PI.3) Disambiguation: each cleaned text is

transformed into a list of named entities by resolv-

ing polysemy and synonymy problems. Advanced

NLP methods are required for this task. In our ex-

periment, we used BabelFy (Moro et al., 2014) as it

understands multiple languages and calculates three

scores for each recognized named entity. The named

entities are also transformed into unique references

from BabelNet (Navigli and Ponzetto, 2012), allow-

ing us to manipulate the exact same named entities in

all documents, whatever the input languages are.

(PI.4) Filtering of Terms: the most irrelevant

named entities are removed based on the coherence

score attributed by BabelFy in the previous step. In

our experiments, we require a coherence score of at

least 0.05 to keep a named entity. This score was em-

pirically chosen because it removes way more irrele-

vant named entities than relevant ones.

3.2 Structural Analysis

The structural analysis phase comprises two steps

that calculate metrics in order to produce the mutual

impact graph showing the relevance of documents.

(PII.1) The formal concept analysis is the ﬁrst

step, divided into ﬁve sub-steps (see Figure 2). Its

objective is to produce the mutual impact matrix be-

tween terms and documents from the occurrence ma-

trix in order to evaluate the relevance of documents.

• Normalisation: occurrences of terms per docu-

ments are transformed into proportions in order

to reduce the length disproportions between doc-

uments. Absolute values are converted into per-

centages per line. Thus, all documents are treated

equally.

• Transposition: the matrix is transposed in order to

change the point of view from documents charac-

terized by occurrences of terms into terms charac-

terized by their appearances within documents.

• Binarisation Strategy: building the lattice re-

quires a formal context, or, in other words, a

binary matrix. Multiple strategies of binariza-

tion exist (also called reﬁnement strategies) (Jaffal

et al., 2015). In our case, we use the most simple

Using Formal Concept Analysis for Corpus Visualisation and Relevance Analysis

123

one: the direct strategy that transforms all values

> 0 as 1 and keeps 0 as 0.

• Lattice Construction: the formal context repre-

senting terms within documents is used for build-

ing a lattice and its formal concepts (Belohlavek,

2008). A formal concept is a node containing ob-

jects and attributes (at least one of them). In our

case, the objects are terms, and the attributes are

the documents.

• Metric Calculus: the lattice is analyzed, and the

mutual impact (Jaffal et al., 2015) metric is cal-

culated by comparing the appearances of couples

of terms and documents within each formal con-

cept. The mutual impact shows how strong the

relationship is between each term and each docu-

ment. This bond is calculated for each term and

each document with the formula:

MI(O

, A

) =

formal concepts containing O

and A

formal concepts containing O

or A

where O

represents a term and A

represents a

document. The output is a mutual impact matrix

with a value representing the bond between each

term and each document.

(PII.2) The construction of the mutual impact

graph is the ﬁnal step. Its objective is to visualize

on a map the terms and their importance within the

corpus, as well as the documents and their relevance.

The visualization uses the mutual impact matrix as an

adjacency matrix and produces a graph of terms and

documents. We used Gephi (Bastian et al., 2009) with

the ForceAtlas (ﬁrst version) spatialisation algorithm.

Nodes are moved until they ﬁnd their optimal position

thanks to attractive and repulsive forces. Nodes are

repulsing each other, and edges attract nodes based

on the values of the edges. Because of the input for-

mat, the visualization is a bipartite graph: a set of

nodes represents documents, and another set repre-

sents terms. As presented in Figure 3, the nodes of

documents are colored in grey and are linked to nu-

merous nodes of terms. Unlike, these term nodes

are colored based on the number of neighbors (the

warmer the color, the more the node of term is linked

to different nodes of documents). When focusing on

the nodes of terms, we can ﬁnd terms from every doc-

ument in the center of the map and terms from fewer

documents scattered around (terms, present in only

one document, are forming speciﬁc groups for each

document far from the central set). The central set of

terms is, in fact, a global view of the text corpus with

its keywords. When focusing on the nodes of docu-

ments, we can visually see the relevance of documents

within the corpus by checking how close documents

are to the central set of terms.

Figure 3: The output map of terms and documents.

4 EXPERIMENT

4.1 Scenarios of Evaluation

A proof of concept (Elliott, 2021) has been realized

with a speciﬁc scenario to check the pipeline’s valid-

ity and properties. A prototype has been developed

and used in three proof of concept demonstrations. A

ﬁrst case expects to visualize the content of 9 PHP

courses. The mutual impact graph is visualized to

check the corpus’ main keywords and the documents’

relevance. A second case introduces a Java course in

the corpus. The validity of the mutual impact graph

is checked by comparing the ﬁrst and second cases:

the Java course should be the most irrelevant because

it is not specialized in PHP. A third case is presented

in order to check how correcting a document impacts

the results.

In the regular case, 9 PHP courses in French are

processed through the pipeline. These courses present

web development with HTML, PHP, and MySQL for

beginners. They are denoted as C1-C9 in the follow-

ing ﬁgures. 6 of these courses are made of slides (C1,

C2, C4, C7, C8, C9), and 3 are made of regular texts

(C3, C5, C6). A Java course (CJA) in text format

is later introduced to check how the mutual impact

graph behaves when some errors appear. This course

is also in French and in a text format. The third ex-

periment only works on courses in text format. There-

fore, we introduced 4 more PHP courses in text for-

mat (C11, C15, C17, C19) to avoid disproportion by

keeping a close number of input documents as in the

other cases.

https://github.com/metalbobinou/CREA-phd

KMIS 2024 - 16th International Conference on Knowledge Management and Information Systems

124

4.2 General View of the Corpus’

Content

The documents are processed through phase I, and the

corpus is transformed into a matrix of occurrences.

The mutual impact graph is generated using the direct

strategy in step PII.1.

Figure 4 shows terms as nodes colored following a

cold-warm schema. Red nodes are terms occurring in

all the documents, orange nodes are terms occurring

in all the documents minus one, etc. In the regular

case, the terms in red nodes are: post, nombre (num-

ber), m

ethode post (post method), ﬁchier (ﬁle), code,

donn

ee (data), langage (language), site, php, naviga-

teur (browser). These terms are typical of a course

on web development with PHP. They are extended

with the terms in orange nodes like base de donn

(database) or even foreach, which are also typical for

a website in PHP that uses a database. Terms that

are present in fewer documents but still in more than

half of the corpus (the yellow and green nodes) are

also typical of web development for nearly all of them

(session, mysql, utilisateur (user), ...).

Figure 4: Central set terms of the mutual impact graph in

regular case.

Figure 5: Central set terms of the mutual impact graph in

Java case.

In Figure 5, the Java course is added to the corpus.

The terms in red nodes are the same as in Figure 4,

except php which becomes an orange node. This be-

havior is expected as the Java course does not dis-

cuss PHP; therefore, one document does not include

it. The terms in other colors are still relevant as they

mainly concern client-server programming, OOP vo-

cabulary, and similar topics. It can be noted that an-

other color is introduced in order to show that an ad-

ditional document is present. Nearly all of the terms

in green in the ﬁrst case are now light green. In the

entire graph, some of the terms in green are repulsed

into dark green nodes (meaning they are missing from

one more document). However, a majority of terms

from the ﬁrst case are still present in the second case

with the same number of document edges.

4.3 Relevance of Documents

Figure 6: Relevance of documents in regular case.

Figure 7: Relevance of documents in Java case.

Using Formal Concept Analysis for Corpus Visualisation and Relevance Analysis

125

Figures 6 and 7 represent the relevance of documents

in two cases: the regular case with only PHP courses,

and the Java case with the additional Java course. The

relevance’s visualization is also produced from the

mutual impact graph, except the focus is mainly on

the grey nodes representing documents.

In the regular case of PHP courses (Figure 6), the

documents in the slide format are closer to the central

set than the text ones. In the Java case (ﬁgure 7), the

Java document (CJA) is the most distant of the central

set. It can be noted that C6 is as distant from the cen-

tral set as CJA. This discrepancy is explained by the

fact that the C6 document contains not only a PHP

course but also reports of students’ projects in more

than half of the document. These reports discuss var-

ious business problems that require a website (online

shoe store, online music store, etc). Therefore, the

document is not exactly a pure PHP course like the

others.

In order to test how the mutual impact graph re-

acts when a document is corrected, we compared mul-

tiple cases of courses while correcting one of them.

Document C6 is written in three parts with nearly the

same amount of pages: the regular PHP course, the re-

ports of the students’ projects, and an advanced PHP

course. For the experiment, we used PHP courses in

text format only (in order to avoid the effect of mix-

ing slide and text formats), and we corrected docu-

ment C6 by removing the students’ projects ﬁrst and,

later, the advanced chapters. The experiment was also

done twice, with and without the Java course, in order

to have a better view of the effect of correction on a

corpus with and without an irrelevant course.

In the regular case without any modiﬁcation (Fig-

ure 8), C6 is the most outlier in both regular and

Java cases. In the Java case, CJA is the sole docu-

ment nearly as far as C6. After removing the stu-

dents’ projects from C6 (Figure 9), it becomes one

of the closest documents from the central set in both

cases, indicating that it became way more relevant to

the corpus than previously. In the Java case, CJA be-

comes the most outlier, keeping its irrelevance. Fi-

nally, when the advanced chapters are also removed

from C6 (Figure 10), it becomes the closest to the cen-

tral set in the Java case, and the second closest in the

regular case. C11 and C15 become the most outliers

in the regular case. In the Java case, CJA and C11 are

the most outliers.

In each ﬁgure from 8 to 10, document C6 becomes

closer to the central set thanks to the corrections. It

can be noted that CJA stays the most outlier in ev-

ery case, and the other documents move away a bit.

Therefore, the corrections show improvements in the

positioning of C6 while keeping the irrelevance of

CJA.

Figure 8: Relevance of text documents (C6 intact).

Figure 9: Relevance of text documents (no student

projects).

KMIS 2024 - 16th International Conference on Knowledge Management and Information Systems

126

Figure 10: Relevance of text documents in the regular case

(no student projects, nor advanced chapters).

5 DISCUSSION

In the mutual impact graph of the regular case (Fig-

ure 4), the set of terms present in all of the documents

at the center of the graph, shown as red nodes, are rel-

evant to the content of the text corpus, and they also

produce a clear summary of the main keywords. Each

ring of colors around the central set adds more terms

relevant to the corpus. When the Java course is in-

troduced (Figure 5), the terms in all the documents

are nearly the same: only php is a bit repulsed. As

the Java course also mainly talks about programming

and partially about web development with dedicated

frameworks, the results are nearly unchanged, which

is the expected behavior. The mutual impact graph

with the terms lets a user quickly get an idea of the

subject and keywords composing the corpus. It can

be used to quickly discover a new academic ﬁeld and

ﬁnd the keywords that best describe it. Another usage

for this graph would be to help build a book’s index:

the keywords are highlighted, and the author selects

which words to keep or remove.

Concerning the relevance of documents in Fig-

ure 7, the Java course is the most distant one with

C6 (which contains a lot of unwanted content) in the

ﬁrst experiment. This behavior is perfect for the cur-

rent case: the teacher who would like to use exist-

ing documents is warned that C6 and CJA should be

checked more precisely in order to detect if their con-

tent is relevant. Even if the text documents are distant

from the slide documents, the most irrelevant ones

are far away, allowing the user to measure the rele-

vance visually. In the third experiment, the relevance

of document C6 is greatly improved because of the

corrections applied while keeping the irrelevant doc-

ument CJA far away. It must be noted that the shape

of the graph changes because it reﬂects the relevance

of each document relating to the general relevancy of

the whole corpus. The mutual impact graph is a pic-

ture of the corpus as a whole: it is not a graph about

the relevance of one or some documents against one

or some other documents. Multiple use cases can be

derived from the mutual impact graph. The graph

would help users select the best documents about one

or more topics or remove the most irrelevant ones.

Another usage would be for a teacher to compare its

own course with the existing ones or even with re-

search articles to check how close it is to the state of

the art.

The global results show that FCA, with the mu-

tual impact measure, can highlight a corpus’ main

terms and even show its documents’ relevance. It

does not create a list of topics nor calculate the prob-

ability of each term being included in a topic like

LDA, but it does reﬂect the importance of each term

for the whole corpus. This behavior is expected by

the nature of FCA: it ”differs from statistical data

analysis in that the emphasis is on recognizing and

generalizing structural similarities, such as set inclu-

sion relation from the data description, and not on

mathematical manipulations of probability distribu-

tions” (Carpineto and Romano, 2004). However, we

acknowledge that BabelFy does participate actively

in the process of topic modeling by recognizing the

named entities and evaluating their relevance to each

portion of the text. It must be noted that the con-

struction of the formal context also ﬁlters some terms.

FCA (formal context’s construction + metrics calcu-

lus) and BabelFy must be used together, or at least,

FCA with an entity linking tool. Another limitation

concerns the fact that the documents used in the ex-

periments were written in French and with the goal

of teaching. In our opinion, language shouldn’t be a

problem as BabelFy is well trained for recognizing a

lot of languages, but mixing different documents writ-

ten with distinct goals or distinct formats might affect

the quality of the resulting map in a similar way that

can be seen in Figure 6 where the documents in the

slide format are closer to the central set than the text

ones.

Using Formal Concept Analysis for Corpus Visualisation and Relevance Analysis

127

6 CONCLUSION

In this paper, we proposed a visualization pipeline for

textual corpora analysis based on FCA instead of the

usual LDA for the topic modeling step. Mutual im-

pact was used within FCA in order to produce a ma-

trix for the force-based graph algorithm. The pipeline

produces a map that can be used in two ways:

• The main keywords are placed by order of impor-

tance, allowing the reader to quickly get an idea

of the topics contained in the corpus

• Documents are placed based on their relevance to

the keywords found, allowing the reader to see an

eventual discrepancy in the chosen texts.

The map presents a visualization of the corpus as

a whole. Removing a document impacts the visual-

ization because of the absence of a node and because

the topic modeling step does not work on the same

texts. To evaluate these claims, we presented a case

study on multiple PHP courses and an intruder Java

course (also about programming). First, a map dis-

played the most important keywords and the varia-

tions with and without the Java course. Then, one of

the PHP courses containing more than half of its text

about out-of-scope topics was corrected, showing a

signiﬁcant upgrade in the output.

We consider FCA an exciting method for topic

modeling and expect to try other metrics on the lat-

tice in order to ﬁnd more possible usages. Multiple

usages and combinations have been already proposed

in (Poelmans et al., 2013), but we expect to use the

conceptual similarity metric (Jaffal et al., 2015) for

an even more precise combination of terms. Also, a

deeper comparison with LDA and other newer meth-

ods like neural networks might be interesting as the

construction of the results does not rely on probabili-

ties and is perfectly transparent thanks to the set the-

ory behind FCA. Numerous applications in digital hu-

manities can be considered with FCA because trans-

parency, and therefore explainability, are already part

of its design by highlighting relationships between

objects.

REFERENCES

Ahn, J.-w. and Brusilovsky, P. (2009). Adaptive visualiza-

tion of search results: Bringing user models to visual

analytics. Information Visualization, 8(3):167–179.

Akhtar, N., Javed, H., and Ahmad, T. (2019). Hierarchical

summarization of text documents using topic model-

ing and formal concept analysis. In Data Manage-

ment, Analytics and Innovation: Proceedings of ICD-

MAI 2018, Volume 2, pages 21–33. Springer.

Alghamdi, R. and Alfalqi, K. (2015). A survey of topic

modeling in text mining. Int. J. Adv. Comput. Sci.

Appl.(IJACSA), 6(1).

Andrienko, G., Andrienko, N., Drucker, S. M., Fekete, J.-

D., Fisher, D., Idreos, S., Kraska, T., Li, G., Ma, K.-

L., Mackinlay, J. D., et al. (2020). Big data visual-

ization and analytics: Future research challenges and

emerging applications. BigVis 2020: Big data visual

exploration and analytics.

Assa, J., Cohen-Or, D., and Milo, T. (1997). Displaying

data in multidimensional relevance space with 2d vi-

sualization maps. In Proceedings. Visualization’97

(Cat. No. 97CB36155), pages 127–134. IEEE.

Bastian, M., Heymann, S., and Jacomy, M. (2009). Gephi:

an open source software for exploring and manipulat-

ing networks. In Third international AAAI conference

on weblogs and social media.

Belohlavek, R. (2008). Introduction to formal concept anal-

ysis. Palacky University, Department of Computer

Science, Olomouc, 47.

Blei, D. and Lafferty, J. (2006). Correlated topic models.

Advances in neural information processing systems,

18:147.

Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent

dirichlet allocation. Journal of machine Learning re-

search, 3(Jan):993–1022.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D.,

Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,

Askell, A., et al. (2020). Language models are few-

shot learners. Advances in neural information pro-

cessing systems, 33:1877–1901.

Carpineto, C. and Romano, G. (2004). Concept data anal-

ysis: Theory and applications. John Wiley & Sons.

De Sisto, M., Hern

andez-Lorenzo, L., De la Rosa, J., Ros,

S., and Gonz

alez-Blanco, E. (2024). Understand-

ing poetry using natural language processing tools: a

survey. Digital Scholarship in the Humanities, page

fqae001.

Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer,

T. K., and Harshman, R. (1990). Indexing by latent

semantic analysis. Journal of the American society

for information science, 41(6):391–407.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.

(2018). Bert: Pre-training of deep bidirectional trans-

formers for language understanding. arXiv preprint

arXiv:1810.04805.

di Sciascio, C., Mayr, L., and Veas, E. (2017). Exploring

and summarizing document colletions with multiple

coordinated views. In Proceedings of the 2017 ACM

Workshop on Exploratory Search and Interactive Data

Analytics, pages 41–48.

Eisenstein, J., Chau, D. H., Kittur, A., and Xing, E. (2012).

Topicviz: Interactive topic exploration in document

collections. In CHI’12 Extended Abstracts on Human

Factors in Computing Systems, pages 2177–2182.

Elliott, S. (2021). Proof of concept research. Philosophy of

Science, 88(2):258–280.

Engebretsen, M. and Kennedy, H. (2020). Data visualiza-

tion in society. Amsterdam university press.

KMIS 2024 - 16th International Conference on Knowledge Management and Information Systems

128

Fortuna, B., Grobelnik, M., and Mladenic, D. (2005). Visu-

alization of text document corpus. Informatica, 29(4).

Ganter, B. and Wille, R. (2012). Formal concept analysis:

mathematical foundations. Springer Science & Busi-

ness Media.

George, L. and Sumathy, P. (2023). An integrated clus-

tering and bert framework for improved topic model-

ing. International Journal of Information Technology,

15(4):2187–2195.

Gretarsson, B., O’donovan, J., Bostandjiev, S., H

ollerer, T.,

Asuncion, A., Newman, D., and Smyth, P. (2012).

Topicnets: Visual analysis of large text corpora with

topic modeling. ACM Transactions on Intelligent Sys-

tems and Technology (TIST), 3(2):1–26.

Hambarde, K. A. and Proenca, H. (2023). Information re-

trieval: recent advances and beyond. IEEE Access.

Hofmann, T. (1999). Probabilistic latent semantic index-

ing. In Proceedings of the 22nd annual international

ACM SIGIR conference on Research and development

in information retrieval, pages 50–57.

Jaffal, A., Le Grand, B., and Kirsch-Pinheiro, M. (2015).

Reﬁnement strategies for correlating context and user

behavior in pervasive information systems. Procedia

Computer Science, 52:1040–1046.

anicke, S., Franzini, G., Cheema, M. F., and Scheuermann,

G. (2015). On close and distant reading in digital hu-

manities: A survey and future challenges. EuroVis

(STARs), 2015:83–103.

Kherwa, P. and Bansal, P. (2019). Topic modeling: a

comprehensive review. EAI Endorsed transactions on

scalable information systems, 7(24).

Lee, B., Riche, N. H., Karlson, A. K., and Carpendale,

S. (2010). Sparkclouds: Visualizing trends in tag

clouds. IEEE transactions on visualization and com-

puter graphics, 16(6):1182–1189.

Menhour, H., S¸ahin, H. B., Sarıkaya, R. N., Aktas¸,

M., Sa

glam, R., Ekinci, E., and Eken, S. (2023).

Searchable turkish ocred historical newspaper collec-

tion 1928–1942. Journal of Information Science,

49(2):335–347.

Moro, A., Raganato, A., and Navigli, R. (2014). Entity

linking meets word sense disambiguation: a uniﬁed

approach. Transactions of the Association for Com-

putational Linguistics, 2:231–244.

Navigli, R. and Ponzetto, S. P. (2012). Babelnet: The au-

tomatic construction, evaluation and application of a

wide-coverage multilingual semantic network. Artiﬁ-

cial Intelligence, 193:217–250.

Newman, D., Baldwin, T., Cavedon, L., Huang, E., Karimi,

S., Martinez, D., Scholer, F., and Zobel, J. (2010). Vi-

sualizing search results and document collections us-

ing topic maps. Web Semantics: Science, Services and

Agents on the World Wide Web, 8(2-3):169–175.

North, K. and Kumta, G. (2018). Knowledge manage-

ment: Value creation through organizational learning.

Springer.

Olsen, K. A., Korfhage, R. R., Sochats, K. M., Spring,

M. B., and Williams, J. G. (1993). Visualization of

a document collection: The vibe system. Information

Processing & Management, 29(1):69–81.

Olsen, K. A., Williams, J. G., Sochats, K. M., and Hirtle,

S. C. (1992). Ideation through visualization: the vibe

system. Multimedia Review, 3:48–48.

Padilla, L. M., Creem-Regehr, S. H., Hegarty, M., and Ste-

fanucci, J. K. (2018). Decision making with visualiza-

tions: a cognitive framework across disciplines. Cog-

nitive research: principles and implications, 3:1–25.

Peinelt, N., Nguyen, D., and Liakata, M. (2020). tbert:

Topic models and bert joining forces for semantic sim-

ilarity detection. In Proceedings of the 58th annual

meeting of the association for computational linguis-

tics, pages 7047–7055.

Poelmans, J., Kuznetsov, S. O., Ignatov, D. I., and Dedene,

G. (2013). Formal concept analysis in knowledge pro-

cessing: A survey on models and techniques. Expert

systems with applications, 40(16):6601–6623.

Salton, G. (1983). Introduction to modern information re-

trieval. McGraw-Hill.

Schmid, H. (1994). Probabilistic part-of-speech tagging us-

ing decision trees. In New methods in language pro-

cessing, page 154.

Schmid, H. (1995). Improvements in part-of-speech tagging

with an application to german. In In Proceedings of

the ACL SIGDAT-Workshop, pages 47–50.

Singh, J., Zerr, S., and Siersdorfer, S. (2017). Structure-

aware visualization of text corpora. In Proceedings of

the 2017 Conference on Conference Human Informa-

tion Interaction and Retrieval, pages 107–116.

Using Formal Concept Analysis for Corpus Visualisation and Relevance Analysis

129