Autonomous Semantic Structuring of Lecture Topics

Synthesis of Knowledge Models

Robin Nicolay, Nikolaj Troels Graf von Malotky, Tanja Auge and Alke Martens

University of Rostock, Institute of Computer Science, Chair of Practical Informatics, 18051 Rostock, Germany

robin.nicolay@uni-rostock.de, nikolaj.graf von malotky@uni-rostock.de

Keywords:

Latent Dirichlet Allocation, Topic Models, Mental Models, Knowlege Management, Force-directed Algo-

rithms.

Abstract:

Students attending lectures in universities suffer from a weak structural awareness on lecture content. Ac-

cording to learning theories, structural awareness is a relevant factor to association and comprehension of new

learning inputs. We synthesize semantic structures from non annotated lecture slides using Topic Modeling

algorithms to identify relevant terms and relate them in force-directed graphs. The synthesized graphs provide

a structural overview on the topic distribution and relations of non annotated sequential lecture slides.

1 INTRODUCTION

University teachers, aka lecturers, use lectures to

teach facts and concepts to students. From the per-

spective of the teacher, in the preparation phase the

relevant information of a lecture is analyzed, reduced

and sequenced into learning units and taught in a way

that is comparable to leading a path through a con-

strained area of knowledge. During a study a stu-

dent learns from many different lectures or knowledge

units. Some of these units reference each other, some

are partially overlapping or describe common topics

from different perspectives. An ideal situation would

be given, if the students identiﬁes and understands the

key concepts plus their relations and their hierarchy

or topology. However, reality often leads to different

situations: Even if the teacher usually is preparing a

lecture via sequencing the important facts of an over-

all topic in a logical order, in quite a lot cases, lec-

tures are evolutionary grown over time. This results

in a situation, where the key concepts of a lecture are

somewhat hidden in the text. From the perspective

of the teacher, this is not so bad, as the overall pic-

ture shall not be inﬂuenced. However, in exams, we

were able to observe that students often miss impor-

tant items or misinterpret items. The situation grows

even worse, when students were asked to detect rela-

tions between different topics, i.e. different lectures

in different semesters. This seems to be a cognitive

step which is not directly supported in current lecture

formats. On the teacher’s side, an interconnection of

the topics or areas of study is seldom taking place.

This results in a situation where students are usually

missing the overall picture.

To develop a solution, i.e. to develop support

mechanisms and tools to support the student’s knowl-

edge construction in lectures, we have to take a look

at learning psychology. Learning theories such as

cognitivism describe the inner effects of processing

lecture information inputs by using cognitive models

(lernpsychologie.net, 2016). To understand, how pro-

cesses of knowledge construction could potentially

take place, we made several investigations together

with our students.

Our work started a while ago, when we developed

a tool for extracting knowledge from lectures with the

goal to support student’s annotation of lectures and

for giving support for learning (Nicolay et al., 2015).

To support our claim that understanding of the main

concepts and their interrelation is the key to under-

standing the lecture in a ﬁrst step and to understand-

ing the overall picture in a study direction as a sec-

ond step, we investigated student’s intuitive method

of knowledge construction.

Our main study took place last semester, where

we made a structured investigation with a ﬁrst grade

master course with 20 students from different depart-

ments. They were organized in teams of two to four

(mixed male and female). They got a free choice of

material (either digital or not), and free choice of most

important topics. Their task was to identify areas of

knowledge and relations in self designed knowledge

Nicolay, R., Malotky, N., Auge, T. and Martens, A.

Autonomous Semantic Structuring of Lecture Topics - Synthesis of Knowledge Models.

DOI: 10.5220/0006367903490355

In Proceedings of the 9th International Conference on Computer Supported Education (CSEDU 2017) - Volume 2, pages 349-355

ISBN: 978-989-758-240-0

349

management structures. We asked students, how they

identify key words, how they relate and organize them

and how they identify clusters or areas of topics. They

had to show and explain the results, e.g. in form

of a poster or slides. Additional to keywords, rela-

tions and clusters, they have to show us in a next step,

which system they use (e.g. mindmap, tables, other

graphs or trees). More information about this part

of our study and the related insights is described in

section 2. After we got insight to the student’s ideas

of organizing lecture knowledge, we applied this to

our tool described in section 3.1 and 3.2. We added a

mathematical algorithm to visualize the results using

force-directed graphs described in section 4 allowing

further inference of insights described in section 5.

2 STUDENT’S INTUITIVE

MANAGEMENT OF

KNOWLEDGE

We asked students how they process or recap the past

study course. Therefore we had 20 students from dif-

ferent ﬁelds of study, such as mathematics, ship build-

ing, teacher training and computer science. All stu-

dents recently ﬁnished their bachelors level and at-

tended some ﬁrst master lectures. The Age of the stu-

dents where between 24 and 31. We asked the stu-

dents to invent a concept, with which they can man-

age and organize their experience of past courses.

They developed several different semantic knowl-

edge structures consisting of keywords and relations.

Hereby, we did not ﬁnd signiﬁcant differences be-

tween male and female students. The following para-

graphs summarize the outcomes in particular areas.

Students identiﬁed relevant topics and keywords

in two ways. First, the memorization of most rele-

vant terms, second the identiﬁcation of keywords us-

ing past lecture material. The results of both tech-

niques where almost equal. While the memorized

terms show a tendency to specializations on interest-

ing and for the individual student relevant aspects, the

topics identiﬁed by lecture material showed a broader

coverage of the study course and reﬂected the char-

acteristics of the curriculum. All groups deﬁned dif-

ferent hierarchies between terms. We identiﬁed two

main levels and call them: Thematic area terms for

higher more general terms and technical terms for

lower and more speciﬁc terms. Students used the-

matic area terms to group the more speciﬁc technical

terms in domains.

While students put a lot work on the speciﬁca-

tion, and deﬁnition on keywords, they neglected the

speciﬁcation of the relations. Students intuitively de-

ﬁned relations to denote inheritance dependencies be-

tween area keywords and technical terms. Procedu-

ral descriptions between keywords as used in Concept

Maps deﬁned in (Ca

nas et al., 2004) where not used.

Students relate terms by declarative Associations. To

map learning resources, they link learning materials

as Occurrences to Topics. This intuitive approach is

very similar to the concept of Topic Maps deﬁned in

(Marius et al., 2008).

For a cognitive description of relations between

important keywords, students needed to handle many

thematic areas and cross references on technical

terms. We identiﬁed two main approaches to layout

identiﬁed keywords.

The ﬁrst approach shown in ﬁgure 1 deﬁned sev-

eral thematic areas. These area keywords then where

surrounded by technical terms mapped to these ar-

eas. At a ﬁrst glance, the ﬁrst approach was sim-

ilar to Mindmaps. But students noted hierarchy of

an Mindmaps did lack in decentralization and a clean

cross referencing of dependencies between more than

one thematic areas and a single technical terms.

Figure 1: Example of student’s analog Mindmap approach

to organize learned knowledge (S. Brossmann, T. Auge).

The second approach shown in ﬁgure 2, provided

an interesting way to deal with cross referencing of

keywords. Students used dark colored sticker to de-

note thematic area keywords. In the middle is the

general thematic area. Matching this general thematic

area there are more specialized thematic areas placed

equally spaced around it. The all technical terms were

placed with a rule in mind: The level of dependency

of a technical term to a thematic area is described by

its distance. That means technical terms in the mid-

dle of the circle did not have a speciﬁed afﬁliation

and specialized techinal terms are placed in the areas

in the corresponding thematic area. The result is that

words with an afﬁliation to thematic areas built a lobe

from the center of the circle to the thematic area key-

word.

CSEDU 2017 - 9th International Conference on Computer Supported Education

350

Figure 2: Example of a student’s analog approach to or-

ganize learned knowledge (J. Abel, S. Arth, T.Baudis, K.

Klohs).

Students claimed that the structure of a lecture is

not visible to a listener during lectures. Difﬁculties

arise in the identiﬁcation of relevant topics, inference

of relations between topics, relations between topics

and lecture material as well as relations between lec-

tures themselves. Information in lectures are not pre-

sented in a structured way but in a sequences of lit-

tle annotated lecture slides. An option would be to

provide a structured, semantic overview about lecture

topics on an high abstraction level. This overview,

given by the lecturer, could be used by the students

to follow the structure and relations of topics during a

lecture while individually adding received stimuli and

associations to the provided skeleton.

In the next sections we describe our approach to

analyze sequential lecture material and synthesize the

identiﬁed structures shown in ﬁgure 1 and 2.

3 AUTOMATIC EXAMINATION

OF LECTURE’S TOPIC AREAS

AND SEMANTIC STRUCTURES

This section describes our system derived from two

parts: First, a system based on a former approach

to extract semantic topics from non annotated lecture

slides described in detail in (Nicolay et al., 2016).

Second, the identiﬁcation of relations and semantic

structures between topics, keywords etc.

3.1 Inference of Topics from Lecture

Slides

To synthesize a model that approximates a lecturer’s

intended taught knowledge structure after attending

a lecture, we analyze the text on presented lecture

slides. We found in (Nicolay et al., 2016) that the text

on lecture slides allows us to identify semantic topics

or thematic areas during a lecture. Further steps will

include a synchronized transcription of the lecturers

verbally explanations during a lecture. The following

paragraphs provides a short overview about the used

algorithms.

Preparation of Text and Vocabulary. To prepare

and clean up the text on lecture slides, we imple-

mented a set of ﬁlters described below. To iden-

tify semantic topics from unstructured text, we used

the statistic topic modeling algorithm Latent Dirich-

let Allocation (LDA) introduced by Blei (Blei et al.,

2003) implemented using the Markov-Chain-Monte-

Carlo (MCMC) Gibbs Sampling algorithm (Grifﬁths

and Steyvers, 2004).

At ﬁrst, we describe the implemented a set of ﬁl-

ters. The system extracts all text without special char-

acters from the slides of a lecture and splits the text

into single words. Then the system removes gram-

matical deviations having a minor impact to seman-

tic meaning, such as a stemming (Yoshiki Shibukawa,

2015) and uniﬁes all words to lowercase. Addition-

ally, it removes stopwords (Porter et al., 2015), short

words with less than 3 letters, numbers, and words

with a high occurrence on many different slides (e.g.

repeating footers). Then all words are added to an

word-slide adjacency matrix. Using the ﬁlters the av-

erage number of different words (or rank of matrix) is

reduced by 25 percent.

Common lecture slides consist of a low amount

of text. As shown in (Wallach et al., 2009) the use

of Dirichlet Priors introduced by LDA improves the

quality and convergence of inference on texts with

Autonomous Semantic Structuring of Lecture Topics - Synthesis of Knowledge Models

351

a lower number of words considerably. Neverthe-

less, we looked into ways of interpreting layout in-

formation and structure on lecture slides to improve

the inference with additional meta information. First

to be published results show, analyzing outline slides

that contain meaningful headings of a lecture, indicate

number and order of thematic areas. Further, corre-

lating size of words and importance added as weight

function to the Gibbs implementation improves the

separation between topics and their word distribu-

tions.

On the other hand, arrangement of text on lec-

ture slides is not easy to understand for automatic al-

gorithms. Factors such as layout templates, images

and aesthetics inﬂuence the position of words on a

slide. Here LDA supports a ”bag of words assump-

tion” and does not infer information from the position

of words within a text. As stated in (Blei, 2012, S.82)

this would be a disadvantage for synthesizing natural

speech, but is appropriate for semantic analysis.

Identifying Topics with Latent Dirichlet Alloca-

tion. In our approach LDA infers a ﬁxed number

of topics T = t

, t

, ..., t

from a set of slides S . As

mentioned above, we keep the separation of lecture

information into separate slides. The ﬁltered, relevant

subset of occurring words is V . LDA provides a dis-

crete distribution θ(s) denoting the proportion of ev-

ery topic t ∈ T on slide s ∈ S. Every topic t is deﬁned

by a discrete distribution β(t) over the probability of

occurrence of every word w ∈ V . Both, the possible

random values of θ(s) and β(t) sum to one.

Figure 3: Intensity of 6 different topics t ∈ T to the slides

s ∈ S used during a lecture.

To get an impression of the result, ﬁgure 3 shows

the inferred distribution of six topics over a set of 43

lecture slides on Learning Theories at the University

of Rostock. The Bars in this ﬁgure lying above each

other, show the proportion of every topic to a slide

θ(s). The topics Topic0 to Topic5 are, as a result of

the sampling algorithm automatically named and only

deﬁned by their probability of occurrence of semanti-

cally relevant words β(t), showing as Adjacency Ma-

trix indicated in ﬁgure 4. A MCMC sampling algo-

rithm uses a high number of randomly generated sam-

ples the initial distribution of topics on every lecture

slide and corrects the topics by a high number of iter-

ations ﬁtting steps. The essence is, that the algorithm

does not include knowledge about lecture slides being

next to each other. However, ﬁgure 3 shows a mean-

ingful distribution of topics during a lecture. Nearby

slides tend to share the same topic. Furthermore, top-

ics show the trend to fade in and fade out during a

lecture. In conclusion, results on real data indicate

a well known pedagogical distribution of topics on a

sequence of lecture slides.

Figure 4: An excerpt of β(t). The probability of occur-

rences of vocabulary words w ∈ V on lecture topics t ∈ T .

3.2 Relating Slides by Topics

In section 3.1, we explained our approach to iden-

tify single topics in lecture slides. In (Nicolay et al.,

2016), we described a ﬁrst approach to connect slides

with each other that share the same topics. Identiﬁed

connections, based on common topics, are meaning-

ful. Figure 5 shows, nearby slides have a higher pro-

portion of shared topics. Only a few cross references

bridge the chord diagram. Accordingly only a few

slides, lying far apart in a lecture, have high propor-

tions to a common topic.

CSEDU 2017 - 9th International Conference on Computer Supported Education

352

Figure 5: Relations between slides of a talk based on joined

topic assignments. Slides are arranged clockwise at the

outer ring of the graph. Lines indicate a shared topic of

over 65% for connected slides. (Nicolay et al., 2016).

4 MAPPING OF TOPICS AND

KEYWORDS

To fulﬁll the requirements indicated by the student

models described in section 2, we need to order tech-

nical terms to thematic areas by relations based on

afﬁliations. Therefore, we decided to interpret af-

ﬁliations of technical keywords to thematic areas as

forces and examined two force-directed approaches.

First, the circular approach. We place area thematic

keywords around technical terms, as seen in ﬁgure 2.

Second, an extension of student’s Mindmap approach

by proximity concepts of the circular approach.

To graphically depict the results of LDA, we used

forced-directed graph models. Identiﬁed abstract top-

ics T were interpreted as abstract thematic clusters.

Their centers are shown as white points. The elements

of the ﬁltered vocabulary V denote the unﬁltered set

of technical terms indicated by smaller gray points.

The distribution β(t) deﬁnes the proportion of every

technical keyword w ∈ V to every topic t ∈ T .

While the sets of technical keywords and thematic

areas are deﬁned, we need to convert β(t) (the knowl-

edge of afﬁliation of technical terms to thematic ar-

eas) to positions indicating proximity to thematic ar-

eas. Therefore we built a force-directed graph con-

taining vertices K = T ∪V and edges E = (w, t) ∈

β(t). The edges are weighted by the level of propor-

tion deﬁned in β(t). Forces on the graph are denoted

by the weight of edges as an attraction strength factor,

a repulsion strength between two vertices and a grav-

ity to the center. The learning resources (not shown in

the following graphs) can be related to thematic areas

by θ(s) outlined in ﬁgure 3. Both ﬁgures 6 and 7 rep-

resents the network as a graph drawing, with different

success.

Figure 6: The ﬁrst result on an force-directed graph of the

LDA results. The 6 topics (white points) placed on the outer

circle, words (dark points) oriented as lobes to their afﬁli-

ated topics. Forces (edges) are determined by proportions

deﬁned in β(t). We reduced visibility of weak connections.

Figure 7: The second result on a force-directed graph of the

LDA results. The words (dark points) are organized around

their most afﬁliated topics. Forces are determined by pro-

portions β(t). We reduced visibility of weak connections.

The different arrangement result in different force

Autonomous Semantic Structuring of Lecture Topics - Synthesis of Knowledge Models

353

models. Figure 6 shows ForceAtlas algorithm, ﬁg-

ure 7 shows ForceAtlas 2. Both algorithms (described

in (Jacomy et al., 2014)) are implemented for the

open-source tool Gephi (v. 0.9.1) (Bastian et al.,

2009), a software to visualize and manipulate net-

works. ForceAtlas2 is developed by combining ex-

isting techniques as an improvement and extension of

the ForceAtlas algorithm. It simulates a physical sys-

tem with mutual repulsion of the nodes and attraction

towards their inzident edges. The lower based energy

model is inspired by real life: Forces of nodes de-

pends on the distance between the interacting entities.

The position of a node can’t be interpreted on its

own, it has to be compared to the other nodes in the

graph drawing. However, using ForceAtlas 2, the dis-

tance, edge weight and degree plays an important role

for positioning the nodes. In the classical case, the

attraction is linearly based on the distances

F(n

, n

) = d(n

, n

The edge weight inﬂuences the attraction multiplica-

tively

F(n

, n

) = w(e)

· d(n

, n

)

with weight w(e) of the edge e and edge weight inﬂu-

ence parameter δ (δ = 0: weights are ignored, δ = 1:

attraction is proportional to the weight, δ ≥ 2: strong

inﬂuences of the weights). And node degree is impor-

tant for the dissuade hubs:

F(n

, n

) =

d(n

, n

)

deg(n

) + 1

Here grant authorities (nodes with a high degree) get

a more central position than hubs (nodes with a small

degree). Whereas in the use of ForceAtlas, hubs are

pushed at the periphery and authorities more central.

5 INTERPRETATION OF

SYNTHESIZED

FORCE-DIRECTED SEMANTIC

NETWORKS

Looking into the interpretation of results, the circular

arrangement, approximates the student’s model from

ﬁgure 2. The force graph builds out a good visual-

ization of lobe based on pure LDA data. In ﬁgure

6, most of the words lie in a topic’s lobe. The dis-

tance to a topic node identify the attraction to the cor-

responding topic. The bigger the radius, the stronger

the afﬁnity by other topics. On the ﬁrst look, the in-

terpretation of keywords and their level of afﬁliation

to speciﬁc topics seems straight forward. However, 6

shows words belonging to two not-adjazent topics (A

and C), are likely to be pulled into the wrong topic’s

lobe (B). Furthermore, words oriented to the center

of the graph, either have no strong afﬁliation to one

of the topics, or have a strong afﬁliation to opposite

oriented topics. This issues leads to misleading inter-

pretations of afﬁliation.

In the case of the representation in ﬁgure 7, com-

bining aspects of Mindmap and proximity, the posi-

tion of the thematic area nodes is arranged by avoid-

ing foreign keywords. Each word is assigned unam-

biguously to the topic based on the strongest afﬁli-

ation. Words in the inner circle indicate a high at-

traction to its topic; the larger the radius, the weaker

the afﬁnity. Words that almost equally ﬁt into multi-

ple thematic areas keywords as shown in ﬁgure 7 at

A and B are highlighted by their corresponding strong

edges. The more equal the afﬁliation to both topics is,

the more both topics share a keywords by pulling it

out of the corresponding circles. This representation

supports a more clear identiﬁcation of thematic areas

and corresponding terminal keywords.

6 CONCLUSION AND FURTHER

WORK

In this paper we have synthesized a student’s intu-

itive model from a non annotated sequence of lec-

ture slides. We visualized semantic structures using

a force-directed graph drawing algorithms.

The possible application for teaching and learn-

ing are manifold. While the structure itself provides

an overview about the semantic structure of a cur-

riculum, it serves as an semantic index on learning

material. The proportional assignment of learning re-

sources to thematic areas support a semantic selection

of appropriate learning material across different lec-

tures.

As one aspect, we improve the visualization of

these results. To limit complexity, we need to ﬁlter

the view to show only currently relevant information

and resources. Based on a student’s currently marked

content, the structure needs to be dynamically reor-

ganized to accent relevant technical terms and inter-

topic relations. Our goal is to automatically synthe-

size an assistive map showing the completed path of

learning content as well as relevant relations between

current and past learning resources.

Another aspect is described in (Wittrock, 1989).

Learning is a process of associating input stimuli to

structures of internal knowledge. The structure syn-

thesized in this work, provides a step towards visu-

alizing and organizing an associative grid out of non

annotated, sequential learning material. On the one

CSEDU 2017 - 9th International Conference on Computer Supported Education

354

hand, it improves the structural awareness of students,

on the other hand the structure can be enhanced by

students during the processing of stimuli perceived in

the lecture. The observation of student’s annotation

of lecture content described as coding in (Lee et al.,

2008) enables an observation of student’s processing

of knowledge. Adding a semantic meta structure to

non annotated lecture material enables an observation

of student’s coding in connection to lecture topics and

relations.

Our next operative goals are to reduce the number

of words in our representation. We need to identify

relevant parameters, such as average importance, inter

topic relations and layout information, such as size of

text to weight identiﬁed technical terms by relevance.

Further, we look into time dynamics of these net-

works. Students and lecture structures develop over

time. New keywords and relation appear during a cur-

riculum. David Blei introduces dynamic topic models

in (Blei and Lafferty, 2006). These concepts allow an

observation of appearing new topics and distribution

of words over time. These changes lead to movements

in the force graph over time, providing insights in a

lectures teaching and later student’s learning process

and progress.

Acknowledgement: We thank all the students

of our course ”Indiviual Knowledgemanagement” in

Winterterm 2016 2017 for their enthusiasm, their in-

terest and their work in our course - and for allowing

us to use their insights and graphs for future research.

REFERENCES

Bastian, M., Heymann, S., Jacomy, M., et al. (2009). Gephi:

an open source software for exploring and manipulat-

ing networks. ICWSM, 8:361–362.

Blei, D. M. (2012). Probabilistic topic models. Communi-

cations of the ACM, 55(4):77.

Blei, D. M. and Lafferty, J. D. (2006). Dynamic topic mod-

els. In Proceedings of the 23rd international confer-

ence on Machine learning, pages 113–120.

Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent

dirichlet allocation. the Journal of machine Learning

research, 3:993–1022.

nas, A. J., Novak, J. D., and Gonz

alez, F. (2004). Con-

cept maps: Theory, methodology, technology.

Grifﬁths, T. L. and Steyvers, M. (2004). Finding scientiﬁc

topics. Proceedings of the National Academy of Sci-

ences, 101(Supplement 1):5228–5235.

Jacomy, M., Venturini, T., Heymann, S., Bastian, M., and

Muldoon, M. R. (2014). Forceatlas2, a continuous

graph layout algorithm for handy network visualiza-

tion designed for the gephi software. PLoS ONE,

9(6):e98679.

Lee, H. W., Lim, K. Y., and Grabowski, B. L. (2008). Gen-

erative learning: Principles and implications for mak-

ing meaning. Handbook of research on educational

communications and technology, 3:111–124.

lernpsychologie.net (2016). Lernpsychologie - Lernen und

Ged

achtnis einfach erkl

art: https://perma.cc/uc9w-

uzwp.

Marius, L., Moore, G., and Moore, G. (2008). ISO 13250-2:

Topic Maps — Data Model: http://perma.cc/MPT4-

MVLM.

Nicolay, R., Schwennigcke, B., Sahl, S., and Martens, A.

(2016). Visualisierung konzeptuellen lernens durch

semantische vernetzung sequenzieller lehrinhalte. In

Mayr, H. C., Pinzger, M., and Gesellschaft f

ur Infor-

matik e. V. Bonn, Gesellschaft f

ur Informatik e. V,

editors, GI Edition Proceedings Band 259 INFOR-

MATIK 2016, volume 259 of GI-Edition. Proceedings,

pages 1009–1016.

Nicolay, R., Schwennigcke, B., Vetterick, J., Sucharowski,

W., and H. Cap, C. (2015). Interlect - lecture content

interface. In 7th International Conference on Com-

puter Supported Education, pages 269–276.

Porter et al. (2015). Natural language toolkit - nltk 3.0 doc-

umentation: Stopwords corpus: https://perma.cc/f8l8-

kc96.

Wallach, H. M., Mimno, D. M., and McCallum, A. (2009).

Rethinking lda: Why priors matter. In Advances in

neural information processing systems, pages 1973–

1981.

Wittrock, M. C. (1989). Generative processes of compre-

hension. Educational Psychologist, 24(4):345–376.

Yoshiki Shibukawa (2015). Snowballstemmer 1.2.1:

https://perma.cc/xpj6-jnmf.

Autonomous Semantic Structuring of Lecture Topics - Synthesis of Knowledge Models

355