
Table 7: Top 7 words of topics by SenClu and TopClus for first 15 of 50 topics.
To. 20Newsgroups Dataset New York Times Dataset
Method: SenClu
0 hepatitis, biopsy, cph, chronic, hypoglycemia, pituitary, persistent banquette, sauce, rum, cucumber, entree, menu, patronize
1 infringe, participle, amendment, verb, indulge, infringing, constitution tyson, boxing, heavyweight, bout, evander, knockout, holyfield
2 pirating, protection, copy, cracked, pirated, cracker, disassemble emission, dioxide, carbon, environmentalist, environmental, logging, landfill
3 scsi, ide, drive, controller, bus, modem, mhz japan, japanese, tokyo, nippon, mitsubishi, nomura, takeshita
4 gld prosecutor, trial, jury, defendant, judge, lawyer, juror
5 doctor, medication, pain, hernia, diet, migraine, crohn drug, patient, cancer, doctor, disease, health, dr
6 satan, angel, heaven, enoch, god, eternal, poem detective, police, arrested, stabbed, murder, arrest, graner
7 wheelie, bike, aerobraking, landing, ride, bdi, riding mir, astronaut, shuttle, nasa, module, atlantis, spacecraft
8 window, graphic, microsoft, cica, adobe, rendering, shading germany, german, deutsche, ackermann, frankfurt, dresdner, daimler
9 solvent, bakelite, phenolic, wax, drying, adhesive, soldering rate, economist, index, nikkei, bond, inflation, economy
10 ei, ax, mq, pl, max, lj, gk bedroom, apartment, bath, building, square, developer, ft
11 xterm, motif, widget, server, mit, sunos, window kerry, bush, mccain, clinton, presidential, president, poll
12 israel, israeli, arab, palestinian, lebanese, palestine, gaza cloning, gene, chromosome, genetic, cloned
13 antenna, frequency, transmitter, radio, receiver, detector, khz ounce, bullion, dollar, cent, mercantile, settled, crude
14 airmail, mcwilliams, mcelwaine, dublin, expiration, dftsrv, albert editor, circulation, magazine, reader, tabloid, publishing, journalism
Method: TopClus
0 please, thanks, thank, appreciate, sorry, appreciated, gladly student, educator, grader, pupil, teenager, adolescent, school
1 saint, biblical, messiah, missionary, apostle, church, evangelist surname, mustache, syllable, corps, sob, nickname, forehead
2 iranian, korean, hut, child, algeria, vegetable, lebanese participation, involvement, effectiveness, supremacy, prowess, responsibility
3 considerable, tremendous, immense, plenty, countless, immensely, various garage, dwelling, viaduct, hotel, residence, bungalow, building
4 expression, phrase, symbol, terminology, prefix, meaning, coordinate clit, lough, bros, kunst, mcc, quay, lund
5 memoir, publication, hardcover, encyclopedia, bibliography, paperback moth, taxa, una, imp, null, def, une
6 anyone, somebody, anybody, someone, anything, everybody, something many, everybody, anything, everyone, several, much, dozen
7 individual, people, populace, human, being, inhabitant, peer mister, iraqi, hussein, iraq, iranian, iran, kurdish
8 disturbance, difficulty, complication, danger, annoyance, susceptible, problem iraqi, iraq, baghdad, saddam, hussein, kuwait, iran
9 beforehand, time, sooner, moment, waist, farther, halfway dilemma, uncertainty, agitation, reality, dissatisfaction, implication, disagre.
10 upgrade, availability, replacement, sale, modification, repository, compatibility nominate, terminate, establish, stimulate, locate, replace, protect
11 buy, get, install, spend, sell, keep, build withstand, hesitate, imagine, explain, apologize, happen, translate
12 appropriated, reverted, wore, abolished, rescued, exercised, poured forefront, accordance, extent, instance, way, precedence, behalf
13 government, diplomat, fbi, ceo, parliament, officer, parliamentary privy, continual, outstretched, purposely, systematically, unused, unfinished
14 graduation, university, rural, upstairs, overseas, basement, undergraduate cautious, goofy, arrogant, painful, cocky, hasty, risky
5.2 Qualitative Evaluation
Table 7 shows top words for the first 15 topics com-
paring against the method that performed best accord-
ing to our evaluation and prior work (Meng et al.,
2022). We did not attempt to label topics, since dur-
ing actual topic modeling users face exactly such out-
puts. However, for the 20Newsgroups dataset we
listed the ground truth classes 2 to support under-
standing of topics and the dataset. Our assignment
shows that TopClus somewhat suffers from the same
issue as LDA: Common words might form topics that
have little meaning and must be eliminated. For ex-
ample, topic 0, 3, and 6 of the 20Newsgroups dataset
consist of frequent non-topical words, while, for in-
stance, topic 4,8 and 9 are not easy to assign to any
topic. This is a common issue (as also observed for
LDA) nd rooted in the Bag of Words assumption.
Other topics are well interpretable, e.g., topic 1 can
be easily associated with the ground truth label ‘reli-
gion’ (see Table 2 for ground truth labels) and topic
11 with ‘forsale’. For SenClu most topics are easy to
interpret, e.g., Topic 1 and 5 discuss a medical topic,
Topic 3 hardware, etc. But it also contains a few top-
ics, which make limited sense. For example, Topic 4
consists of just 1 token and topic 10 can also not well
be interpreted.
5.3 Overall Evaluation
Table 8 summarizes the comparison of all methods in-
cluding quantitative evaluation but also offered func-
tionality by the evaluated methods. Though LDA is
very fast and a conceptually elegant approach, it suf-
fers in terms of topic quality, which is the most im-
portant aspect of a topic model. Therefore, it is not
the method of choice compared to methods relying
on pretrained contextual embeddings. This finding is
aligned with prior works (Meng et al., 2022; Groo-
tendorst, 2022). BerTopic is very fast but topic qual-
ity is often not top notch and it treats documents as
just having one topic. This is against the key idea
of topic models that documents can have multiple
topics. It is extremely problematic for long, diverse
texts, where this is almost certainly the case. Top-
Clus yields high quality topics but suffers from chal-
ICAART 2024 - 16th International Conference on Agents and Artificial Intelligence
414