Generating word clouds from social streams is a dif-
ficult task; users often discuss the same entity using
multiple aliases. This leads to a direct degradation in
the utility of word clouds for accessing this complex
source of data. We proposed a technique that groups
aliases of the same entity and represents them with
a canonical term. The method improves the cover-
age of word clouds and access to the relevant content.
Due to the imperfect nature of state-of-the-art named
entity recognition methods, redundancy of terms in
word clouds is often increased. Therefore, it is nec-
essary to apply a method for diversifying terms. In
this work, we found that the proposed technique not
only significantly decreased redundancy but also at-
tained significantly higher coverage than the baseline
word cloud generation method, leading to better word
clouds and therefore improved information access.
An extrinsic user evaluation supported our hy-
pothesis that word clouds with grouped named enti-
ties are significantly more relevant and diverse than
word clouds with no entity grouping. Further, word
clouds with grouped named entities that attain higher
levels of MAP are more likely to be rated as relevant
by users.
Finally, it was shown that the previously-proposed
MAP metric for automatic cloud evaluation predicts
extrinsic human evaluations of cloud quality. Thus,
when designing word clouds, the MAP metric should
be used as a quality predictor of the cloud generation
technique, enabling automatic assessment of word
cloud quality without a human in the loop.
This work was partially supported by the European
Union under grant agreement No. 611233 PHEME.
