passing the Word and Topic information into the
topic indexer. For our purposes, we assume the
topic is known.
2. Query Processing Workflow The query processor
sends and auto-completion query to the Topic In-
dexer, and attaches the additional Topic informa-
tion. Again, we assume that the topic is known.
The main constraint of such an approach is that
the system needs enough input data in order to be
able to find the topic, and only then make predictions.
Therefor we consider as input data long documents
(e.g. e-mails, essays, reports, etc.), not short in-line
messages. We also assume that our topics are already
identified for each of the documents, thus the problem
of topic identification is not covered within this pa-
per. The main focus remains on presenting solutions
for the Topic Model and Indexer, and how the Builder
and Query Processor communicate with them. The
part of the Topic Model Overview diagram 1 that we
cover in this paper is the Topic Autocompletion Work-
flow.
The issues with one model solutions that we iden-
tified are (1) topic interferences, (2) big models and
(3) slow query processing. In order to address these
issues, we present the following methods of storing
topic information:
• have one model for all topics, and include a list of
topics for each gram. (one-for-all) Although this
method does not directly address problems 2 and
3, it aims to reduce the topic interferences, but not
totally separate the model, as words from differ-
ent topics are mostly shared between the topics.
The obvious disadvantage is that there still is a
single model that stores all words, regardless of
their topic. The advantage is that each word is
stored once, even if this word appears in multiple
topics.
• have a separate model for each topic. (one-for-
each) This method provides a more extreme so-
lution to all of the issues above. It separates the
models in completely different sub-models, and
predicts queries using only the sub-model corre-
sponding to the topic at hand. The disadvantage
is that words are shared between multiple topics.
If most of the words in the data-set are shared
between topics, this method will store all of the
shared words in each sub-module, thus resulting
in extra copies of the same word which will ulti-
mately increase the overall size required to store
all of the sub-models.
In order to avoid confusion, in the following sec-
tions we refer to the overall data model as super-
model, and to separate topic-specific models as sub-
models. Moreover, throughout this paper, we con-
sider that each document has one topic. In cases when
a document has multiple topics, we consider only the
most relevant one.
3.1 One-For-All Topic Model
The first idea that comes in mind is to create a
super-model that stores topic information for each
gram. The problem that remains after is how to store
this information inside the super-model? One can
insert for each gram a list of topics it appears in.
For example, in the case of an inverted index data
model, a list of topics in which the word appears
is stored together with the postings list. An entry
from the User Oriented index containing this extra
information looks like this: market:{postings:
{1 : [1], 3 : [1], 101:[2]}, topics:{
food, automobile, shopping }}.
The obvious issue with this solution is the size.
The super-model will grow a lot due to topic informa-
tion. Imagine a word like how. As it most probably
appears in all topics, it has a large topic list that affects
the running performance of the system. To address
this problem, we propose to generalize the idea con-
cerning user document ids presented in (Prisca ’15),
by applying it to topics. This results is a new Topic
Model: the Topic Oriented Index.
To achieve this, we define a mapping between
each topic identified in the data set, and a range of
document ids. For this mapping, any common data
structure can be used, but it is important for it to be a
singleton. This means that once topics are mapped to
a certain id range, the same mapping is used through-
out the whole system. Moreover, since we cannot al-
ways have all topics identified, there needs to be an id
range dedicated to uncategorized documents. We re-
fer to this as the Uncategorized id range, and propose
that it takes values at the end of the id domain (i.e.
[lastKnownTopicId → ∞), where lastKnownTopicId
is the last id assigned to a known topic). In case new
topics appear in the data set, one only has to reduce
the set of Uncategorized id ranges, and allocate some
of them to the new topic.
As an example, let us consider that we have a data
set consisting of the following documents, split by
topic:
• doc1, Topic:Architecture: This building has an old
Gothic facade design, made by a famous architect
• doc2, Topic:Software: Our system architect chose
the Facade design for this particular problem
• doc3, Topic:Mathematics: The problem can be
represented in an equation system.
Topic Oriented Auto-completion Models - Approaches Towards Fastening Auto-completion Systems
243