Topic Oriented Auto-completion Models

Approaches Towards Fastening Auto-completion Systems

Stefan Prisca

, Mihaela Dinsoreanu

and Camelia Lemnaru

Computer Science, Technical University of Cluj-Napoca, Vitei 7, 551107, Medias, Romania

Computer Science, Technical University of Cluj-Napoca, Baritiu Str. 26-28, RO-400068, Cluj-Napoca, Romania

Computer Science Department, Technical University of Cluj-Napoca, 26 Baritiu St., room c9, 400027, Cluj-Napoca,

Keywords:

Word Auto-completion, Topic Oriented Data Models, Topic Indexing.

Abstract:

In this paper we propose an autocompletion approach suitable for mobile devices that aims to reduce the overall

data model size and to speed up query processing while not employing any language speciﬁc processing. The

approach relies on topic information from input documents to split the data models based on topics and index

them in a way that allows fast identiﬁcation through their corresponding topic. Doing so, the size of the data

model used for prediction is decreased to almost one ﬁfth of the size of a model that contains all topics, and

the query processing becomes two times faster, while maintaining the same precision obtained by employing

a model that contains all topics.

1 INTRODUCTION

With the increasing usage of mobile devices, and de-

vices with limited typing facilities, it is highly desir-

able to have solutions that speed up typing. The goal

of such a system is to predict words (and phrases)

while the user is typing, thus allowing for faster writ-

ing and increasing productivity. The auto-completion

problem is not new. Such solutions have been used

for years in a variety of activities. The most com-

mon of these activities is everyday text writing tasks.

Nowadays almost everyone owns a smartphone de-

vice. Most of these devices have a built-in software

that suggests words while the user writes messages,

emails, etc. Another use case is represented by query

predictions in search engines. For instance, Google

(and other engines) display a list of suggestions when

someone starts typing in the search ﬁeld. Other appli-

cations of auto-completion systems include command

line suggestions, code completions in IDEs, etc. An-

other important usage of auto-completion is in sys-

tems that assist persons with writing disabilities in

performing daily task on a computer.

The general ﬂow of an auto-completion system is:

(1) A query is made to the system (for example the

ﬁrst letters of a command or ﬁle in the command line),

(2) the system processes the query and (3) returns sug-

gestions. A query for word auto-completion usually

consists of 2 or 3 previous words, and the ﬁrst let-

ters of the desired word. For example, consider typ-

ing the following text: ’I am go’. At this point, a

word completion query would be triggered with the

previous words being [’i’, ’am’] and the preﬁx being

’go’. Hopefully the system answers with ’going’ or

’goofy’. The component of an auto-completion sys-

tem which stores possible completions is the model.

Usually, there is one model for an auto-completion

system, which leads to the following problem: in data

sets which spread on multiple topics, there is inﬂu-

ence induced from one topic to the others. This means

that words/phrases from one topic will be mixed with

words from the others. For example, if a data set

contains documents from totally different topics like

automobile constructions and software development,

there are scenarios in which the auto-completion sys-

tem might propose automobile instead of algorithm.

2 RELATED WORK

There has been a lot of work in the ﬁeld of auto-

completion, with most recent focus on query auto-

completion for search engines.

The authors of (Bast ’06) present a simple auto-

completion system, that makes use of an inverted in-

Prisca, S., Dinsoreanu, M. and Lemnaru, C..

Topic Oriented Auto-completion Models - Approaches Towards Fastening Auto-completion Systems.

In Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2015) - Volume 1: KDIR, pages 241-248

ISBN: 978-989-758-158-8

241

dex (Manning ’08). Although this is not the main

purpose of their paper, they present the basic princi-

ple of using a normal inverted index, and the word

retrieval function based on the documents they com-

monly appear in. The approach is relatively simple,

the hash-like structure of the inverted index allows for

fast word retrievals, as it is proven by (Bast ’06), and

there is room for a lot of extensions over the inverted

index.

Such an extension to the inverted index is dis-

cussed by the authors of (Prisca ’15). Here, an user

oriented alternative to the inverted index data model

is presented, called User Oriented Index. This new

model is based on the idea that in an auto-completion

system there are two document types: (1) initial doc-

uments, those documents that are used for general in-

dex construction, and (2) user documents: documents

that are written by the user, and that are given higher

priority. They speciﬁcally separate user documents

in order to give higher ranking to words that come

from these documents. To achieve this, the User Ori-

ented Index makes use of document id masks to iden-

tify user-written documents. These masks are used to

split the document id domain in two: (1) initial doc-

uments [0 → userDocMask) and (2) user documents

[userDocMask → ∞]. Moreover, in order to improve

ﬁnal ranking precision, this new index keeps track of

word positions from the input documents. The re-

sults showed an average improvement of 13% over the

regular inverted index, in terms of prediction quality

and learning capabilities. On the other hand, storing

all that additional information requires more space,

and the User Oriented Index takes approximately 1.3

times more size compared to a regular index. This

approach requires a good size reduction strategy that

will not inﬂuence the performance.

More recent research has been done in the area

of query auto-completion for search engines. Tak-

ing another user-oriented approach, Milad Shokouhi

(Shokouhi ’13) presents a possible supervised rank-

ing framework for learning user search preferences

based on the user’s long term search history and lo-

cation. They prove that personalized rankers improve

the performance of regular popularity based rankers

by 9%. Although this system improves the perfor-

mance of regular rankers, it is a supervised learning

method that requires the model to be learned.

All of the non-learning approaches discussed rely

on a single, big model to make predictions. This

model contains all the topics from the data set, mixed

together, which may result in predicting words from

the wrong topics. Also, this leads to large model sizes,

and requires more and more pruning, encoding, etc.

Moreover, another fact to keep in mind is that for an

Figure 1: Main Topic Oriented model building and query

processing workﬂow.

auto-completion system to be useful, it must be able

to process queries fast. Research (Card ’91), (Miller

’68) shows that the upper time limit for to the human

eye to perceive instantly is around 100 ms. This is a

big constraint, as the predictions must appear imme-

diately after a key is pressed, otherwise risking to be

outdated. This again is hard when having only one

huge model to predict completions against, as when

its size grows, so does the query processing time. One

alternative to consider is deﬁning a strategy that sepa-

rates this one big model into smaller ones that are eas-

ier to store and provide faster query processing while

offering the same performance.

3 PROPOSED APPROACH

In this paper we will present a solution to the one

model problems, that aims to reduce the size of the

model, reduce the query processing time and still of-

fer the same completion capabilities as standard mod-

els (even better in some cases, as shown by our re-

sults). Our approach leverages topic information to

split the model based on the topics contained inside

it, and load the corresponding sub-model when mak-

ing query completions.

Figure 1 illustrates how the main building and

querying workﬂow is altered to include additional

logic to identify and attach topic information, and to

track the topics. Instead of communicating directly

with the model, the Builder and Query Processor pass

their messages to a Topic Indexer. The job of the

Topic Indexer is to track topics inside the model, and

provide the required information to the builder and

query processor. We can identify the two workﬂows:

1. Builder Workﬂow: The builder takes in documents

and inserts them into the corresponding model by

KDIR 2015 - 7th International Conference on Knowledge Discovery and Information Retrieval

242

passing the Word and Topic information into the

topic indexer. For our purposes, we assume the

topic is known.

2. Query Processing Workﬂow The query processor

sends and auto-completion query to the Topic In-

dexer, and attaches the additional Topic informa-

tion. Again, we assume that the topic is known.

The main constraint of such an approach is that

the system needs enough input data in order to be

able to ﬁnd the topic, and only then make predictions.

Therefor we consider as input data long documents

(e.g. e-mails, essays, reports, etc.), not short in-line

messages. We also assume that our topics are already

identiﬁed for each of the documents, thus the problem

of topic identiﬁcation is not covered within this pa-

per. The main focus remains on presenting solutions

for the Topic Model and Indexer, and how the Builder

and Query Processor communicate with them. The

part of the Topic Model Overview diagram 1 that we

cover in this paper is the Topic Autocompletion Work-

ﬂow.

The issues with one model solutions that we iden-

tiﬁed are (1) topic interferences, (2) big models and

(3) slow query processing. In order to address these

issues, we present the following methods of storing

topic information:

• have one model for all topics, and include a list of

topics for each gram. (one-for-all) Although this

method does not directly address problems 2 and

3, it aims to reduce the topic interferences, but not

totally separate the model, as words from differ-

ent topics are mostly shared between the topics.

The obvious disadvantage is that there still is a

single model that stores all words, regardless of

their topic. The advantage is that each word is

stored once, even if this word appears in multiple

topics.

• have a separate model for each topic. (one-for-

each) This method provides a more extreme so-

lution to all of the issues above. It separates the

models in completely different sub-models, and

predicts queries using only the sub-model corre-

sponding to the topic at hand. The disadvantage

is that words are shared between multiple topics.

If most of the words in the data-set are shared

between topics, this method will store all of the

shared words in each sub-module, thus resulting

in extra copies of the same word which will ulti-

mately increase the overall size required to store

all of the sub-models.

In order to avoid confusion, in the following sec-

tions we refer to the overall data model as super-

model, and to separate topic-speciﬁc models as sub-

models. Moreover, throughout this paper, we con-

sider that each document has one topic. In cases when

a document has multiple topics, we consider only the

most relevant one.

3.1 One-For-All Topic Model

The ﬁrst idea that comes in mind is to create a

super-model that stores topic information for each

gram. The problem that remains after is how to store

this information inside the super-model? One can

insert for each gram a list of topics it appears in.

For example, in the case of an inverted index data

model, a list of topics in which the word appears

is stored together with the postings list. An entry

from the User Oriented index containing this extra

information looks like this: market:{postings:

{1 : [1], 3 : [1], 101:[2]}, topics:{

food, automobile, shopping }}.

The obvious issue with this solution is the size.

The super-model will grow a lot due to topic informa-

tion. Imagine a word like how. As it most probably

appears in all topics, it has a large topic list that affects

the running performance of the system. To address

this problem, we propose to generalize the idea con-

cerning user document ids presented in (Prisca ’15),

by applying it to topics. This results is a new Topic

Model: the Topic Oriented Index.

To achieve this, we deﬁne a mapping between

each topic identiﬁed in the data set, and a range of

document ids. For this mapping, any common data

structure can be used, but it is important for it to be a

singleton. This means that once topics are mapped to

a certain id range, the same mapping is used through-

out the whole system. Moreover, since we cannot al-

ways have all topics identiﬁed, there needs to be an id

range dedicated to uncategorized documents. We re-

fer to this as the Uncategorized id range, and propose

that it takes values at the end of the id domain (i.e.

[lastKnownTopicId → ∞), where lastKnownTopicId

is the last id assigned to a known topic). In case new

topics appear in the data set, one only has to reduce

the set of Uncategorized id ranges, and allocate some

of them to the new topic.

As an example, let us consider that we have a data

set consisting of the following documents, split by

topic:

• doc1, Topic:Architecture: This building has an old

Gothic facade design, made by a famous architect

• doc2, Topic:Software: Our system architect chose

the Facade design for this particular problem

• doc3, Topic:Mathematics: The problem can be

represented in an equation system.

Topic Oriented Auto-completion Models - Approaches Towards Fastening Auto-completion Systems

243

The document topics from above are: Architec-

ture, Software, Mathematics. As there are few docu-

ments in each topic, we shall use id ranges of 100 for

this example, and assign them to each topic: (1) Ar-

chitecture: [0 → 100], (2) Software : [101 → 200],

(3)Mathematics: [201 → 300] and Uncategorized:

[301 → ∞). The simple Inverted Index that would re-

sult from these documents is presented in 1 and the

Topic Oriented Index in table 2. Both the indexes have

been pruned with an occurrence threshold (OCC TH)

of 2.

Table 1: Example of Inverted Index pruned with OCC

= 2.

Word Posting List

facade {0, 1}

the {1, 2}

this {0, 1}

design {0, 1}

an {0, 2}

architect {0, 1}

system {1, 2}

Table 2: Example of Topic-Oriented Index pruned with

OCC TH = 2.

Word Posting List

facade {0 : [7], 101 : [6]}

the {101 : [5], 201 : [1]}

this {0 : [1], 101 : [9] }

design {0 : [8], 101 : [7]}

an {0 : [4], 201 : [7]}

architect {0 : [13], 101 : [3] }

system {101 : [2], 201 : [9]}

As a comparison to the regular Inverted index, the

Topic Oriented Index respects the entry format of the

User Oriented Index ((Prisca ’15)), which means that

it uses the document id masks, and stores word posi-

tions from the documents. Furthermore, we want to

store user information as it is done in the User Ori-

ented Index. The same idea with a user mask can

be applied. This means that the document id will

be composed from a user id, a topic id and the ac-

tual document id within the topic range. Formally,

the document id can be deﬁned as < userDocMask >

+ < topicIdRange >. Considering the topics from

our previous example, if a user with userDocMask =

1000 wrote documents about mathematics, then the

range of ids associated to those documents is 1000 +

[201 → 300] = [1201 → 1300].

Using this kind of document ids allows us to store

information about multiple users in the same index by

having more userDocMasks. Continuing our previous

example, we can associate a userDocMask of 2000

to a second user. Assuming this second user writes

about software programming, we now encode his/her

documents with ids in range [2101 → 2200].

3.1.1 Building the Topic Oriented Index

The Builder Workﬂow of the Topic Oriented Index

involves the same basic strategy as building the User

Oriented Index (Prisca ’15). The only difference is

that for the Topic Oriented Index, the logic for doc-

ument id assignment changes. We now need to store

the last used id for each topic, and append the user

mask to that id. The build procedure is presented in

algorithm 1. This algorithm implements the Builder

Workﬂow for the Topic Oriented Index.

Algorithm 1: Topic Oriented Index builder workﬂow algo-

rithm.

1: topicMapping ← LOADTOPICMAPPINGS()

2: function APPENDTOINDEX(index, ﬁlePath)

3: rawContet ← read( f ilePath)

4: docTopic ← IDENTIFYTOPIC(rawContent)

5: preprocessed ← PREPROCESS(rawContent)

6: docId ← GETLASTDOCID(topicMapping,docTopic) + 1

7: docPosition ← 0

8: for all word in preprocessed do

9: UPDATE(index,word, docId,docPosition)

10: docPosition ← docPosition + 1

11: UPDATELASTDOCID(topicMapping,docId) return

processed

3.1.2 Query Processing for the Topic Oriented

Index

Query processing and ranking workﬂow operations

are similar to the corresponding operations in the User

Oriented Index. The difference for the Topic Oriented

Index is that we can no longer compare the document

id to a single mask, but instead we have to compare it

to the topic id interval. Formally, we can present this

approach as algorithm 2, which implements the Query

Processing workﬂow for the Topic Oriented Index.

Algorithm 2: Topic Oriented Index Query Processing.

1: KnownTopic ← GETCURRENTTOPIC()

2: function TOPICINDEXQUERYPROCESSOR(query)

3: if KnownTopic == Null then

4: idRange ← [0 → ∞]

5: else

6: idRange ← GETIDRANGE(KnownTopic)

return PROCESSQUERY(query, idRange)

3.2 One-For-Each Topic Model

The second approach to having topic information, is

to split the super-model in totally separate sub-models

for each topic. This kind of model separation clearly

KDIR 2015 - 7th International Conference on Knowledge Discovery and Information Retrieval

244

Figure 2: One-For-Each Topic Model.

distributes sub-models among all topics. This reduces

the size of the used model, as each of the sub-models

will only contain a subset of the grams contained in

the super-model. One could use various data struc-

tures to map each of the topics to the corresponding

sub-model, like a table, a list of tuples, a tree, etc. We

have decided to use a tree structure for our purposes,

as it allows us to easily track the topic hierarchy (the

Topic Tree). This is illustrated in ﬁgure 2.

The One-For-Each topic model relies on having

an underlying auto-completion system for each sub-

topic, which has all the auto-completion logic imple-

mented. This means that the mere job of the One-For-

Each topic model is to simply track the topics, and

delegate any auto-completion work to the underlying

system. It can be thought of as an extension for ex-

isting auto-completion models, that adds the logic to

track topics.

3.2.1 Building Topic Tree Models

As mentioned, the Builder Workﬂow for the Topic

Tree Model will identify the topic, and delegate all

other work to the underlying auto-completion system.

The algorithm for building topic models in presented

bellow. 3.

Algorithm 3: Topic Tree Model building algorithm.

1: function BUILDTOPICMODELS(rawInput)

2: topicPath ← getTopicPath(rawInput)  Retrieve the full

topic path of the given input

3: wordModel ← BUILDINDEXMODEL(rawInput)

4: SAVEWORDBYTOPIC(wordModel,topicPath)

5: UPDATETOPICINDEX(topicPath, wordModel)

3.2.2 Query Processing on Topic Tree Models

An issue with total topic separation arises at query

processing: what happens when we do not know the

topic of the current document (i.e. the document for

which the system currently processes queries)? The

system will not be able to make predictions, as mod-

els for different topics are disjoint. In order to over-

come this, we have to combine the topic oriented

auto-completion strategy with a single model auto-

completion strategy. The solution is to ﬁrst use a sin-

gle model in order to make general predictions. After

the user writes enough content to allow a topic identi-

ﬁcation system to determine the topic, we switch to

the sub-model corresponding to that topic, and use

that to make predictions. The algorithm for query

processing is presented in 4. This is just an adapter

to the underlying model’s query processor. The only

purpose of this algorithm is to load the proper model

into memory. This algorithm implements the Query

Processing workﬂow for the Topic Tree Models.

Algorithm 4: Topic Tree Model Query Processing.

1: KnownTopic ← Null

2: UsedModel ← SingleModelPath

3: function TOPICQUERYPROCESSOR(query)

4: if KnownTopic == Null then

5: if UsedModel 6= SingleModelPath then

6: UsedModel ←

LOADMODEL(singleModelPath)

7: else

8: modelPath ← GETMODELPATH(KnownTopic)

9: if UsedModel 6= modelPath then

10: UsedModel ← LOADMODEL(modelPath)

return PROCESSQUERY(query,UsedModel)

4 EXPERIMENTS AND RESULTS

In this section we present the results obtained by topic

oriented data models, and compare them to single

models. We are mainly interested in measuring the

precision, recall and runtimes obtained by using var-

ious data models for word prediction systems (note

that we do not experiment with phrase predictions).

The precision and recall are computed using the Mean

Reciprocal Rank, as presented in equations 1, (2).

As data models for comparison, we chose two that

are topic oriented, and two that are not:

1. single, non-topic User Oriented Index: the user

focused auto-completion data structure presented

in paper (Prisca ’15) To separate between general

documents and user documents, we use a user-

DocumentMask of 30000 ((Prisca ’15)).

2. Topic Oriented Index: The altered Inverted Index

data structure presented in this paper. This data

structure encodes topic information using docu-

ment ids. For building the Topic Index, we con-

sider topic id ranges of 1000, and an userDoc-

umentMask of 30000. This means that the ﬁrst

topic from our data set will get ids [0 → 1000] ∪

[30000 → 31000] , the second will have [1001 →

2000] ∪ [31001 → 32000], and so on. The sec-

ond interval of id ranges corresponds to user doc-

uments on the given topic. We assume that in our

data set there are not enough documents and top-

Topic Oriented Auto-completion Models - Approaches Towards Fastening Auto-completion Systems

245

Figure 3: Topic structure of the BigEn data set.

ics to consume all the ids allocated within the in-

tervals.

3. Topic Tree with an underlying User Oriented In-

dex: The tree-like indexing structure for topics

that uses underlying User Oriented Indexes for

word storage. This data structure was presented

in this paper. For the underlying User Oriented

Index, we used the same userDocumentMask of

30000.

4. single, non-topic regular Inverted Index: the regu-

lar Inverted Index data structure, commonly used

for information retrieval tasks. A short introduc-

tion to this data structure is given by (Bast ’06)

and (Manning ’08).

In order to test the topic models, we used data sets

that spread on multiple topics. We marked the topic

of each document by hand. Note that we do not use a

topic identiﬁcation system to ﬁnd the topic of a doc-

ument. In a real-life situation, the topic identiﬁcation

system might introduce additional errors. Because

our system is dependent on the topic, for our exper-

iments we consider having large documents, like e-

mails, essays, articles, software documentations, etc.

This ensures that the topic for the document at hand

is known.

We conducted tests on a large data set, consisting

of documents written in English, on different topics:

Woodworking, Fitness, Cycling, Computer games,

Gadgets, Phone and Notebook reviews, Chemistry,

Math, Web technologies, Economy, Travel, Food

recipes, etc. This set spreads on over 7.4 million

words, and has a size of 46 MB. We obtained the

documents from different web sources using a web

crawler. For this data set, we consider that a possible

user has documents written on the following topics:

Chemistry, Food recipes and about traveling in In-

dia. For the purpose of our experiments, we make the

assumption that only one user has documents in our

model, and all the user-written documents are writ-

ten by the same user. This set is referred to as BigEn

for the rest of this paper. This data set contains doc-

uments that spread over the topic hierarchy presented

in ﬁgure 3.

The experiments have been conducted on an In-

tel i7 dual core, 3.7 MHz processor with 8 GB RAM

memory, using the Python programming language.

After building the data models, we stored them in

json format, without any compression or encoding.

One can also use a low-cost database like MongoDB

to store the models, but we found it easier to keep

them in a simple text ﬁle. We build a Topic Index, a

Topic Tree, a single Simple Index and a single User-

Oriented Index. We use the whole data set for all

models, such that all of them are built from the same

user and nonuser documents. After the build, a prun-

ing step is applied, with an occurrence threshold of

5 ∗ 10

−6

∗ nChars for initial general documents, and

of 0.5 ∗ 10

−6

∗ nChars for user documents. The re-

sults obtained after building the models are presented

in table 3. We compared the size on disk, build time,

number of words and number of documents that result

in each model. Since the topic tree involves multiple

separate models, we present the average statistics.

Table 3: Comparing build measurements of our models.

Model Build

Time

Size nWords nDocs

Topic

Tree

5.6 s 10 MB 5000(avg) 97(avg)

Topic In-

dex

88 s 51 MB 16512 3550

User Ori-

ented In-

dex

82 s 48 MB 16512 3550

Simple

Index

138 s 7 MB 2183 3550

The User Oriented Index and Topic Index both

have the same number of words in them. The reason is

the the Topic Index uses a underlying User Oriented

Index to build its data model, therefor the resulting

words are the same. The difference between the two

is that the Topic Index takes more space due to hav-

ing larger document ids from topic ranges. The Topic

Tree takes less average size, has less average docu-

ments and has the fastest build. Although the Topic

Tree contains on average 1/3 of the words in other

models, these do not spread on that many documents,

thus having a more reduced size (1/5 of the User Ori-

ented Index).

In order to test the system, we pass over the test

documents with sliding window of 3. The ﬁrst two

words represent the previous words of the query, and

the third one is the desired word. We consider the ﬁrst

4 letters of the desired word as the letters passed to the

query, and then check which (if any) of the words in

the resulted list correspond to the desired word. For

the topic based models, we switch between different

indexes based on the current topic. For testing, we as-

sumed that a single user has documents written on the

following topics Chemistry, Food recipes, India. The

KDIR 2015 - 7th International Conference on Knowledge Discovery and Information Retrieval

246

test results are presented in tables 4,5,6, correspond-

ing to each of the user topics.

To measure the performance of the systems, we

use the Mean Reciprocal Rank metric for precision 1

and recall 2. Although we generally consider a word

to be predicted correctly if it appears in the ﬁrst three

positions of the resulted list, we still prefer that it ap-

pears on the ﬁrst position, thus lowering the scores

of the second and third positions by using the above

metrics.

RankPrecision =

∑

1/rank(accepted autocompletion)

n(predicted autocompletion)

(1)

RankRecall =

∑

1/rank(accepted autocompletion)

n(queries)

(2)

It can be seen that the User-Oriented Index, the

Topic Oriented Tree and the Topic Oriented Index per-

form mostly the same. Although there are not large

variations among the three, the Topic Oriented Tree

tends to outperform the other two, due to the fact that

it contains words from one topic only, and it totally

eliminates interferences from other topics. This also

improves its runtime. As it can be seen, it’s half the

runtime of other systems. Nevertheless, all three solu-

tions outperform a simple index in terms of precision

and recall.

Table 4: Test results of topic based vs single models on

Chemistry documents.

Model Precision Recall Runtime

Topic Index 73.48% 72.83% 6 ms

Topic Tree 72.95% 72.23% 14 ms

User Oriented

Index

73.03% 72.30% 14 ms

Simple Index 63.74% 59.93% 12 ms

Table 5: Test results of topic based vs single models on

Food recipes documents.

Model Precision Recall Runtime

Topic Index 83.39% 81.83% 8 ms

Topic Tree 88.04% 88.04% 4 ms

User Oriented

Index

84.09% 82.51% 8 ms

Simple Index 77.16% 67.34% 8 ms

We are also interested in how fast our solutions

are able to learn from the user. To ﬁnd out, we plotted

learning curves for the BigEn data set. We compared

the learning capabilities of the User Oriented Index,

the Topic Oriented Tree, and the Topic Oriented In-

dex. We create two testing scenarios:

Table 6: Test results of topic based vs single models on

documents about traveling in India.

Model Precision Recall Runtime

Topic Index 75.94% 72.15% 9 ms

Topic Tree 79.56% 78.35% 5 ms

User Oriented

Index

76.38% 72.56% 9 ms

Simple Index 65.82% 51.31% 10 ms

• scenario one: how fast the systems learn with no

other user-written documents except the ones they

learn from

• scenario two: how the systems learn when there

are user-written documents in the data set on other

topics

Figure 4: Learning comparisons - scenario one.

Figure 5: Learning comparisons - scenario two.

During our learning tests, we observe that the

Topic Oriented Tree tends to learn much faster, but

it is also more sensitive. The interpretation is that it

contains few words at the beginning, and whatever

is indexed can improve or decrease its precision. It

can be seen in both plots that around 2000 words, the

learning curve of the Topic Oriented Tree drops a lit-

tle. However, it adapts faster than the other two. After

500 words its precision increases to over 80%, while

the other two are around 70%. To be noted that the

learning curve of the Topic Oriented Tree does not

change between the two test scenarios. Due to the

fact that this model separates topics completely, it is

irrelevant if there are other topics used by the user, as

Topic Oriented Auto-completion Models - Approaches Towards Fastening Auto-completion Systems

247

these will have other data models. On the other hand,

the Topic Oriented and Simple Index both suffer from

topic interferences. With no other user topics, these

two reach 78% (4), but when there are other topics

(5) they stagnate at 70%.

5 CONCLUSION

In this paper, we presented prediction data models

that are split by document topics, and which achieve

the same results as regular models, while having less

then 25% of the size (on average) and requiring half

the query processing time. We conclude that in order

to beneﬁt from the advantages of both single models

and topic split models, one can build single models at

ﬁrst, and as the data set grows, start using topic ori-

ented data structures. This keeps runtimes and sizes

small even after learning from a large collection of

documents, and ensures that the system can make pre-

diction even when the topic is unknown, by using the

single model.

REFERENCES

S. Card, G. Robertson, and J. Mackinlay. The information

visualizer, an information workspace. Proceedings of

the SIGCHI conference on Human factors in com-

puting systems: Reaching through technology, pages

181186, 1991.

R. Miller. Response time in man-computer conversational

transactions. Proceedings of the AFIPS Fall Joint

Computer Conference, 33:267277, 1968.

H. Bast and I. Weber: Type Less, Find More: Fast Auto-

completion Search with a Succinct Index, SIGIR ’06

Proceedings of the 29th annual international ACM SI-

GIR conference on Research and development in in-

formation retrieval.

Manning, Christopher D., Prabhakar Raghavan, and Hin-

rich Schtze, Introduction to information retrieval, Vol.

1. Cambridge: Cambridge university press, 2008.

P. Krishnan, J. Vitter, and B. Iyer. Estimating alphanumeric

selectivity in the presence of wildcards, Proceedings

of the 1996 ACM SIGMOD international conference

on Management of data, pages 282293, 1996.

Carmel, David, et al. ”Static index pruning for information

retrieval systems.”, Proceedings of the 24th annual in-

ternational ACM SIGIR conference on Research and

development in information retrieval. ACM, 2001.

Stefan Prisca, Mihaela Dinsoreanu, Rodica Potolea. ”A lan-

guage independent user adaptable approach for word

auto-completion”, 11th International Conference on

Intelligent Computer Communication and Processing,

2015.

Jiang, Jyun-Yu, Yen-Yu Ke, Pao-Yu Chien, and Pu-Jen

Cheng. ”Learning user reformulation behavior for

query auto-completion.”, In Proceedings of the 37th

international ACM SIGIR conference on Research &

development in information retrieval, pages. 445-454.

ACM, 2014.

Whiting, Stewart, and Joemon M. Jose. ”Recent and robust

query auto-completion.”, In Proceedings of the 23rd

international conference on World wide web, pp. 971-

982. ACM, 2014.

Milad Shokouhi: Learning to personalize query auto-

completion, Proceedings of the 36th international

ACM SIGIR conference on Research and develop-

ment in information retrieval, 2013, Pages 103-112

KDIR 2015 - 7th International Conference on Knowledge Discovery and Information Retrieval

248