3. Wait for the user to implement one or several suggesti-
ons
Algorithm 2
Entry :
• a system of categories
• a document base indexed with those categories
• a set of quality measures (balance, cost, redundancy)
associated to the indexed document base
Output : a ranked list of modifications
User : person in charge of indexing the document base
Method :
1. If the balance is low, add
• splitting suggestions for the biggest categories
• merging suggestions for some small categories
2. If the browsing cost is high (i.e. the granularity of the
categories should be refined), add
• splitting suggestions for the biggest categories
• a complex restructuring suggestion (e.g. one may
decide to add new categories that are orthogonal to
the existing ones)
3. If the category access cost is high, add a complex sug-
gestion (one option is to manually turn the system into
a multi-categorial or hierarchical one)
4. Perform the redundancy analysis:
• if there are generic categories that include many ot-
hers, add suppressing suggestions for them
• if there are couples of overlapping categories, add
merging suggestions for them
• if there are categories included in others, add
– suppressing suggestions for the smaller ones
– a complex restructuring suggestion (e.g. partition
the big categories into smaller ones so as to form a
hierarchy)
5. Rank the suggestion list according to
• the internal complexity of the modification type (e.g.
splitting one category is much simpler than restruc-
turing the whole document base)
• the expected impact of the modification in terms of
balance, cost and redundancy
• the restructuring effort needed, computed as the
number of documents that must be reclassified
• the internal priority of the categories to be modified
(this depends on their age and activity rate
9
).
Two types of suggestions are made. The simple
operations are supported by the system which propose
the categories to merge or split, even if the indexers
still have to decide how to reclassify their documents.
The complex operations are left to the indexers: the
system simply points out the need for restructuring.
9
Large categories deserve attention if they are still very
active and keep on growing but small categories that do not
seem active any more are good candidates for merging.
The algorithm has several thresholds, which have
been fixed on an intuitive basis so far. As future work,
we plan to start with arbitrarily high thresholds (to
maximize the number of suggestions and limit false
negative) and tune them incrementally as the indexer
uses the system and accepts/refuses the suggestions.
5.2 Index Analysis and Restructuring
This section shows the impact of the restructuring al-
gorithm on some of the blogs of our corpus.
Correcting the Balance. jeuxvideo3 (Table 2) is
a typical dynamic mono-categorial blog. The number
of categories is multiplied by 7 and the cost by 7.5 but
it remains close to the optimum (reference cost).
Improving the balance takes priority as it declines
significantly, from 0.86 to 0.76. The mean frequency
of the 91 categories is 60 but five categories exceed 4
times this size. The system first suggests to split some
of these big categories, starting with the one which
contains 15% of the posts. Formally, to limit all cate-
gories to 4 times the average size, the indexer has to
re-categorize 1240 posts and create 5 new categories.
The balance should improve to 0.832. The total cost
is expected to increase slightly (up to 153).
A second suggestion consists in merging the 31
categories with very few posts into one “miscellane-
ous” category. This requires reclassifying 118 posts
but reduces the number of categories to 66, increases
the balance to 0.890 and reduces the cost to 149.12.
Note that the indexer may reach the same result
in a different way. However, if local improvements
of the balance become too intricate, the only possible
escape is to restructure the indexing system as a multi-
categorial one. In that case, the restructuring cost is
difficult to estimate.
Reducing the Access Cost. technologie2 is a
small multicategorial blog, reaching 243 posts after
eight years of activity (Table 2).
The balance is good, constantly near or beyond
0.9 but the access cost is always more than twice the
reference cost, with the querying cost accounting for
almost 90% of the total cost. Not only is the number
of basic categories high, but there are twice as many
compound categories and the multi-annotation is not
uniform (in 2015, 45% of the posts are associated to
a single category, whereas 25% have 3 to 5).
The algorithm suggests first to reduce the number
of categories and multi-categories. One simple pro-
posal would be to delete the domotique category that
is uninformative (it is the biggest category with 1/4
th
Controlling the Drift of Semantic Indexing Systems
205