involved in finding frequent itemsets for the database of Figure 1 using the joinless
apriori algorithm. Figure 4 presents the joinless apriori algorithm.
3 Data and Simulation
Our presentation of both the classical and joinless apriori algorithms so far has
assumed that the objective is to find association rules that describe the correlations
among all items in the transaction database. But there are situations where the mining
objective is to find the relationships between items in a transaction and only one or a
few other items in the transaction. In our research for example, we are interested in
finding the relationship between user navigation behaviors in hypermedia (as
evidenced by a set of navigation pages they visit, which form the antecedents of
mined association rules), and the Web pages they are presumed to be interested in
(i.e., content pages, which form the rule consequents). This information gives us the
basis for building models for making Web page recommendations to users. For the
rest of this paper, we refer to a database with well-defined consequents (or content
pages) as c-annotated (i.e., consequent-annotated). [8-12] discuss the classification of
Web pages into navigation and content pages.
The procedure to obtain frequent itemsets for a c-annotated transactions database
is identical to the case for a general transactions database, except that: (1) one of the
items in each transaction is annotated as the consequent of the association rules for
that transaction, and (2) candidate and frequent itemsets are generated only for the
non-annotated items of the transactions. For example, Figure 5 shows a c-annotated
transactions database, and Figure 6, the process of obtaining frequent itemsets for the
database.
For the purpose of analyzing the performance of the joinless apriori algorithm, we
used web server logs, but it should be emphasized that the source of the data is
immaterial to the performance of the algorithm. The data comprised the user access
log for the web site of the School of Information Sciences, University of Pittsburgh
for the months of June to August 2004. The raw data were collected in the common
log format [13] and totaled about 500MB.
We followed the heuristics presented in [8, 9, 12, 14, 15] to extract user sessions
from the server logs, and the maximal forward reference (MFR) heuristic [15] to
extract transactions within user sessions, with a transaction comprising one or more
navigation pages and terminating with a content page. In order to reduce the lengths
of very long transactions, we applied a new heuristic we are researching on. This
heuristic PR × ILW (page rank
×
inverse links to word count ratio) combines the page
rank algorithm [16, 17] and the links:word count ratio of web pages to classify them
as navigation or content pages. The distribution of transactions lengths obtained is
shown in Table 1.
Finally, we ran both the classical and joinless apriori algorithms on the
transactions database for the following values of minimum support count: 1, 2, 3, 12,
24, 59, 118, and 235 (corresponding to the following fractions of the transaction
database size: 0.000001, 0.000005, 0.00001, 0.00005, 0.0001, 0.00025, 0.0005,
0.001), and for different average association rule lengths. Average rule lengths were
controlled by varying the maximum acceptable transaction length L
max
between 5 and
238