X ⊂ I. An itemset of cardinality (or size) k is called
a k-itemset. The database D = {t
1
,t
2
,.. .,t
n
} is a
set of transactions, identified by an identifier; each
transaction contains a subset of the items described
in I.
To evaluate an itemset, the support sup(X,D) is
the number of transactions in the database D that
contain the given itemset X . If sup(X,D)minsup, the
itemset X is frequent. If X ⊂ Y , Y is a superset of X ;
if Y is frequent, X is also frequent. A frequent itemset
X is called maximal if it has no frequent supersets. A
frequent set X is closed if it has no frequent superset
with the same support.
From frequent itemsets the association rules are
obtained comparing items’ frequencies. A rule,
constructed with several items, has the form
X = (x1,x2, ,x j) → Y = (y1, y2,, yk)
where, X and Y are itemsets in I and are disjoint X ∪
Y =
/
0; X and Y are called antecedent (left-hand-side
or LHS) and, respectively, consequent (right-hand-
side or RHS) of the rule (Hahsler et al., 2005). Each
rule then evaluated using concepts such as support
and confidence.
Support of a rule is defined with conjoint
probabilities P(X ∩Y ), as the fraction of transactions
that contain both X and Y and confidence of a
rule is defined as the conditional probability P(Y |X)
which measures how often the items of Y appear in
transactions that contain X.
con f idence(X → Y ) = P(X|Y ) =
support(X ∪Y,D)
support(X,D)
Given a set of transactions D, the goal of association
rule mining is to find all rules having the support
up to minsup (a threshold) and the confidence up to
mincon f (an other threshold). Positive correlation
between Y and X of rule X → Y , also called the lift
is defined as
li f t(X → Y ) = P(Y|X)/P(Y ) =
P(X ∩Y )
P(X)P(Y )
Further details about all these concepts and operations
can be found in (Zaki and Wagner Meira, 2014).
To obtain the different itemsets (frequent, closed
and maximal), to generate and to explore rules,
input data can be just a n × m table with n species
on lines and m attributes on columns or a list
of transactions. In all cases, a sparse matrix
is produced. R program (R Core Team, 2014)
and R libraries arules (Hahsler et al., 2005) and
arulesViz (Hahsler and Chelluboina, 2011).
4 RESULTS
4.1 Items and k-Itemset Focus
We realize a first exploration considering the
attributes retained from Groups 1, 2, 3, 4 and 5. From
the transactions records we obtain the transactions
set as an itemMatrix in sparse format with 134 rows
(transactions) and 266 columns (items): 134 species,
the 65 genera, the 27 origins, and the 10 different
levels for pollution and 30 attributes (yes and no
values) for the other variables. As many entries are
empty, density of the matrix is 0.0789, meaning that
only 2814 entries out of 35644 contain a value.
The absolute frequency distribution of items
related to recommended and not recommended (yes
and no values) planting sites is: streets and middle-
roads (71, 63), urban recreational parks (130, 4),
parking lots (105, 29), beneath electric lines (73, 61),
cemeteries (116, 18) and sport fields (120, 14).
With the Apriori algorithm k-itemsets can be
generated starting with 1-itemsets. However it is
necessary to fix a minimum support. If, for example,
a minimum support of 0.01, is fixed, 10525099
itemsets are generated. This huge amount of item
sets is not directly exploitable and requires to look
for other strategies such as to increase the threshold
for minimum support or to obtain subsets containing
a given item of interest. Fixing a threshold of 0.1
for minimum support, we obtain 272958 frequent
itemsets (with a minimum of 476 itemsets with 2
items and a maximum of 22647 itemsets with 10
items), 38294 closed frequent itemsets and 23771
maximally itemsets (with a minimum of 2 itemsets
with 2 items and a maximum of 22647 itemsets with
10 items).
Considering, that our aim is to predict plantation
sites (variables from Group 4), it is important
to evaluate the distribution of itemsets containing
planting sites and the more adequate species or genus
(variables from Group 1). The distribution of itemsets
containing a least one planting site is presented in
Figure 1.
Table 1 presents, for a minimum support of 0.1
and 0.15, the distribution of the number of itemsets
(frequent, closed and maximally) and the number of
item sets that includes genus Pinus and Quercus.
This exploration of item sets can continue:
increasing or diminishing minimum support or
eliminating items with lower frequencies. If
minimum support is reduced this may not be enough
to reveal attributes from Group 1 (Table 2), given their
low frequencies. Another option to consider is to
reduce the number of items, eliminating those with
Exploring Urban Tree Site Planting Selection in Mexico City through Association Rules
427