a good trade-off between the runtime and the number
of extracted FCLs relatively to the predefined
threshold. Indeed, the number of the FCLs depends
on the threshold set in the beginning of the algorithm.
A higher threshold allows extracting the FCLs in a
short time, but the result is not significant enough,
since most of the links will be rejected and significant
latent information may be omitted in this case, while
a lower threshold will extract more frequent links at
the expense of performance.
FLMin (Stattner, 2012d), uses a bottom-up search
and the Apriori principle (Samatova, 2014) by
browsing only the itemsets that all of their
subitemsets are frequents. Using the same principle
as well as the frequency property (Stattner, 2012b),
MFCLMin (Stattner, 2012c) looks for the maximal
frequent conceptual links i.e. those which are not
included in other frequent conceptual links.
Subsequently, authors in (Stattner, 2013) and
(Tabatabaee, 2017) proposed respectively the
algorithms H-MFCLMin and D-MFCLMin that
implement the concepts of filtering threshold and
itemset dependency to reduce the search space, thus
significantly improving the performance of the search
process with the trade-off of loss of searched patterns.
Comparing with the results of the complete research
process, the authors have shown that the loss is
admissible from a certain support threshold. Finally,
PALM (Stattner, 2017) is a parallel implementation
that tries to improve performance of the extraction
process by simultaneously exploring several parts of
the search space.
To the best of our knowledge, these are
exhaustively the list of works addressing the FCL
extraction problem. While the last one constitutes a
parallel implementation, the former are sequential
and they adopt an apriori based approach, i.e.,
scanning the database for each FCL candidate and
computing the relative support. Furthermore, each
one of the sequential implementations improves the
performances of the previous, by exploiting more
properties of the network. At this stage, we should
notice that despite that the solution space given by the
MFCLMin and the D-MFCLMin algorithms is
smaller than that obtained by FLMin, this doesn’t
cause any loss in the solution space because as for the
itemset mining problem, from the maximum FCLs we
can reach all the FCLs in the network.
Contrariwise, the H-MFCLMin sacrifices some
solutions for a performance gain. Finally, MFCLMin
and D-MFCLMin remain the only sequential
implementations that list all the maximal FCLS for a
given network. Despite all the properties exploited by
these two algorithms (frequency property,
downward-closure property and the dependency
property) in order to maximize the extraction process
performances, the main problem of them and of any
apriori based algorithm is the multiple scan of the
database. Indeed, MFCLMin and D-MFCLMin
proceed in a breadth first manner, generate all
conceptual links candidates of size k, scan the
network for each candidate and eliminate all but those
frequent before moving to larger candidate
conceptual links. This may induce heavy charge on
the process for large networks.
In this paper, we present a new solution for the
maximum FCLs extraction problem, namely the Bin-
MFCLMin, it constitutes a sequential implementation
that looks for all the maximum FCLs within a social
network, and uses a compressed binary representation
of the social network in order to reduce the time of
extracting frequent conceptual links. As we will see
through this paper, the compressed representation
transforms the input network data into an integer
matrix whose size is reduced by a factor more than 60
than the original network, which allows a gain in run
time up to 91%. The paper is organized as follows:
section 2 gives details about the problem modelling,
section 3 explains the proposed solution and section
4 shows and discusses the obtained results, we finally
conclude and present our perspectives in the last
section.
2 PROBLEM MODELLING
In order to model the FCL extraction problem, we
consider a social network represented by a graph G =
(V;E) where V is the set of nodes and E is the set of
relations between the nodes.
We use a set of attributes A (a1,…, am) and a set
of attribute values (a11,… .a1j
1
,
a21,…,a2j
2
….am1,… .., amj
m
) where j
k
the number
of values that can take the attribute ak.
Each node is described by a set of pairs (attribute,
value), each attribute = value pair is said to be an item
and the set of (attributes, values) describing a node(s)
constitutes an itemset.
An itemset which contains one pair (attribute,
value) is called 1-itemset, while an itemset containing
t pairs (attribute, value), it is called t-itemset.
If m1 and m2 are two itemsets, then the set of ties
linking the nodes satisfying the itemset m1 and the
nodes satisfying the itemset m2 constitutes a
conceptual link noted (m1, m2):
(m1, m2) = {e ∈E, e = (a, b) a satisfy m1
and b satisfy m2, a, b ∈ V}
(1)