(de Souza Gazolli and Salles, 2012), because it is fast,
simple and has no parameters for calibration, further-
more, its results are similar to other approaches.
The approach used in this work, called Contextual
Mean Census Transform (CMCT), represents each
image as a vector. This approach works on a gray-
scale image. On each pixel x of the image is centered
a small mask 3 × 3 that performs an operation on the
neighboring pixels to the pixel x. Firstly, the average
intensity I
x
of the image pixels under the mask is cal-
culated. Next, the intensities of the neighboring pixels
to the pixel x (N
x
) are compared to I
x
.
If the intensity value of a pixel y (I
y
) is greater
than or equal to I
x
, a bit 1 is generated in the pixel
of the mask 3 × 3 which is over the pixel y. Other-
wise, a value 0 is generated in the same pixel of the
mask. Equation 3 shows this calculation. After going
through all the neighboring pixels of x, the mask 3 ×3
will have binary values in all positions, except in the
central pixel. These values form a binary word of 8
bits, and it is converted into an integer between 0 and
255. This initial operation is called Modified Census
Transform (MCT), or MCT8, because it forms words
of 8 bits.
T
x
= ⊗
y∈N
x
ζ(I
y
, I
x
), ζ(m, n) =
(
1, m ≥ n
0, otherwise.
(3)
The number obtained in Equation 3 is stored in a
pixel of a new image, whose position is related to the
location of the pixel x. After passing the mask on all
pixels of the original image, a new image is obtained.
A histogram is computed from this new image, that is,
the number of times which value has occurred in the
new image. This histogram has 2
8
= 256 elements.
After this step, the MCT is passed over this new
image and the histogram of the resulting image is ob-
tained, again with 256 elements. The histograms of
both images are concatenated to form a vector h of
512 elements. To avoid a very big difference between
the values of the elements, and to achieve a better
classification performance, logarithm operation is ap-
plied on the non-zero values of the vector h, as shown
in Equation 4. Finally, the new vector
˜
h is normalized
by Equation 5.
˜
h
i
= 1 + log(h
i
) ∀ i ∈ {1,. . . , 512}|h
i
> 0. (4)
b
h
i
=
˜
h
i
∑
512
i=1
˜
h
i
∀ i ∈ {1,. . . , 512}. (5)
The previous procedure is applied to each image
of the dataset. As a result, a numerical matrix F with
dimensions W × 512 represents the images and a vec-
tor N with dimensions W × 1 represents the vector of
categories associated with the images, where W is the
number of images in the dataset.
3 DATASETS
For the experiments were collected posts from six
Brazilian groups of Facebook social network, in the
period from 04/2011 to 07/2015, using Facebook4j
API (facebook4j.org). The groups are on different
subjects, such as nature, religion, politics, etc. Each
post contains a set of fields, such as message, posting
date, post ID, group name, link of image posted on
Facebook, among others.
The second step was to collect the images. In this
step, only the images found on Facebook were col-
lected, because links of external pages could be bro-
ken. Moreover, we only collected images that had at
least a minimum size of 15 × 15 pixels and a mini-
mum size of 15 kB.
Then, we applied a filtering procedure on the col-
lected data. This procedure has the function of re-
moving all posts either with the empty message field
or with empty link of image field. we also consid-
ered an empty image link if there is a link, but we did
not collect the image. Finally, after these steps, the
dataset is available to be used in our studies.
To accomplish the task of recommending groups
to users based on their posts profiles, the following
procedures were followed. Initially, the 20 users with
the largest number of posts were identified. All posts
from these 20 users were separated of the dataset, and
these posts will be used to obtain information about
user profiles and to recommend groups to the users.
These data will be called user dataset.
After removing all posts from the 20 users of the
dataset, the next step was to select randomly 1001
posts of each group. This number was selected so that
the number of posts per group was equal.These posts
will be used to get information about each group.
These data will be called group dataset.
For each post of user and group datasets, a repre-
sentation of the message in vector space model was
obtained, as described in Section 2.1. Therefore, we
obtained a matrix of messages of the user dataset
with 4830 × 6183 dimensions, that is, 4830 posts and
6183 different terms, while the messages of the group
dataset formed a matrix of 6006 × 6183.
For each dataset was obtained the vector represen-
tation of each image included in the posts. For this,
the procedure described in Section 2.2 was used. For
user dataset was obtained a matrix 4830 × 512, and
for group dataset was obtained a matrix 6006 × 512.
Thus, each post in each dataset is represented by
two vectors: a vector representing the message and
the other one representing the image. With these ma-
trices will be carried out the experiments.
Recommending Groups to Users based Both on Their Textual and Image Posts
317